[Request for discussion] Alpha code.json Project Inventory Schema #44

theresaanna · 2016-09-20T18:02:26Z

Thank you all so much for the feedback you've given on the draft schema and for the great discussion. It's been invaluable and is baked into the alpha version.

You'll find our alpha version of the project metadata schema below. We welcome your feedback as we iterate on it. We've been working to come up with something carefully considered and easy to comply with as soon as possible so that agencies can have the most time possible to prepare.

The schema needs to describe a vast and diverse universe of software, though we aren't the first folks to think about how to describe a software package. Some of you have mentioned other schemas that aim to solve similar problems. We've looked at them and have found one that we feel best matches the projects we aim to describe here.

Code for DC and DC employees extended a project schema created by BetaNYC and have created their civic.json. We are extending (and slightly modifying) it to create our own schema. They have schema creation and validation tools that we hope to leverage as well. Thanks to the team for sharing their great work!

code.json:

{
    "agency": "DOABC",
    "projects": [
        {
            "status": "Alpha",
            "vcs": "git",
            "repository": "https://github.com/presidential-innovation-fellows/mygov",
            "homepage": "https://agency.gov/project-homepage",
            "downloadURL": "https://agency.gov/project/dist.tar.gz",
            "name": "mygov",
            "description": "A Platform for Conecting People and Government",
            "tags": [
              "platform",
              "government",
              "connecting",
              "people"
            ],
            "languages": [
              "java",
              "python"
            ],
            "contact": {
              "email": "project@agency.gov",
              "name": "Project Coordinator Name",
              "twitter": "https://twitter.com/projectname",
              "phone": "2025551313"
            },
            "partners": [
                {
                    "name": "DOXYZ",
                    "email": "project@doxyz.gov"
                }    
            ],
            "license": "https://path.to/license" OR null,
            "openSourceProject": 1,
            "governmentWideReuseProject": 0,
            "exemption": null,
            "updated": {
                "lastCommit": "2016-04-30",
                "metadataLastUpdated": "2016-04-13",
                "sourceCodeLastModified": "2016-04-12"
            }
        }
    ]
}

You can see our working document here, with some discussion that's already taken place.

Required fields:

Agency - The agency's acronym
Project:
- Name - The project name
- Description - A description of the project
- License - null or a URL to the project license
- Open Source Project: 0, indicating a closed source codebase, or 1, indicating an open source codebase
- Government-wide Reuse Project: 0 indicates that the project is not developed for government-wide reuse, 1 indicates that it is
- Tags: A list of keywords that describe the project
- Contact:
  - Email: Preferably an address dedicated to the project, though any address where a project contact can be reached

Optional fields:

Project:
- Status: The list of accepted options can be found in the civic.json specification: "Ideation", "Alpha", "Beta", "Production", "Archival"
- VCS: The Version Control System that the project uses
- Repository: The project repository URL
- Homepage: The project homepage URL
- Download URL: The URL where a distribution of the project can be found
- Languages: A list of languages used in the codebase
- Contact
  - Name: The name of a contact for the project
  - Twitter: The URL of the project Twitter account
  - Phone: The phone number of the project contact
- Partners: A list of the acronyms of partner agencies involved in the project
- Exemption: The exemption that excuses the project from government-wide reuse, a number 1-5 corresponding to the exemptions listed here: https://sourcecode.cio.gov/Exceptions/
- Updated:
  - Last Commit: A timestamp of the last commit made on the project. Would ideally be populated dynamically.
  - Metadata Last Updated: The date that the inventory.json was last updated for this codebase
  - Source Code Last Modified: A field intended for closed-source software and software outside of a VCS. The date the source code or package was last updated

Note:

We differ with civic.json in our implementation of License. civic.json accepts only a URL. We also accept null in the event that there is no license.

Remaining questions:

Are any of these fields likely to be overly burdensome for agencies to collect or difficult to interpret?
Are there any reasons we shouldn't accept two different data types for License: null and a string?
Should the Tags field accept any alphanumeric tag freeform, or should we define a set of tags? The former provides less clean data but maximum flexibility, while the latter keeps data clean and searches simpler, but is rigid.
Are there other fields we should include in Contact?
The value of Exemption would be a number between 1-5, corresponding to the exemptions in the policy. Is it a good idea to tie the Exemption field to the content of the policy? I assume this policy will change?
What format should we collect the Updated field's timestamps in?
Is there any other information we should collect about partner agencies?

The text was updated successfully, but these errors were encountered:

emanuelfeld · 2016-09-21T00:45:30Z

Initial thoughts:

License

I'm reminded of this recent discussion relating to whether 18F's non-standard license was actually required.

Could additionally allowing an SPDX license identifier (which is what npm's package.json does) prod agencies to use standard licenses? This is something that could be baked into a code.json form (and probably auto-filled once given a license URL).

Agency

Looking at the top-level agency field, I'm concerned that there may be duplication/conflicts in cases where more than one agency is involved in a code project. In DC civic.json the partners field includes all parties involved, including the principal one.

Updated

@stvnrlly is (I think rightly) biased against fields that will be frequently updated. I believe all of these fields should be automatically generated. These are easy to neglect/mess up.

Contact

You may want to allow for additional contact URLs, outside of Twitter. DC Civic.json's contact object has a freeform URL attribute.

Tags

Defining taxonomies is a pain. I don't believe anyone can predict in advance the specific tags that would prove useful. On the other hand, there may be room for guidance on what makes a good tag. You often see people including every conjugation/singular/plural/geographic format/etc.

Unique IDs

If implemented with view toward an API, I would like a way to discover new projects and monitor changes in existing ones (e.g. new partners, updated license, new repository URL). Could any of the required code.json fields serve as a unique ID?

stvnrlly · 2016-09-21T04:35:55Z

Cool! Very nice to see this happening, and it looks great. Overall, I agree with Emanuel's comments (especially when he says that he thinks that I'm right). Here are some additional thoughts:

Government-specific Elements

I'd like to encourage you to think about how this could be useful outside of the government, too. With the possible exception of the exemptions, I don't think that there's anything necessarily government-specific about this schema, so choosing a term besides agency could allow organizations like Code for DC to move to this standard and get one step closer to a shared standard.

Agency Info

Right now it's just a name, but in my opinion a URL is even better, as it provides some disambiguation and context. As such, making that an object with name and url could be helpful. There are probably some fun OMB codes that could go in there, too, but that doesn't seem useful right now (and might be a job for a separate API at a later date).

Multiple Projects

Making that an array is a great idea.

Binary States

I'd recommend using true and false instead of 0 and 1, as it may make more sense to a non-techie.

Exemptions

Since this is related to Government-Wide Reuse, why not combine them into a single object? Additionally, if multiple exemptions are possible, an array may fit better, and it may help to link directly to a URL for the exemption instead of a number proxy. If a URL isn't possible, naming the exemptions and including that along with the number (e.g. 1 - Law or Regulation) could help future-proof it.

License

The URL requirement in civic.json had a nice forcing function of forcing OSS projects to think about licensing, but that doesn't work as well for the government. In my opinion, this field exists mostly to answer the question of "Can I use this?" If I saw that a project didn't have a license, I wouldn't necessarily know if that's because it was public domain or because it was under copyright and unlicensed.

So, at the risk of making this much too complicated, I'd propose something like this:

"copyright": {
    "licensed": true,
    "license": [SPDX identifier],
    "copyrighted": true

Public domain status could then be indicated with "copyrighted": false, while still allowing for the project to also be marked as CC0.

Required Fields

There should be at least one required field that points to a location to learn more about the project, be it a homepage or a repository.

Fields that Don't Need a Human

There are a few things—like description, languages, and last commit—that could also be pulled from the GitHub API. In those situations, I think it's better to leave it out of the schema and let people pull that information directly from the source to reduce the number of places where that information needs to be maintained.

Here, it seems like there's a non-zero probability of non-GitHub projects being tracked, so there may be a good argument for keeping them in.

Ability to Complete

Just as a side note, I don't see anything in here that a project member isn't likely to know, which is great. That seems like an obvious thing, but I've definitely dealt with standards that stump me, and then it doesn't get filled out well.

jcastle-zz · 2016-09-22T13:10:59Z

@theresaanna schema looks good for a start. Think the optional element of URL should be mandatory. What's the point in identifying repos by name and not by location?

We need to finalize the schema soon because agencies will have to collect the metadata. They first have to consider where the code libraries are stored.

Does anyone know of a Github API that collects all org repo metadata and ouputs in a JSON format (or similar)? That would help jumpstart the metadata collection process.

IanLee1521 · 2016-09-22T16:48:07Z

@jcastle -- I have some Javascript code that does this to visualize our (@LLNL) orgs on our http://software.llnl.gov page. You can find that here: https://github.com/LLNL/llnl.github.io/blob/master/js/github-dynamic.js

I also have some Python scripts I'll get pushed up today.

IanLee1521 · 2016-09-22T18:23:41Z

@theresaanna -- Is there a suggestion on what to put when the code is not directly from an agency? In my case would I put Lawrence Livermore National Laboratory and / or DOE ?

Based on GSA/code-gov-web#44

IanLee1521 · 2016-09-22T18:28:38Z

A sample I just did for a single project can be found here: https://github.com/LLNL/llnl.github.io/blob/master/_data/code.json

I'll work on a script to get it more fleshed out shortly.

david-a-wheeler · 2016-09-26T14:33:07Z

I would suggest adding "release date", that is, the date it was initially released to the public. This is interesting information for many reasons, and isn't always obvious from the version control information.

This is one of the fields captured here: http://www.dwheeler.com/government-oss-released/

theresaanna · 2016-09-27T15:17:52Z

@IanLee1521 Thanks so much for digging into this and trying it out!

@theresaanna -- Is there a suggestion on what to put when the code is not directly from an agency? In my case would I put Lawrence Livermore National Laboratory and / or DOE ?

That's an excellent question. My thinking is that we may want to add another field that would accommodate LLNL. I'm not sure what to call it, though. Do you think this is a good solution, and do you have an idea of what would be a suitable key?

IanLee1521 · 2016-09-27T17:15:56Z

@theresaanna -- Perhaps something like organization? I imagine that other agencies will have the same issue. Certainly DOE with the national labs. But also I would expect DOD would want to get subdivided to Army, Navy, Marines, Air Force, etc. Another example would be GSA -> 18F.

Another option would be to have that all included in a single field, something like:

"agency": "DOE // LLNL"
"agency": "GSA // 18F"

etc.

okamanda · 2016-09-27T17:49:51Z

Hello folks,

Thanks for keeping up the lively discussion on the alpha version of the schema. The specification for version 1.0 of the metadata schema is now available here: https://github.com/presidential-innovation-fellows/code-gov-web/blob/master/_draft_content/schema/specification_v1_0.md. Sample JSON files to be included soon.

mikecharles · 2016-09-27T17:50:31Z

With an organization you could even drill down to a specific level. For example, my organization would be:

DOC/NOAA/NWS/NCEP/CPC

If the org is parsed into levels, one could query a specific level to see how much code is being produced at that level:

noaa_code = find(organization[1] == 'NOAA')
doc_code = find(organization[0] == 'DOC')

Something like that...

ctubbsii · 2016-09-27T19:31:06Z

I still have no idea what "built for government-wide reuse" means. All open source projects would be "world-wide reuse", so does "government-wide reuse" refer to closed-source, but government-shared projects? It seems to me that all government closed-source software should allow that, provided whatever project-specific prerequisites are met. Further, "for" implies intent. I'm not sure why intent matters. The whole point of open source'ing and inner source'ing, is reuse. Aren't all software projects potentially re-usable, regardless of initial intent? It seems to me that the whole point of this effort is to make it easier to share and re-use, government produced software, regardless of intent. This is not clear at all, and is prone to confusion.

If that field is kept, it really needs better documentation. That documentation should specifically address the circumstances under which that field has a particular value, and should explain why that value is necessary because it could not be deduced by other attributes (like, open source status/license/exemption/etc.).

bondsbw · 2016-09-28T00:55:47Z

I agree that true and false are much better than 1 and 0.

thecapacity · 2016-09-28T17:50:56Z

I particularly like @stvnrlly 's comments and wanted to record a few (some overlapping with the other commenters too) to weigh in.

Apologies for the quick list;

I feel like the "vcs" key value should be dropped, e.g. GitHub lets one pull from SSH, Git, or SVN so it feel confusing (e.g. even in the example I think someone might be confused if it should be "github" or "git", especially with other companies like Microsoft incorporating git into their tools.
- This could also potentially be inferred from the URL.
I feel like the "language" key/value should just be part of the tags e.g. "python" might just be a tag vs. a "sub field tag" which would allow easier filtering e.g. search: python+connecting (which is of course possible with multiple fields but harder to implement.
The "partners" field feels like it's destined to be underutilized, such as when agencies don't know / care who's using their code and maybe wan't maintain this... also ideally the VCS will track "forking" so this could ideally be queried in real-time vs. a static snapshot.
- The Gov-wide reuse field also feels similarly “undefined”, e.g. an agency might not know if it’s reused
The "openSourceProject" field as a binary flag seems confusing (to me) - if the software is released as part of the Open Source release then isn’t this always True?
I would change "updated" to "schema_updated" to make it clear "what's being updated" (per the earlier discussions)
Lastly, and I know it was discussed before - I don't feel like "License" should be permitted to be null - and in fact I think it might be better to require this to be a file within the repository.
- e.g. I think the License field could just be a "tag" vs. a specific field.

MikePulsiferDOL · 2016-10-03T13:53:44Z

I think it's important to think about scalability. How much of this can be automated as @stvnrlly suggested for a few fields? Maintaining the data.json file for DOL has been a nightmare of manual labor, especially when there are schema updates. Even CKAN can be hours and hours of clicking the days away.

IanLee1521 · 2016-10-06T15:24:25Z

@MikePulsiferDOL -- I'm in the process of getting some code released that would help with generating this JSON. It's making it's way through our release process which is taking its time... I'll update once I can push it to @LLNL.

IanLee1521 · 2016-10-06T15:34:07Z

@ctubbsii / @thecapacity -- I think that the two fields "openSourceProject" and "governmentWideReuseProject" are meant to encode three possible states of being for Code that is to be listed on Code.gov:

Open Source
Government-wide Reuse
Closed Source / Exempted / etc.

To @ctubbsii 's comment:

... so does "government-wide reuse" refer to closed-source, but government-shared projects? It seems to me that all government closed-source software should allow that ...

While I agree with the sentiment that all software should allow that the fact is that until the Federal Source Code Policy there hasn't been any hard requirement for it to be available broadly across government and the default has been "Closed Source". The policy makes the requirement for government-wide reuse.

mchogan · 2016-10-12T22:12:07Z

Consider adding a schema version identifier so that future parsers will know what fields to expect.

"openSourceProject": 1

Consider eliminating attributes that can be calculated. It should be possible to calculate openSourceProject using the license URL.

"governmentWideReuseProject": 0

Consider minimizing the number of attributes in the schema. Instead of trying to get the right answer on the first try, include an attribute called something like "optional": that accepts an array of JSON objects so that experimental attributes can be tested before graduating to required schema fields. With a schema version attribute a parser would know which fields are expected and that introspection is required for the optional fields.

"status": "Alpha",

Instead of status, consider asking for the release number. Usually a release or version number indicates alpha, beta, 1.0.0, etc. It might be worth recommending a standard like semver, used by Angular2 and other projects.

mchogan · 2016-10-12T22:43:43Z

It might be worth extending an existing package manager schema rather than creating a new one. For example...

JJediny · 2016-10-13T16:54:42Z

Comments on Current/Proposed as of 10/13

openSourceProject

Seems redundant and confusing, if the project uses an accepted open source license then it is true/1 if it doesn't then false/0. Suggest removing

agency

Using an agency acronym is dangerous as some agencies internally can't even agree about their own (e.g. USFWS or FWS, USACE or ArmyCorps, etc.). While the use of program/bureau Codes are 'safer' for data quality they are not intuitive. We have previously made the recommendation that the Government can and should create a reference mapping of Agency Domain Names (e.g. @gsa.gov, @usfws.doi.gov, etc.) mapped to their bureau/program. Using an agencies domain is far more stable and less likely to create a data cleaning nightmare and frankly speaking the people doing the data entry likely already know their email address.
https://project-open-data.cio.gov/v1.1/schema/#bureauCode

license/language

Both of these attributes should implement/reference a controlled vocabulary to ensure consistency.

Comments on what's missing as of 10/13

Globally Unique Identifiers (GUID/UUID)

These are critical to establishing provenance to the canonical source of data. The whole point of them is that they can be distributedly generated but yet still statistically unique that the chances of anyone generating duplicative GUID/UUID(s) is realistically impossible. Not using one 1. makes any parent/child relationships impossible and 2. there are no other Unique identifiers used so as titles change then knowing "is this project really that project" making it impossible to avoid/test for redundant/duplicative entries. See #56

isPartof (Parent/Child relationships)

As we have discovered in implementing data.json, the concept of a collection (i.e. the ability for one component of a project to reference its parent project) is critically important.
https://project-open-data.cio.gov/v1.1/schema/#isPartOf

contact.role

the contact field should allow/encourage multiple entries but currently there is no concept of a contact's role (e.g. project manager, development lead, etc). This has been a concern in project open data that personal turn over and/or the want/need to direct people to a generic inbox for a program/team to complement the specific employee/POC for the project.

General Comments
JSON is great for having one file that contains a series of entries (more then one open source project). But it is less human readable then its YAML derivative. Given that multiple YAML documents can be compiled into one JSON document; IMHO it is more practical for those responsible for data entry to use YAML as either a code.yml file in the root dir of the repo and/or as an enhanced README which is a single README.md file with YAML frontmatter to better manage structured data (this is the exact file format of how Github Pages works to structured content for static websites in lieu of a CMS/Database). It then is easy enough to generate a json file from all of those files as a crawl/collect/transform within a Github Organization for instance

mattbailey0 · 2016-10-18T15:41:24Z

Thanks everyone. We've had a bit of a proliferation of issues related to the schema. Let's move this conversation to #41

IanLee1521 added a commit to LLNL/llnl.github.io that referenced this issue Sep 22, 2016

Created code.json sample

7d105ac

Based on GSA/code-gov-web#44

philipashlock mentioned this issue Oct 12, 2016

[Request for Discussion] Software inventory metadata schema and inventory collection #41

Open

mattbailey0 closed this as completed Oct 18, 2016

ToniBonittoGSA mentioned this issue Aug 18, 2019

Blog post - needs featured, social image GSA/digitalgov.gov#1235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request for discussion] Alpha code.json Project Inventory Schema #44

[Request for discussion] Alpha code.json Project Inventory Schema #44

theresaanna commented Sep 20, 2016 •

edited

emanuelfeld commented Sep 21, 2016

stvnrlly commented Sep 21, 2016

jcastle-zz commented Sep 22, 2016

IanLee1521 commented Sep 22, 2016 •

edited

IanLee1521 commented Sep 22, 2016

IanLee1521 commented Sep 22, 2016

david-a-wheeler commented Sep 26, 2016

theresaanna commented Sep 27, 2016

IanLee1521 commented Sep 27, 2016

okamanda commented Sep 27, 2016

mikecharles commented Sep 27, 2016

ctubbsii commented Sep 27, 2016

bondsbw commented Sep 28, 2016 •

edited

thecapacity commented Sep 28, 2016

MikePulsiferDOL commented Oct 3, 2016

IanLee1521 commented Oct 6, 2016

IanLee1521 commented Oct 6, 2016

mchogan commented Oct 12, 2016

mchogan commented Oct 12, 2016

JJediny commented Oct 13, 2016 •

edited

mattbailey0 commented Oct 18, 2016

[Request for discussion] Alpha code.json Project Inventory Schema #44

[Request for discussion] Alpha code.json Project Inventory Schema #44

Comments

theresaanna commented Sep 20, 2016 • edited

Required fields:

Optional fields:

Note:

Remaining questions:

emanuelfeld commented Sep 21, 2016

stvnrlly commented Sep 21, 2016

Government-specific Elements

Agency Info

Multiple Projects

Binary States

Tags

Exemptions

License

Required Fields

Fields that Don't Need a Human

Ability to Complete

jcastle-zz commented Sep 22, 2016

IanLee1521 commented Sep 22, 2016 • edited

IanLee1521 commented Sep 22, 2016

IanLee1521 commented Sep 22, 2016

david-a-wheeler commented Sep 26, 2016

theresaanna commented Sep 27, 2016

IanLee1521 commented Sep 27, 2016

okamanda commented Sep 27, 2016

mikecharles commented Sep 27, 2016

ctubbsii commented Sep 27, 2016

bondsbw commented Sep 28, 2016 • edited

thecapacity commented Sep 28, 2016

MikePulsiferDOL commented Oct 3, 2016

IanLee1521 commented Oct 6, 2016

IanLee1521 commented Oct 6, 2016

mchogan commented Oct 12, 2016

mchogan commented Oct 12, 2016

JJediny commented Oct 13, 2016 • edited

mattbailey0 commented Oct 18, 2016

theresaanna commented Sep 20, 2016 •

edited

IanLee1521 commented Sep 22, 2016 •

edited

bondsbw commented Sep 28, 2016 •

edited

JJediny commented Oct 13, 2016 •

edited