-
Notifications
You must be signed in to change notification settings - Fork 110
[Request for discussion] Alpha code.json Project Inventory Schema #44
Comments
Initial thoughts: License I'm reminded of this recent discussion relating to whether 18F's non-standard license was actually required. Could additionally allowing an SPDX license identifier (which is what npm's package.json does) prod agencies to use standard licenses? This is something that could be baked into a code.json form (and probably auto-filled once given a license URL). Agency Looking at the top-level agency field, I'm concerned that there may be duplication/conflicts in cases where more than one agency is involved in a code project. In DC civic.json the partners field includes all parties involved, including the principal one. Updated @stvnrlly is (I think rightly) biased against fields that will be frequently updated. I believe all of these fields should be automatically generated. These are easy to neglect/mess up. Contact You may want to allow for additional contact URLs, outside of Twitter. DC Civic.json's contact object has a freeform URL attribute. Tags Defining taxonomies is a pain. I don't believe anyone can predict in advance the specific tags that would prove useful. On the other hand, there may be room for guidance on what makes a good tag. You often see people including every conjugation/singular/plural/geographic format/etc. Unique IDs If implemented with view toward an API, I would like a way to discover new projects and monitor changes in existing ones (e.g. new partners, updated license, new repository URL). Could any of the required code.json fields serve as a unique ID? |
Cool! Very nice to see this happening, and it looks great. Overall, I agree with Emanuel's comments (especially when he says that he thinks that I'm right). Here are some additional thoughts: Government-specific ElementsI'd like to encourage you to think about how this could be useful outside of the government, too. With the possible exception of the exemptions, I don't think that there's anything necessarily government-specific about this schema, so choosing a term besides Agency InfoRight now it's just a name, but in my opinion a URL is even better, as it provides some disambiguation and context. As such, making that an object with Multiple ProjectsMaking that an array is a great idea. Binary StatesI'd recommend using TagsI'm strongly in favor of defining the tags or dropping it altogether, as people are often too creative for their own good. However, it may make sense to leave it freeform for an initial period and then reevaluate (1) if it's useful and (2) if certain tag themes are emerging. ExemptionsSince this is related to Government-Wide Reuse, why not combine them into a single object? Additionally, if multiple exemptions are possible, an array may fit better, and it may help to link directly to a URL for the exemption instead of a number proxy. If a URL isn't possible, naming the exemptions and including that along with the number (e.g. LicenseThe URL requirement in So, at the risk of making this much too complicated, I'd propose something like this:
Public domain status could then be indicated with Required FieldsThere should be at least one required field that points to a location to learn more about the project, be it a homepage or a repository. Fields that Don't Need a HumanThere are a few things—like Here, it seems like there's a non-zero probability of non-GitHub projects being tracked, so there may be a good argument for keeping them in. Ability to CompleteJust as a side note, I don't see anything in here that a project member isn't likely to know, which is great. That seems like an obvious thing, but I've definitely dealt with standards that stump me, and then it doesn't get filled out well. |
@theresaanna schema looks good for a start. Think the optional element of URL should be mandatory. What's the point in identifying repos by name and not by location? We need to finalize the schema soon because agencies will have to collect the metadata. They first have to consider where the code libraries are stored. Does anyone know of a Github API that collects all org repo metadata and ouputs in a JSON format (or similar)? That would help jumpstart the metadata collection process. |
@jcastle -- I have some Javascript code that does this to visualize our (@LLNL) orgs on our http://software.llnl.gov page. You can find that here: https://github.com/LLNL/llnl.github.io/blob/master/js/github-dynamic.js I also have some Python scripts I'll get pushed up today. |
@theresaanna -- Is there a suggestion on what to put when the code is not directly from an agency? In my case would I put |
A sample I just did for a single project can be found here: https://github.com/LLNL/llnl.github.io/blob/master/_data/code.json I'll work on a script to get it more fleshed out shortly. |
I would suggest adding "release date", that is, the date it was initially released to the public. This is interesting information for many reasons, and isn't always obvious from the version control information. This is one of the fields captured here: http://www.dwheeler.com/government-oss-released/ |
@IanLee1521 Thanks so much for digging into this and trying it out!
That's an excellent question. My thinking is that we may want to add another field that would accommodate LLNL. I'm not sure what to call it, though. Do you think this is a good solution, and do you have an idea of what would be a suitable key? |
@theresaanna -- Perhaps something like Another option would be to have that all included in a single field, something like:
etc. |
Hello folks, Thanks for keeping up the lively discussion on the alpha version of the schema. The specification for version 1.0 of the metadata schema is now available here: https://github.com/presidential-innovation-fellows/code-gov-web/blob/master/_draft_content/schema/specification_v1_0.md. Sample JSON files to be included soon. |
With an organization you could even drill down to a specific level. For example, my organization would be:
If the org is parsed into levels, one could query a specific level to see how much code is being produced at that level:
Something like that... |
I still have no idea what "built for government-wide reuse" means. All open source projects would be "world-wide reuse", so does "government-wide reuse" refer to closed-source, but government-shared projects? It seems to me that all government closed-source software should allow that, provided whatever project-specific prerequisites are met. Further, "for" implies intent. I'm not sure why intent matters. The whole point of open source'ing and inner source'ing, is reuse. Aren't all software projects potentially re-usable, regardless of initial intent? It seems to me that the whole point of this effort is to make it easier to share and re-use, government produced software, regardless of intent. This is not clear at all, and is prone to confusion. If that field is kept, it really needs better documentation. That documentation should specifically address the circumstances under which that field has a particular value, and should explain why that value is necessary because it could not be deduced by other attributes (like, open source status/license/exemption/etc.). |
I agree that |
I particularly like @stvnrlly 's comments and wanted to record a few (some overlapping with the other commenters too) to weigh in. Apologies for the quick list;
|
I think it's important to think about scalability. How much of this can be automated as @stvnrlly suggested for a few fields? Maintaining the data.json file for DOL has been a nightmare of manual labor, especially when there are schema updates. Even CKAN can be hours and hours of clicking the days away. |
@MikePulsiferDOL -- I'm in the process of getting some code released that would help with generating this JSON. It's making it's way through our release process which is taking its time... I'll update once I can push it to @LLNL. |
@ctubbsii / @thecapacity -- I think that the two fields "openSourceProject" and "governmentWideReuseProject" are meant to encode three possible states of being for Code that is to be listed on Code.gov:
To @ctubbsii 's comment:
While I agree with the sentiment that all software should allow that the fact is that until the Federal Source Code Policy there hasn't been any hard requirement for it to be available broadly across government and the default has been "Closed Source". The policy makes the requirement for government-wide reuse. |
Consider adding a schema version identifier so that future parsers will know what fields to expect.
Consider eliminating attributes that can be calculated. It should be possible to calculate openSourceProject using the license URL.
Consider minimizing the number of attributes in the schema. Instead of trying to get the right answer on the first try, include an attribute called something like "optional": that accepts an array of JSON objects so that experimental attributes can be tested before graduating to required schema fields. With a schema version attribute a parser would know which fields are expected and that introspection is required for the optional fields.
Instead of status, consider asking for the release number. Usually a release or version number indicates alpha, beta, 1.0.0, etc. It might be worth recommending a standard like semver, used by Angular2 and other projects. |
It might be worth extending an existing package manager schema rather than creating a new one. For example... |
Comments on Current/Proposed as of 10/13
Seems redundant and confusing, if the project uses an accepted open source license then it is true/1 if it doesn't then false/0. Suggest removing
Using an agency acronym is dangerous as some agencies internally can't even agree about their own (e.g. USFWS or FWS, USACE or ArmyCorps, etc.). While the use of program/bureau Codes are 'safer' for data quality they are not intuitive. We have previously made the recommendation that the Government can and should create a reference mapping of Agency Domain Names (e.g. @gsa.gov, @usfws.doi.gov, etc.) mapped to their bureau/program. Using an agencies domain is far more stable and less likely to create a data cleaning nightmare and frankly speaking the people doing the data entry likely already know their email address.
Both of these attributes should implement/reference a controlled vocabulary to ensure consistency. Comments on what's missing as of 10/13
These are critical to establishing provenance to the canonical source of data. The whole point of them is that they can be distributedly generated but yet still statistically unique that the chances of anyone generating duplicative GUID/UUID(s) is realistically impossible. Not using one 1. makes any parent/child relationships impossible and 2. there are no other Unique identifiers used so as titles change then knowing "is this project really that project" making it impossible to avoid/test for redundant/duplicative entries. See #56
As we have discovered in implementing data.json, the concept of a collection (i.e. the ability for one component of a project to reference its parent project) is critically important.
the contact field should allow/encourage multiple entries but currently there is no concept of a contact's role (e.g. project manager, development lead, etc). This has been a concern in project open data that personal turn over and/or the want/need to direct people to a generic inbox for a program/team to complement the specific employee/POC for the project. General Comments |
Thanks everyone. We've had a bit of a proliferation of issues related to the schema. Let's move this conversation to #41 |
Thank you all so much for the feedback you've given on the draft schema and for the great discussion. It's been invaluable and is baked into the alpha version.
You'll find our alpha version of the project metadata schema below. We welcome your feedback as we iterate on it. We've been working to come up with something carefully considered and easy to comply with as soon as possible so that agencies can have the most time possible to prepare.
The schema needs to describe a vast and diverse universe of software, though we aren't the first folks to think about how to describe a software package. Some of you have mentioned other schemas that aim to solve similar problems. We've looked at them and have found one that we feel best matches the projects we aim to describe here.
Code for DC and DC employees extended a project schema created by BetaNYC and have created their civic.json. We are extending (and slightly modifying) it to create our own schema. They have schema creation and validation tools that we hope to leverage as well. Thanks to the team for sharing their great work!
code.json:
You can see our working document here, with some discussion that's already taken place.
Required fields:
Optional fields:
Note:
We differ with civic.json in our implementation of License. civic.json accepts only a URL. We also accept null in the event that there is no license.
Remaining questions:
The text was updated successfully, but these errors were encountered: