-
-
Notifications
You must be signed in to change notification settings - Fork 16
Reflection on current project state, and a proposal for a metaschema #23
Comments
I do think it makes sense to explore this. A couple of points:
|
JSON-LD seems to be the recommended (Google, origins) standard. It can encompass Schema.org. The latest version, 1.1, is in draft and has some interactive examples on the spec. We could store a denormalized version of JSON-LD in the backend as well as a denormalized version of context/schema so we know what fields are valid without having to send an HTTP request for the schema. |
Too much fun to fully write on my phone!
Yeah, by adopting these two contexts, we've _already_ set up a lot of work
if it can't be handled in a mostly automated fashion. Automation, akin to
how schema-dts or pythreejs work, is the right path. Using the highest
fidelity canonical description (they both publish jsonld of their
meta-model) is probably the only way
to get this done, and will implicitly handle the multiplicity and
inheritance issues.
These would get you to dumb, but type-checkable, classes in many Jupyter
languages. But they would be derived from the canonical definition.
Combining the JSON-LD contexts from SDO/WADM one could derive a canonical
serialization format, and only have to explicitly handle conflicts like
Person, Organization and Dataset.
So that's Read and Print... what about Execute? Resolving these types is
another task, and should be pretty decoupled from the contexts we implement
as a concrete schema.
It would be folly to ignore actual graph implementation that expose a
"real" graph query language.
For example, a schema provider backed by a full-on graph store could just
build sparql/gremlin... rdfalchemy and sqlite would be enough for the
single user experience. Bigger deployments might already be graphql-aware,
like:
https://dgraph.io/
https://edgedb.com/
But would be harder to configure. At any rate, not only might you have
multiple kinds of storage, you might use more than one storage/resolver on
the same server for the same type. So this will take some serious thought,
especially when it comes to things like pagination across multiple sources.
Both extensible schema and even extensible types/unions are important. I
started on adding extensible schema (new types, which can reference/extend
existing ones) on my prototype:
https://github.com/deathbeds/jupyter-graphql/pull/3/files#diff-8a9380c1249ac99297a763e1f9a4ee77
A pip-installable extension can add some things (query, mutation, subs)
defined by graphene types.
Further (unpushed, for some reason) work adds an example of extending a
type by adding fields, and while it's really ugly, using python
type("",(),{}) magic, it does work.
The first example I tried extends notebook metadata, adding
SlideShowMetaBase to the CellMetaData type. The contents plugin advertises
this as another entry_point. I would rather use a union or something, but
there's no multiple inheritance, so it would really one work in specific
cases.
|
Here is a possible implementation story, inspired by a conversation with @dcharbon and looking again at the JSON LD spec. User StoriesUsers will want to click on a resource and see relevant metadata about it. They will want to be able to edit that metadata as well as click on resources in the metadata to see metadata about those. Metadata providers want to be able to provide metadata for the user for certain resources. The metadata they provide will have different fields that likely come from some type specification for the type of object they are describing. As users edit the metadata, they need to be able to be notified of these changes so that they can update where they store the metadata. Data modelA resource in this context is a Linked Data node, as laid out in the JSON LD spec. So it has some So as a Metadata Provider, you have to define a way to query yourself to see if you have metadata about a resource, and if you do, to return that in the expanded JSON LD syntax. You also have to define to update yourself with an updated version of the metadata. The metadata explorer will see what the active resource is and query each of the providers to see if it has data about that resource. The first that does will be displayed to the user. Primitive types will be displayed without links, but types that link to other IDs will be displayed as links. All existing fields are editable, but the user cannot add new fields. The removes the need to process the type at all to understand what all valid field for it could be. The ability to add new fields from the UI could be added at a later date. If a user edits a field, the provider that had that metadata gets notified with the updated object. This implementation allows us to integrate our existing graphql backend, but the core APIs would not depend on it and allow users to define other backends however they want to provide and persist metadata. The major technical hurdles I see here are creating proper editable UIs given arbitrary JSON LD nodes and communicating the proper structure of the nodes that the data provider should return. |
I like this idea - this is really the type of problem that JSON LD was invented to solve. Questions and thoughts:
|
I have seen the {
"@context": "http://schema.org/",
"name": "Jane Doe",
"jobTitle": "Professor",
"telephone": "(425) 123-4567",
"url": "http://www.janedoe.com"
} There is also the standard
Sure. This is the interface I am thinking about: type LinkedData = {
'@id': string,
[prop: string] : any
}
interface IMetadataProvider {
// Maybe this method is not required
listResources(): Promise<Array<URL>>;
getResource(resource: URL): Promise<LinkedData>;
updateResource(data: LinkedData): Promise<void>;
} Multiple providers would be useful if you already have your metadata stored somewhere, and don't wanna replicate it into a local GraphQL database. Instead, you can access your existing store however you like client side and as long as you can query it about resources. It also provides an abstraction layer over graphql, so if we wanna move metadata storage into the real time data store, we can do this by implementing a new provider, without changing the metadata extension. |
there is an implementation of graphql server using json-ld concept: https://www.hypergraphql.org/ ... not sure yet if it could be helpful. |
about graphql layer, one thing that we need to keep in mind is it is strong typed ... so work with generic structure doesn't work very well ... so that is why I am investigating graphql-schema-org. |
As @xmnlab mentions, with GraphQL's typed schema, you can't extend the schema at run-time. I.e. If we wanted RDF's notion of "say anything about anything", then we'd need to look elsewhere for a solution. (Right?) So, maybe we should step back and say: "How extendable do we actually want this metadata service to be?" The way I see it we have a few options:
Option 1 is obviously not what we want. Option 2 is interesting... if we choose this option, there are of course many more questions to answer, but GraphQL could do this as @bollwyvl has already shown (via python-graphine). Option 3 is also interesting. It takes more of the semantic web mindset. @dcharbon has thoughts on how to go about this, which he has partially shared with us. @ellisonbg Which option above is most inline with your thoughts? |
I wrote up some notes explaining this idea more, to articulate how we might support arbitrary types of schemas. I think for now we are going to work on getting the current approach working with editing, and then eventually try to create a JupyterLab extension API for this kind of system: Goals:
JupyterLab Metadata Extension:
JupyterLab Metadata API:
JupyterLab Metadata Server:
|
Here's an example (trivial example, but it shows the point) of how controlled vocabularies are referenced in use cases, i.e., in an example integration of JupyterLab Metadata Service: https://github.com/Coleridge-Initiative/adrf-onto/blob/master/adrf.ttl Note that almost always there are multiple vocabularies being both blended and extended. |
Thank you, that's very helpful to see. I haven't used Turtle at all before now. It would also be helpful to see how that matches up with a particular instance of some data at some point. |
We're building out examples from the ADRF framework -- the NYU project which will use these data registry and metadata service features in Jupyter -- that use Turtle and JSON-LD interchangeably, depending on "what" is reading the file. Will share those with the project here. From an AI practitioner standpoint, I would expect my peers to use Turtle in human-curated definitions. Also, the wiki in that ADRF repo above links to more details and resources about Turtle, JSON-LD, other vocabularies, etc. |
Here's an example of a formal metadata description for a dataset, based on training data used in the Rich Context Competition:
That resolves (again, with ~7 lines of Py) into a graph:
This is a small example -- only a single entity represented -- although it shows how a knowledge graph looks. The following presentation describes how to handle metadata for linked data, data catalogs in practice, leading into knowledge graph work for reproducible science: https://www.slideshare.net/tplasterer/dataset-catalogs-as-a-foundation-for-fair-data One of the better online specs/tutorials for how to handle this kind of metadata markup is at: Our WIP code example for NYU and knowledge graph in social science research across US fed/state/local agencies is at https://github.com/Coleridge-Initiative/adrf-onto/ What's shown above is an example of to how metadata about linked data for social science, also applies in life sciences, etc. It doesn't illustrate the curation/data stewardship links (next on my TODO list). Even so, note that this level of governance will be applied to datasets and publications throughout the sciences, and given the push for compliance, data privacy, provenance, audits, etc., similar kinds of graph-based data governance are showing up for finance, healthcare, manufacturing, etc. I have a hunch that in talking with Capital One, Two Sigma, Bloomberg about their use of metadata about datasets would look similar to this. Ongoing regulatory compliance efforts will push that point even further. That's why I'm urging Jupyter to consider more of this approach for the Metadata Service. |
Then, 2 more lines of Py:
Can transform that graph into JSON-LD, so that it's easily read by machines (without any special parsing beyond JSON) and also readily exchangeable via APIs:
Note that I've pretty-printed here to help visualize it, though this JSON-LD would compress during API calls. |
@ceteri Thanks for these examples, this is really helpful. So I could see the API as taking the ID of a entity and returning the flattened JSON-LD for that entity. Then we could generate a UI around these these field mappings. console.log(myDataProvider.get('https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381')) {
"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381",
"@type": [
"http://purl.org/dc/dcmitype/Dataset"
],
"http://purl.org/dc/terms/alternative": [
{
"@language": "en",
"@value": "EPESE"
}
],
"http://purl.org/dc/terms/description": [
{
"@language": "en",
"@value": "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"
}
],
"http://purl.org/dc/terms/identifier": [
{
"@language": "en",
"@value": "8481423"
}
],
"http://purl.org/dc/terms/publisher": [
{
"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#duke_univ"
}
],
"http://purl.org/dc/terms/title": [
{
"@language": "en",
"@value": "Established Populations for Epidemiologic Studies of the Elderly Project"
}
],
"http://purl.org/pav/createdOn": [
{
"@type": "http://www.w3.org/2001/XMLSchema#date",
"@value": "1993-02-01"
}
],
"http://purl.org/pav/curatedBy": [
{
"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#cornoni-huntley_j"
}
],
"http://xmlns.com/foaf/0.1/page": [
{
"@id": "https://www.ncbi.nlm.nih.gov/pubmed/8481423"
}
]
} |
Excellent, for example that could be rendered as nested tables in a reasonable compact way. |
This would be one way to view a graph of metadata used to answer the questions that @ellisonbg enumerated: @tonyfast does that fit with what you're thinking too? |
Another general note about open standards for metadata about datasets used in science: I'd recommend reading the FAIR Data Principles, originally described in
The While that article does not specifically mention Jupyter, reading between the lines there's so much overlap between that widely accepted set of FAIR practices and intentions for the JupyterLab metadata service. |
We're probably in agreement, but there is a lot of different language be used to talk about this problem so I cannot say exactly. I see the analysis project as graph that can be searched. It's contents - virtual file systems - is enriched by type information & typology. Graph and structured databases will execute queries; and the query language bounds the questions we can ask. From this perspective, we can ask questions composed in the specified query language. For example, this is a sqlite database of a couple hundred notebooks we could ask questions like: A graph database could answer different questions, and a more general tool would query all types of files as data. There could be so many way to query, "blood sample please?". At this point though, I think I'm stuck on how did the types get there? Who is annotating information with metadata? Where is the knowledge coming from? How do we know what types are salient? From the json-ld perspective, I think that directories should permit multiples contexts with A lot of the metadata desired in the diagram could be recorded as a stream of context information. Which I think raises the important question again, "How are the types being annotated?" |
Based on some experiments with https://gist.github.com/tonyfast/443c7b5b23449ef9fe7b024538ff2261 |
@ceteri I see you have one piece of linked data here as an example. Do you have a larger set readily available? I am putting together a little prototype for the metada explorer and would like to use some of your real-ish data if possible. EDIT: Ideally it would be good to have an example with multiple entities that link to each other. |
@saulshanabrook @tonyfast: last weekend at Sci Foo, @ellisonbg and I sketched out functionality for a minimum viable UI to demonstrate the Metadata Explorer. 1/ For example, let's say a dataset has been registered through the Data Explorer, e.g., a Jupyter notebook knows where to lookup metadata for it. 2/ For example, the
We also must follow the metadata references to papers that cite usage of the dataset: For each author, there will be pages on ORCID, Google Scholar, ResearchGate, etc.: A dataset will also have a Data Provider, such as: 3/ 4/ Not all providers will have endpoint URIs to obtain just the metadata. We can build scrapers/gateways for enough of them to demo the UI. Also, there aren't a large number of these kinds of sites. We can also work with the providers to get additional endpoints available -- that's current dialog among scientific publishers. I've already talked with a Google eng manager about this, and they'd be interested plus they have the eng resources to support. It's not a lot of work either. 5/ the strategy is to follow these links as much as possible. that would impl a traversal of the graph shown above in #23 (comment) Next up, I'll develop a better sample file to use, in JSON-LD |
Footnote: one way to integrate with the scrapers/gateways would be to register a URI pattern, then when a user clicks a link with that URI pattern in the metadata explorer UI, we use the scraper/gateway instead of simply doing the HTTP GET on the URI. For example, the HTML results on Google Dataset Search are basically a list of JSON, plus some JavaScript to render it. It's not hard to scrape that kind response and build a metadata gateway for it. The other popular providers, such as ORCID and ResearchGate have embedded metadata (aka "micro data") that we can scrape similarly. |
This is a little demo I put together to show a simple linked data explorer #27 The core of it is a linked data provider that has a function that takes a URL and returns some linked data. We can hook this up to a server extension that serves up scraped data about certain URLs. |
@saulshanabrook @tonyfast @ellisonbg here's TTL for an example from the Rich Context Competition which includes 2 research papers, 2 datasets used by them, 7 authors, and the 2 data providers:
Here's that same metadata graph converted to JSON-LD (since GH doesn't yet support attaching JSON files?)
|
And the above, but compacted with this
|
Great :) I'll see your compacted context and raise you a default vocabulary -- now in the code that generates the JSON-LD above ^^^ |
I think having data in this form will really help these explorations.
Thanks for putting this together Paco and Nick!
…On Sat, Jul 20, 2019 at 12:44 PM Paco Nathan ***@***.***> wrote:
Great :) I'll see your compacted context and raise you a default vocabulary -- now in the code that generates the JSON-LD above ^^^
***@***.***
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
--
Brian E. Granger
Principal Technical Program Manager, AWS AI Platform (brgrange@amazon.com)
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
Thanks for this! I updated my prototype with this data and we can walk around it: |
A further point I'd like to raise (was a bit rushed doing that initial extension of @ceteri's stuff, as i was trying to get something out the door before traveling): I think it's pretty unlikely that most front-end tools will be building their own ontologies on the fly, most of them will either be anchored in the Big N (haven't sussed out what those would be) like DC, OWL, PROV, etc. or would be either a) semi-officially adopted by a Jupyter metadata service contract (e.g. a conformance suite) or b) extended by an existing community of practice (e.g. NASA/ESA would probably have to collaborate on something) or c) on a per-application basis. At any rate, we'd probably end up with a well-known location, e.g. Anyhow, keep in mind that ontologies frequently just describe what Can Be Known, not what Must Be Known. For that, we would likely have to consider a further constraint language, e.g. SPIN, SWRL or the present new hotness, SHACL. |
What if we could know everything? If we start to know everything then a complex graph forms that may difficult to manage. Each schema describes facets of the problem. Users will traverse small cycles in a larger graph of knowledge. If we knew schema ahead of time could we assist authors in providing better markup. Ultimately, any semantic information is opaque to the compute. How is an author incentivized to include more knowledge in their computational documents. This notebook treats many schema as data in a dataframe. binder The dataframe represents a graph with over 10000 edges including some information in multiple languages. With this information, how could we help scientists? |
I am going to close this for now, since we seem to have settled on displaying arbitrary JSON LD |
(EDIT March 8, 2019) Turns out what I proposed below is more-or-less RDF (glad to find out that people have already solved these sorts of problems for us!). We're now working on a proposal to offer an RDF/JSON-LD extendable schema served through GraphQL for JupyterLab. I'll update this post once we finish thinking through our next proposal.
Reflection on current project state
A main goal of this project is to expose the relevant parts of schema.org's schema as a GraphQL service (see #4). This is to facilitate the storage and retrieval of various "rich context" data within JupyterLab.
As of last week, we have a minimal working prototype of this, which contains a GraphQL server and an interface for querying the metadata of a dataset (i.e. a concrete implementation of this small part of schema.org). We also use this same GraphQL server to store users' comments on files in JupyterLab; more on that later.
The goal of that minimal working prototype was twofold: (1) to have something working to demo to the stakeholders of this project, and (2) to explore all relevant technologies by building a "vertical slice" of the software stack.
Now that we've built this vertical slice, I have formed some opinions on how we should change our approach. I hope this post can start a discussion!
Proposal for a metaschema
So, we already use two schemas in our minimal prototype:
Two points to this:
union
ing to precisely follow how schema.org allows properties to have "one or more types as its domain" -- it would be a mess. (For detail, see schema.org's data model.) One more note: It is expected that for any given object, most of the its properties will be unused -- thus again, it will be tedious to pass around concrete JS objects with all those fields defined yet mostly unused.So, that was a description of the problems we've realized. To overcome those problems... I have some ideas below. Most of the ideas below were inspired by @saulshanabrook in one of our meetings. I've merely tried to articulate them further. (Saul, is below more-or-less what you were thinking?)
I propose we come up with a "metaschema". I.e. A schema to describes schemas. Another way to see this is that we would not implement schema.org as a "hard-coded schema", but instead it would be represented as data in the shape of a metaschema. Yet another way to say it: If you wanted to begin supporting a new part of schema.org (say, the FlightReservation type), you would do so be inserting data into the database to document the name of your new type ("FlightReseveration") and to list out the properties it may have. This idea of a metaschema also makes it simple to support multiple schemas together. In general I believe the metaschema solves all the issues mentioned above:
(object_id, property_name, value)
. To have an array of values, you just store multiple tuples with the same(object_id, property_name)
. At query-time, you could collapse those into an array of values, if you so desired.value
will really be a two-tuple of(type, value)
. Every property's definition would contain an array oftype
specifiers, and we simply have some validator code to ensure you only assign a(type, value)
tuple to a property if itstype
is appropriate (i.e. in the property's array oftype
specifiers).(object_id, property_name, value)
forproperty_name
s that don't exist on the givenobject_id
.object_id
is a two-tuple(object_id, schema_id)
. Then objects from different schemas just coexist.Well, at this point I hope I've given a lot to either agree or disagree with! Thoughts from the group? (Specifically @saulshanabrook @xmnlab @ellisonbg @bollwyvl)
The text was updated successfully, but these errors were encountered: