Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify property-generator round-tripping algorithm #160

Closed
msporny opened this issue Sep 5, 2012 · 38 comments
Closed

Specify property-generator round-tripping algorithm #160

msporny opened this issue Sep 5, 2012 · 38 comments

Comments

@msporny
Copy link
Member

msporny commented Sep 5, 2012

When using property generators, most authors will expect that a single term that is expanded to multiple properties via .expand() would result in the same single term if run through .compact() with the same @context. There are a number of proposed ways of achieving this round-tripping:

  1. Utilize a combinatorial algorithm that only coalesces multiple property IRIs to a single property-generator term if every single one of the values for each IRI are exactly the same. Otherwise, do not compact any of the IRIs down to the property generator term.
  2. When compacting, pick the first IRI in the property generator '@id' array as the 'winning' IRI . All other IRIs associated with the property generator are either left fully expanded, or are compacted using another term in the @context that matches. This will lead to extra data during round tripping, which may be acceptable.
  3. When compacting, pick the first IRI in the property generator '@id' array as the 'winning' IRI . For all other IRIs in the property generator '@id' array, if the value matches the value associated with the first IRI, drop the value when compacting. This may require a deep search of values that are arrays. All other IRIs associated with the property generator are either left fully expanded, or are compacted using another term in the @context that matches.
  4. When compacting, if a '@compact' IRI is specified in the property generator, use that as the 'winning' IRI. For all other IRIs in the property generator '@id' array, if the value matches the value associated with the '@compact' IRI, drop the value when compacting. This may require a deep search of values that are arrays. All other IRIs associated with the property generator are either left fully expanded, or are compacted using another term in the @context that matches.

For the sake of clarity, the context for #1, #2, and #3 would look like this:

{
  'term': {
    '@id': ['http://example.org/vocab#term1', 
            'http://example.org/vocab#term2' 
  ]}
}

The context for #4 would look like this:

{
  'term': {
    '@id': [
      'http://example.org/vocab#term1', 
      'http://example.org/vocab#term2'
     ],
    '@compact': 'http://example.org/vocab#term2'},
}
@lanthaler
Copy link
Member

There might be a much easier solution if we decide to change the "semantics" of property generators. We could leave everything as is and interpret the context

{
  "term": { 
     "@id": [ 
        "http://example.org/vocab#term1",
        "http://example.org/vocab#term2"
     ]
  }
}

as

{
  "term": "http://example.org/vocab#term1",
  "term": "http://example.org/vocab#term2"
}

and say that expansion will use all IRIs to create expanded output, i.e., term1 and term2. Compaction will, just as currently, chose the best matching term for a given IRI. There might be multiple IRIs yielding the same term but that's the author's intent. We could then say that duplicates have to be eliminated in compaction.

I think the advantage of this approach is that it is easy to implement and easy to understand/explain.

@gkellogg
Copy link
Member

I think @lanthaler's approach probably is simplest, but it requires changing the term redefinition logic to allow for additive definitions within a single level of context processing.

@lanthaler
Copy link
Member

Just to make it clear: The second code snippet should just illustrate how it will work, this should not and, in most languages, cannot be supported directly. If we chose to adopt this approach, the only thing we would need to decide is whether we eliminate duplicates during compaction (which might get tricky when it comes to lists) or not.

@linclark
Copy link
Contributor

Gregg asked me to post here. I'll clarify how we would use property generators in Drupal. I am leaving out our use of language maps to keep things clear.

In my example, a site admin has created a content type that has two distinct fields, one field for tags and one field for related news items.

While they are distinct in Drupal's vocabulary, both map to schema:about.

{
  "@context": {
    "site": "http://mysite.com/",
    "field_tags": {
      "@id": ["site:vocab/field_tags", "http://schema.org/about"]
    },
    "field_related": {
      "@id": ["site:vocab/field_related", "http://schema.org/about"]
    }
  },
  "@id": "site:node/1",
  "field_tags": [
    {
      "@id": "site:term/this-is-tag"
    }
  ],
  "field_related": [
    {
      "@id": "site:node/this-is-related-news"
    }
  ]
}

So on expansion, this is what results:

[
{
  "@id": "http://mysite.com/node/1",
  "http://schema.org/about": [
  {
    "@id": "http://mysite.com/term/this-is-tag"
  }],
  "http://mysite.com/vocab/field_tags": [
  {
    "@id": "http://mysite.com/term/this-is-tag"
  }],
  "http://mysite.com/vocab/field_related": [
  {
    "@id": "http://mysite.com/node/this-is-related-news"
  }]
  "http://schema.org/about": [
  {
    "@id": "http://mysite.com/node/this-is-related-news"
  }],
}]

My understanding is that, with some compaction algorithms, the data would get mixed together when going back to compact form. For example:

  "field_tags": [
    {
      "@id": "site:term/this-is-tag"
    },
    {
      "@id": "site:node/this-is-related-news"
    }
  ],
  "field_related": [
    {
      "@id": "site:term/this-is-tag"
    },
    {
      "@id": "site:node/this-is-related-news"
    }
  ]

We would prefer to compact to approximately the same form as we started with, which only has one value for each alias.

@linclark
Copy link
Contributor

I just finished listening to the audio from the last telecon. There was a question from Manu about why we care about property generation if we don't care about the representation in other RDF formats.

Being able to match properties using universal identifiers has value in and of itself, even if something can't be transformed cleanly to N-Triples.

For example, two sites might want to contribute data to the same central repository. They have different content types with different fields, but they can use common vocabularies to provide alignment.

For one site, the context would be:

{
  "@context": {
    "site": "http://one.com/",
    "central_repo": "http://central.org/"
    "field_about_modules": {
      "@id": ["site:vocab/field_about_modules", "central_repo:about"]
    },
}

For the other site, the context would be:

{
  "@context": {
    "site": "http://two.com/",
    "central_repo": "http://central.org/"
    "field_tutorial_about": {
      "@id": ["site:vocab/field_tutorial_about", "central_repo:about"]
    },
}

The central repository will only be looking for central_repo:about, so it will expand the JSON-LD. It doesn't have to convert it to another RDF format, however... it can just work with the expanded JSON object.

@gkellogg
Copy link
Member

Lin, actually, your example would expand to use just a single key for http://schema.org/about:

While they are distinct in Drupal's vocabulary, both map to schema:about.

{
  "@context": {
    "site": "http://mysite.com/",
    "field_tags": {
      "@id": ["site:vocab/field_tags", "http://schema.org/about"]
    },
    "field_related": {
      "@id": ["site:vocab/field_related", "http://schema.org/about"]
    }
  },
  "@id": "site:node/1",
  "field_tags": [
    { "@id": "site:term/this-is-tag"}
  ],
  "field_related": [
    {"@id": "site:node/this-is-related-news"}
  ]
}

Results in :

[
{
  "@id": "http://mysite.com/node/1",
  "http://schema.org/about": [
    {"@id": "http://mysite.com/term/this-is-tag"},
    {"@id": "http://mysite.com/node/this-is-related-news"}
  ],
  "http://mysite.com/vocab/field_tags": [
    {"@id": "http://mysite.com/term/this-is-tag"}
  ],
  "http://mysite.com/vocab/field_related": [
    {"@id": "http://mysite.com/node/this-is-related-news"}
  ]
}]

This complicates the compaction algorithm, as it does require the combinatorial step of looking for all values across all IRI keys mapped to a property generator term and extracting only those values that are in common. So, for every property generator term, look at every property defined by that term for common values and only use the property generator term for those values. Potentially, there could be more than one property generator term with overlapping property IRIs associated with it, which could mean that the values would be replicated. Also, as noted before, the values could be nontrivial, including node definitions that are recursive and may or may not be equivalent.

This is really complicated, and has a big spec-smell to me.

@linclark
Copy link
Contributor

For our use case, I don't think we would require the combinatorial step.

I could be wrong, but I believe any one of Manu's proposals would give us the values we want for "field_tags" and "field_related" in my example. And it seems that only the first requires the combinatorial algorithm.

@msporny
Copy link
Member Author

msporny commented Sep 27, 2012

I had a good chat with Lin today about Drupal's use case. The first part of the discussion focused on trying to see if Drupal could use a pre-processing or post-processing step to achieve what property generators provide. After a bit of discussion, it became clear that any sort of pre-or-post processing step would add unnecessary complexity to the Drupal system.

The second part of the discussion concerned the question of why property generators were difficult. The key issue is that round-tripping is hard, for two reasons:

The first reason is that, upon expansion, it becomes impossible to tell which values came from a property generator and which ones did not. You are adding data to the graph without tracking where that data came from. This means that compacting back to what you had is a non-deterministic operation.

The second reason is that, upon expansion and then compaction again, even if you could end up with what you started with, you would still have to process each item being associated with the property-generator term to check for duplicates. This operation would be very time-complex as you would have to do a deep compare on each object against every other object being coalesced into the property generator.

During the conversation a new approach surfaced. What if we just marked all the things that were generated as a result of a property generator when going to expanded form? That is, if we have this:

{
  '@context': {
    'foo': {
      '@id': ['http://example.com/foo', 'http://schema.org/foo']
    }
  },
  "foo": "bar"
}

Expanded form would result in this:

{
  "http://example.com/foo": "bar",
  "http://schema.org/foo": {
    "@value": "bar",
    "@processor": {"source": "property-generator"}
  }
}

Re-compacting (with the same context) would just throw out anything that had a "@processor": {"source": "property-generator"} in it, giving you the original input.

The upsides to this approach are 1) understanding which terms came from a property generator are no longer non-deterministic, and 2) Time complexity is reduced to a constant N time algorithm where N is the number of property generator entries that have to be removed from the output.

The downside to this approach is that someone compacting may not want to remove property-generator-sourced terms. We could have an API option, like {removePropertyGeneratedValues: false} to the compact() call in this case. We could also list the @context that contained the property generators, and the item is removed only if the @context is the one that is being used for compaction. I don't think we need to do either of these things, but perhaps others feel more strongly about it than I do.

I think this proposal addresses all of the concerns that are related to this issue that the group has at the moment. Which corner-cases am I missing?

@lanthaler
Copy link
Member

Manu, even though your proposal solves the problem for Drupal’s use case, I don’t think it solves it in general. Your proposal relies on the fact that some metadata is embedded to facilitate round-trip-ability. There’s no way to ensure that such metadata is always there. What if some server directly generates expanded JSON-LD? This would be a quite reasonable thing to do in M2M-communication as there’s no need for nice, short terms.

I’m still not sure if I really grasped Lin’s requirements. In here comment above she said

In my example, a site admin has created a content type that has
two distinct fields, one field for tags and one field for related
news items.

While they are distinct in Drupal's vocabulary, both map to
schema:about.

Why do exist two fields in Drupal to express the same concept!? Either it is the same, or it is not. Let's assume we have the following data:

fieldA: 1, 3, 5
fieldB: 2, 4, 6

both would map to schema:about: 1, 2, 3, 4, 5, 6

Is it OK to end up with

fieldA: 1, 2, 3, 4, 5, 6
fieldB: 1, 2, 3, 4, 5, 6

if the data is re-imported into the same system? I assume it wouldn't be OK otherwise there wouldn't be two distinct fields in the first place, right? The problem is that in the context you implicitly state that fieldA == schema:about == fieldB.

All these problems could be solved easily by not using property-generators at all: include the relevant data multiple times. You would then end up with

fieldA: 1, 3, 5
fieldB: 2, 4, 6
schema:about: 1, 2, 3, 4, 5, 6

in the expanded output and round-tripping that would be a no-brainer.

I would like to stress that I really want to support Drupal's use cases and try to find a solution for their problems but I think we should not try at any price.

The discussions around this specific feature show that it is not even clear how this feature should work in the optimal case, not to mention the corner cases.

I would therefore prefer to put this feature on hold for the time being. The spec is purposely designed to ignore unknown data in the context. So the Drupal community could exploit that to introduce a proprietary extension to JSON-LD that satisfies their needs. Maybe something along the lines of

"fieldA": { "@id": "...", "otherIds": [ "..", ".." ] }

a simple preprocessing step would could then just iterate over the data and duplicate all fieldA keys x-times using the IRIs in otherIds as keys.

PROPOSAL: Do not support property-generators in JSON-LD 1.0.

@linclark
Copy link
Contributor

linclark commented Oct 1, 2012

Why do exist two fields in Drupal to express the *same* concept!?

Within Drupal, these two fields would be handled differently, as different concepts. Tags would be configured to autocomplete from a free-tagging vocabulary. Related News would be configured to autocomplete from nodes of the "news article" type. They could also be formatted in separate ways. The values would be stored in different database tables.

This is a distinction that we need to maintain internally and need to maintain in the deployment use case (moving content from site to site), but it is not a distinction that is important to other consumers. Therefore, the two values are exposed using separate properties for Drupal consumers, but the same property for search engine consumers.

I can ask the group whether we are OK with having property values repeated multiple times. However, many of the people anticipating this work in Drupal are concerned specifically with mobile. I'm not sure they will agree to doubling and tripling the size of the data.

Adding our own custom preprocessing as part of a library might be an option. It would make it hard to interface with multiple non-Drupal consumers, though, as they could not be expected to have a JSON-LD library with this customization... unless it becomes a widely adopted non-speced feature.

@lanthaler
Copy link
Member

Within Drupal, these two fields would be handled differently, as
different concepts. Tags would be configured to autocomplete from
a free-tagging vocabulary. Related News would be configured to
autocomplete from nodes of the "news article" type. They could
also be formatted in separate ways. The values would be stored
in different database tables.

OK, I see. Would it be possible to coerce the values to different datatypes
or are all of them IRIs? If it is possible, you could have two terms, which
expand to the same IRI but coerce the values to different types. After
expanding, both would be combined into one property IRI. Compacting it
again, would separate the values again:

http://bit.ly/U0Qo87

Would this solve your problem?

I can ask the group whether we are OK with having property values
repeated multiple times. However, many of the people anticipating
this work in Drupal are concerned specifically with mobile.
I'm not sure they will agree to doubling and tripling the size of
the data.

Do you have a sample document? I'm quite sure gzipping it would actually
eliminate this overhead. On the other hand, the memory needed locally is
exactly the same and property generators could be misused for DoS attacks by
sending relatively small payloads which expand to huge in-memory
representations.

Adding our own custom preprocessing as part of a library might be
an option. It would make it hard to interface with multiple non-Drupal
consumers, though, as they could not be expected to have a JSON-LD
library with this customization... unless it becomes a widely adopted
non-speced feature.

Not sure I understand what you are saying here. Normal consumers wouldn't
see your two Drupal-internal vocabularies but just schema:about which I
think would be fine.
For the "deployment use case" you described above both systems would indeed
have to use a modified JSON-LD processor but since both systems are under
the same user's control (and both run Drupal I assume) that wouldn't be a
problem, would it?

@gkellogg
Copy link
Member

gkellogg commented Oct 2, 2012

On the telecon today, I tried to outline what I think would allow for effective round-tripping of property generators without needing to introduce pragma-like data to the expanded output.

The basis of the proposal, common with most everything presented to date, is to associate multiple IRIs with a term:

{
  "@context": {
    "name": { "@id": ["http://schema.org/name", "http://purl.org/dc/terms/title]}
  },
  "name": "foo"
}

Expanding this document results in both IRIs being used as properties with the same value:

[{
  "http://schema.org/name": [{"@value": "foo"}],
  "http://purl.org/dc/terms/title": [{"@value": "foo"}]
}]

Compacting this expanded object with the same context effectively requires the following (before 2.3):

  • For each term pg in the active context which is a property generator:
    • Using the first expanded IRI p associated with the property generator:
      • Skip to the next property generator term unless p is a property of element.
      • For each node n which is a value of p in element:
        • For each expanded IRI pi associated with the property generator other than p
          • Skip to the next value in p if pi is not a property of element or n is not a value of pi (using nodeEquivalence).
        • Remove n as a value of all p and pi in element.
        • Add the result of performing the compaction algorithm on n to pg to output.

The nodeComparison algorithm would compare values, node definitions, node references and lists of values or node definitions/references for equivalence. Comparing node definitions is somewhat complicated. The set of expanded values may have at most one node definition. Comparison of node definitions/references is done by comparing the node identifier (@id) only. The algorithm must also perform comparison of list values, which is a fairly simple recursive case.

Expansion is modified to ensure that node definitions are output only once, and node references otherwise. This requires that all anonymous node definitions which are values of property generator properties be assigned a Blank Node identifier using a well-known prefix. Detecting any Blank Node identifier in the input graph using this prefix must result in an error.

This algorithm should handle the case where simple terms are used alongside property generator terms, and only those terms having values that are common across all property generator IRIs are assigned to the property generator term on compaction.

@msporny
Copy link
Member Author

msporny commented Oct 3, 2012

@dlongley and I looked at @gkellogg's proposal above and found the following issues:

  1. It doesn't prevent the need to do a deep comparison of all items in the array (against every other item the the array O(n^2)). That's effectively what the algorithm above does, but it then assigns a static value (bnode ID) to JSON objects in the array. If somebody changes the expanded output manually, the compaction algorithm will destroy the new information that was created (in the case where two items that were the same before end up being different after the change).
  2. In order to remedy (1), you have to do a deep comparison at the time that you're compacting. You can't assume that the expanded data didn't change. We could also decide that if two items in the array have the same @id, then you include both items in the output (don't merge).

(2) is the generalized solution to the problem (which we have discussed before and rejected because of the requirement to do the deep comparison). However, that is the correct solution to the particular problem in front of us - removing duplicates. There is no way to avoid a deep comparison between objects in the list - it's O(n^2).

@dlongley came up with a different approach to this problem that doesn't create the duplicate problem, is far less complicated, but still solves the Drupal use case (and the general use case of sites using different IRIs to refer to data in expanded form). I'll put that proposal in below.

@msporny
Copy link
Member Author

msporny commented Oct 3, 2012

Property Aliasing Proposal

The core problem that Drupal is attempting to solve is to ensure that Drupal sites can exchange data with one another while not having to tightly couple their context property IRIs. That is, each Drupal site has their own data internally, with their own property IRIs. Many of those property IRIs are specific to the drupal site. For example, the title of a tag may be mapped to http://mydrupalsiteA.org/vocabs/title or http://schema.org/title or http://purl.org/dc/terms/title. Site A might know about http://schema.org/title but not http://purl.org/dc/terms/title, Site B might know about http://schema.org/title and http://purl.org/dc/terms/title but not http://mydrupalsiteA.org/vocabs/title.

So the problem is: How do you export data such that Site A can map http://schema.org/title to "footitle", and site B can map http://purl.org/dc/terms/title to "bartitle"?

Property Generators

One approach was to use property generators to duplicate data. So, Site A would duplicate the same data (in expanded form) for http://schema.org/title and http://purl.org/dc/terms/title and http://mydrupalsiteA.org/vocabs/title. The down-side with this approach is that, when compacting, you must coalesce the data back into a single property (if the consuming site is using a property generator for the same property). The coalesce step requires a deep comparison to de-duplicate data and is a very expensive O(n^2) operation.

If we have to implement this, we can, but there is a better solution that doesn't duplicate data in the first place.

Property Aliasing

Property aliasing has the benefits of addressing the Drupal use case above while not duplicating data. In order to use a property alias, Site A would do this (same syntax as we have for property generators):

{
  "@context": {
    "footitle": { "@id": ["http://schema.org/title", "http://purl.org/dc/terms/title", "http://mydrupalsiteA.org/vocab/title"]}
  },
  "footitle": "baz"
}

Site B contacts Site A to get the data. Site A expands the data using its context. The new property aliasing feature would use the first IRI in the list to expand the data:

{
  "http://schema.org/title": "baz"
}

Site B would then use its own context to compact the data above and work with it:

{
  "@context": {
    "bartitle": { "@id": ["http://purl.org/dc/terms/title", "http://schema.org/title", "http://mydrupalsiteB.org/vocab/title"]}
  },
  "bartitle": "baz"
}

Site B could then communicate the same data back to Site A by following the same algorithm. Note that expanded form is different this time around (because the first IRI in the list was different on SiteB):

{
  "http://purl.org/dc/terms/title": "baz"
}

Site A could them compact the data above and work with it, like so:

{
  "@context": {
    "footitle": { "@id": ["http://schema.org/title", "http://purl.org/dc/terms/title", "http://mydrupalsiteA.org/vocab/title"]}
  },
  "footitle": "baz"
}

So, the data round-trips to exactly what one would expect without needing to duplicate data in expanded form and without the need for @processor statements.

Pathological Cases

There is one class of pathological cases. Basically, this is when developers manually inject full URL values into compacted or expanded data. Let's look at the compact case:

{
  "@context": {
    "footitle": { "@id": ["http://schema.org/title", "http://purl.org/dc/terms/title", "http://mydrupalsiteA.org/vocab/title"]}
  },
  "footitle": "baz",
  "http://purl.org/dc/terms/title": "baz",
  "http://mydrupalsiteA.org/vocab/title": "baz"
}

The above would expand out to:

{
  "http://schema.org/title": "baz",
  "http://purl.org/dc/terms/title": "baz",
  "http://mydrupalsiteA.org/vocab/title": "baz"
}

and then compact to:

{
  "@context": {
    "footitle": { "@id": ["http://schema.org/title", "http://purl.org/dc/terms/title", "http://mydrupalsiteA.org/vocab/title"]}
  },
  "footitle": ["baz", "baz", "baz"]
}

Obviously, this is not preferable, but also keep in mind that the developer had to go out of their way to make this happen. If you continue to just use terms and not muck around with the expanded data, you're in good shape.

There is no duplicate removal for property aliasing because it's a 'grouping' mechanism, not a 'same as' mechanism. When you use a property alias, you're saying "Any of the following URLs should be grouped under term X". You're not saying "Any of the following URLs are owl:sameAs the other URLs". While I do admit that this is a bit of weasel wording, it prevents us from having to do a deep compare. We /could/ do a deep compare, but the assertion is that if people use the type of markup above in compact form, they're doing it wrong(tm).

This same line of reasoning applies to subject definitions that have the same @id, but different data. We don't merge in that case either.

Also note that this pathological problem exists in the other proposals as well.

@gkellogg
Copy link
Member

gkellogg commented Oct 3, 2012

  1. It doesn't address the requirement to do a deep comparison of all items in the array (against every other item the the array O(n^2)). That's effectively what the algorithm above does, but it then assigns a static value (bnode ID) to JSON objects in the array. If somebody changes the expanded output, the algorithm, during compaction, will destroy the new information that was created (in the case where two items that were the same before end up being different after the change).

It addresses this by allowing only one value to be a node definition, the rest MUST be node references.

  1. In order to remedy Basic: Focus more on Linked Data less on RDF #1Basic: Focus more on Linked Data less on RDF #1, you have to do a deep comparison at the time that you're compacting. You can't assume that the expanded data didn't change. We could also decide that if two items in the array have the same @idhttps://github.com/id, then you include both items in the output (don't merge).

The fact that we have node references only requires that we compare @id elements.

#2#2 is the generalized solution to the problem (which we have discussed before and rejected because of the requirement to do the deep comparison). However, that is the correct solution to the particular problem in front of us - removing duplicates. There is no way to avoid a deep comparison between objects in the list - it's O(n^2).

If only one node can be a node definition, then there is nothing to compare deeply.

Gregg

@dlongleyhttps://github.com/dlongley came up with a different approach to this problem that doesn't create the duplicate problem, is far less complicated, but still solves the Drupal use case (and the general use case of sites using different IRIs to refer to data in expanded form). I'll put that proposal in below.


Reply to this email directly or view it on GitHubhttps://github.com//issues/160#issuecomment-9116633.

@lanthaler
Copy link
Member

Gregg:

I don’t think that Drupal could live with the fact that the data is just under one IRI, I think it has to be under all IRIs. The problem is that site B might not use a property generator but just use a term for the single property it is interested in (which then contains just a node reference).

Manu:

I don't understand how your proposal should work. Let's make it simple by saying site A expands "term" to "A" and "X", so the expanded output would just contain a property "A". Site B understands "B" and "X". Your proposal would only work if the order of the IRIs in A's context is "X", "A" and in B's context "X", "B". All other combinations wouldn't support round-tripping. I find it very dangerous to rely on the order of the IRIs here.

@dlongley
Copy link
Member

dlongley commented Oct 4, 2012

Markus,

A expands "term" to "A" and "X", so the expanded output would just contain a property "A". Site B understands "B" and "X".

This is a general problem; not one having to do with property generators per se. For example, suppose site A put "A", "B", and "C" in its property generator, but site B needed "D". The assumption here is that there's at least one shared property -- and that one is listed first. So when building your context, you put the most "public" or "general" property name first.

All that being said, I think we're probably just going to have to go with bnode generation + deep comparison. There are some issues with the proposal Manu and I put forward that would require including something like an @origin flag to keep track of where properties came from ... but that flag would be lost during compaction. It seems like the most readily available solution to this problem is doing deep comparisons to remove duplicates (which itself has drawbacks including maintaining read/write synchronicity between sites ... but that may not be a requirement).

@gkellogg
Copy link
Member

gkellogg commented Oct 4, 2012

@dlongley , @davidlehn , @msporny and I had a discussion about the merits of either approach. As dave says, I think we settled on a variation of my approach that uses node definitions for each expanded property, rather than a signal node definition and one or more node references. This has the consequence of requiring deep node comparison to test for equivalence (although IMO, a node reference could still be used for comparison too).

This allows us to stay within the RDF data model and ensure that this is compatible with both from- and to-RDF.

There should be a warning that exchanging data with a service that does not understand all terms in a property generator could result in data which is not round-tripable. For example, if I use this to describe a personal profile document in both FOAF and schema.org, if I provide this to an application that only updates the FOAF part of the data, it will not round trip:

{
  "@context": {
    "schema": "http://schema.org/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "name": {"@id": ["schema:name", "foaf:name"]},
    "email": {"@id": ["schema:email", "foaf:mbox"], "@type": "@id"}
    },
    "@id": "http://greggkellogg.net/foaf#me",
    "@type": ["schema:Person", "foaf:Person"],
    "name": "Gregg Kellogg",
    "email": "mailto:gregg@greggkellogg.net"
}

This would expand to the following:

[{
    "@id": "http://greggkellogg.net/foaf#me",
    "@type": [
      "http://schema.org/Person",
      "http://xmlns.com/foaf/0.1/Person"
    ],
    "http://schema.org/name": [{"@value": "Gregg Kellogg"}],
    "http://schema.org/email": [{"@id": "mailto:gregg@greggkellogg.net"}],
    "http://xmlns.com/foaf/0.1/name": [{"@value": "Gregg Kellogg"}],
    "http://xmlns.com/foaf/0.1/mbox": [{"@id": "mailto:gregg@greggkellogg.net"}]
}]

If I use an service that updates this document to add an additional foaf:mbox:

[{
    "@id": "http://greggkellogg.net/foaf#me",
    "@type": [
      "http://schema.org/Person",
      "http://xmlns.com/foaf/0.1/Person"
    ],
    "http://schema.org/name": [{"@value": "Gregg Kellogg"}],
    "http://schema.org/email": [{"@id": "mailto:gregg@greggkellogg.net"}],
    "http://xmlns.com/foaf/0.1/name": [{"@value": "Gregg Kellogg"}],
    "http://xmlns.com/foaf/0.1/mbox": [
      {"@id": "mailto:gregg@greggkellogg.net"},
      {"@id": "mailto:gregg@kellogg-assoc.com"},
    ]
}]

it will not result in something that can be compacted using the property generator term:

{
  "@context": {
    "schema": "http://schema.org/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "name": {"@id": ["schema:name", "foaf:name"]},
    "email": {"@id": ["schema:email", "foaf:mbox"], "@type": "@id"}
    },
    "@id": "http://greggkellogg.net/foaf#me",
    "@type": ["schema:Person", "foaf:Person"],
    "name": "Gregg Kellogg",
    "email": "mailto:gregg@greggkellogg.net",
    "schema:mbox": {"@id": "mailto:gregg@kellogg-assoc.com"}
}

In my mind, this is perfectly acceptable, and what I would want to have happen anyway.

@lanthaler
Copy link
Member

The problem you describe Gregg could easily be solved by merging the data but that would break Drupal’s requirements.

As I tried to explain in the last telecon I find this quite problematic when looking at this from a consumers point of view. If I’m a consumer and state that termA maps to the IRIs X, Y, Z I would expect it to select X independently if there’s an equivalent Y, Z.

What’s the relationship of the IRIs in a property generator? It’s definitely not a owl:sameAs.. but the structure looks exactly like that and I find that very problematic. I will post my proposal in a separate comment after lunch as I don’t think it’s understandable from the minutes.

@lanthaler
Copy link
Member

Separating compaction from expansion for property generators

The problem with all proposals so far has been in supporting round-tripping of the results. We tried hard to find a solution that restores the original input document when expanding a document and then compacting it again using the same context. This sounds easy in principle but is difficult to implement efficiently as it requires to compare all data for equality which has a computational complexity of O(n²). The other problem with all of the current proposals is that the form property generators are specified in the context suggests that all IRIs denote the same concept (as in owl:sameAs) which is not true.

In practice, however, the relationship between the IRIs is more the one of a subproperty to several super-properties. Typically you would use the feature to use a term with a very specific IRI (likely from a proprietary vocabulary) and map the same term to other widely used vocabularies to allow other applications that don't know the proprietary vocabulary to "understand" the data nevertheless.

The solution I would thus like to propose explicitly separates the IRIs in the context to highlight that they are not equal. This should make it clear to developers that if they use that feature, their documents won't round-trip anymore as shown in the following example:

{
  "@context": {
    "myvocab": "http://example.com/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "schema": "http://schema.org/",
    "name": {"@id": "myvocab:name", "@alsoExpandTo": [ "foaf:name", "schema:name" ] },
    "friends": {"@id": "myvocab:friend", "@alsoExpandTo": [ "foaf:knows", "schema:knows" ] }
    },
    "@id": "/people/markus",
    "name": "Markus Lanthaler",
    "friends": {
        "@id": "/people/gregg",
        "name": "Gregg Kellogg"
    }
}

expands to:

[
   {
      "@id": "/people/markus",
      "http://example.com/name": [{
         "@value": "Markus Lanthaler"
      }],
      "http://xmlns.com/foaf/0.1/name": [{
         "@value": "Markus Lanthaler"
      }],
      "http://schema.org/name": [{
         "@value": "Markus Lanthaler"
      }],
      "http://example.com/friend": [{
         "@id": "/people/gregg",
         "http://example.com/name": [{
            "@value": "Gregg Kellogg"
         }],
         "http://xmlns.com/foaf/0.1/name": [{
            "@value": "Gregg Kellogg"
         }],
         "http://schema.org/name": [{
            "@value": "Gregg Kellogg"
         }]
      }],
      "http://xmlns.com/foaf/0.1/knows": [{
         "@id": "/people/gregg",
         "http://example.com/name": [{
            "@value": "Gregg Kellogg"
         }],
         "http://xmlns.com/foaf/0.1/name": [{
            "@value": "Gregg Kellogg"
         }],
         "http://schema.org/name": [{
            "@value": "Gregg Kellogg"
         }]
      }],
      "http://schema.org/knows": [{
         "@id": "/people/gregg",
         "http://example.com/name": [{
            "@value": "Gregg Kellogg"
         }],
         "http://xmlns.com/foaf/0.1/name": [{
            "@value": "Gregg Kellogg"
         }],
         "http://schema.org/name": [{
            "@value": "Gregg Kellogg"
         }]
      }]
   }
]

compacting it again with

{
  "@context": {
    "myvocab": "http://example.com/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "schema": "http://schema.org/",
    "name": {"@id": "myvocab:name", "@alsoExpandTo": [ "foaf:name", "schema:name" ] },
    "friends": {"@id": "myvocab:friend", "@alsoExpandTo": [ "foaf:knows", "schema:knows" ] }
    }
}

yields

{
  "@context": {
    "myvocab": "http://example.com/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "schema": "http://schema.org/",
    "name": {"@id": "myvocab:name", "@alsoExpandTo": [ "foaf:name", "schema:name" ] },
    "friends": {"@id": "myvocab:friend", "@alsoExpandTo": [ "foaf:knows", "schema:knows" ] }
    },
   "@id": "/people/markus",
   "name": ["Markus Lanthaler"],
   "foaf:name": ["Markus Lanthaler"],
   "schema:name": ["Markus Lanthaler"],
   "friends": {
      "@id": "/people/gregg",
      "name": ["Gregg Kellogg"],
      "foaf:name": ["Gregg Kellogg"],
      "schema:name": ["Gregg Kellogg"]
   },
   "foaf:knows": {
      "@id": "/people/gregg",
      "name": ["Gregg Kellogg"],
      "foaf:name": ["Gregg Kellogg"],
      "schema:name": ["Gregg Kellogg"]
   },
   "schema:knows": {
      "@id": "/people/gregg",
      "name": ["Gregg Kellogg"],
      "foaf:name": ["Gregg Kellogg"],
      "schema:name": ["Gregg Kellogg"]
   }
}

So the additionally produced data stays there. It is then up to the application to decide what to do with it. Most likely an application would only extract the data it is interested in anyway so I don't think this should be a big problem.

Pros of this approach:

  • compaction stays exactly the same, no additional overhead
  • clear separation of the main IRI and the additional IRIs (which should make it easier for developers to understand)
  • addresses Drupal's use case (correct me if I'm wrong)

Cons of this approach:

  • Introduction of an additional keyword; @alsoExpandTo, which should probably be renamed :-)
  • No round-tripping at all

@lanthaler
Copy link
Member

Regardless of my proposal above I did a little experiment to see if property generators are really worth the effort. The only advantage they seem to bring is to save bandwidth.

So I went to DBpedia and queried for all people that were born in Boston before 1950. Turns out there were 728 persons. I then went and constructed a JSON-LD document in which each of these persons is the friend of one other person:

{
  "@context": {
    "myvocab": "http://example.com/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "schema": "http://schema.org/",
    "name": {"@id": "myvocab:name", "@alsoExpandTo": [ "foaf:name", "schema:name" ] },
    "friends": {"@id": "myvocab:friend", "@alsoExpandTo": [ "foaf:knows", "schema:knows" ] }
  },
  "@graph" : [
    {
      "@id": "/people/Robert_W._Upton",
      "name": "William Upton",
      "friends": {
        "@id": "/people/William_Stimpson",
        "name": "William Stimpson"
      }
    },
    {
      "@id": "/people/William_Stanley_Braithwaite",
      "name": "William Stanley Beaumont Braithwaite",
      "friends": {
        "@id": "/people/William_Phillips,_Jr.",
        "name": "William Phillips, Jr."
      }
    },
...

The document above contains a property generator (which proposal is chosen is irrelevant for this experiment). Thus every name would not just expand to http://example.com/name but also to http://xmlns.com/foaf/0.1/name and http://schema.org/name; same for friends. Consequently, the expanded document would be roughly three times as big.

To evaluate how useful property generators are I also created a second document without property generators that looks like this:

{
  "@context": {
    "myvocab": "http://example.com/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "schema": "http://schema.org/",
    "name": {"@id": "myvocab:name" },
    "friends": {"@id": "myvocab:friend" }
  },
  "@graph" : [
    {
      "@id": "http://en.wikipedia.org/wiki/Robert_W._Upton",
      "name": "William Upton",
      "foaf:name": "William Upton",
      "schema:name": "William Upton",
      "friends": {
        "@id": "http://en.wikipedia.org/wiki/William_Stimpson",
        "name": "William Stimpson",
        "foaf:name": "William Stimpson",
        "schema:name": "William Stimpson"
      },
      "foaf:knows": {
        "@id": "http://en.wikipedia.org/wiki/William_Stimpson",
        "name": "William Stimpson",
        "foaf:name": "William Stimpson",
        "schema:name": "William Stimpson"
      },
      "schema:knows": {
        "@id": "http://en.wikipedia.org/wiki/William_Stimpson",
        "name": "William Stimpson",
        "foaf:name": "William Stimpson",
        "schema:name": "William Stimpson"
      }
    },
...

I then compared the size of document as-is and the sizes after gzipping the documents. Here are the numbers (in bytes):

                                    small           medium           complete
                            | pretty  gzipped | pretty  gzipped |  pretty  gzipped
With property generators    |  1,175    390   | 10,035    1,910 |  72,017   11,526
Without Property generators |  3,371    478   | 37,399    3,214 | 274,856   21,281
----------------------------|-----------------|-----------------|-----------------
Expanded                    |  6,839    496   | 81,611    3,884 | 611,684   27,296

complete means the complete document including all 728 persons, medium and small are shrinked-down version of about 10k and 1k (= 4 + 4 persons). Please note that the expanded output still contains relative IRIs, if http://www.example.com/ is used as base, the expanded complete-document would grow to 865,027 bytes.

Honestly I was quite surprised to see that compression wasn't able to eliminate the repeated inclusion of properties with different values. This is probably due to the fact that we have quite few properties and with short values. I was also surprised to see how a 72k document, that consumes 11k on the wire if transmitted compressed expands to over 600k in-memory - quite an easy way to launch a DoS attack. So if we are really going to implement this we will need to at least at a flag to disable this and clearly outline the security consequences of this feature.

@gkellogg
Copy link
Member

gkellogg commented Oct 5, 2012

On Oct 5, 2012, at 3:25 AM, Markus Lanthaler notifications@github.com wrote:

The problem you describe Gregg could easily be solved by merging the data but that would break Drupal’s requirements.

It can also be solved by flattening. However, Drupal's covered because they just need to compact what was previously expanded, not deal with a change from an outside party to only part of the data; that's mostly a theoretical issue.

As I tried to explain in the last telecon I find this quite problematic when looking at this from a consumers point of view. If I’m a consumer and state that termA maps to the IRIs X, Y, Z I would expect it to select X independently if there’s an equivalent Y, Z.

The proposal is that you would get all of the data with either X, Y, or Z. I don't see what the problem is.

What’s the relationship of the IRIs in a property generator? It’s definitely not a owl:sameAs.. but the structure looks exactly like that and I find that very problematic. I will post my proposal in a separate comment after lunch as I don’t think it’s understandable from the minutes.

Definitely not sameAs, it could be similar to subPropertyOf, and we discussed this. In fact, I don't think that there is any implied relationship.

@lanthaler
Copy link
Member

As I tried to explain in the last telecon I find this quite problematic when looking at this from a consumers point of view. If I’m a consumer and state that termA maps to the IRIs X, Y, Z I would expect it to select X independently if there’s an equivalent Y, Z.

The proposal is that you would get all of the data with either X, Y, or Z. I don't see what the problem is.

No, you will only get the data that was expanded to X and Y and Z otherwise that additional email address would be put under that term as well. I assume a developer would expect an or here instead if all IRIs are just listed in an array.

@tidoust
Copy link

tidoust commented Oct 5, 2012

On Fri, Oct 5, 2012 at 5:15 PM, Markus Lanthaler
notifications@github.comwrote:

[...]

Honestly I was quite surprised to see that compression wasn't able to
eliminate the repeated inclusion of properties with different values. This
is probably due to the fact that we have quite few properties and with
short values. I was also surprised to see how a 72k document, that consumes
11k on the wire if transmitted compressed expands to over 600k in-memory -
quite an easy way to launch a DoS attack. So if we are really going to
implement this we will need to at least at a flag to disable this and
clearly outline the security consequences of this feature.

We can certainly warn people about the consequences of expansion with
regards to memory consumption but I don't really get the DoS possibility.
Without even using expansion, isn't it already trivial to generate a
JSON-LD doc composed of the "abcdefghijkmlnopqrstuvwxyz" string repeated
millions of time and to transmit a compressed version of it? Zipping a >2Mb
such document (about 100 000 of lines) yields a 7Kb file in the end. That
sounds way more efficient than using expansion (for the purpose of
attacking someone, that is).

@lanthaler
Copy link
Member

Compression is negotiable by the client and the server. Just switch it off and limit the data that you are going to accept. I believe the uncompressed size is also transferred at the end of the gzip stream. On the other hand you can’t just switch off property generators as the data you would get out of expansion is completely different.

@linclark
Copy link
Contributor

linclark commented Oct 5, 2012

Wouldn't expansion in general leave you open to DoS, then?

An attacker could provide an IRI that is extremely long for a term that is very short. HTTP doesn't specify an upper bound for URIs, and AFAIK, URIs up to 2080 chars are recognized by all browsers.

{
  "@context": {
    "a": "http://aaaaa...."
  },
  "a:p1": {"@id": "a:one"}
  ....
}

@lanthaler
Copy link
Member

That’s true, but expansion doesn’t allow you to duplicate whole subtrees and is therefore much less “effective”.

@msporny
Copy link
Member Author

msporny commented Oct 8, 2012

Markus, great work on getting the data together on size on the wire vs. size in memory for property generators. I do agree that there is a DDoS possibility there and that we should do something about it. I suggest that we put in a maximum memory limit as an option. I also suggest that we put in a maximum processing limit as an option as well. I think we should leave the default unspecified because it's almost completely dependent on the environment in which and hardware on which you're operating.

The maximum memory limit is needed for the property generator feature. Units would be bytes?

The maximum processing limit is needed for the normalization feature. There are certain graph isomorphisms that are NP. Units would ideally be time taken inside a particular JSON-LD call, but I don't know if there is a very good way to get that data. Units would be milliseconds? Barring that, we could do number of 'ticks', which would be per-processor dependent.

All this said, it would be easier to DoS the site in other ways. However, I don't think we should allow ourselves to be the reason a site is DoS'd.

@msporny
Copy link
Member Author

msporny commented Oct 8, 2012

Cons of this approach:

  • Introduction of an additional keyword; @alsoExpandTo, which should probably be renamed :-)
  • No round-tripping at all

@linclark, correct me if I'm wrong, but I don't think Drupal developers would be very happy if their data didn't round-trip, or finding extra data in their compacted form.

I understand your concern, Markus. The way property generators are expressed makes it seem as if we are saying "these X properties are exactly the same". However, what we're really saying is "if you see property A, then expand to X, Y, and Z". There is no implied relationship between them... it's effectively a copy-and-paste operation. I think that's fairly easy to point out in the spec, and we have to carefully lay that out in tutorials as well.

I'm not convinced that @alsoExpandTo or anything else would imply that there is no rigorous semantic relationship between the expanded IRIs. If people see A expand to X, Y, and Z - some of them are going to think there is a relationship regardless of what the spec states or the keywords imply. There is a very weak relationship, but it's not owl:sameAs and I'd even go as far as to say it's not subpropertyOf either. I think that the only relationship that's there is that one property was copy-pasted to other properties... nothing else is implied (either implicitly or explicitly).

The no round-tripping con is a big issue. I think it's important if we're not going to confuse developers with the round-tripped data. We can do better, even if it is computationally costly. That is, I'd rather warn folks of the dangers of property generators and let them decide whether or not they want to burn the CPU cycles on the feature. What we do shouldn't suprise developers and I think that anything that doesn't round-trip the information cleanly (when using the same context) is going to really confuse people.

The best proposal I've seen so far is a hybrid: Use @gkellogg's proposal above, use subject definitions when expanded, generate bnode IDs so you know what to eliminate when compacting, and use deep object comparisons when compacting. This allows one to expand and compact cleanly and without error when the same context is used. It solves the Drupal use case and it solves the problem in a general way.

We will also need to place the following warnings in the spec:

  • Property generators can cause documents to grow dramatically in size and might trigger the memory fence for the processor.
  • Property generators only round-trip data correctly when the exact same property generator definitions are used to expand and compact.
  • Using property generators to expand and publish data for read-only use cases is safe.
  • Using property generators to expand and publish/receive data for read-write use cases is dangerous. This is because the sender of the data may not use the same property generators as the publishing site.

PROPOSAL: Adopt Gregg Kellogg's property generator algorithm when expanding/compacting with the following modifications; 1) use subject definitions everywhere when expanding, 2) generate bnode IDs for all subject definitions without an '@id', 3) use deep comparisons when eliminating subject definitions during compaction.

PROPOSAL: Add warning language to the JSON-LD Syntax and API specs noting the most problematic issues when working with property generators.

@lanthaler
Copy link
Member

I understand your concern, Markus. The way property generators are expressed makes it seem as if we are saying "these X properties are exactly the same". However, what we're really saying is "if you see property A, then expand to X, Y, and Z". There is no implied relationship between them... it's effectively a copy-and-paste operation. I think that's fairly easy to point out in the spec, and we have to carefully lay that out in tutorials as well.

That's true for expansion, but not at all for compaction. For compaction it is "if the same value exists for all (not any) of these IRIs, use this term term with that value". It's exactly that all that worries me but perhaps it turns out that it's not an issue in practice..

Frankly, I still don't like this feature and how we are going to implement it but I won't object it.. so let's proceed.

@lanthaler
Copy link
Member

RESOLVED: Adopt Gregg Kellogg's property generator algorithm when expanding/compacting with the following modifications; 1) use subject definitions everywhere when expanding, 2) generate bnode IDs for all subject definitions without an @id, 3) use deep comparisons when eliminating subject definitions during compaction.

@lanthaler
Copy link
Member

RESOLVED: Add warning language to the JSON-LD Syntax and API specs noting the most problematic issues when working with property generators.

@lanthaler
Copy link
Member

RESOLVED: Add a non-normative note to tell developers that their implementations may have a feature that allows all but one node definition created by a property generator to be collapsed into a node reference.

@lanthaler
Copy link
Member

I have a few more questions regarding property generators as they seem to break basically every single algorithm we have at the moment.

What should happen if a term that is a property generator is used as the value of @id or @type? Throw an error?

Do we need to relabel all blank nodes in expansion? I think yes

What about compaction? Property generator terms should probably be preferred but what if you later find that it doesn't apply, i.e., not all property IRIs of that term contain the value? The only potential solution I see at the moment for this problem is to do IRI compaction in two steps.. first get a set of candidates for a specific IRI/value pair, then check the candidates according their rank and choose the first one that applies.

@msporny
Copy link
Member Author

msporny commented Dec 5, 2012

Having not put much thought into it, here are my off-the-cuff responses (which may change once I think about it a bit more):

What should happen if a term that is a property generator is used as the value of @id or @type? Throw an error?

Yes, fatal error.

Do we need to relabel all blank nodes in expansion? I think yes

Yes, probably. It's probably a good idea to do so anyway so that people won't depend on the blank node IDs. We will need to keep a global mapping table around for old blank node IDs -> new blank node IDs, which is kinda annoying.

Do IRI compaction in two steps.. first get a set of candidates for a specific IRI/value pair, then check the candidates according their rank and choose the first one that applies.

Seems like a sensible first cut at the problem.

@gkellogg
Copy link
Member

gkellogg commented Dec 5, 2012

Yes, I agree with Manu. Fortunately, we already have algorithms for renaming BNodes, so applying it to expansion shouldn't be a stretch, but it is something new.

lanthaler added a commit that referenced this issue Dec 6, 2012
This is more or less an exact copy of the example in issue #160. See #160 (comment)
lanthaler added a commit that referenced this issue Dec 6, 2012
lanthaler added a commit that referenced this issue Dec 7, 2012
lanthaler added a commit that referenced this issue Dec 10, 2012
lanthaler added a commit that referenced this issue Dec 10, 2012
…ldn't roundtrip cleanly

The reason is that some of the values explicitly overwrite the property generator's datatype/language.

This addresses #142 and #160.
lanthaler added a commit that referenced this issue Dec 10, 2012
lanthaler added a commit that referenced this issue Dec 11, 2012
lanthaler added a commit that referenced this issue Dec 11, 2012
lanthaler added a commit that referenced this issue Dec 11, 2012
lanthaler added a commit that referenced this issue Dec 13, 2012
lanthaler added a commit that referenced this issue Dec 13, 2012
@lanthaler
Copy link
Member

RESOLVED: Rename all blank node identifiers when doing expansion.

lanthaler added a commit that referenced this issue Dec 20, 2012
lanthaler added a commit that referenced this issue Dec 20, 2012
@lanthaler
Copy link
Member

I've updated all algorithms, unless I hear objections I will close this issue in 24 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants