Define exactly how (IRI) compaction is supposed to work #113

Closed
lanthaler opened this Issue May 5, 2012 · 21 comments

Comments

Projects
None yet
5 participants
Member

lanthaler commented May 5, 2012

In the context of issue #74 we were already discussing how compaction should work and came to the conclusion that:

The intent for how IRI conflicts are resolved when compacting/expanding: any conflicts between terms that use the same IRI will use the most specific solution when compacting (for example, when compacting "foo": "5" and having to pick between a term that specifies "xsd:integer" as the type and one that doesn't, the one that specifies "xsd:integer" is selected).
If there is no solution that is more specific than the other, then a lexicographical comparison is made between the terms in the @context and the lexicographically least term and it's associated @type and other information is used to expand the data.

We didn't go any further in describing it more specifically. This lead to the situation that Gregg and Dave implemented the algorithm completely different than I did. My strategy is to choose the term that eliminates the most expanded object forms (in lists, for arrays, every value is compacted separately now). If more terms achieve the same ranking according that criteria, the lexicographically least term is chosen.

Edit: I updated my algorithms based on the feedback, please read the algorithms in the comments below instead!

My IRI compaction algorithm looks like this:

  - calculate rank of full IRI
  - calculate rank for every term/prefix
  - chose term/compact IRI/IRI with highest rank, falling back to lexicographically least

The term rank calculation for a term (could also be a (compact) IRI) and a value works as follows:

  - set rank to 0
  - if it's a term that is defined in the context rank++ (we prefer terms)
  - if value is a `@list` object
    - if term has a list-container, rank++
    - check for every item in the list if the result of compaction would be a scalar, 
      if so rank++ else rank--
  - otherwise
    - if term has a list-container, return rank = false (will never be chosen)
    - if term has a set-container, rank++
    - check if the result of compacting value would be a scalar, if so rank++ else rank--

You can find the full implementation in Processor.php of JsonLD.

I think this algorithm is much easier to implement and much easier to understand than the current algorithm in the spec so I would propose to change it. The only open issue is whether a term with a mismatching type will ever match.

Consider for example the following:

{ 
    "http://example.org/term1": { 
        "@type": "http://example.org/different-datatype", 
        "@value": "v1" 
    }
}

Should this be compacted to

{ 
    "@context": { 
        "term1": { 
            "@id": "http://example.org/term1", 
            "@type": "http://example.org/datatype" 
     },
    "term1": { 
        "@type": "http://example.org/different-datatype", 
        "@value": "v1" 
    }
}

or not?

Another question is whether we compact to empty suffixes, so the example above could also be compacted to "term1:" (note that the suffix is empty which would drop the type coercion).

Owner

dlongley commented May 5, 2012

We'll have to look more closely at what happens between the two algorithms to see what the different outputs would be. I don't think that we should only consider "what removes the most expanded forms or IRIs" without considering the semantics behind what's going on. In other words, if given the option of compacting to a term with a type that doesn't match, or compacting to a prefix with a type that does or with no type at all, I think the prefix would be preferred. I think an application would be expecting to find only things of type X in a term with a coercion of type X. In my view, if they didn't care what the type was, they wouldn't have specified it along with the term. The same principle applies to language.

Also, we don't want to break up @lists, I don't think. So we should be careful there. If we agree that lists shouldn't be broken up, then there is something to be said about picking the term that results in the fewest number of expanded forms there; however, that might result in an overly complex algorithm (as opposed to what's in the spec now) with little gain, IMO.

Owner

dlongley commented May 5, 2012

For example, when compacting with this input:

[{
  "http://example.com/term": [
    {"@value": "foo", "@type": "type:A"},
    {"@value": "bar", "@type": "type:B"}
  ]
}]

And this context:

{
  "@context": {
    "term1": {"@id": "http://example.com/term", "@type": "type:A"},
    "ex": "http://example.com/"
  }
}

I think the output ought to be:

{
  "@context": {
    "term1": {"@id": "http://example.com/term", "@type": "type:A"},
    "ex": "http://example.com/"
  },
  "term1": "foo",
  "ex:term1": {"@value": "bar", "@type": "type:B"}
}

Perhaps the example seems even more obvious when using a language:

[{
  "http://example.com/question": [
    {"@value": "Do you speak German?", "@language": "en"},
    {"@value": "Sprichst du Deutsch?", "@language": "de"}
  ]
}]

And this context:

{
  "@context": {
    "english": {"@id": "http://example.com/question", "@language": "en"},
    "ex": "http://example.com/"
  }
}

I think the output ought to be:

{
  "@context": {
    "english": {"@id": "http://example.com/term", "@language": "en"},
    "ex": "http://example.com/"
  },
  "english": "Do you speak German?",
  "ex:question": {"@value": "Sprichst du Deutsch?", "@type": "type:B"}
}

Whereas, unless I'm mistaken, the algorithm above would result in:

{
  "@context": {
    "english": {"@id": "http://example.com/term", "@language": "en"},
    "ex": "http://example.com/"
  },
  "english": ["Do you speak German?", {"@value": "Sprichst du Deutsch?", "@language": "de"}]
}

Which I don't think is preferable. I think when you include a @type or a @language along with your term or prefix, you expect only those things with matching types or languages to be present. If you specifically say you want a null language, then I think you should get anything without a language as well.

Member

lanthaler commented May 6, 2012

In other words, if given the option of compacting to a term with a type that doesn't match, or compacting to a prefix with a type that does or with no type at all, I think the prefix would be preferred.

In my implementation the prefix with the matching type would be chosen but if it doesn't, the term would.

I think an application would be expecting to find only things of type X in a term with a coercion of type X.

This would be a reasonable design choice and I wouldn't be opposed to change my algorithm to do that. The question remains though if we should (mis-)use term definitions (i.e., terms that have type/language/container set) as prefixes in that case.

In my view, if they didn't care what the type was, they wouldn't have specified it along with the term. The same principle applies to language.

Agree.

Also, we don't want to break up @lists, I don't think. So we should be careful there.

Please note that I do NOT break up lists. I don't think that would even be possible without breaking the data.

If we agree that lists shouldn't be broken up, then there is something to be said about picking the term that results in the fewest number of expanded forms there; however, that might result in an overly complex algorithm (as opposed to what's in the spec now) with little gain, IMO.

I think the algorithm I outlined above is simple (and understandable) enough to achieve that.

A related question is what we do with properties in the input document that have multiple lists (sets of lists). Would it be OK to compact the first list to a property and put the others in other properties? Should all be put in a property without list-container? Should an exception be thrown?

Member

lanthaler commented May 6, 2012

Regarding your example, that's true. At the moment I prefer terms but I'm more than willing to change that behavior. Please note that also the following would be a valid output that would be even compacter (I'm not advocating this!):

{
  "@context": {
    "term1": {"@id": "http://example.com/term", "@type": "type:A"},
    "ex": "http://example.com/"
  },
  "term1": "foo",
  "term1:": {"@value": "bar", "@type": "type:B"}
}

The question remains though what should happen if no prefix exists. Should we keep the full IRI or use the term then?

{
  "@context": {
    "term1": {"@id": "http://example.com/term", "@type": "type:A"}
  },
  "term1": "foo",
  "http://example.com/term": {"@value": "bar", "@type": "type:B"}
}

And what happens with terms that fall back to the default language?

{
  "@context": {
   "@language": "it",
    "comment": { "@id": "http://example.com/term" },
    "ex": "http://example.com/"
  },
  "comment": [
    {"@value": "Do you speak German?", "@language": "en" },
    {"@value": "Sprichst du Deutsch?", "@langage": "de"}
}

Would that be correct or should the full IRI be used instead due to the (implicit) language mismatch?

Owner

dlongley commented May 8, 2012

I don't think we should do empty suffixes, it seems odd to me -- so I can see others considering it unexpected behavior.

I'll have to think more about what should happen with a default language. I can see it going either way right now, which means we might just go with whatever is easiest (probably what we currently have ... which is to reject the term because of the implicit language mismatch). I'm not sold either way at the moment, so maybe we need a few more examples that could highlight whether one way would seem more unnatural than the other.

Member

lanthaler commented May 8, 2012

Agree, empty suffixes seem odd to me as well.

I've updated my term rank implementation to not simply check if a expanded object form can be eliminated but to check if the term's type/language match.

So my IRI compaction algorithm still looks like this:

- calculate rank of full IRI
- calculate rank for every term/prefix
- chose term/compact IRI/IRI with highest rank, falling back to lexicographically least

But the term rank calculation for a term (could also be a (compact) IRI) has to be changed to:

- set rank to 0
- if it's a term that is defined in the context rank++ (we prefer terms)
- if value is a `@list` object
  - if term has a list-container, rank++
  - check for every item in the list if the term definition matches the item, 
    if so rank++ else rank--
- otherwise
  - if term has a list-container, rank = false (will never be chosen), return rank
  - if term has a set-container, rank++
  - if a non-null value was passed
    - if the value matches the term definition, rank++
    - otherwise rank-- **and** if the term has a type or language assigned
      rank = false (to never choose non-matching terms)
- return rank

The type/language match check is also quite simple:

- if value is an object
  - and has an @value property
    - if a @type property exists, return true if it matches the passed type, otherwise false
    - otherwise, if a @language property exists, return whether it matches the passed language
    - otherwise
      - if a type was passed, return false
      - if a language was passed and the value of @value is a string, return false
      - otherwise, return true
  - otherwise if value has an @id property
    - return true if the passed type equals @id, otherwise false
  - otherwise
    - return false if a type or language was passed, otherwise true
- otherwise (value is a scalar)
  - if a type was passed, return false
  - if value is a string and a language was passed, return false
  - return true

This code passes all the compaction/expansion tests we currently have: Travis build status

I think these algorithms are much easier to understand and to implement. If you are OK with it, I'll go ahead and update the spec accordingly.

Owner

dlongley commented May 8, 2012

Hmm, I wouldn't update the spec just yet; I'm not sure exactly how different the two algorithms we have right now are or which one is actually simpler. I'll take a closer look when I get a chance. Any thoughts, @gkellogg?

Owner

gkellogg commented May 8, 2012

I'm not sure it's simpler either, just different. And, don't have time to re-implement right now anyway, so I'll go with the flow for now.

The current algorithm is more strict on container-set and container-list; it seems like this one might allow some cross-over. I think there's value in being deterministic in going straight to a CURIE or IRI if there's no exact term match.

The main difference is requiring all @list members to have a type/language match, rather than the average of them.

Member

lanthaler commented May 9, 2012

The current algorithm is more strict on container-set and container-
list; it seems like this one might allow some cross-over.

No, it won't cross over except if the full IRI is coerced to a list and no prefixes are defined...bBut you have no choice then anyway.

I think
there's value in being deterministic in going straight to a CURIE or
IRI if there's no exact term match.

Don't understand what you are trying to say with this.

The main difference is requiring all list members to have a
type/language match, rather than the average of them.

Both algorithms calculate the sum of the members. Mine uses +1 for matches and -1 for mismatches whereas the yours adds something between 0 and 3 for every term - which I found quite difficult to reproduce on a sheet of paper e.g.

Owner

gkellogg commented May 9, 2012

At this point, I'd say just do what you want. I don't really have the energy to continue debating such changes. Certainly, if algorithms that we're using can be simplified, and still do the same thing, we should do this. However, it seems that the trend is for everyone to take the algorithms and instead of implementing them, come up with their own thing.

Perhaps we'd be better to focus on areas of the spec that are not done, rather than re-working those that are complete with multiple implementations.

Member

lanthaler commented Sep 25, 2012

There have been some further discussions in the 2012-09-18 telecon which yielded to the following action:

Manu Sporny: Okay - maybe Markus and I need to write the pseudocode for what we've discussed today, then we look at it as a group, then decide what we want to go with and include it in the spec.

Owner

msporny commented Sep 30, 2012

I spoke with @dlongley about this as he updated the JavaScript, PHP, and Python implementations last week based on the latest text in the spec. He did not agree that we should try to re-write the term ranking algorithm because the original approach he used when writing the IRI compaction algorithm was to do multiple rounds. He said that there are so many conditionals that you end up having a ton of rounds and the term ranking algorithm ends up being much easier than doing multiple rounds.

So, at this point, I agree with him as the re-write I proposed would end up further complicating the spec text.

Member

lanthaler commented Nov 19, 2012

Issue #172 is related to this.

Owner

msporny commented Nov 20, 2012

PROPOSAL 1: Clarify parts of the IRI compaction algorithm that need to change, but do not change the algorithm in any large way as it works and has been implemented by two different people.

PROPOSAL 2: Adopt Markus' proposed algorithm above for the IRI compaction algorithm.

Owner

msporny commented Nov 20, 2012

PROPOSAL 1: +1
PROPOSAL 2: -1

Contributor

tidoust commented Nov 20, 2012

PROPOSAL 1: +1
PROPOSAL 2: -1

Member

lanthaler commented Nov 20, 2012

Clarify parts of the IRI compaction algorithm that need to change, but do not change the algorithm in any large way as it works and has been implemented by two different people

What does this mean? Which parts can be discussed/clarified? Term ranking apparently not and that's the meat of the algorithm.

Member

lanthaler commented Nov 20, 2012

As I just found out, the one of the two implementations you mentioned is buggy as it collapses multiple lists into one: http://bit.ly/UQG2C7

Owner

gkellogg commented Nov 20, 2012

PROPOSAL 1: +1
PROPOSAL 2: -1

Member

lanthaler commented Nov 27, 2012

RESOLVED: When compacting lists, the most specific term that matches all of the elements in the list, taking into account the default language, must be selected.

lanthaler added a commit that referenced this issue Dec 20, 2012

Member

lanthaler commented Dec 20, 2012

I've updated all algorithms, unless I hear objections I will close this issue in 24 hours.

@lanthaler lanthaler closed this Dec 21, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment