Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More specific frame matching #110

Closed
gkellogg opened this issue Apr 26, 2012 · 28 comments
Closed

More specific frame matching #110

gkellogg opened this issue Apr 26, 2012 · 28 comments

Comments

@gkellogg
Copy link
Member

Currently, framing allows you to select subject definitions based on @type matching or duck typing for included properties. It allows value properties to be explicitly matched based on defining the property and excluding things that are undefined, but it does not allow you to be more specific about the types of values selected.

For example, DBPedia creates output in many languages, and it would be reasonable to frame to get just a specific language. This could be supported by adding more specific value matching. Consider the following:

Document:

{
  "rdfs:comment": [
    {"@value": "comment in english", "@language": "en"},
    {"@value": "commentar auf deutsch", "@language": "de"},
    {"@value": "日本語でのコメント", "@language": "ja"},
  ]
}

Frame:

{
  "@context": {},
  "rdfs:comment": {"@value": {}, "@language": "en"}
}

This might select only those values that match the specific language. Or, using a {} wildcard, only those values that have a defined language. The same could be used for @type: {}, when used in conjunction with@value. (We could also consider @language: false or @type: false to indicate that there is no language or type.

Using the array syntax, multiple values could be selected:

{
  "@context": {},
  "rdfs:comment": [
    {"@value": {}, "@language": "en"}
    {"@value": {}, "@language": "de"}
  ]
}

This could be extended to say that for a property, the {} wildcard only selects resources, and to select a value requires an explicit {"@value": {}} to be a value. Personally, I think that if you use {"@type": {}}, it's clear that you only want values which are resources, and that {} can remain a wildcard matching both values and resources.

@dlongley
Copy link
Member

We currently allow filtering on subjects with no @type using @type: [] (as opposed to @type: {} for subjects with any @type). We could do the same for @language and @type in @value ... or we could adopt @type: false for subjects instead of @type: []; however we'll still have the case where @type: []. If we adopt @language: [], then we can also shorten selecting for multiple languages, etc.

@lanthaler
Copy link
Member

+1

@gkellogg
Copy link
Member Author

Here are some proposed framing semantics:

The property value {} is a wild card, matching one or more values, subject references or subject definitions.

An empty array [] is a synonym for an array containing a wild card: [{}]. This is useful, as the input frame is expanded and will always be in array form.

Framing iterates over all property values looking for matches; right now it's just the first value. (probably not default values).

The property value {"@value": {}} matches any value with or without @language or @type.

The property value {"@value": {}, "@type": false} matches any value without a @type. (Same for @language).

The property value {"@value": {}, "@language": "en"} matches any value having a language of "en". Specified in array form, it matches any of the listed languages.

The property value {"@value": {}, "@type": "xsd:integer"} matches any value having a type of "xsd:integer". Specified in array form, it matches any of the listed types. (Presume matching of associated native types as well).

The property value {"@type": {}} matches any subject definition, with or without @type (note the absence of @value). This is the same as today.

The property value {"prop": {}} matches any subject definition having all of the specified properties, with pattern matching over their values as noted here.

To support frame-specific semantics, the expansion algorithm takes an internal boolean flag of frame defaulting to false. When set to true, it allows extended value syntax used in frames.

Future: consider the use of variables of the form "?var" to be used where existing patterns are, in addition to @id, which can perform SPARQL-like BGP matching.

@dlongley
Copy link
Member

The property value {"@value": {}, "@type": false} matches any value without a @type. (Same for @language).

We should consider this instead:

The property value {"@value": {}, "@type": []} matches any value without a @type. (Same for @language).

This would also provide for this:

The property value {"@value": {}, "@type": ["ex:foo","ex:bar"]} matches any value with a @type of "ex:foo" OR "ex:bar".

The property value {"@value": {}, "@language": ["en","de"]} matches any value with a @language of "en" OR "de".

@lanthaler
Copy link
Member

The property value {} is a wild card, matching one or more values, subject references or subject definitions.

An empty array [] is a synonym for an array containing a wild card: [{}]. This is useful, as the input frame is expanded and will always be in array form

Does this mean {} and [] are equivalent?

The property value {"@value": {}, "@type": false} matches any value without a @type. (Same for @language).

Can't we use null to specify that a property shouldn't be there? I know it normally would get dropped during frame expansion but I think an internal flag as discussed last week could solve this issue!?

IMO, the following rule is inconsistent with the others, I would expect it to just match object with an @type property (having an arbitrary value):

The property value {"@type": {}} matches any subject definition, with or without @type (note the absence of @value). This is the same as today.

@lanthaler
Copy link
Member

If we are going to implement this (and I hope we will), we should also revisit how frame matching works in general. For example, I would say that the following frame (see playground)

{
  "@context": {
    "dc": "http://purl.org/dc/elements/1.1/",
    "ex": "http://example.org/vocab#"
  },
  "ex:contains": { "@type": "ex:Chapter" }
}

shouldn't produce an output like

{
    "@context": {
        "dc": "http://purl.org/dc/elements/1.1/",
        "ex": "http://example.org/vocab#"
    },
    "@graph": [{
        "@id": "http://example.org/library",               /------------------------------\
        "@type": "ex:Library",                       <----| this subject shouldn't be here |
        "ex:contains": null                                \------------------------------/
    }, {
        "@id": "http://example.org/library/the-republic",
        "@type": "ex:Book",
        "ex:contains": {
            "@id": "http://example.org/library/the-republic#introduction",
            "@type": "ex:Chapter",
            "dc:description": "An introductory chapter on The Republic.",
            "dc:title": "The Introduction"
        },
        "dc:creator": "Plato",
        "dc:title": "The Republic"
    }]
}

but just

{
    "@context": {
        "dc": "http://purl.org/dc/elements/1.1/",
        "ex": "http://example.org/vocab#"
    },
    "@graph": [{
        "@id": "http://example.org/library/the-republic",
        "@type": "ex:Book",
        "ex:contains": {
            "@id": "http://example.org/library/the-republic#introduction",
            "@type": "ex:Chapter",
            "dc:description": "An introductory chapter on The Republic.",
            "dc:title": "The Introduction"
        },
        "dc:creator": "Plato",
        "dc:title": "The Republic"
    }]
}

So, instead of just checking the existence of a property, framing should also check the value of a property and just include a subject if all specified properties pass the filter as specified in the frame.

I would also argue that we don't need to embed everything by default but just if it is specified in the frame. So, an "ex:contains": {} should just embed library->book (not going as far as library->book->chapter), and chapter->book (playground example).

@dlongley
Copy link
Member

Markus, I can see at least one potential issue with each of your two suggestions. Both of these ideas existed at one time or another in framing and were abandoned. Maybe there's some wiggle room with the first one, but not the second.

Regarding the first one, (multilevel subject matching):

If we want to be consistent, then we may need to deeply traverse the entire set of subframes in order to find matches. In the given example, the suggestion is to only go one level deeper into the frame (as that's the deepest level in the example). Specifically, we have a frame that has the property "ex:contains" and the value {"@type": "ex:Chapter"}. Currently, we match at a single level (without deep-diving into subframes), so any subject with the property "ex:contains" is matched, ignoring the value of "ex:contains". The suggestion is to also check the value of "ex:contains". But what if that value itself contains more subframes?

Should the match traverse the entire path only matching a subject that has those matching deep links? Or should we only go 1 level deep -- if so, why? Could we construct frames where that sort of behavior isn't actually what we want? Could we construct a frame for the use case where we wanted all subjects that have the property X but we only want their deeply-linked subjects if they have a certain property? For example, what if we wanted all of the libraries that contain books, but we only want the books if they contain "red-flagged" chapters (whatever that means)? Would this mean we'd have to run two different framing operations instead of just checking for null?

We could make this work, but we'd have to consider these things ... and it might turn out that it's easier to check for null than implement something complicated and then come up with a solution for the use cases we changed or eliminated by altering the algorithm.

Regarding the second suggestion (not auto-embedding data unspecified in the frame):

This won't work for things like hashing or digital signatures. It's important to keep all of the deep information in the graph for the subjects selected. You can't always know all of the information that was linked to, but if this information is undesirable in your results you can either easily drop it by using {"@embed": false} or simply ignore it in your code. The same can't be said in reverse; if you need all of the information about a particular subject but you don't know all of the properties ahead of time, this second suggestion will cause you to lose that information.

A simple use case is an application that takes RDFa from a webpage that includes details about an asset that is for sale. You want to filter that asset information and organize it in a particular way for your application, so you use a frame with {"@type": "foo:Asset"}. Some of the properties in the asset are used by your application, others are not. All of the properties will be displayed to a potential buyer. You also need to confirm that the asset details were digitally signed by an appropriate party. The signed information will contain everything (deeply) linked to by the asset subject, which with the current framing algorithm, you have in your result set. All you have to do is normalize your result and run the signature verification algorithm.

If we don't include deep-links by default, how do we cover this use case? Why not include deep links by default and use {@embed: false} (or set embed to false as the default in the framing options) in those cases where we know we don't want more information? The latter is what we do now and I currently think it's the best solution.

Note: We also have a flag to exclude properties that aren't explicitly in the frame (@explicit).

@dlongley
Copy link
Member

An empty array [] is a synonym for an array containing a wild card: [{}]. This is useful, as the input frame is expanded and will always be in array form

Does this mean {} and [] are equivalent?

I don't think that they should be. I think [] should indicate "match nothing" and {} should indicate "match all".

The property value {"@value": {}, "@type": false} matches any value without a @type. (Same for @language).

Can't we use null to specify that a property shouldn't be there? I know it normally would get dropped during frame expansion but I think an internal flag as discussed last week could solve this issue!?

I think we can avoid the flag entirely by using [] to indicate "match nothing". Since we should also support using arrays to indicate OR matching (eg: "@type": ["foo:type1", "foo:type2"] means match either "foo:type1" or "foo:type2") , I think this would be simplest.

IMO, the following rule is inconsistent with the others, I would expect it to just match object with an @type property (having an arbitrary value):

The property value {"@type": {}} matches any subject definition, with or without @type (note the absence of @value). This is the same as today.

I agree that it seems inconsistent. In a value construct, if you specify: {"@type": {}} then I think only values with a @type (any type) should match. If you specify {"@type": []} then I think only values without a @type should match. If you don't specify "@type" (you just use {}) then I think anything with or without a type can match.

@lanthaler
Copy link
Member

Agree regarding {} and [].

If we want to be consistent, then we may need to deeply traverse the entire set of subframes in order to find matches

Yes, if the frame is that deep, that's what we need to do I think.

Should the match traverse the entire path only matching a subject that has those matching deep links?

Yes

Could we construct a frame for the use case where we wanted all subjects that have the property X but we only want their deeply-linked subjects if they have a certain property? For example, what if we wanted all of the libraries that contain books, but we only want the books if they contain "red-flagged" chapters (whatever that means)? Would this mean we'd have to run two different framing operations instead of just checking for null?

Either that or writing the frame so that all books are returned and the app checks whether they are "red-flagged" or not. I think the normal use case is to "filter" the graph and return it in a structure that is easy to process. By always returning all the data we leave the filtering completely to the application.. well almost, we filter for values and @type IRIs.

Edit: Another way to achieve what you trying to do would be to leverage @default by setting it to null.

Regarding the second suggestion (not auto-embedding data unspecified in the frame): This won't work for things like hashing or digital signatures

I don't really buy this argument as it is a bit oversimplified I think. You will need to know what properties and up to which level the graph was signed by the other party and can't just assume that it included everything. What if there were some properties added just for SEO which are not signed e.g.? Nevertheless, I see were you are coming from but I don't like the idea of having to add @embed everywhere in the frame as I already specify explicitely what I want have included.

What about adding a flag like autoembed subgraph (default: false) to the framing algorithm? The result of {} for the library example would then not contain the book twice and the chapter three times by default but just

...
"@graph": [
    {
        "@id": "http://example.org/library",
        "@type": "ex:Library",
        "ex:contains": { "@id": "http://example.org/library/the-republic" }
    }, {
        "@id": "http://example.org/library/the-republic",
        "@type": "ex:Book",
        "ex:contains": { "@id": "http://example.org/library/the-republic#introduction" }
        "dc:creator": "Plato",
        "dc:title": "The Republic"
    }, {
        "@id": "http://example.org/library/the-republic#introduction",
        "@type": "ex:Chapter",
        "dc:description": "An introductory chapter on The Republic.",
        "dc:title": "The Introduction"
    }
]

Which is, what I think, the frame's author's intention.

@dlongley
Copy link
Member

You will need to know what properties and up to which level the graph was signed by the other party and can't just assume that it included everything.

You can if that's the protocol. It's a very easy way of specifying what is part of the signed or hashed information in the page. Much more difficult ways include enumerating all properties and relationships that can appear in the signature -- not to mention how limiting that is with regard to extensibility. Specifying an additional field to allow enumerating more properties and relationships that are part of the signature is even more complex.

I may be "ok" with this being an option that is part of framing, but I actually think it should be the default behavior. We've been using framing in practice in PaySwarm and this is exactly what we want and need. If we don't want information embedded, we simply use some combination of @embed: false or @explicit: true -- that's why they're there. Frames can be very specific about limiting what you want back if you want them to be -- and less specific if you want to capture whatever is there, just in a particular structure.

@dlongley
Copy link
Member

Nevertheless, I see were you are coming from but I don't like the idea of having to add @embed everywhere in the frame as I already specify explicitely what I want have included.

You don't have to; just specify @embed: false in the options once (to set that as the new default behavior).

@lanthaler
Copy link
Member

If you do that, then you would have to add @embed to every single match otherwise nothing would be included at all - or is that a bug? See playground.

Currently I look primarily at the playground to check how framing is supposed to work as I think the algorithm in the spec is not really up to date, is it?

@dlongley
Copy link
Member

I believe the algorithm is up-to-date.

Also keep in mind that I said you could specify @embed = false in the options (which we currently don't have a UI for changing in the playground). However, yes, if you set @embed to false, then you have to specify what you want embedded in the frame by using @embed: true. I don't see that as a problem as it's exactly what's asked for, namely, only embed those specific things requested, don't just automatically embed everything.

@lanthaler
Copy link
Member

Well, I could as well argue that this specifies that I would like to have everything down to chapters:

...
  "@type": "ex:Library",
  "ex:contains": {
    "@type": "ex:Book",
    "ex:contains": {
      "@type": "ex:Chapter"
...

I don't know if that's a bug, but even including "@embed": true everywhere doesn't actually changes anything. See here.

That being said, I still think we should automatically embed everything that is specified in the frame, potentially several times, and stop at the depness as specified by the frame (unless a special flag is set to also include the remaining subgraph).

@dlongley
Copy link
Member

What we could do is automatically enable @embed if there is a subframe. I think that would make sense.

@dlongley
Copy link
Member

Oh, and you just need to set @embed: true at the top-level in your example if you want to embed anything inside of the library subject.

@dlongley
Copy link
Member

One more thing; one reason to not automatically enable embed would be if we adopted deep-filtering like suggested above. Then you might actually just want to filter out a particular subject that has the links in the frame -- and get a reference to it. However, I don't think that's really what framing should do. The point of framing is primarily to specify a particular structure, filtering is secondary but also important.

@gkellogg
Copy link
Member Author

Perhaps this is not what we want, but the playground is consistent with the algorithm, which states that if embedOn is false to skip embedding, and add output to parent. At this point, output contains only the @id property, so the result is consistent with the algorithm, at least up to that point.

Putting "@embed": true in lower frames doesn't have an effect, as they are never processed.

We don't' get down to 4.6.2.3.2, where we would otherwise embed the output because embedOn is false. To perform further processing would suggest that 4.6 should be "An Then" instead of "Otherwise". What would the ramifications be of doing that?

@lanthaler
Copy link
Member

What we could do is automatically enable @embed if there is a subframe. I think that would make sense.

Agree - I think :-)

Oh, and you just need to set @embed: true at the top-level in your example if you want to embed anything inside of the library subject.

I understand that (which would be equal to just remove it alltogether).. but why don't have the @embeds in the subframes no effect?

One more thing; one reason to not automatically enable embed would be if we adopted deep-filtering like suggested above. Then you might actually just want to filter out a particular subject that has the links in the frame -- and get a reference to it. However, I don't think that's really what framing should do. The point of framing is primarily to specify a particular structure, filtering is secondary but also important.

Not sure I understand what you are saying here.. Are you saying that in deep-filtering I would have to reconstruct the structure myself then and I just get the top-level subject back? I think the whole path down to the deepest subframe should be returned by default - nothing more, nothing less.

Having filtering to restart at every level seems a bit odd to me - @type filtering works exactly the other way round. I don't know if we have the same idea of how @value matching is supposed to work if no matching value is found for a property. I would say that that should result in the whole subject not matching (same as there's a @type mismatch). Therefore I would say it is just consequent to treat other objects (that have an @id) the same way.

@lanthaler
Copy link
Member

I re-read the thread and tried to distill the open questions:

  1. Do we interpret {} as wildcard?
  2. Do we interpret [] as match nothing, i.e., property shouldn't be in subject?
  3. Do we interpret [ a, b ] the value of subject's property should be a OR b?
  4. Should we add support for value matching (e.g. "@value": 5; _edit:_ this includes @language and @type )
  5. Should we do deep-filtering not just for value objects but for everything?
  6. Should we automatically include the whole subtree or go just as deep as the frame is (and having a flag to explicitly embedd the the subtree)?

Furthermore we might need to discuss how @embed is supposed to work.

Please correct me if I forgot something.

@lanthaler
Copy link
Member

Here's my opinion on all those points:

  1. Yes
  2. Yes
  3. Yes. a and b can either be scalars or objects but not other arrays.
  4. Yes
  5. Yes
  6. By default we should just go as deep as the frame is and we should have a flag like @embedChildren to switch to the other behavior. If that's on, we have to check for cycles.

Regarding @embed, I think that it should just be used to limit what's returned when doing deep-filtering. So a frame like @type: library -> @type: book, @embed: false -> @type: chapter would just include libraries which contain books which contain chapters but just embed libraries (neither books nor chapters). Apart from deep filtering I think this is in line with how it works currently.

@gkellogg
Copy link
Member Author

I agree with your votes. For 4), I'd extend it to include {"@lang": "en"} and {"@type": "xsd:string}, with array representations for OR, and {} for wildcard (meaning, must exist).

@lanthaler
Copy link
Member

Yeah, sorry.. that wasn't clear enough.. @value was just one example. Of course also @language and @type should be supported

@lanthaler
Copy link
Member

I think we need to clarify one more thing:

  1. Does ... "property": {} means that property must exist for a match?

My opinion: Yes

@dlongley
Copy link
Member

  1. Do we interpret {} as wildcard?

Yes.

  1. Do we interpret [] as match nothing, i.e., property shouldn't be in subject?

Yes.

  1. Do we interpret [ a, b ] the value of subject's property should be a OR b?

Yes.

  1. Should we add support for value matching (e.g. "@value": 5; edit: this includes @language and @type )

Yes.

  1. Should we do deep-filtering not just for value objects but for everything?

I'm apprehensive about this. I haven't had the time to understand the repercussions of making this change yet.

  1. Should we automatically include the whole subtree or go just as deep as the frame is (and having a flag to explicitly embed the the subtree)?

I think we should automatically (by default) include all of the data that is linked to the subjects that match the frame (this means including the whole subtree). I think it should be an option to not include this information, by specifying @embed: false.

  1. Does ... "property": {} means that property must exist for a match?

Currently, if you provide @type: "foo", then anything with the @type "foo", regardless of its other properties, will match. If you have a subject with @type "foo" and it doesn't have "property" then the output will use null for that property or the value of @default if given in the frame. If you do not provide a @type, then duck-typing is used; subjects that are missing "property" will not match. I believe this is the behavior we want.

Furthermore we might need to discuss how @embed is supposed to work.

Yes, I think we need some more discussion here.

@lanthaler
Copy link
Member

I just uploaded the latest version of my processor which supports everything we discussed above. I'll send a mail to the mailing list in a sec but wanted to add a reference in the relevant issues first. I also uploaded a modified version of the playground so that you can play with it.

I'm sure there are still some minor bugs in my implementation but I hope it helps nevertheless to keep the discussion on this going.

lanthaler added a commit that referenced this issue Aug 29, 2012
@lanthaler
Copy link
Member

RESOLVED: Do not support .frame() in JSON-LD 1.0 API.

@gkellogg gkellogg added 1.1 and removed on-hold labels Sep 21, 2016
gkellogg added a commit that referenced this issue Oct 4, 2016
…framing algorithm in general, and implement more specific node matching, including deep matching, as described in #110 for multiple values for `@id` in #424.

Still requires more work on expansion and tests. Note that frame matching can now be quite expensive, if all features are used. Also, implementations may want to save some of the work done when matching to do the actual framing.
@gkellogg
Copy link
Member Author

gkellogg commented Oct 4, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants