Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: how to join values across data sources, with join conditions? #16

Closed
bjdmeest opened this issue Mar 25, 2021 · 19 comments
Closed
Assignees

Comments

@bjdmeest
Copy link
Member

bjdmeest commented Mar 25, 2021

Came from kg-construct/rml-core#1

it is currently possible to join values across data sources, but without join conditions (see eg https://github.com/RMLio/rml-fno-test-cases/blob/master/RMLFNOTC0009-CSV/mapping.ttl, see also https://kg-construct.slack.com/archives/C01QFSW77QF/p1615717859003600)

@bjdmeest bjdmeest self-assigned this Mar 25, 2021
@bjdmeest bjdmeest changed the title describe the current "it is possible to join values across data sources, but without join conditions" Feature: how to join values across data sources, with join conditions? Mar 26, 2021
@andimou
Copy link

andimou commented Mar 26, 2021

I think this might be affected in the end by how we handle joins in RML

@samiscoding
Copy link

Came from kg-construct/rml-core#1

it is currently possible to join values across data sources, but without join conditions (see eg https://github.com/RMLio/rml-fno-test-cases/blob/master/RMLFNOTC0009-CSV/mapping.ttl, see also https://kg-construct.slack.com/archives/C01QFSW77QF/p1615717859003600)

I think it goes back again to the basic definition of fnml:functionMap; is it correct to be able to define a rml:logicalSource for a rr:termMap different than the rml:logicalSource of the rr:triplesMap to which it belongs?

@pmaria
Copy link
Collaborator

pmaria commented Oct 8, 2021

Does a Function Triples Map need a Logical Source?

The current use case for a LogicalSource definition on a FunctionTriplesMap seems to be:

The ability to generate values from a different source and use these values as the result of a Function Term Map.

An example of this is included in one of the proposed FnO test cases: RMLFNOTC009

However, since a FunctionTriplesMap doesn't generate values directly, but generates intermediate function execution triples expressed in FnO, the question of how to handle joins between a TriplesMap and a FunctionTriplesMap with a different LogicalSource arises.

As this is not the same type of join as a join on a RefObjectMap this join would have to be defined. Subsequently, this would require another specific type of join to be implemented by engines.

At the same time we have a very similar mapping challenge for generating literal values by a joining different logical sources: join-on-literal challenge.

I believe it would be advantageous to come up with a solution that covers both generating literals from different LogicalSources using joins, as generating function values from different LogicalSources.
As this solution would not be specific to functions, I think we should look for a solution in the definition of LogicalSources. (pinging @thomas-delva)

@samiscoding
Copy link

Does a Function Triples Map need a Logical Source?

The current use case for a LogicalSource definition on a FunctionTriplesMap seems to be:

The ability to generate values from a different source and use these values as the result of a Function Term Map.

An example of this is included in one of the proposed FnO test cases: RMLFNOTC009

However, since a FunctionTriplesMap doesn't generate values directly, but generates intermediate function execution triples expressed in FnO, the question of how to handle joins between a TriplesMap and a FunctionTriplesMap with a different LogicalSource arises.

As this is not the same type of join as a join on a RefObjectMap this join would have to be defined. Subsequently, this would require another specific type of join to be implemented by engines.

At the same time we have a very similar mapping challenge for generating literal values by a joining different logical sources: join-on-literal challenge.

I believe it would be advantageous to come up with a solution that covers both generating literals from different LogicalSources using joins, as generating function values from different LogicalSources. As this solution would not be specific to functions, I think we should look for a solution in the definition of LogicalSources. (pinging @thomas-delva)

That's the reason I insist that we should consider the big picture while defining the fundamental concepts i.e. function triples map and function term map! Check the alternative definitions with an example in overview.md at the branch "function-alternative".

@andimou
Copy link

andimou commented Oct 10, 2021

so your suggestion is to follow a similar approach as in the case of the rml:parentTriplesMap and have functionTriplesMap which can be optionally combined with a join? Then a FunctionTriplesMap SHOULD have exactly 1 LogicalSource and we define this either it is the same as the Logical Source or not in the same way we do with the Referencing Object Map?

@samiscoding
Copy link

@andimou yes, exactly!

@thomas-delva
Copy link

I believe it would be advantageous to come up with a solution that covers both generating literals from different LogicalSources using joins, as generating function values from different LogicalSources.
As this solution would not be specific to functions, I think we should look for a solution in the definition of LogicalSources.

When working on RML fields I had in mind you could do something like :sourceC rml:joinOf :sourceA, :sourceB . and then source C would be a "virtual" logical source that has all the fields defined in sources A and B, and the data in source C would be a join of the data in sources A and B. Then source C could be used in a triples map to generate RDF from two joined sources in a very general way: generating IRIs or literals in a homogeneous way, mixing fields of both A and B to generate one RDF term, generating function values, etc. Looking back, this rml:joinOf idea seems a bit too general and too far from current RML, so perhaps it can be simplified. Just throwing it out there. :)

In general I tend to agree FNML shouldn't need its own way to define joins. For the example in RMLFNOTC009 I wonder why one cannot just call grel:toUpperCase in the subject map of a new triples map and then join as usual with rr:parentTriplesMap. (This is a slight abuse as that subject map would generate literals, but as long as no invalid RDF triples are generated that should be fine imho.) (Disclaimer: I admit I'm not too up to date with the how and why of all FNML aspects.)

@bjdmeest
Copy link
Member Author

bjdmeest commented Mar 2, 2022

I have the feeling the discussion is revolving around 'functions should or should not specify their own logical source', to solve exactly this issue. I think we first need to solve that before we can solve functions properly. That's why I made following

Proposal

I'm purposely not specifying the relation with existing RML and R2RML constructs, nor specifying exactly how to describe a function, instead, I'm making a proposal where we can have functions without defining their own logical sources, and still join values across data sources

TL:DR; functions are a special kind of term map / no logical source for functions / you specify input values for functions using term maps (so you can do nesting) / join conditions specify childterm and parentterm instead of child and parent (so you can put functions there) / referencingObjectMaps have a join result term to specify a new term based on values of the parent logical source instead of relying solely on the subject of the parent triples map

Definitions

  1. A Triples Map is something that generates RDF constructs (Triples, Quads, RDF*, ... 🤷‍♂️) from a Logical Source, using Term Maps. RDF constructs consist of RDF Terms.
  2. A Term Map is something that generates an RDF Term. It takes its values from the Logical Source of its Triples Map
  3. A Function Term Map is something that generates an RDF Term after executing a function. It cannot specify its own Logical Source, i.e., takes its values from the Logical Source of its Triples Map.
  4. A ReferencingMap is something that generates an RDF Term from a different Triples Map, called the Parent Triples Map, i.e., takes values from the Parent Triples Map's Logical Source, called the Parent Logical Source. It can use a JoinCondition and generates the Join Result Term.
  5. A Join Condition specifies how to join these logical sources. It consists of a Child Term (generating a Term, taking values from the original Logical Source) and a Parent Term (generating a Term, taking values from the Parent Logical Source). By default (i.e., when no Join Condition is specified), a full join is performed.
  6. A Join Result Term is a Term Map that generates the term from the values of the Parent Logical Source. By default (i.e., when no Join Result Term is specified), the Join Result Term is the Subject generated by the Parent Triples Map.

Results

Using these definitions, we can:

  • specify a Function Term Map is never having its own logical source, so we nicely separate concerns.
  • use functions anywhere in a join, eg., lowercase both (1) child and (2) parent values, specify a (3) special comparison function that does fuzzy matching, and (4) transform other values from the parent logical source for the result
  • if we nest functions, we can do something like join values across source

Diagrams

A function description (red = FnO stuff, green = FNML stuff, feel free to ignore those colors for now):

graph LR
    TM([TermMap])
    FM([FunctionTermMap]):::fnml
    TM -->|is-a| FM
    FM -->|execution| Ex([Execution]):::fnml
    FM -->|output| J(IRI):::fnml
    Ex -->|function| ExOM([fno:Function TermMap]):::fno
    Ex -->|parameterMap| ParamPOM([ParameterMap])
    ParamPOM -->|parameter| ParamPM(parameter):::fno
    ParamPOM -->|parameter value| ParamOM([parameter value TermMap])
    classDef fnml fill:#8F9
    classDef fno fill:#F89
    classDef rml fill:#89F
    classDef ls2 fill:#09F
Loading

A join description (dark blue === Parent Logical Source):

graph LR
    T3M([TriplesMap])
    T3M-->|predicatObjectMap| POM([PredicatObjectMap])
    POM -->|predicateMap| PM([PredicateMap])
    POM -->|objectMap| ROM([ReferencingObjectMap])
    ROM -->|parentTriplesMap| PT3M([TriplesMap]):::ls2
    ROM -->|joinCondition| JC([JoinCondition])
    ROM -->|joinResultTerm| JTM([TermMap]):::ls2
    JC -->|childTerm| ChTM([TermMap])
    JC -->|parentTerm| PaTM([TermMap]):::ls2
    classDef fnml fill:#8F9
    classDef fno fill:#F89
    classDef rml fill:#89F
    classDef ls2 fill:#09F
Loading

A join across sources example (result is "{childsource_value}{parentsource_value}"

graph LR
    T3M([TriplesMap])
    T3M-->|predicatObjectMap| POM([rr:PredicatObjectMap])
    POM -->|objectMap| FM
    FM([FunctionTermMap])
    FM -->|execution| Ex([Execution])
    FM -->|output| J(grel:stringOut):::fno
    Ex -->|function| ExFn(grel:array_join):::fno
    Ex -->|parameterMap| ParamPOM([ParameterMap])
    ParamPOM -->|parameter| P1(grel:array_value):::fno
    ParamPOM -->|parameter value| O1("{childsource_value}"):::fno
    ParamPOM -->|parameter| P2(grel:array_value):::fno
    ParamPOM -->|parameter value| ROM([ReferencingObjectMap])
    ROM -->|parentTriplesMap| PT3M([TriplesMap]):::ls2
    ROM -->|joinCondition| JC([JoinCondition])
    ROM -->|joinResultTerm| JTM("{parentsource_value}"):::ls2
    JC -->|childTerm| ChTM([TermMap]):::ls2
    JC -->|parentTerm| PaTM([TermMap]):::ls2
    classDef fnml fill:#8F9
    classDef fno fill:#F89
    classDef rml fill:#89F
    classDef ls2 fill:#09F
Loading

@bjdmeest
Copy link
Member Author

bjdmeest commented Mar 2, 2022

@samiscoding and @pmaria could you have a look at my proposal here? I have the feeling we need to fix this first before we can fix FNML :) (@dachafra putting you in the loop since you were gonna check FNML in any case ;) )

@pmaria
Copy link
Collaborator

pmaria commented Mar 3, 2022

TL:DR; functions are a special kind of term map / no logical source for functions / you specify input values for functions using term maps (so you can do nesting) / join conditions specify childterm and parentterm instead of child and parent (so you can put functions there) / referencingObjectMaps have a join result term to specify a new term based on values of the parent logical source instead of relying solely on the subject of the parent triples map

Generally agree, although I don't see child and parent values as terms, rather as just "values".

Definitions

[...]
4. A ReferencingMap is something that generates an RDF Term from a different Triples Map, called the Parent Triples Map, i.e., takes values from the Parent Triples Map's Logical Source, called the Parent Logical Source. It can use a JoinCondition and generates the Join Result Term.

Do I understand it correctly that this is a new construct that is the generalization of a referencing object map?

  1. A Join Condition specifies how to join these logical sources. It consists of a Child Term (generating a Term, taking values from the original Logical Source) and a Parent Term (generating a Term, taking values from the Parent Logical Source). By default (i.e., when no Join Condition is specified), a full join is performed.

A full join is not the current behavior when no join condition is specified for a referencing object map. What would be the use case for a full join?

Results

[...]

  • if we nest functions, we can do something like join values across source

Could you give an example of what this would look like?

Generally speaking I would steer clear of joining within a function term map, because:

  • joins are complex as it is,
  • now we would conceptually have to join during the "evaluation of an expression", which is different from the usual referencing object map joins.

Pros of this approach:

  • separation of concern wrt the logical source - function term map not having longical source
  • ability to generate values and terms using other sources
  • possible to combine expressions on both (or more) logical sources in a single result, e.g. template based on LS1 and LS2.

Downsides of this approach:

  • implementation, and I would say reasoning about the mapping, becomes complex because of conceptually different places to join.
  • must use functions to generate terms based on multiple sources

In general my preference would still be to have a more general way to join sources, such that:

  • it is possible to generate terms based on multiples sources from templates or any other possible future expressions type
  • the join logic can be implemented in a single general way

@bjdmeest
Copy link
Member Author

bjdmeest commented Mar 4, 2022

  • if we nest functions, we can do something like join values across source

Could you give an example of what this would look like?

Is my final diagram a clarification? I can cook up some Turtle if you want :

Downsides of this approach:

* implementation, and I would say reasoning about the mapping, becomes complex because of conceptually different places to join.

* must use functions to generate terms based on multiple sources

I agree with your preference that joins can/should probably be solved more generally, my point was more that this structure allows complex joins across functions and sources and whatever. If we can solve the joins somewhere else, we can always limit the spec that function terms cannot be referencing object maps. But I prefer having a generic structure that later is limited than a specific structure that is hard to expand later on

In general my preference would still be to have a more general way to join sources, such that:

* it is possible to generate terms based on multiples sources from templates or any other possible future expressions type

* the join logic can be implemented in a single general way

👍

@pmaria
Copy link
Collaborator

pmaria commented Mar 4, 2022

  • if we nest functions, we can do something like join values across source

Could you give an example of what this would look like?

Is my final diagram a clarification? I can cook up some Turtle if you want :

yeah I think so. So basically you do something like

someFunction(value_TM1, value_TM2_via_join, ... , value_TMX_via_join)

@bjdmeest
Copy link
Member Author

bjdmeest commented Mar 4, 2022

yeah I think so. So basically you do something like

someFunction(value_TM1, value_TM2_via_join, ... , value_TMX_via_join)

exactly, way simpler represented than what I was trying 😅

@samiscoding
Copy link

It is an interesting perspective to look at the problem, however,

  1. Trying an example, I see that it leads to longer and more complex mapping rules compared to previous proposals. I'm a big fan of precision at the expense of complexity but if we can find a simpler solution that covers the definition of the same concepts we should consider it!
  2. If I understand it correctly in this case one doesn't need to use "Fields" as discussed before instead of logicalSources, right?
  3. Based on this definition, there wouldn't be any concept of FunctionTriplesMap, right?
  4. I'm a bit confused by the concepts and syntaxes that you use from RML and R2RML. If we still want to reuse them then I see no reason to throw away previous proposals as we did during the Ghent meeting! Correct me if I'm wrong, wasn't the objection against our previous proposals in the meeting about not proposing it from scratch and reusing syntaxes? 😅

@bjdmeest
Copy link
Member Author

It is an interesting perspective to look at the problem, however,

1. Trying an example, I see that it leads to longer and more complex mapping rules compared to previous proposals. I'm a big fan of precision at the expense of complexity but if we can find a simpler solution that covers the definition of the same concepts we should consider it!

Fully agree that it becomes more (too?) complex, the argument I mostly wanted to make was "We can keep source definition out of the function construct to allow joining values across data sources". It's very complex without additional constructs, but (i) it is currently possible and (ii) we can think of a better construct separate from functions :)

2. If I understand it correctly in this case one doesn't need to use "Fields" as discussed before instead of logicalSources, right?

True

3. Based on this definition, there wouldn't be any concept of FunctionTriplesMap, right?

We can steer away from linking function definitions with the triplesmap definition, but that's not completely cleared out yet, see kg-construct/rml-core#45 (comment)

4. I'm a bit confused by the concepts and syntaxes that you use from RML and R2RML. If we still want to reuse them then I see no reason to throw away previous proposals as we did during the Ghent meeting! Correct me if I'm wrong, wasn't the objection against our previous proposals in the meeting about not proposing it from scratch and reusing syntaxes? 😅

Huh, I had it completely the other way around, that it's confusing to reuse syntax and it would be better to make a clear distinction. Maybe we should clear that up with the community first.

@bjdmeest
Copy link
Member Author

I removed the FnO label, as we decided that joins and functions are 2 complementary things that shouldn't be convoluted

@dachafra
Copy link
Member

As it's a join issue, I'm going to move it to its corresponding repo

@dachafra dachafra transferred this issue from kg-construct/rml-core Aug 29, 2023
@elsdvlee elsdvlee transferred this issue from kg-construct/rml-jc Jan 26, 2024
@elsdvlee
Copy link
Collaborator

Agreed with Ben to make a test case and verify if this issue can be solved using logical views.

@bjdmeest
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants