-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statements table cannot have a primary key defined over it due to nulls #16
Comments
Note that one consequence of lack of PKs is lack of enforcement of triple duplication, e.g.
I think this lack of enforcement is a good thing. Different ontologies may have the same triple (via import modules etc; although having a named graph column would alleviate this #13). It would be tedious to filter these in advance. And it would be slow for the code to have to do dupe checks. The solution here is to put the burden on the client, to have good loading hygeine and to avoid loading the same thing multiple times. and to possibly do SQL UPDATES to remove dupes (this may not be super-trivial due to the fact that NULL != NULL in SQL... it's like a blank node!). Just wanted to flag this too. |
I still think ORMs are a bad idea but I do care about NULLs and primary keys. I don't mind reworking things if we can find a better design. I've tried lots of ways to represent RDF over the years. With RDFTab I decided to try lots of columns even though there are lots of NULLs -- call the current design "Schema 0". The core of the problem is that RDF objects come in four forms: ID, plain literal, typed literal, language literal. We need at least three columns: subject, predicate, object. We can add a graph column. The stanza column is an optimization to avoid chasing blank nodes across the whole table. If we use thick triples then stanza is not needed. I've given a fair bit of thought to a column with a hash of the triple, carefully designed for checking triple identity, but that seems complicated on a number of levels. We could use just these columns and avoid NULLs by always using JSON to represent the object: The next simplest thing is to add a datatype column and overload it -- call this "Schema 2":
For thick triples we could add another datatype for the case when the object is a nested JSON object. Do you have any other alternatives? For writing queries and working with results, I think Schema 1 is strictly worse than 0 and 2, so I want to discard Schema 1. Schema 0 and Schema 2 have different tradeoffs and annoyances, but it seems to me that they're roughly the same. One key difference could be indexing. For Schema 0 we build different indexes for IRI and literals, but for Schema 2 they'd be the same index. I don't know if that's better or worse, I haven't tested it. If you think Schema 2 is better on the whole, then I'd be OK with moving in that direction. |
Great analysis. I think I am 100% in alignment. I didn't think there was a
good solution when I wrote my ticket, but I am actually liking schema 2.
Mild preference for a non-CURIE for the two keyword values for datatype
(e.g. just `id` or `plain`) but that is very minor
…On Wed, May 12, 2021 at 5:56 AM James A. Overton ***@***.***> wrote:
I still think ORMs are a bad idea but I do care about NULLs and primary
keys. I don't mind reworking things if we can find a better design.
I've tried lots of ways to represent RDF over the years. With RDFTab I
decided to try lots of columns even though there are lots of NULLs -- call
the current design "Schema 0".
The core of the problem is that RDF objects come in four forms: ID, plain
literal, typed literal, language literal. We need at least three columns:
subject, predicate, object. We can add a graph column. The stanza column is
an optimization to avoid chasing blank nodes across the whole table. If we
use thick triples then stanza is not needed.
I've given a fair bit of thought to a column with a hash of the triple,
carefully designed for checking triple identity, but that seems complicated
on a number of levels.
We could use just these columns and avoid NULLs by always using JSON to
represent the object: {"object": "ex:foo"}, {"value": "Foo"}. I don't
like this because it's a pain to do the simplest thing and just query for
literals such as labels. Call this "Schema 1".
The next simplest thing is to add a datatype column and overload it --
call this "Schema 2":
- graph: ID or NULL?
- stanza: ID
- subject: ID
- predicate: ID
- object: ID or literal value
- datatype:
- ID (typed literal): xsd:integer
- @lang <https://github.com/lang> (language literal): @en-CA
- keyword for a plain literal (legacy?): :plain
- keyword for an ID: :id
For thick triples we could add another datatype for the case when the
object is a nested JSON object.
Do you have any other alternatives?
For writing queries and working with results, I think Schema 1 is strictly
worse than 0 and 2, so I want to discard Schema 1. Schema 0 and Schema 2
have different tradeoffs and annoyances, but it seems to me that they're
roughly the same.
One key difference could be indexing. For Schema 0 we build different
indexes for IRI and literals, but for Schema 2 they'd be the same index. I
don't know if that's better or worse, I haven't tested it.
If you think Schema 2 is better on the whole, then I'd be OK with moving
in that direction.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMOLCNWT5ITXCIOLNH6DTNJ3GNANCNFSM44XUZZQA>
.
|
What do you think of my proposal @beckyjackson, @lmcmicu, @ckindermann? |
I'm not sure I see the overhead issue with having an extra ID column as the primary key, and it would make inserting to the database more efficient (thus probably make it faster to generate a given DB), plus save on the space required to actually instantiate the primary key (as compared with the other proposals). But that aside, I think Schema 2 sounds reasonable. I'm not sure, though, that I understand the need for two different datatypes to cover these two cases:
In both cases these wouldn't these just be URIRefs? |
Further to the first part of my comment (regarding having an auto-incremented ID column as a primary key). After thinking on it a bit more I don't think it would actually save space, as we'd want to have a secondary index on the fields that need to be queried over. However secondary indexes can have nulls unproblematically. And this doesn't nullify my first observation that having an auto-incremented ID primary key would make insertions a lot faster (as opposed to the other proposals for a primary key), and it seems to me that this is an important consideration, regardless of what is decided regarding the index used for queries. |
Some initial thoughts on the auto-increment ID:
one possibility would be to make |
I wasn't looking at it from this perspective, since I am still learning as far as RDF goes. I was just speaking to the "overhead" point that was made initially. The added overhead, in and of itself, doesn't seem problematic to me. I just wanted to mention that, and that if we did this then we could use secondary indexes, which if I'm not mistaken can have nulls (and duplicates). I can't really speak to the other points about the RDF spec and about reification. That said, I think James's Schema 2 sounds reasonable. |
Thanks Michael, great points!
… |
Schema 2 sounds reasonable to me as well.
Even with PKs there is no straightforward way to avoid triple duplication as far as I can see. Consider the thick triples So,
seems the way to go! |
Thanks for your input, everyone. Nobody is objecting to Schema 2, so I'll make a plan to move towards that. |
If we're going to switch to Schema 2, what do we think about having very short one-letter column names?
Note that thick triples doesn't need a stanza, and Chris isn't using it for semantic-sql so far. |
No objections from me. |
For some applications it is desirable to to declare primary keys for tables. For example, sqlalchemy requires all mapped tables to have a PK if the ORM layer is to be used.
If we were to declare a primary key for statements we would need to either
The first has disadvantages, e.g. overhead of adding SEQUENCEs, divergence from RDF model
The second seems reasonable. However, many RDBMs do not allow nulls in a PK. sqlite is more lax here though. However, even though sqlite allows it, certain tools such as sqlalchemy ORM will not allow nulls in PK fields. By design, with
statements
, EITHER object OR value MUST be NULL.e.g. given:
if we try and do an insert with an rdfs:label (which has object NULL) we get:
sqlalchemy.orm.exc.FlushError: Can't update table statements using NULL for primary key value on column statements.object
This is OK if one doesn't want to use the sqlalchemy ORM layer (the lower-level base layer should work just fine), but it might be good to be aware of this limitation for future evolution
Aside: I used to be more of a computer science purist and decried ORMs as evil, impedance mismatch yadda yadda. But the sqlalchemy ORM is quite nice and I see the advantages of ORMs in solving more pedestrian problems...
I don't think we should change anything as I think the existing structure is nice and simple and good for certain kinds of queries.
I think the solution here is: if you want to do CRUD with an ORM, then define a separate schema in which object and value are combined into one field (e..g a basic "triples" table), and define a trivial bridge layer between rdftab.statements and your.triple (e.g populate one from the other via INSERT INTO .. AS SELECT ...). This is a separate concern from rdftab.
This issue can be immediately closed, just wanted to flag this if useful for planning future evolution.
The text was updated successfully, but these errors were encountered: