Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statements table cannot have a primary key defined over it due to nulls #16

Open
cmungall opened this issue May 12, 2021 · 13 comments
Open

Comments

@cmungall
Copy link
Contributor

For some applications it is desirable to to declare primary keys for tables. For example, sqlalchemy requires all mapped tables to have a PK if the ORM layer is to be used.

If we were to declare a primary key for statements we would need to either

  1. Add an additional statement id column and declare it PK
  2. Use a unique combination such as s,p,o,v as a composite PK

The first has disadvantages, e.g. overhead of adding SEQUENCEs, divergence from RDF model

The second seems reasonable. However, many RDBMs do not allow nulls in a PK. sqlite is more lax here though. However, even though sqlite allows it, certain tools such as sqlalchemy ORM will not allow nulls in PK fields. By design, with statements, EITHER object OR value MUST be NULL.

e.g. given:

class Statements(Base):
    """
    Represents an RDF triple
    """
    __tablename__ = 'statements'
    stanza = Column(Text, primary_key=True)
    subject = Column(Text, primary_key=True)
    predicate = Column(Text, primary_key=True)
    object = Column(Text, primary_key=True)
    value = Column(Text, primary_key=True)
    datatype = Column(Text, primary_key=True)
    language = Column(Text, primary_key=True)

if we try and do an insert with an rdfs:label (which has object NULL) we get:

sqlalchemy.orm.exc.FlushError: Can't update table statements using NULL for primary key value on column statements.object

This is OK if one doesn't want to use the sqlalchemy ORM layer (the lower-level base layer should work just fine), but it might be good to be aware of this limitation for future evolution

Aside: I used to be more of a computer science purist and decried ORMs as evil, impedance mismatch yadda yadda. But the sqlalchemy ORM is quite nice and I see the advantages of ORMs in solving more pedestrian problems...

I don't think we should change anything as I think the existing structure is nice and simple and good for certain kinds of queries.

I think the solution here is: if you want to do CRUD with an ORM, then define a separate schema in which object and value are combined into one field (e..g a basic "triples" table), and define a trivial bridge layer between rdftab.statements and your.triple (e.g populate one from the other via INSERT INTO .. AS SELECT ...). This is a separate concern from rdftab.

This issue can be immediately closed, just wanted to flag this if useful for planning future evolution.

@cmungall
Copy link
Contributor Author

Note that one consequence of lack of PKs is lack of enforcement of triple duplication, e.g.

(venv) ~/repos/semantic-sql(main) $ sqlite3 db/pato.db "SELECT count(*) FROM statements"
73122
(venv) ~/repos/semantic-sql(main) $ ./bin/rdftab db/pato.db < owl/pato.owl 
(venv) ~/repos/semantic-sql(main) $ sqlite3 db/pato.db "SELECT count(*) FROM statements"
109683
(venv) ~/repos/semantic-sql(main) $ ./bin/rdftab db/pato.db < owl/pato.owl 
(venv) ~/repos/semantic-sql(main) $ sqlite3 db/pato.db "SELECT count(*) FROM statements"
146244

I think this lack of enforcement is a good thing. Different ontologies may have the same triple (via import modules etc; although having a named graph column would alleviate this #13). It would be tedious to filter these in advance. And it would be slow for the code to have to do dupe checks.

The solution here is to put the burden on the client, to have good loading hygeine and to avoid loading the same thing multiple times. and to possibly do SQL UPDATES to remove dupes (this may not be super-trivial due to the fact that NULL != NULL in SQL... it's like a blank node!). Just wanted to flag this too.

@jamesaoverton
Copy link
Member

I still think ORMs are a bad idea but I do care about NULLs and primary keys. I don't mind reworking things if we can find a better design.

I've tried lots of ways to represent RDF over the years. With RDFTab I decided to try lots of columns even though there are lots of NULLs -- call the current design "Schema 0".

The core of the problem is that RDF objects come in four forms: ID, plain literal, typed literal, language literal. We need at least three columns: subject, predicate, object. We can add a graph column. The stanza column is an optimization to avoid chasing blank nodes across the whole table. If we use thick triples then stanza is not needed.

I've given a fair bit of thought to a column with a hash of the triple, carefully designed for checking triple identity, but that seems complicated on a number of levels.

We could use just these columns and avoid NULLs by always using JSON to represent the object: {"object": "ex:foo"}, {"value": "Foo"}. I don't like this because it's a pain to do the simplest thing and just query for literals such as labels. Call this "Schema 1".

The next simplest thing is to add a datatype column and overload it -- call this "Schema 2":

  • graph: ID or NULL?
  • stanza: ID
  • subject: ID
  • predicate: ID
  • object: ID or literal value
  • datatype:
    • ID (typed literal): xsd:integer
    • @lang (language literal): @en-CA
    • keyword for a plain literal (legacy?): :plain
    • keyword for an ID: :id

For thick triples we could add another datatype for the case when the object is a nested JSON object.

Do you have any other alternatives?

For writing queries and working with results, I think Schema 1 is strictly worse than 0 and 2, so I want to discard Schema 1. Schema 0 and Schema 2 have different tradeoffs and annoyances, but it seems to me that they're roughly the same.

One key difference could be indexing. For Schema 0 we build different indexes for IRI and literals, but for Schema 2 they'd be the same index. I don't know if that's better or worse, I haven't tested it.

If you think Schema 2 is better on the whole, then I'd be OK with moving in that direction.

@cmungall
Copy link
Contributor Author

cmungall commented May 12, 2021 via email

@jamesaoverton
Copy link
Member

What do you think of my proposal @beckyjackson, @lmcmicu, @ckindermann?

@lmcmicu
Copy link
Collaborator

lmcmicu commented May 18, 2021

I'm not sure I see the overhead issue with having an extra ID column as the primary key, and it would make inserting to the database more efficient (thus probably make it faster to generate a given DB), plus save on the space required to actually instantiate the primary key (as compared with the other proposals).

But that aside, I think Schema 2 sounds reasonable. I'm not sure, though, that I understand the need for two different datatypes to cover these two cases:

  • ID (typed literal): xsd:integer
  • keyword for an ID: :id

In both cases these wouldn't these just be URIRefs?

@lmcmicu
Copy link
Collaborator

lmcmicu commented May 18, 2021

Further to the first part of my comment (regarding having an auto-incremented ID column as a primary key). After thinking on it a bit more I don't think it would actually save space, as we'd want to have a secondary index on the fields that need to be queried over. However secondary indexes can have nulls unproblematically. And this doesn't nullify my first observation that having an auto-incremented ID primary key would make insertions a lot faster (as opposed to the other proposals for a primary key), and it seems to me that this is an important consideration, regardless of what is decided regarding the index used for queries.

@cmungall
Copy link
Contributor Author

Some initial thoughts on the auto-increment ID:

  • we would be deviating from the RDF spec, and we would have to think through the semantics of operations like merge
  • how would this play with reification? If statements are reified in either rdf or owl, a statement/axiom node is generated (this can be blank or an IRI in RDF, but in OWL it must be blank). There can be zero to many statements per triple.

one possibility would be to make statements map to rdf/owl Statements, and not triples, with an auto-reification step for any unreified triples, giving them a fresh blank node. This has some advantages, particularly for efficient querying of reified triples. We have to deal with blank nodes for merge semantics anyway. But there are disadvantages here too. Just take a look at the rdfstar mailing list to get an idea of the complexities that emerge when you trying and introduce statements as first-class entities.

@lmcmicu
Copy link
Collaborator

lmcmicu commented May 18, 2021

Some initial thoughts on the auto-increment ID:

* we would be deviating from the RDF spec, and we would have to think through the semantics of operations like merge

I wasn't looking at it from this perspective, since I am still learning as far as RDF goes. I was just speaking to the "overhead" point that was made initially. The added overhead, in and of itself, doesn't seem problematic to me. I just wanted to mention that, and that if we did this then we could use secondary indexes, which if I'm not mistaken can have nulls (and duplicates). I can't really speak to the other points about the RDF spec and about reification.

That said, I think James's Schema 2 sounds reasonable.

@cmungall
Copy link
Contributor Author

cmungall commented May 19, 2021 via email

@ckindermann
Copy link
Collaborator

ckindermann commented May 19, 2021

Schema 2 sounds reasonable to me as well.

Note that one consequence of lack of PKs is lack of enforcement of triple duplication [...].

Even with PKs there is no straightforward way to avoid triple duplication as far as I can see. Consider the thick triples D subclassOf ObjectIntersectionOf(A,B,C) and D subclassOf ObjectIntersectionOf(B,C,A). These triples are the same (in OWL). Yet, they would have different values for the PK. So, if a client manipulates such triples without maintaining a consistent order for elements in RDF lists (or sets in thick triples), then it is very easy to end up with an exponential number of duplicates.

So,

The solution here is to put the burden on the client, to have good loading hygeine and to avoid loading the same thing multiple times.

seems the way to go!

@jamesaoverton
Copy link
Member

Thanks for your input, everyone. Nobody is objecting to Schema 2, so I'll make a plan to move towards that.

@jamesaoverton
Copy link
Member

jamesaoverton commented May 25, 2021

If we're going to switch to Schema 2, what do we think about having very short one-letter column names?

x current
g graph
s subject
p predicate
o object
d datatype
z stanza

Note that thick triples doesn't need a stanza, and Chris isn't using it for semantic-sql so far.

@lmcmicu
Copy link
Collaborator

lmcmicu commented May 26, 2021

No objections from me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants