-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph DB: OAK: Satisfy unmet requirements #516
Comments
Can you post (or slack) a link to the ontology you are having issues loading? |
@cmungall It's not an ontology per-se, but all of N3C's OMOP, represented as a single ontology: https://github.com/HOT-Ecosystem/n3c-owl-ingest/ The reason we need a single DB is because it needs to be able to walk across mappings between the various OMOP vocabs in a single call. |
can you slack or post a link to the .owl file? |
I think I have an older version you sent to me, this had a syntax error:
where no omoprel prefix was declared - not sure if this was fixed |
@cmungall Sure; there's a link in the release, but it was broken anyway. Here: download link (5GB) I think we fixed the |
@cmungall, you can see the raw (csv) data that TermHub is based on here, and here are the file sizes in kilobytes:
We retrieve updates to the frequently updated data through APIs rather than csvs, so a data management solution (even for read-only) would preferably allow API-based inserts/updates rather than (only) a data pipeline from csv to database. We will also be doing more active CRUD originating from our web app, but have delayed building that, in part, until we decide on a long-term data management solution with better graph data support than Postgres provides. Also, our data-management solution will need to provide high-performance graph queries to support a data-dense, highly-interactive visualization interface to combined vocabulary and value set data. |
@cmungall Aw, sorry, I really thought this was shared openly. I just looked though and it shows that you have access under your @lbl.gov: I was unable to change sharing settings in this particular folder. Maybe Shahim just gave you access? If for some reason it still won't let you download it, let me know and I'll upload another copy somewhere else for you. |
Thanks for the link - I assume n3c.owl was the intended ontology. This one didn't have any syntax errors - thanks! I made a sqlite db of this - shall I deposit it somewhere for you (13G)? Is it OK to deposit in a public s3 bucket? I did this on my mac laptop. I had to increase memory to 48G for relation-graph to finish, but even this step is optional. My next step is to make a jupyter notebook demonstrating the functionality you feel is missing (for example I am not sure where you got the idea that OAK can only do a fixed set of queries, one of the main use cases is being able to quickly construct powerful queries that combine lexical search with graph traversal). Where shall we start? |
The structure of the ontology is unusually flat, with the majority of terms being singletons:
for example: <!-- https://athena.ohdsi.org/search-terms/terms/1123893 -->
<owl:Class rdf:about="https://athena.ohdsi.org/search-terms/terms/1123893">
<rdfs:label>bisoprolol and acetylsalicylic acid; systemic</rdfs:label>
<terms:concept_class_id>ATC 5th</terms:concept_class_id>
<terms:concept_code>C07FX04</terms:concept_code>
<terms:domain_id>Drug</terms:domain_id>
<terms:standard_concept>C</terms:standard_concept>
<terms:valid_end_date>2099-12-31</terms:valid_end_date>
<terms:valid_start_date>1970-01-01</terms:valid_start_date>
<terms:vocabulary_id>ATC</terms:vocabulary_id>
</owl:Class> Before we progress much further, does this reflect what you would expect? Note there is no logical axiom connecting this concept to any other, which seems very odd to me. If we look at We see I would expect that to be mapped to a subclass axiom. Note that OAK is only as good as the input that is provided to it; OAK is probably not going to be very useful on a largely flat ontology of singletons. I think it may be better to add OAK loaders directly from the raw csvs (is this conforming to omop standard tables) - this should be easy but I don't know when we would get to it. |
Successful conversionYes, Converted? That's super appreciated and also I'm surprised. On @ShahimEssaid Windows PC, he ran out of memory with even ~100GB allocated. That makes me wonder if this I made a folder where you should have edit permissions to upload it: here Analysis of
|
Yes I think I mentioned before you can skip the robot step to normalized to rdf/xml, because it's already normalized.
Thx! Uploaded
OAK is based on 25 years of experience working with a diverse range of use cases for ontologies from basic science to clinical research, I would be surprised if there was something that was not either immediately possible or easy to implement using OAK as a building block.
Let's see if we can do this with the complete ontology!
It's not intended to become a database - in fact it wraps databases. However, graphs are a central data structure in bioinformatics and central to everything we do in Monarch, so it's not surprising OAK should be able to do a lot of this - as I say I don't think the requirements between your monarch world hats and your TIMS/TH hats are so different, which is why I am not convinced we need different technology stacks. Having said that, I often see a naive tendency for computer scientists to approach ontologies through a lens of graph theory and start producing network centrality statistics and what not, not realizing that an ontology graph is fundamentally structured differently than say a PPI. But if folks want to do that kind of thing, there is a bridge between OAK and nx, you can export subgraphs to nx and do all your graph theoretic work there. We also have bridges to cx, so you can easily upload your graphs to ndex, view in cytoscape... we have your graph needs covered! You might find this useful reading (for thinking about OAK in TH but also for Mondo): https://incatools.github.io/ontology-access-kit/guide/relationships-and-graphs.html
I want to emphasize that semantic-sql is just a layer over relational databases (with most support currently for sqlite but likely pg soon), and RDBMSs are very good for being performant within whatever memory you can give them. The issues you have been having have been loader issues. While I appreciate these are frustrating (I know you also encountered frustrating issues when loading the edit version of mondo), the good news is (a) it's on the roadmap to make the loaders more straightforward, bypassing dependence on robot/owlapi (this has always been the memory issue you faced, it's not in semsql itself) (b) you can write your own loaders by literally just preparing a few TSVs to bulk upload into sqlite/pg. |
I'll address some of other use cases gradually. But I think the most fundamental conceptual mismatch is that OAK delegates anything with value sets modeling to LinkML, which handles both intensional and extensional value sets. So right now these don't get stored in the sqlite along with the ontology (generally these are pretty small) but this shouldn't be an obstacle, once I understand the use case more I can provide more examples. As far as transactions, as I mentioned the OAK sql adapter can be layered on Pg which has ACID transactions. But I think you mean basic CRUD operations? This is of course handled, and in fact we have a complete change model (KGCL) which I think you should be familiar with from Mondo... |
@Sigfried Sorry for editing your post/title. It seemed like the discussion quickly became OAK-centric, so I created a separate issue just for Graph DB requirements based on what you had here and elsewhere. I updated your post to be more OAK specific. I also looked at all the currently open graph db related issues, added missing ones to the milestone, and created a grouping issue. The advantage of the grouping issue in this situation is that it has subgroups. |
do you have any example of value sets so I can add to the notebook (both intensional and extensional) |
Hey @cmungall other than grabbing this example from the FHIR docs, I don't have anything handy right now. We could maybe pull something off of VSAC at some point for a better example. LOINC Serum Cholesterol
I'm not 100% sure that the extensional one is really an expansion of the intensional one. I'd expect to see the the intensional parent in the expansion JSON somewhere but I don't see it. In any case, if you look at the original post of this issue, you'll see that I organized it in a systematic way. I wanted to get at each requirement/issue, what your proposals were, and any problems identified with each proposal. I discussed with Lisa and Tim briefly at the TIMS meeting today, but Shahim is out for 2 weeks, and we want to discuss with him before we get back to you more on this. Jupyter examples would be great, but I would feel bad if you spent much time on it and ultimately determined that the burden of pros/cons and requirements favors another option over OAK. |
Are there any value sets that use the omop IDs directly? Ideally I would come up with a coherent example that demonstrates use of the ontology and value sets together. Or if you can like we can just use the mappings to go from the omop IDs to loinc IDs? I would have thought there was a place or service that would allow downloading value sets in bulk... |
Transactionality / CRUDIt seems that this is primarily for value sets - for example
rather than changes in the ontology itself. In that case you can disregard my comment on KGCL. As you know from your work on Mondo, KGCL is for describing changes to an ontology. Note that with OAK, value sets are managed externally, they aren't stored in the same database as the ontology. The default way to handle CRUD is trivial updates on the objects and serialization to YAML. However, if you wanted value sets managed in the same SQL database this is just mere plumbing. [more later...] |
Value sets examples@cmungall Actually I forgot earlier I could have just used TermHub to get you some examples, and they're use N3C OMOP as well. For bulk download of ValueSets VSAC as I mentioned earlier works as well but it's been a long time since I've used it. You can also sorta do that if you use the N3C enclave APIs, which is what we use to populate TermHub. Here's the value set / concept set I chose, "[DM]Type2 Diabetes Mellitus (v1)" (view on TermHub): Value set functionality@hlehmann17 wrote something in an email that I'll copy/paste here. We'll want to do these operations, and I wonder if they are best done in a graph DB, OAK, or just as well done in a traditional DB or other option (e.g. in memory Python):
CRUD of value setsI hear you and TBH I don't know what @Sigfried's plan is regarding integrating value sets into the same graph or graphs as the N3C OMOP versions (and possibly other ontologies) that we want in our graph DB. They're on vacation until Wednesday but maybe they can elaborate at some point. @Sigfried KGCLDepending on how Siggie responds above, perhaps you are right and we don't need it. Also I don't know anything about KGCL other than what it's for and what the acronym stands for. |
This is great! This is very much the bread and butter of every bioinformatics ontology library written in the last few decades. I will add some examples to the notebook once I have time to find some suitable value sets. As you may be aware the focus on Monarch is on phenotypic profile comparison - e.g. given a mondo disease, find similar diseases or genes based on properties in common. This is implemented in OAK completely generically so it works for any term set. I can also demonstrate some things that might not be on your radar yet... |
Sorry, I've been on vacation. Back now. Immediate thoughts:
|
Updates from today's meetings. LinkML enumerations TIMS-specific requirements |
Update from today's meeting with Chris M & Shahim: The plan is that TIMS will not use OAK and will not create an API for TermHub. It will stick to more FHIR-specific use cases and will try to implement using HAPI's relational DB. So now the plan is to continue using The OP hasn't been updated to reflect the latest changes in understanding for each requirement, but my current thinking on each is:
|
OAK not currently planned. |
Overview
This is a discussion to determine whether OAK meets TermHub/TIMS requirements.
Pros/cons and other issues
In addition to whether OAK meets specific requirements, here's a general pros/cons list, as well as some issues we encountered in TermHub during a test run (I would bet that some of these issues are already resolved).
General pros/cons
Pros
Cons
Documentation: Additionally I don't think there are docs that clearly addresses a lot of the use cases identified here, though OAK otherwise has good documentation and could be further updated.
Build time: I believe it takes 2-3 hours to build SemSql. Perhaps this will be much shorter if we skip the robot unnecessary normalization step. I wonder how long loading would take in Neo4J/JanusGraph. If similar, then this is a non-issue. It's not the hugest problem though.
TermHub OAK test run issues
Requirements list
Requirements 1-3 and 5 were originally posted in TIMS dev meeting notes Aug 14. Siggie added '4' here.
If a box is checked, we feel that OAK adequately addresses the requirement. It might also be nice if we as a team could give a rating 1-10 on how well we think OAK addresses it.
Requirements details
1. Transactionality
Solved when we get Postgres adapter.
Details
OAK options
PostgreSQL
What is primary advantage over sqlite? Sqlite is ACID compliant. Answer: PostgreSQL supports more complex transactions and provides a more robust concurrency model (source).
Problem: Adapter doesn't exist yet. But this may not be necessary anyway.
2. CRUD
Solved?: We just write directly to OAK in Posgres or sqlite?
Details
Need to be able to do CRUD (Creates, Reads, Updates, Deletes). Siggie also wrote: The value set data changes frequently -- by the minute or hour -- and needs to be up-to-date in the graph database.
OAK options
i. KGCL
Problems: Joe: Not sure yet. Need to understand this better to see if it meets our needs. Not familiar w/ KGCL, nor how it might be used w/ OAK / if it addresses needs.
ii. Write directly to DB
If using Postgres or Sqlite, we can write to the db (I believe only the
statements
table).Implementation problems: Not a huge problem, but on our end we have to think about how we want to do this. We could have (a) 2 separate DBs and use OAK only for certain things, or (b) two linked DBs (e.g. TermHub's main DB, and OAK). Options to keep them in sync (a) Write some code that will periodically sync the two, (b) wrap around all of our write functions such that they update both databases, (c) some postgres routines that automatically run when certain tables are updated.
Problems that apply to any option: Per my comments under "3. Query complexity" > "OAK options" > "ii. Integration w/
cx
/nx
packages". If we needcx
/nx
, I'm worried we'd have to re-render these additional structure(s) on every create, update, or delete.3. Web API
Details
TIMS and TermHub aim to use the same database, so needs to be deployed independent of each project and internet accessible.
OAK options
i. Can create one
Problem: Doesn't exist yet
We can also create our own; OAK doesn't need to do so. But that's additional work.
4. Query complexity
Details
I (Joe) think that OAK mainly operates on pre-built functions. So it’s possible there will be queries we want to do that OAK doesn’t support.
Examples of problems
Currently we do our graph analysis using a graph tool -- NetworkX, which could be replaced by OAK -- but it forces our application to do most data retrieval in Postgres, including a lot of graph-like analysis, and a small part of the complex stuff in the graph tool. (Joe: Not sure what the issue is here, per OAK or otherwise)
OAK options
i. Pseudo query language
Chris wrote:
Problem: This is probably different than the power of a full query language (though IDK if we're sure what all of our use cases for that will be). Not seeing anything in the docs close to like a cypher/gremlin/graphql query. Maybe closest is this page, but these are still method calls and Python manipulations.
ii. OAK likely has what we need
Chris wrote:
Problem: Chris may be right, and I often trust his instincts. But this is somewhat going on faith. I suppose I should say that what we're doing is more in the realm of web applications and backend infrastructure than "basic science and clinical research", so there may well be different use cases. Additionally, "OAK as a building block" may be true, but I would wonder if some solution we would need to implement that includes OAK + other things could simply be done by one single graph DB, which would be easier to implement and less fragile.
iii. Integration w/
cx
/nx
packagesChris said: ...you can export subgraphs to nx and do all your graph theoretic work there. We also have bridges to cx, so you can easily upload your graphs to ndex...
Problems: First we need to learn more about
cx
/nx
in order to evaluate. If one/both of these turns out to be a need, some additional concerns are (i) adds an extra step to our setup pipeline (not a big problem), (ii) web hosting: I worry that this could introduce memory issues or other implementation/stability problems, (iii) 2 APIs: one where we use OAK normally, and another for this, makes for more fragile/confusing code (not huge) (iv) CRUD: I fear thiscx
/nx
would not work, because if we have users making small updates to our database, my guess is that we would have to do another full re-render incx
/nx
. This problem scales with number of users.5.1 Data variety: Additional structures
Details
In addition to (i) edges for vocabulary structures, we need (ii) concept metadata, (iii) relationship metadata, (iv) full N3C/OMOP vocabulary tables, and (v) versions of these tables.
OAK options
5.2. Data variety: ValueSets
Details
We need value sets (OMOP/N3C, eventually ATLAS and VSAC) in the graph database.
OAK options
i. LinkML?
LinkML enumerations: https://linkml.io/linkml/intro/tutorial06.html
Chris said:
Problems: This means we have to use multiple interfaces to execute queries. It's not something we can't do, but may be less optimal / harder to maintain than whatever we would need to do that might be done using a single graph db.
6. Solution for SemanticSQL memory issues
Solved?: Moot if we go the Postgres route.
Details
Couldn’t create SemSQL DB for N3C OMOP. Ran out of memory even with 100GB, even when selecting only ~10% of relationship types. (We want a database that we can do graph analysis with really large data. Siggie: Couldn’t get the support we needed to get it done. Joe: I think more so technical limitations than support on this one.)
OAK options
i. Skip robot normalization step
Chris mentioned in this thread and on Slack. Joe highlights on Slack where to do in SemSQL codebase.
Problems:
(i) Memory still high: Still required 48GB on Chris' machine, even with just ~10% of relationship types,
Possible solutions: Joe: Add option to SemSql to just set up empty database tables and stream / load in data using the same methods we'd be using for "1. Transactionality / CRUD"?
(ii) No CLI option: Until SemSql adds this option (or programmatically determines if normalization is necessary), this requires more than just simply calling the package; requires custom development set up.
ii. Write own loaders using TSV
Chris said: you can write your own loaders by literally just preparing a few TSVs to bulk upload into sqlite/pg.
Joe: Need to learn how. Somehow read ontology (using which tool?) and make a TSV in format of the
statements
table? Then load it into sqlite somehow?Related
networkx
--> OAK #612The text was updated successfully, but these errors were encountered: