Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing ontology terms #6

Open
bschilder opened this issue Feb 29, 2024 · 14 comments
Open

Missing ontology terms #6

bschilder opened this issue Feb 29, 2024 · 14 comments

Comments

@bschilder
Copy link

bschilder commented Feb 29, 2024

Hi again!,

I've noticed something a bit strange when importing ontologies as ontology_DAG objects. There seems to be some terms that are available on the OLS but not when I import the file with simona.

ont <- simona::import_ontology("http://purl.obolibrary.org/obo/uberon.owl")
sum(grepl("UBERON:0001155",ont@terms))
# [1] 0

I can confirm both are pulling from the same remote OWL file.
https://www.ebi.ac.uk/ols4/ontologies/uberon

I can also confirm that the term is searchable and not deprecated on OLS:
https://www.ebi.ac.uk/ols4/ontologies/uberon/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0001155

This doesn't seem to be specific to UBERON, as I've noticed similar issues with CL.
https://www.ebi.ac.uk/ols4/ontologies/cl

Do you have an idea of what might be going on here?

Thanks!,
Brian

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3        ggdendro_0.1.23          
  [3] rstudioapi_0.15.0         jsonlite_1.8.8           
  [5] shape_1.4.6               magrittr_2.0.3           
  [7] GlobalOptions_0.1.2       fs_1.6.3                 
  [9] vctrs_0.6.5               memoise_2.0.1.9000       
 [11] ggtree_3.10.0             rstatix_0.7.2            
 [13] gh_1.4.0                  htmltools_0.5.7          
 [15] progress_1.2.3            curl_5.2.0               
 [17] broom_1.0.5               gridGraphics_0.5-1       
 [19] htmlwidgets_1.6.4         httr2_1.0.0              
 [21] lubridate_1.9.3           plotly_4.10.4            
 [23] cachem_1.0.8              networkD3_0.4            
 [25] igraph_2.0.1.1            mime_0.12                
 [27] lifecycle_1.0.4           iterators_1.0.14         
 [29] pkgconfig_2.0.3           Matrix_1.6-5             
 [31] R6_2.5.1                  fastmap_1.1.1            
 [33] shiny_1.8.0               clue_0.3-65              
 [35] digest_0.6.34             aplot_0.2.2              
 [37] colorspace_2.1-0          ggnewscale_0.4.10        
 [39] patchwork_1.2.0           S4Vectors_0.40.2         
 [41] rprojroot_2.0.4           grr_0.9.5                
 [43] ggpubr_0.6.0              timechange_0.3.0         
 [45] fansi_1.0.6               httr_1.4.7               
 [47] KGExplorer_0.99.0         abind_1.4-5              
 [49] compiler_4.3.1            here_1.0.1               
 [51] bit64_4.0.5               withr_3.0.0              
 [53] doParallel_1.0.17         backports_1.4.1          
 [55] orthogene_1.9.1           carData_3.0-5            
 [57] viridis_0.6.5             homologene_1.4.68.19.3.27
 [59] dendextend_1.17.1         maps_3.4.2               
 [61] ggsignif_0.6.4            MASS_7.3-60.0.1          
 [63] rappdirs_0.3.3            rjson_0.2.21             
 [65] scatterplot3d_0.3-44      piggyback_0.1.5          
 [67] tools_4.3.1               ape_5.7-1                
 [69] httpuv_1.6.14             glue_1.7.0               
 [71] rols_2.30.0               nlme_3.1-164             
 [73] promises_1.2.1            grid_4.3.1               
 [75] cluster_2.1.6             generics_0.1.3           
 [77] gtable_0.3.4              tidyr_1.3.1              
 [79] data.table_1.15.0         hms_1.1.3                
 [81] tidygraph_1.3.1           xml2_1.3.6               
 [83] car_3.1-2                 utf8_1.2.4               
 [85] BiocGenerics_0.48.1       foreach_1.5.2            
 [87] pillar_1.9.0              stringr_1.5.1            
 [89] yulab.utils_0.1.4         babelgene_22.9           
 [91] pals_1.9                  later_1.3.2              
 [93] circlize_0.4.15           dplyr_1.1.4              
 [95] treeio_1.26.0             lattice_0.22-5           
 [97] bit_4.0.5                 tidyselect_1.2.0         
 [99] ComplexHeatmap_2.18.0     gitcreds_0.1.2           
[101] gridExtra_2.3             IRanges_2.36.0           
[103] stats4_4.3.1              Biobase_2.62.0           
[105] matrixStats_1.2.0         visNetwork_2.1.2         
[107] stringi_1.8.3             yaml_2.3.8               
[109] lazyeval_0.2.2            ggfun_0.1.4              
[111] codetools_0.2-19          tibble_3.2.1             
[113] ggplotify_0.1.2           Polychrome_1.5.1         
[115] cli_3.6.2                 xtable_1.8-4             
[117] munsell_0.5.0             dichromat_2.0-0.1        
[119] Rcpp_1.0.12               mapproj_1.2.11           
[121] gprofiler2_0.2.2          png_0.1-8                
[123] parallel_4.3.1            simona_1.0.10            
[125] ellipsis_0.3.2            ggplot2_3.4.4            
[127] prettyunits_1.2.0         viridisLite_0.4.2        
[129] tidytree_0.4.6            scales_1.3.0             
[131] purrr_1.0.2               crayon_1.5.2             
[133] GetoptLong_1.0.5          rlang_1.1.3              
[135] rvest_1.0.3
@bschilder
Copy link
Author

Regarding robot

Ok, so something else I've noticed: setting the path to robot myself (via this function which downloads robot from https://github.com/ontodev/robot/releases) yields different results than running simona::import_ontology twice in a row (first time with and error). I didn't realize that simona is setting the path of robot after it fails the first time.

Cell Ontology example

Here's another example from the Cell Ontology.

Attempt 1

ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl")
Parsing [Term] sections in the obo file [15950/15950]ns in the obo file [10000/15950]ng [Term] sections in the obo file [9000/15950]ing [Term] sections in the obo file [8000/15950]ing [Term] sections in the obo file [7000/15950]ing [Term] sections in the obo file [6000/15950]ing [Term] sections in the obo file [5000/15950]ing [Term] sections in the obo file [4000/15950]ing [Term] sections in the obo file [3000/15950]ing [Term] sections in the obo file [2000/15950]ing [Term] sections in the obo file [1000/15950]d.obo.gz' --check false
remove 187 obsolete terms
There are more than one root:
  BFO:0000002, BFO:0000003, CL:0000015, CL:0000019, CL:0000021,
    and other 222 terms ...
  A super root (~~all~~) is added.
[CHEBI:36080 ~ PR:000000001 ~ CHEBI:36080]
Error: Found isolated rings (one path is listed above). Set `remove_rings = TRUE` to remove them.

Attempt 2

Here I download robot and set the path to it myself.

KGExplorer:::get_ontology_robot()
ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
grep("CL:0002494",ont@terms)
# [1] 0
Parsing [Term] sections in the obo file [15950/15950]950]g [Term] sections in the obo file [10000/15950]ng [Term] sections in the obo file [9000/15950]ing [Term] sections in the obo file [8000/15950]ing [Term] sections in the obo file [7000/15950]ing [Term] sections in the obo file [6000/15950]ing [Term] sections in the obo file [5000/15950]ing [Term] sections in the obo file [4000/15950]ing [Term] sections in the obo file [3000/15950]ing [Term] sections in the obo file [2000/15950]ing [Term] sections in the obo file [1000/15950]9.obo.gz' --check false
remove 187 obsolete terms
There are more than one root:
  BFO:0000002, BFO:0000003, CL:0000015, CL:0000019, CL:0000021,
    and other 222 terms ...
  A super root (~~all~~) is added.
Removed 749 terms in isolated rings.

But cardiocytes (CL:0002494) is indeed a term in the current CL:
https://www.ebi.ac.uk/ols4/ontologies/cl/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCL_0002494?lang=en

Attempt 3

This time, I'll try using the method of running simona::import_ontology twice.

try({
ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
})
ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
grep("CL:0002494",ont@terms)
# [1] 1213
Parsing [Term] sections in the obo file [15950/15950]950]g [Term] sections in the obo file [10000/15950]ng [Term] sections in the obo file [9000/15950]ing [Term] sections in the obo file [8000/15950]ing [Term] sections in the obo file [7000/15950]ing [Term] sections in the obo file [6000/15950]ing [Term] sections in the obo file [5000/15950]ing [Term] sections in the obo file [4000/15950]ing [Term] sections in the obo file [3000/15950]ing [Term] sections in the obo file [2000/15950]ing [Term] sections in the obo file [1000/15950]9.obo.gz' --check false
remove 187 obsolete terms
There are more than one root:
  BFO:0000002, BFO:0000003, CL:0000015, CL:0000019, CL:0000021,
    and other 222 terms ...
  A super root (~~all~~) is added.
Removed 749 terms in isolated rings.

Success!

So it seems my attempts to avoid the initial error with simona::import_ontology is actually causing more problems than it's resolving. Would it possible to have simona::import_ontology detect and install robot without producing an error the first time?

@bschilder
Copy link
Author

bschilder commented Mar 2, 2024

That said, i'm still noticing missing terms when using the OBO file directly from the CL GitHub:
https://github.com/obophenotype/cell-ontology/releases

ont <- simona::import_ontology("https://github.com/obophenotype/cell-ontology/releases/download/v2024-02-13/cl-base.obo")
"CL:0002494" %in% ont@terms
# FALSE
arsing [Term] sections in the obo file [2925/2925]sing [Term] sections in the obo file [1000/2925]2024-02-13/cl-base.obo")
remove 186 obsolete terms
There are more than one root:
  CL:0000000, CL:0000014, CL:0000015, CL:0000018, CL:0000019,
    and other 188 terms ...
  A super root (~~all~~) is added.

@bschilder
Copy link
Author

Any idea of what might be going on here @jokergoo ? I'm in the process of publishing several papers that revolve around the use of simona and want to make sure there's not any issues before we move forward.

jokergoo added a commit that referenced this issue Apr 4, 2024
@jokergoo
Copy link
Owner

jokergoo commented Apr 4, 2024

The error of the path of robot.jar has been fixed. I just forgot to update the variable which saves the path after robot.jar is downloaded.

For the missing terms, that was a stupid bug. it is something like x[l], but I wrote as x[!l] so many terms were missing.

Now the two bugs are all fixed. Please update from GitHub.

> ont = import_obo("~/Downloads/cl-base.obo")
Parsing [Typedef] sections in the obo file [5/5]
remove 2 obsolete terms
Parsing [Term] sections in the obo file [2925/2925]
remove 186 obsolete terms
There are more than one root:
  CL:0000006, CL:0000034, CL:0000039, CL:0000048, CL:0000056,
    and other 102 terms ...
  A super root (~~all~~) is added.
> "CL:0002494" %in% ont@terms
[1] TRUE

And

> ont = import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
`robot_jar` was not set. Download `robot.jar` from GitHub...
trying URL 'https://github.com/ontodev/robot/releases/download/v1.9.5/robot.jar'
Content type 'application/octet-stream' length 92575534 bytes (88.3 MB)
=================grep("CL:0002494",ont@terms)
=================================
downloaded 88.3 MB

Downloading http://purl.obolibrary.org/obo/cl.owl...
Converting file2fb870b88517_cl.owl to the obo format.
  '/usr/bin/java'  -jar '/private/var/folders/g3/f2y6rp510nxf3t5sj6h902bc0000gr/T/Rtmp4CqZ8g/robot_temp_2fb835ce2469.jar' convert --input '/private/var/folders/g3/f2y6rp510nxf3t5sj6h902bc0000gr/T/Rtmp4CqZ8g/file2fb870b88517_cl.owl' --format obo --output '/var/folders/g3/f2y6rp510nxf3t5sj6h902bc0000gr/T//Rtmp4CqZ8g/file2fb86229e2d1.obo.gz' --check false
Parsing [Typedef] sections in the obo file [315/315]
remove 2 obsolete terms
Parsing [Term] sections in the obo file [15950/15950]
remove 187 obsolete terms
There are more than one root:
  CL:0000006, CL:0000034, CL:0000037, CL:0000039, CL:0000048,
    and other 337 terms ...
  A super root (~~all~~) is added.
> grep("CL:0002494",ont@terms)
[1] 592

@bschilder
Copy link
Author

Awesome! I'll try it out, thanks

@jokergoo
Copy link
Owner

jokergoo commented Apr 4, 2024

Just wait, I found another bug...

@jokergoo
Copy link
Owner

jokergoo commented Apr 4, 2024

Just found I haven't considered the following tag in the obo file:

intersection_of

@jokergoo
Copy link
Owner

jokergoo commented Apr 4, 2024

I would say, the obo/owl formats are more complex than I thought... I am not an expert in this field. I worked with GO very often but not with other ontologies.

It seems the intersection_of does not provide the subclass information, according to the EBI OLS website. If you use the cl.obo while not cl-base.obo (or corresponding .owl) file, all the subclasses will be there.

Currently, there are the following three ways to process ontology files.

  1. import_obo(): directly process the .obo file
  2. import_ontology(): if the input is .owl, it calls robot.jar to internally convert to .obo, then use import_obo() to import.
  3. import_owl(): I have some R code which can directly parse the XML file.
> ont1 = import_obo("~/Downloads/cl.obo", remove_cyclic_paths = TRUE, remove_rings = TRUE)
> ont2 = import_ontology("~/Downloads/cl.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)
> ont3 = import_owl("~/Downloads/cl.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)

And if you only restrict the DAG object to CL terms:

> ont1 = dag_filter(ont1, terms = grep("^CL:", dag_all_terms(ont1), value = TRUE))
> ont2 = dag_filter(ont2, terms = grep("^CL:", dag_all_terms(ont2), value = TRUE))
> ont3 = dag_filter(ont3, terms = grep("^CL:", dag_all_terms(ont3), value = TRUE))

With the new github version, you can also filter the namespace by:

> ont1 = dag_filter(ont1, namespace = "CL")
> ont2 = dag_filter(ont2, namespace = "CL")
> ont3 = dag_filter(ont3, namespace = "CL")

Then the three object ont1, ont2 and ont3 are basically the same:

> ont1
An ontology_DAG object:
  Source: cl, releases/2024-02-13
  2731 terms / 3812 relations
  Root: CL:0000000
  Terms: CL:0000000, CL:0000001, CL:0000005, CL:0000006, ...
  Max depth: 14
  Avg number of parents: 1.40
  Avg number of children: 1.42
  Aspect ratio: 35.91:1 (based on the longest distance from root)
                68.4:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition
> ont2
An ontology_DAG object:
  Source: cl, releases/2024-02-13
  2731 terms / 3812 relations
  Root: CL:0000000
  Terms: CL:0000000, CL:0000001, CL:0000005, CL:0000006, ...
  Max depth: 14
  Avg number of parents: 1.40
  Avg number of children: 1.42
  Aspect ratio: 35.91:1 (based on the longest distance from root)
                68.4:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition
> ont3
An ontology_DAG object:
  Source: Cell Ontology, 2024-02-13
  2731 terms / 3813 relations
  Root: CL:0000000
  Terms: CL:0000000, CL:0000001, CL:0000005, CL:0000006, ...
  Max depth: 14
  Avg number of parents: 1.40
  Avg number of children: 1.42
  Aspect ratio: 35.91:1 (based on the longest distance from root)
                68.4:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition

You still need to update the package from GitHub. I made some small changes.

@bschilder
Copy link
Author

bschilder commented Apr 4, 2024

Thanks for the updates @jokergoo
I'm also not an expert in constructing/parsing ontologies, but use them quite a lot myself. cc'ing some people from Monarch/HPO with more expertise than myself who might be able you help guide you.
@cmungall @pnrobinson @matentzn

@matentzn
Copy link

matentzn commented Apr 4, 2024

"intersection_of" is syntax in OBO for equivalent class statements - better not handle these if you dont know exactly what they mean (it does mean "AND"), so you can assume that all intersections together correspond to one big equivalent class statement with lots of AND AND statements. Lucky for you, technically, you can use this as an isa but this is really not what general tools should be doing.

There is a big push in OBO to make sure that the x-base.obo/owl files include all subclass statements, not just x.owl/obo. This is not yet true though for all ontologies, but it is for CL and Mondo for example.

@cmungall
Copy link
Contributor

cmungall commented Apr 6, 2024 via email

@bschilder
Copy link
Author

Hey @jokergoo, thanks again for the updates. Just tried using the current dev version of simona and running into a couple issues. I think it's related to some of the changes meant to address the above issues.

no slot of name "alternative_terms" for this object of class "ontology_DAG"

First, the latest version of simona doesn't seem to be back-compatible with ontology_DAG objects created with the older versions. Using functions like simona::shortest_distances_via_NCA on older objects gives the error:

Error in term_to_node_id(dag, terms, strict = FALSE) : 
  no slot of name "alternative_terms" for this object of class "ontology_DAG"

Would be nice to gracefully handle older objects by not using slots that don't exist in the object.

Missing hasAlternativeId

Overall I'm getting far fewer missing IDs with the dev version of simona, which is great!
But there's still a couple of cases where this comes up, such as "CL:0000111" in the Cell Ontology. This is listed under hasAlternativeId:
https://ols.monarchinitiative.org/ontologies/upheno_patterns/terms?iri=http://purl.obolibrary.org/obo/CL_2000032

Is hasAlternativeId a slot that's available in the OBO/OWL file? If so, is it something you could consider in your mapping?

@jokergoo
Copy link
Owner

@bschilder Did you use the version from GitHub?

> dag = import_obo("~/workspace/ontology/OBOFoundry/cl/cl-basic.obo")
> dag@alternative_terms["CL:0000111"]
  CL:0000111
"CL:2000032"
> term_to_node_id(dag, "CL:0000111")
[1] 2393
> term_to_node_id(dag, "CL:2000032")
[1] 2393

Using functions like simona::shortest_distances_via_NCA on older objects

You cannot use it on older objects because the definition of the ontology_DAG class has been changed. You need to regenerate it.

@bschilder
Copy link
Author

bschilder commented Apr 15, 2024

@bschilder Did you use the version from GitHub?

Yes, but it looks like you've made some additional changes since I last installed.
Currently you're on 1.1.14
I'm using 1.1.13.
Just updated to the newer version

Not sure where to find the exact version of the cl-basic.obo you're using, but here's an example that uses a version we can both access:

ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl/releases/2024-04-05/cl.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)

In my original report, this is how i was checking whether the term was available.

"CL:0000111" %in% ont@terms # FALSE
"CL:0000111" %in% ont@alternative_terms # FALSE
"CL:0000111" %in% names(ont@alternative_terms) # TRUE

But it seems @terms only includes the main IDs, not the alternative IDs. Is that intentional? Is there some unified way to grab all IDs, or do you recommend using unique(ont@terms, names(ont@alternative_terms)) to get the complete list? I have some use cases where I filter input terms to only those that the ontology_DAG will recognize to avoid throwing errors.

I can confirm that other downstream functions are able to use the alt IDs. So things are looking good in this regard!

simona::shortest_distances_via_NCA(ont, terms = "CL:0000111")

Screenshot 2024-04-15 at 15 52 47

term_to_node_id isn't a exported function in simona. i think this is coming from an internal function accessible with simona:::term_to_node_id(). Is that correct?

> dag = import_obo("~/workspace/ontology/OBOFoundry/cl/cl-basic.obo")
> dag@alternative_terms["CL:0000111"]
  CL:0000111
"CL:2000032"
> term_to_node_id(dag, "CL:0000111")
[1] 2393
> term_to_node_id(dag, "CL:2000032")
[1] 2393

Using functions like simona::shortest_distances_via_NCA on older objects

You cannot use it on older objects because the definition of the ontology_DAG class has been changed. You need to regenerate it.

Ok, but if that's the case then it would be good to return an error that lets users know this. Otherwise, it's not obvious what the issue is. I only figured it out because I'm involved in this thread.
Another option would be to provide a way to update old ontology_DAG objects to the new version. I imagine these sorts of issues may pop up again as simona changes over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants