# Ontosearcher First Steps

### The goal of this document is to outline the exploratory first steps of working with your dataset. By the end, you will have selected several ontologies from the internet and compiled them into a single .json file. This file is used in the RDF conversion process. 

#### REQUIREMENTS:
* your data, with each table stored in a separate .csv file. 
* BioPortal API key (if you need one, we'll walk through it when we get there)


### Before You Begin

#### Python and Jupyter Notebook

The Ontosearcher code was written in Python. These training documents were made in Jupyter Notebook, a tool for making interactive Python documents. While you were given .html copies of both documents for ease of access, it is necessary that you have Python installed to actually use the code, and strongly recommended that you also get Jupyter Notebook, since it will allow you to more easily mirror the processes seen here and also allows you to see the results of each block of code individually. Depending on your organization, you may need to go through IT to install one or both and any dependencies. 

The process:
1. Install Python on your machine.
2. Install Jupyter Notebook on your machine
3. Test run the block full of imports below. If you get a ModuleNotFound error, open a terminal window and use the command *pip install thatModule* to install it. Repeat until no more errors occur. Warnings are typically fine.

<hr>

The key feature of RDF is that the final product can effortlessly interface with other RDF datasets. To make that work, it's important to use universal terminology, and to that end, we utilize existing ontologies. However, there are *a lot* of ontologies and they cover every subject imaginable. This document helps to outline the process of finding a good group of ontologies which accurately covers the data in your dataset. 

To begin, you need to pick a few relevant ontologies. You may already be familiar with some, such as eNanoMapper. If not (or if you want to start with better footing), you'll need to do some digging for additional ones. Online tools, such as NCBO's BioPortal (https://bioportal.bioontology.org/), make it easy to look through ontologies that are pretty relevant for our purposes. As an example, try going there and searching for **mouse**. At the time of writing, this returns matches in 70 ontologies, with the first result, SNOMED CT, seeming like a decent enough fit for defining the common research animal (but the other 69 ontologies may be just as good, so feel free to check the first few results for the best one). Do simple searches like this for terms that are good examples of your data, find ontologies that pop up often for those searches, and you'll have a good place to start. Download the .owl files for those ontologies for ease of access and store them in a directory together.

Now, let's use OntoSearcher. First, we set everything up. We import some tools from OntoSearcher, load in one of our tables as a csv, and load in some starter ontologies. For the NaKnowBase, we found the following ones that looked like a good fit:
* EDAM_dev.owl
* enanomapper.owl
* ncit.owl
* npo.owl
* obi.owl
* SCTO.owl

In [1]:
# import EPA OntoSearcher modules, and other packages
from ontosearch.onto import ontolister, ontocontext
from ontosearch.csv_importer import load_data
from ontosearch.find import matcher
from ontosearch.onto_api import bioportal_search, unpack_superclass
from ontosearch.rdf_print import term_lookup
from json import dump



In [2]:
material_csv = load_data("csvHolder/material.csv")

In [3]:
onto = ontolister(
 onto_dir = ('owlfiles/')
)

RUN: C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#P90 not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://purl.bioontology.org/ontology/npo#preferred_Name not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://www.w3.org/2004/02/skos/core#exactMatch not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://www.w3.org/2004/02/skos/core#altLabel not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://www.w3.org/2004/02/skos/core#prefLabel not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://purl.jp/bi



RUN: C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.geneontology.org/formats/oboInOwl#hasExactSynonym not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.geneontology.org/formats/oboInOwl#hasBroadSynonym not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#P90 not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://purl.org/dc/elements/1.1/title not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.w3.org/

Now that the hand-selected ontologies are loaded in, it's time to see how well they cover the data. The **matcher** method runs the dataset through the ontologies and returns structures holding the matched and unmatched terms.

In [4]:
match, unmatch = matcher(onto, material_csv)


number unmatched terms in MaterialID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in publication_DOI: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in CoreComposition: 46

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 8


RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in InnerDiameterUnit: 3

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in InnerDiameterUncertainty: 3

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 1
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_bin

RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SurfaceAreaApproxSymbol: 4

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SurfaceAreaUnit: 7

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary

RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SDLow2: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SDHigh2: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold va

RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ChargeApproxSymbol: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ChargeUnit: 3

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches:

According to the Final Matcher Output above, there's plenty of unmatched ones that we still need to address. OntoSearcher has more tools to help with this. Load in your BioPortal API key (If you do not have one, you can get one by making an account at BioPortal and checking your account settings. I store mine in a text file and load it into the notebook for security purposes.) We feed the unmatched terms into **bioportal_search()** to query BioPortal about potential matches in other ontologies. The method only searches a small random subset of the unmatched terms, allowing it to provide results quickly. Then, we use **unpack_superclass()** to see the results. Each result from unpack_superclass takes the form of 
* the term
* the match
* the ontology the match was found in
* the superclass of the match, so we have context

The context is important because some matches may be false positives. As an example, scientific databases may use **Purity** in the sense of *chemical* purity, but a search could return a literature ontology's term meant for *moral* purity. Never assume a term is correct without verifying the context matches your intent.

In [5]:
with open('bioportal_api_key.txt') as file:
    apiKey = file.readline()

In [6]:
bioapi = bioportal_search(unmatch, apiKey, em=True, mode='col')

Running search in column mode
time elapsed: 109.50610050000068


In [7]:
unpack_superclass(bioapi, apiKey)

{'Supplier': [[['Supplier',
    'http://purl.bioontology.org/ontology/SNOMEDCT/774164004',
    'https://data.bioontology.org/ontologies/SNOMEDCT',
    'https://data.bioontology.org/ontologies/SNOMEDCT/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSNOMEDCT%2F774164004/parents'],
   ['Qualifier value',
    'http://purl.bioontology.org/ontology/SNOMEDCT/362981000']]],
 'LengthUnit': [[['LengthUnit',
    'http://purl.obolibrary.org/obo/UO_0000001',
    'https://data.bioontology.org/ontologies/RO',
    'https://data.bioontology.org/ontologies/RO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUO_0000001/parents'],
   ['Unit', 'http://purl.obolibrary.org/obo/UO_0000000']]],
 'Shape': [[['Shape',
    'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25677',
    'https://data.bioontology.org/ontologies/NCIT',
    'https://data.bioontology.org/ontologies/NCIT/classes/http%3A%2F%2Fncicb.nci.nih.gov%2Fxml%2Fowl%2FEVS%2FThesaurus.owl%23C25677/parents'],
   ['Spatial Qualifier',
    'h

You can repeat the above steps as many times as you'd like: add or remove ontologies and then see what suggestions come up for whatever remains unmatched. You should also repeat the process on the other tables of your dataset, so your ontology collection will have some coverage on them, as well. Once you feel like you've done as much as you can with this broad-strokes approach, we can call this phase finished and save the final ontology collection for use in the next steps (note that we modify our call to ontolister to use a different function). You're now ready to move onto the next steps, which are outlined in **nkb_rdf.ipynb**.

In [8]:
finalOnto = ontolister(ontofunc = ontocontext,
 onto_dir = ('owlfiles/')
)

RUN: C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#P90 not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://purl.bioontology.org/ontology/npo#preferred_Name not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://www.w3.org/2004/02/skos/core#exactMatch not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://www.w3.org/2004/02/skos/core#altLabel not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://www.w3.org/2004/02/skos/core#prefLabel not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\EDAM_dev.owl
http://purl.jp/bi



RUN: C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.geneontology.org/formats/oboInOwl#hasExactSynonym not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.geneontology.org/formats/oboInOwl#hasBroadSynonym not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#P90 not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://purl.org/dc/elements/1.1/title not present in C:\Users\bbeach\OneDrive - Environmental Protection Agency (EPA)\Desktop\ontology test\owlfiles\npo.owl
http://www.w3.org/

In [9]:
with open('full_onto.json', 'w') as file:
    dump(finalOnto, file)