## EPA NaKnowBase RDF Walkthrough
#### Version 3

If you have used version 1 or 2 of the tutorial, feel free to skip to the [appendix](#appendix) for a list of changes in methodology. Otherwise, read on.

In this notebook, we will convert the EPA NaKnowBase into RDF format. The NaKnowBase is a standard relational database, composed of multiple tables. We will add each table to the graph independently, thereby reducing semantic overlap errors. This process will include both automated and manual curation of URIs for the data. This notebook should be treated as a toolkit: use the examples within and you'll be able to match the process to your own data.

NOTE: This document was a collaborative effort, with code and text supplied by multiple individuals. While most inconsistencies in terminology that resulted from this are minor, one may cause slight confusion. The terms *Uniform Resource Identifier* and *Internationalized Resource Identifier*, or URI and IRI, have small and meaningful technological differences, but for the purposes of this tutorial, they can be taken to mean "ID that the RDF uses for a term." When copying code blocks that use them, copy them as written, but don't worry about why one is chosen over the other in any given situation.


### Requirements:
* json file generated in Ontosearch First Steps
* your data, arranged into .csv files stored in individual folders with the same name
    * example - /your_data/material/material.csv
* Python
    * Jupyter Notebook is recommended.

### Before You Begin

#### Example Folder Structure
![Example Folder Structure](environment.png)
Before you begin, your project folder should look something like this. Let's go through it.
* a folder holding your data (here, csvs)
    * inside: an additional folder for each data file, holding the data file
    * EX: /ontosearchProject/csvs/material/material.csv
* a folder holding the Ontosearcher code
    * Do not edit the contents of this folder. 
* your BioPortal API key (see the first tutorial)
    * make sure this key is stored in a filetype that doesn't add extra hidden formatting
    * .txt is a good choice
* the .json file holding the compiled ontologies you selected (see the first tutorial)
* a Jupyter Notebook to do your work in


#### Be Aware of Hidden Files
Take special caution with your data folder. Some operating systems will add hidden files without informing you and this can cause errors to occur. Specifically, Mac users have reported errors due to the creation of the .DS_Store hidden file, which saves the layout of a folder so it looks the same when you come back to it. Often, these files will reveal that they exist during the first five code blocks below. If you get the error **NTriples parsing error (or unrecognized file format) in fileYouDidn'tKnowAbout**, you likely have a hidden file that the Ontosearcher program is attempting (and failing) to read. Read up on the file, verify it is safe to delete, and then do so. 

To see hidden files:
* On Windows, open the folder, go to View, and click the checkbox for *Hidden Files*.
* On a Mac, press Command + Shift + Period. 

#### Python and Jupyter Notebook

The Ontosearcher code was written in Python. These training documents were made in Jupyter Notebook, a tool for making interactive Python documents. While you were given .html copies of both documents for ease of access, it is necessary that you have Python installed to actually use the code, and strongly recommended that you also get Jupyter Notebook, since it will allow you to more easily mirror the processes seen here and also allows you to see the results of each block of code individually. Depending on your organization's security policies, you may need to contact your organization's IT staff to install one or both, and any dependencies. 

The process:
1. Install Python on your machine.
2. Install Jupyter Notebook on your machine
3. Test run the block full of imports below. If you get a ModuleNotFound error, open a terminal window and use the command *pip install moduleNameHere* to install it. Repeat until no more errors occur. Warnings are typically fine.

#### Working In Multiple Sessions
Jupyter notebooks continue to show the output of a block of code, even if you shut down your computer and relaunch the notebook later. **This does NOT mean your work is how you left it.** When the notebook shuts down, all of the variables in the code you've ran so far are lost. You need to re-run every block of code every time you come back to the notebook to work more. You could do this by hitting run on every block of code in order, but there are easier ways. Under the *Kernel* menu up top, select *Restart and Run All*. Or, click the last cell, then click the *Cell* menu at the top and select "Run All Above".

## RDF Conversion, Part 1

We will convert each table in the NaKnowBase, one by one. For each one, the process is similar.
1. Read in the .csv file for the table, converting it to triples.
2. Use Ontosearcher to automatically replace most of the human-friendly terms in those triples with URIs from ontologies.
3. Manually curate replacements for any terms Ontosearcher couldn't confidently match.
4. Apply the curations. 

### Primary Tables

   - **materials**
   - **assay**
   - **publication**
   - **medium**

*import ontosearcher modules and supporting packages*

In [1]:
# import EPA OntoSearcher modules, and other packages
from ontosearch.onto import ontolister, ontocontext
from ontosearch.csv_importer import load_data
from ontosearch.find import matcher
from ontosearch.onto_api import bioportal_search, unpack_superclass
from ontosearch.onto_api import bioportal_sample, dict_samp, bio_summary
from ontosearch.rdf_print import table_from_file, term_editor, term_lookup
from ontosearch.rdf_print import basic_rdf, relational_rdf_loader
from ontosearch.rdf_print import primenode, node_one, node_two, multi_editor
# supporting libraries
from rdflib import Graph, URIRef, Literal, BNode
import pandas as pd
import json



*load in OWL files*

In [2]:
# OWL import
# EDAM, eNM, NCIT, NPO, OBI, SCTO
with open('full_onto.json', 'r', encoding='utf-8') as f:
    onto = json.load(f)

In [3]:
for x in onto:
    print(x)

EDAM_dev.owl
enanomapper.owl
ncit.owl
npo.owl
obi.owl
SCTO.owl


*create RDF graph*

The graph will, eventually, be the final product. The rest of this document will be about populating the graph.

In [4]:
# create an RDF Graph triplestore
nkbGraph = Graph()

## RDF Conversion
MATERIAL TABLE

In [5]:
# CSV import
# load in the NaKnowBase material csv
material_csv = load_data("fullCSV/material")

# CSV to dataframe
# load NaKnowBase material csv into pandas dataframe
#note that table_from_file requires a folder, not a file
material_dfs = table_from_file("fullCSV/material")

In [6]:
#purely for verification, prints out the data we've just loaded
display(material_dfs)

{'material.csv':      MaterialID                  publication_DOI   CoreComposition  \
 0             1               10.1002/cyto.22342            Silver   
 1             1             10.1002/cyto.a.20927  titanium dioxide   
 2             1             10.1002/cyto.a.22793            Silver   
 3             1                 10.1002/em.21848  titanium dioxide   
 4             1                 10.1002/etc.1858  titanium dioxide   
 ..          ...                              ...               ...   
 369          18  10.1016/j.scitotenv.2017.11.195            Silver   
 370          19  10.1016/j.scitotenv.2017.11.195            Silver   
 371          20  10.1016/j.scitotenv.2017.11.195            Silver   
 372          21  10.1016/j.scitotenv.2017.11.195            Silver   
 373          22  10.1016/j.scitotenv.2017.11.195            Silver   
 
     ShellComposition    CoatingComposition SynthesisMethod  SynthesisDate  \
 0                NaN  polyvinylpyrrolidone         

Now that we have our data loaded in, we use Ontosearcher to find matches in our ontology collection for each term in the data. This should look familiar since you've completed Ontosearch First Steps.

In [7]:
# run matcher
materials_match, materials_unmatch = matcher(onto, material_csv, context=True)


number unmatched terms in MaterialID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in publication_DOI: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in CoreComposition: 46

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 8


RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in InnerDiameterUnit: 3

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in InnerDiameterUncertainty: 3

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 1
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_bin

RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SurfaceAreaApproxSymbo: 4

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SurfaceAreaUnit: 7

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 

RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SDLow2: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in SDHigh2: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold va

RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ChargeApproxSymbol: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ChargeUnit: 3

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches:

Now, we handle the structural parts of RDF. Remember, our RDF is going to be comprised entirely of triples of the form (subject)-(predicate)-(object). There's a lot of work to be done converting everything from the .csv row format into these triples, and the below methods automate it for us. They also report every column heading from the .csv that requires manual curation.

In [8]:
# load basic term RDF
# this function makes RDF of every term and its IRI, 
# asserting the basic relationship between an IRI and its human-readable label
basic_rdf(materials_match[0], nkbGraph)

# load relational RDF relationships
# this function reads the dataframe structure, and organizes the data
# by providing each row a unique IRI. All other data is organized as
# properties of that row IRI.
relational_rdf_loader(material_dfs, materials_match[0], nkbGraph)

 IMPORTANT: MaterialID needs manual curation
 IMPORTANT: publication_DOI needs manual curation
 IMPORTANT: CoreComposition needs manual curation
 IMPORTANT: ShellComposition needs manual curation
 IMPORTANT: CoatingComposition needs manual curation
 IMPORTANT: SynthesisMethod needs manual curation
 IMPORTANT: SynthesisDate needs manual curation
 IMPORTANT: CASRN needs manual curation
 IMPORTANT: ProductNumber needs manual curation
 IMPORTANT: LotNumber needs manual curation
 IMPORTANT: OuterDiameterValue needs manual curation
 IMPORTANT: OuterDiameterApproxSymbol needs manual curation
 IMPORTANT: OuterDiameterUnit needs manual curation
 IMPORTANT: OuterDiameterUncertainty needs manual curation
 IMPORTANT: OuterDiameterLow needs manual curation
 IMPORTANT: OuterDiameterHigh needs manual curation
 IMPORTANT: OuterDiameterMethod needs manual curation
 IMPORTANT: InnerDiameterValue needs manual curation
 IMPORTANT: InnerDiameterApproxSymbol needs manual curation
 IMPORTANT: InnerDiameterUn

At this point, you have to do things by hand. For each item flagged above, you need to find a proper match. There are multiple approaches here, and you'll need all of them.
* You could try looking in the ontologies you already curated and seeing if there's a term that's close enough. For example, publication_DOI is probably a good match for the term http://purl.obolibrary.org/obo/OBI_0002110, DOI, from the OBI ontology we're already using. 
* You can look for the term in other ontologies, such as by using BioPortal again. "CASRN" was matched in this way. Searching BioPortal for 'CAS registry number' found the http://semanticscience.org/resource/CHEMINF_000446 entry in the CHEMINF ontology
* There may be no existing term!  If one of the ontologies you've looked at is still being actively maintained, you could send a request to the managers for them to add your term. Or, you could publish your own ontology of the terms you need and put it up for public consumption.
* As a last resort, you could skip that bit of manual curation. The structural connections should already be handled by Ontosearcher.

### Mapping:

Before continuing, you should have manually curated matches for all of the unmatched terms.  Now, we need to implement the matches. 
For some columns, this can be very simple. The next three code blocks show how to map the unmatched terms to the curated matches, how to define "what" the rows are (here, each is a material), and how to replace the generic structural lines in our Graph with the mapped terms.

<a id='isA_relationship'></a>
NEW TO V2 - pay special attention to material_iri, which holds an IRI. We use this to declare each row of this input data as belonging to something that "is a" material. The second code block uses it to create triples declaring each row of material data to be a material. This pattern will be repeated for each table of input data and allows us to know which table of our original data each triple belongs to.

In [9]:
# start with all of the columns to be mapped 
# straight to data thru their IRIs

#NEW TO V2 - addition to add an "is a" relationship triple to this data
material_iri = 'http://purl.bioontology.org/ontology/npo#NPO_199'

material_edits = [
                  ('MaterialID', 'http://purl.org/dc/terms/identifier'),
                  ('publication_DOI', 'http://purl.obolibrary.org/obo/OBI_0002110'),
                  ('CoreComposition', 'http://purl.bioontology.org/ontology/npo#NPO_1808'),
                  ('ShellComposition', 'http://purl.bioontology.org/ontology/npo#NPO_1824'),
                  ('CoatingComposition', 'http://purl.bioontology.org/ontology/npo#NPO_1823'),
                  ('CASRN', 'http://semanticscience.org/resource/CHEMINF_000446'),
                  ('Supplier', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C43530'),
                  ('ProductNumber', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C99287'),
                  ('LotNumber', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C70848'),
                  ('Shape', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25677'),
                  # 'shape in medium' should probably be submitted to eNM as new term
                  ('ShapeInMedium', 'http://purl.enanomapper.org/onto/ENM_8000049'), # currently using eNM 'size in media'
                  ('Solubility', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C60821'),
                  ('UniqueName', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614'),
                  ('UniqueDescription', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25365'),
                  ('DSSToxSubstanceId', 'http://semanticscience.org/resource/CHEMINF_000568'),
                 ]

In [10]:
#NEW TO V2 - addition to add an "is a" relationship triple to this data
for s, p, o in nkbGraph.triples((None, Literal('MaterialID'), None)):
    # add <row> a <material>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(material_iri)))
    # add <material_iri> <rdfs:label> "material"
    nkbGraph.add((URIRef(material_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('material')))

In [11]:
for column, iri in material_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        nkbGraph.add((s, URIRef(iri), o))
        nkbGraph.remove((s, p, o))

Other cases are more complex. Compare these two columns from the NaKnowBase:
* Supplier - who supplied a material (e.g. a chemical supply company or an in-house EPA lab)
* Outer Diameter Unit

Supplier is a simple column. It is a fully functional term all alone.
Outer Diameter Unit, however, has context around it. It's not *just* the units of a generic measurement, it's the units of the *Outer Diameter*. It is surrounded by 6 other columns that all hold a fact about the Outer Diameter (such as Outer Diameter Value, Outer Diameter Method, etc). These 7 columns all share an implicit fact: they describe the Outer Diameter.

You can check ontologies and easily find results for "Unit" or "Method", but you aren't likely to find one for "Outer Diameter Unit" or "Outer Diameter Method". It makes more sense, and it makes things easier, if we separate the implicit fact out of the columns. In these cases, we use **Nodebags**. First, we find a match for the shared theme of the columns. Outer Diameter easily matches to http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C124136 ("Outer Diameter"). Then, we match each of the simplified columns: instead of finding "Outer Diameter Unit" somewhere, we find "Unit" and make the match, and repeat for the other 6. Then, we feed the overarching term, the lesser terms, and the graph into the nodebag method defined below. It will perform the same replacements as the simple version above, but while maintaining this implied thematic grouping in the RDF. 

<a id='nodebag_change'></a>
NEW TO V2 - We have made a change to the logic within the nodebag method. Before, when nodebag was called, it would iterate through every relevant row, making all of the bags. Then, once all of the bags were made, it would attempt to populate them. This worked fine in many cases, but caused issues if a single row needed multiple bags. Now, nodebag makes a single bag and fills it before moving on to the next. How we use the nodebag method was not impacted by this change, but the new version in the block below should be used instead of the version present in V1. 

In [12]:
def nodebag(main_parameter_info, sub_parameter_info, rdfgraph):
    term, termiri = main_parameter_info
    starter_term = sub_parameter_info[0][0]
    # check if 'starter term' already in rdfgraph
    if (None, Literal(starter_term), None) in rdfgraph:
        # make a unique rdf "bag" for each row
        for s, p, o in rdfgraph.triples((None, Literal(starter_term), None)):

            # create unique bag for row
            rowbag = BNode()

            # <row> <parameter> <bag>
            rdfgraph.add((s, URIRef(termiri), rowbag))

            # bag description
            bagid = Literal(f"{term} data for entry: {s} ")
            rdfgraph.add((rowbag, URIRef('http://purl.org/dc/elements/1.1/description'), bagid))

            # <relation> <label> 'label'
            rdfgraph.add((URIRef(termiri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(term)))
            
            #NEW TO V2 - nested the entire below block within the bag-creation loop, instead of in line with the overarching if
            #before, we would make a bag for every row before filling any of them
            #with this change, we now make a bag, fill it, and then move on to the next bag
            
            # stuff the bag, looping thru parameter data
            for column, iri, label in sub_parameter_info:
                # add: <column_iri> rdfs:lable 'label'
                rdfgraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
                # find the unique bag for each row
                #EDIT - below was original
                #for row, parameter, bag in rdfgraph.triples((None, URIRef(termiri), None)):
                #this variant should work when there's more than one bag per row (i.e. SD2)
                for row, parameter, bag in rdfgraph.triples((None, URIRef(termiri), rowbag)):
                    # loop thru all the pre-existing column relationships
                    for s, p, o in rdfgraph.triples((row, Literal(column), None)):
                        # add: <bag> <column_iri> <data>
                        rdfgraph.add((bag, URIRef(iri), o))
                        # remove: old <row> <column> <data>
                        rdfgraph.remove((s, p, o))


In [13]:
# Synthesis
synthesis_term = ('synthesis', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C61408')

synthesis_rdf = [
                  ('SynthesisMethod', 'http://purl.obolibrary.org/obo/CHMO_0001301', 'synthesis method'),
                  ('SynthesisDate', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25164', 'date'),
                 ]

In [14]:
nodebag(synthesis_term, synthesis_rdf, nkbGraph)

In [15]:
# Outer Diameter
outer_diameter_term = ('outer diameter', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C124136')

outer_diameter_rdf = [
                  ('OuterDiameterValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'outer diameter'),
                  ('OuterDiameterApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('OuterDiameterUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('OuterDiameterUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('OuterDiameterLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low value'),
                  ('OuterDiameterHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high value'),
                  ('OuterDiameterMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                     ]

In [16]:
nodebag(outer_diameter_term, outer_diameter_rdf, nkbGraph)

In [17]:
# Inner Diameter
inner_diameter_term = ('inner diameter', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C101685')

inner_diameter_rdf = [
                  ('InnerDiameterValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('InnerDiameterApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('InnerDiameterUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('InnerDiameterUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('InnerDiameterLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
                  ('InnerDiameterHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                  ('InnerDiameterMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                 ]

In [18]:
nodebag(inner_diameter_term, inner_diameter_rdf, nkbGraph)

In [19]:
# Length
length_term = ('length', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25334')

length_rdf = [
                  ('LengthValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('LengthApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('LengthUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('LengthUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('LengthLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
                  ('LengthHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                  ('LengthMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                 ]

In [20]:
nodebag(length_term, length_rdf, nkbGraph)

In [21]:
# Thickness
thickness_term = ('thickness', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C41145')

thickness_rdf = [
                  ('ThicknessValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('ThicknessApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('ThicknessUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('ThicknessUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('ThicknessLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
                  ('ThicknessHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                  ('ThicknessMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                 ]

In [22]:
nodebag(thickness_term, thickness_rdf, nkbGraph)

In [23]:
# Surface Area
surfacearea_term = ('surface area', 'http://semanticscience.org/resource/CHEMINF_000247')

surfacearea_rdf = [
                  ('SurfaceAreaValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('SurfaceAreaApproxSymbo', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('SurfaceAreaUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('SurfaceAreaUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('SurfaceAreaLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
                  ('SurfaceAreaHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                  ('SurfaceAreaMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                 ]

In [24]:
nodebag(surfacearea_term, surfacearea_rdf, nkbGraph)

<a id='metadata_bag'></a>
NEW TO V2 - Above, we have repeatedly dealt with a simple scenario: a series of columns follow a shared theme, so we put them in a bag labeled with that theme. Below, we deal with a more complicated bagging situation.

The NKB allows a row of data to report multiple sets of values on "Size Distribution". So, we have a column called "SDAvg" and, later, a second column called "SDAvg2". The other 5 columns in each set follow a similar pattern. When we convert these column names into ontology terms, naturally, they resolve to the same thing: SDAvg and SDAvg2 both match to C37917 - 'average'. If all 12 of these datapoints were put in the same bag, it would be impossible to tell which 6 went together. To keep things clear, we instead put each set in their own bag. 

However, this solution adds an additional problem. There are three columns that apply to **both** sets: SDType, SDModality, and SDMethod. As explained previously, we can't leave something like "Size Distribution Method" outside of a bag because it is a combination of the overarching "Size Distribution" and the actual information of "Method". But, which bag do we put it in? It applies to the data in both bags, so an immediate response might be "both", but this has issues - first, we have to keep track of which bags this data needs inserted in until we reach the point of insertion, and second, it duplicates the data. Instead, we take the opposite approach: we put the data that applies to both bags into its own Size Distribution bag that only holds metadata. 

In [25]:
# Size Distribution (1)
size_distribution_term = ('size distribution', 'http://purl.enanomapper.org/onto/ENM_8000292')

size_distribution_rdf = [
                  ('SDAvg', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C37917', 'average'),
                  ('SDApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('SDUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('SDUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('SDLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
                  ('SDHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                 ]

In [26]:
nodebag(size_distribution_term, size_distribution_rdf, nkbGraph)

In [27]:
# Size Distribution (2)
size_distribution2_term = ('size distribution 2', 'http://purl.enanomapper.org/onto/ENM_8000292')

size_distribution2_rdf = [
                  ('SDAvg2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C37917', 'average'),
                  ('SDApproxSymbol2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('SDUnit2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('SDUncertainty2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('SDLow2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
                  ('SDHigh2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                 ]

In [28]:
nodebag(size_distribution2_term, size_distribution2_rdf, nkbGraph)

In [29]:
# Size Distribution Metadata
size_distribution_metadata_term = ('size distribution metadata', 'http://purl.enanomapper.org/onto/ENM_8000292')

size_distribution_metadata_rdf = [
    ('SDType', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'type'),
    ('SDModality', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C41147', 'modality'),
    ('SDMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
]

In [30]:
nodebag(size_distribution_metadata_term, size_distribution_metadata_rdf, nkbGraph)

We also define a second version of nodebag: **iri_nodebag()**. In the Outer Diameter example, there is no chance for the term Outer Diameter to have been matched already because it does not exist alone as a column heading; it only exists in combination with other words. Because of this, it is safe to use Outer Diameter as the term for the bags. If the overarching term for a set of columns is ALSO a column itself and could have been matched automatically by the matcher() method, use iri_nodebag. For instance, the NKB has the columns:

* Purity
* Purity Approx Symbol
* Purity Unit
* Purity Method
* Purity Reference Chemical

Since Purity itself is a column, we use iri_nodebag.

In [31]:
# 'purity' was already picked up by the matching algorithm
# so we need a different nodebag function to accurately map all the columns
# that may already have IRIs
def iri_nodebag(main_parameter_info, sub_parameter_info, rdfgraph, match_dict):
    term, termiri = main_parameter_info
    starter_term = sub_parameter_info[0][0]
    starter_url = sub_parameter_info[0][1]

    # check if 'starter term' already in rdfgraph
    if (None, Literal(starter_term), None) in rdfgraph:
        # make a unique rdf "bag" for each row
        for s, p, o in rdfgraph.triples((None, Literal(starter_term), None)):

            # create unique bag for row
            rowbag = BNode()

            # <row> <parameter> <bag>
            rdfgraph.add((s, URIRef(termiri), rowbag))

            # bag description
            bagid = Literal(f"{term} data for entry: {s} ")
            rdfgraph.add((rowbag, URIRef('http://purl.org/dc/elements/1.1/description'), bagid))

            # <relation> <label> 'label'
            rdfgraph.add((URIRef(termiri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(term)))
    else:
        # lookup column iri in matches
        if term_lookup(match_dict, starter_term):
            starter_iri = term_lookup(match_dict, starter_term)[1]
        else:
            starter_iri = starter_url
            
        # make a unique rdf "bag" for each row
        for s, p, o in rdfgraph.triples((None, URIRef(starter_iri), None)):
            # create unique bag for row
            rowbag = BNode()

            # <row> <parameter> <bag>
            rdfgraph.add((s, URIRef(termiri), rowbag))

            # bag description
            bagid = Literal(f"{term} data for entry: {s} ")
            rdfgraph.add((rowbag, URIRef('http://purl.org/dc/elements/1.1/description'), bagid))

            # <relation> <label> 'label'
            rdfgraph.add((URIRef(termiri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(term)))
            
    # stuff the bag, looping thru parameter data
    for column, iri, label in sub_parameter_info:
                # NOTE: concerned this part of f(x) may pick up IRI duplicates
                #        i think fairly protected by node system, but, still
                #        .... not runn remove(s, p, o)?
               
        # lookup column iri in matches
        if term_lookup(match_dict, column):
            column_iri = term_lookup(match_dict, column)[1]
        else:
            column_iri = iri
                    
        # add: <column_iri> rdfs:lable 'label'
        rdfgraph.add((URIRef(column_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
        # find the unique bag for each row
        for row, parameter, bag in rdfgraph.triples((None, URIRef(termiri), None)):
            # check if 'starter term' already in rdfgraph
            if (None, Literal(column), None) in rdfgraph:
                # loop thru all the pre-existing column relationships
                for s, p, o in rdfgraph.triples((row, Literal(column), None)):
                    # add: <bag> <column_iri> <data>
                    rdfgraph.add((bag, URIRef(iri), o))
                    # remove: old <row> <column> <data>
                    rdfgraph.remove((s, p, o))
            else:
                # loop thru all the pre-existing column relationships
                for s, p, o in rdfgraph.triples((row, URIRef(column_iri), None)):
                    # add: <bag> <column_iri> <data>
                    rdfgraph.add((bag, URIRef(iri), o))
                    # remove: old <row> <column> <data>
                    rdfgraph.remove((s, p, o))

In [32]:
# Purity
purity_term = ('purity', 'http://purl.bioontology.org/ontology/npo#NPO_1345')

purity_rdf = [
                  ('Purity', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('PurityApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('PurityUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('PurityMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                  # 'reference chemical' needs an iri
                  ('PurityRefChemical', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C48807', 'chemical'),
                 ]

In [33]:
iri_nodebag(purity_term, purity_rdf, nkbGraph, materials_match[0])

In [34]:
# Hydrodynamic Diameter
hydrodynamic_term = ('hydrodynamic diameter', 'http://purl.bioontology.org/ontology/npo#NPO_1915')

hydrodynamic_rdf = [
              ('HydrodynamicDiameterValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
              ('HydrodynamicDiameterApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
              ('HydrodynamicDiameterUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
              ('HydrodynamicDiameterUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
              ('HydrodynamicDiameterLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
              ('HydrodynamicDiameterHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                 ]

In [35]:
nodebag(hydrodynamic_term, hydrodynamic_rdf, nkbGraph)

In [36]:
# Hydrodynamic Diameter (2)
hydrodynamic2_term = ('hydrodynamic diameter 2', 'http://purl.bioontology.org/ontology/npo#NPO_1915')

hydrodynamic2_rdf = [
              ('HydrodynamicDiameterValue2', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
              ('HydrodynamicDiameterApproxSymbol2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
              ('HydrodynamicDiameterUnit2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
              ('HydrodynamicDiameterUncertainty2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
              ('HydrodynamicDiameterLow2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
              ('HydrodynamicDiameterHigh2', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                 ]

In [37]:
nodebag(hydrodynamic2_term, hydrodynamic2_rdf, nkbGraph)

In [38]:
hydrodynamic_metadata_term = ('hydrodynamic diameter metadata', 'http://purl.bioontology.org/ontology/npo#NPO_1915')

hydrodynamic_metadata_rdf = [
              ('HydrodynamicDiameterMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
]

In [39]:
nodebag(hydrodynamic_metadata_term, hydrodynamic_metadata_rdf, nkbGraph)

In [40]:
# Surface Charge
surface_charge_term = ('surface charge', 'http://purl.bioontology.org/ontology/npo#NPO_1812')

surface_charge_rdf = [
                  ('SurfaceChargeType', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'type'),
                  ('ChargeAvg', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C37917', 'average'),
                  ('ChargeApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('ChargeMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                  ('ChargeUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('ChargeUncertain', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('ChargeLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low'),
                  ('ChargeHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high'),
                 ]

In [41]:
nodebag(surface_charge_term, surface_charge_rdf, nkbGraph)

<a id='foreign_key_tip'></a>
NEW TO V2 - In the NKB, some of the keys that link rows of data between two tables are composite keys of the form (DOI, rowID). If you have foreign composite keys in your data, consider using bags for them. Note that they have that "overarching concept" we look for: in this case, the table they came from.  It is not necessary to bag a primary composite key because the parts of that key are, individually, complete facts about the row we're working on.

Below, we bag the ID and DOI of the Medium referenced by each row. We do this manually, for educational purposes, but the nodebag method would also work. **The nodebag method is strongly recommended** because it makes it easier to verify your work and reduces the risk of user error. 

In [42]:
# Medium ID
for s, p, o in nkbGraph.triples((None, Literal('medium_MediumID'), None)):
    # create unique bag for row
    medium_bag = BNode()
    # <row> <parameter> <bag>
    nkbGraph.add((s, URIRef('http://purl.bioontology.org/ontology/npo#NPO_1853'), medium_bag))
    # <medium_node> <id> "id"
    nkbGraph.add((medium_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
    #attempt to add the medium_DOI to the bag
    #first, find the relevant row
    #same first part as the ID to match the row, second part is the column, third we wildcard
    for x, y, z in nkbGraph.triples((s, Literal('medium_publication_DOI'), None)):
        # <bag> <DOI> "doi"
        nkbGraph.add((medium_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
        #then remove the old
        nkbGraph.remove((x, y, z))
    nkbGraph.remove((s, p, o))

<hr>

## RDF Conversion
ASSAY TABLE

In [43]:
# CSV import
assay_csv = load_data("fullCSV/assay")

# CSV to dataframe
assay_dfs = table_from_file("fullCSV/assay")

In [44]:
# run matcher 
assay_match, assay_unmatch = matcher(onto, assay_csv, context=True)


number unmatched terms in AssayID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in publication_DOI: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in AssayType: 11

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold

In [45]:
# load basic term RDF
basic_rdf(assay_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(assay_dfs, assay_match[0], nkbGraph)

 IMPORTANT: AssayID needs manual curation
 IMPORTANT: publication_DOI needs manual curation
 IMPORTANT: AssayType needs manual curation
 IMPORTANT: AssayName needs manual curation
 IMPORTANT: medium_MediumID needs manual curation
 IMPORTANT: medium_publication_DOI needs manual curation
 IMPORTANT: material_MaterialID needs manual curation
 IMPORTANT: material_publication_DOI needs manual curation


## Mapping:
####      ASSAY

<a id='table_check'></a>
NEW TO V2 - Since we started with the material table, it was the only data in the graph when we started remapping the terms. Now that we're moving on to another table, the graph is holding data from multiple sources, so we have to make sure that we are only remapping data from the table we are trying to work on.

Why? Imagine if some term goes unmatched in multiple tables. For instance, what if "volume" in publication (where it means "a collection of issues of a publication") and "volume" in parameters (where it means "how much a container can hold") both go unmatched? When only one table was in the graph, we could have used the following code to find all of the relevant triples to replace "volume" in when we were ready to manually curate the term.

column = "volume" <br>
for s, p, o in nkbGraph.triples((None, Literal(column), None)):

But, if the graph has data from multiple tables, this is going to return triples from **both** tables, and we have no way to know which is which. Whichever term we map to first will be applied to all of the triples with volume, making half of them wrong! We want to make sure we only edit triples from one table at a time, so we add an additional line. 

column = "volume" <br>
publication_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C48471 <br>
for s, p, o in nkbGraph.triples((None, Literal(column), None)):
<br> &emsp;       if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(publication_iri)) in nkbGraph:

So, when the "for" statement finds a triple where p is "volume", the "if" statement looks at the s from the front of that triple and makes sure that there is another triple in the graph that says s came from the publication table (using a [isA relationship](#isA_relationship) we assign like this when adding the rows in the first place).

In [46]:
assay_iri = 'http://purl.obolibrary.org/obo/OBI_0000070'

assay_edits = [
                  ('AssayID', 'http://purl.org/dc/terms/identifier', 'id'),
                  ('publication_DOI', 'http://purl.obolibrary.org/obo/OBI_0002110', 'doi'),
                  ('AssayType', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25372', 'category'),
                  ('AssayName', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614', 'name'),
                 ]

In [47]:
for s, p, o in nkbGraph.triples((None, Literal('AssayID'), None)):
    # add <row> a <assay>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(assay_iri)))
    # add <assay_iri> <rdfs:label> "assay"
    nkbGraph.add((URIRef(assay_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('assay')))
    
for column, iri, label in assay_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        #NEW TO V2 - this "if" statement verifies that the triple is from our assay data
        #this pattern is ubiquitous, so further instances will not be noted
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(assay_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

In [48]:
# Medium ID 
for s, p, o in nkbGraph.triples((None, Literal('medium_MediumID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(assay_iri)) in nkbGraph:
        # create unique bag for row
        medium_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef('http://purl.bioontology.org/ontology/npo#NPO_1853'), medium_bag))
        # <medium_node> <id> "id"
        nkbGraph.add((medium_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        #attempt to add the medium_DOI to the bag
        #first, find the relevant row
        #same first part as the ID to match the row, second part is the column, third we wildcard
        for x, y, z in nkbGraph.triples((s, Literal('medium_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((medium_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            #then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))

# Material ID
for s, p, o in nkbGraph.triples((None, Literal('material_MaterialID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(assay_iri)) in nkbGraph:
        # create unique bag for row
        material_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef('http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400'), material_bag))
        # <medium_node> <id> "id"
        nkbGraph.add((material_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        #attempt to add the material_DOI to the bag
        #first, find the relevant row
        #same first part as the ID to match the row, second part is the column, third we wildcard
        for x, y, z in nkbGraph.triples((s, Literal('material_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((material_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            #then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))

## RDF Conversion
PUBLICATION TABLE

### Setup:
   ######     PUBLICATION

In [49]:
# CSV import
publication_csv = load_data("fullCSV/publication//")

# CSV to dataframe
publication_dfs = table_from_file("fullCSV/publication//")

In [50]:
display(publication_dfs)

{'publication.csv':                                DOI  \
 0               10.1002/cyto.22342   
 1             10.1002/cyto.a.20927   
 2             10.1002/cyto.a.22793   
 3                 10.1002/em.21848   
 4                 10.1002/etc.1858   
 ..                             ...   
 124        10.2134/jeq2016.12.0485   
 125   10.3109/17435390.2010.539711   
 126   10.3109/17435390.2013.858794   
 127  10.3109/17435390.2015.1107142   
 128                 No Publication   
 
                                               PubTitle  Year  \
 0    Detection of silver nanoparticles in cells by ...  2013   
 1    Detection of TiO2 nanoparticles in cells by fl...  2010   
 2    Characterization, detection, and counting of m...  2015   
 3    Cellular interactions and biological responses...  2014   
 4    phototoxicity of TiO2 nanoparticles under sola...  2012   
 ..                                                 ...   ...   
 124  Aging of Dissolved Copper and Copper-based Nan... 

In [51]:
# run matcher
publication_match, publication_unmatch = matcher(onto, publication_csv, context=True)


number unmatched terms in DOI: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 1
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in PubTitle: 88

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in Year: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 1
threshold value: 90
RUN: 

In [52]:
# load basic term RDF
basic_rdf(publication_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(publication_dfs, publication_match[0], nkbGraph)

 IMPORTANT: PubTitle needs manual curation
 IMPORTANT: PageStart needs manual curation
 IMPORTANT: PageEnd needs manual curation
 IMPORTANT: Keywords needs manual curation
 IMPORTANT: FirstAuthor needs manual curation


### Mapping:
#####      PUBLICATION

While many of the columns in publication were matched, the column names were also often words with multiple definitions (for instance, a Volume could refer to a trait of a container or an entry in a series of publications). We manually review the automated mappings and override them as needed.

In [53]:
publication_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C48471'

publication_edits = [
                  ('DOI', 'http://purl.obolibrary.org/obo/OBI_0002110', 'doi'),
                  ('PubTitle', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42774', 'title'),
                  ('Year', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C29848', 'year'),
                  ('Journal', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976', 'journal'),
                  # no IRI found for this context of volume, using 'publication'
                  ('Volume', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C43320', 'volume'),
                  ('Issue', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C43415', 'isue'),
                  ('PageStart', 'http://bioontology.org/projects/ontologies/birnlex#birnlex_2394', 'start page'),
                  ('PageEnd', 'http://bioontology.org/projects/ontologies/birnlex#birnlex_2395', 'end page'),
                  ('Keywords', 'http://purl.obolibrary.org/obo/IAO_0000630', 'keywords section'),
                  ('Abstract', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C60765', 'abstract'),
                  ('FirstAuthor', 'http://purl.obolibrary.org/obo/MS_1002034', 'first author'),
                  ('Correspondence', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C97019', 'correspondence'),
                  ('Affiliation', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25412', 'affiliation'),
                 ]

<a id='preexisting_check'></a>
NEW TO V2 - There is a risk, when mapping terms, that you may cause an accidental collision. When we change a mapping, we find the triple we want to change, add a version of it with the change made, and then delete the original. Graphs can't hold duplicate triples. If you try to add a triple that is already in the graph, nothing happens. If you continue and delete the original, you've lost data! Let's look at an example.

Imagine you have a graph with two triples in it. <br>
Graph: { (s, p, o), (s, p, r) }.<br>
Let's say that you decide the term r is wrong, and you'd rather use o.<br>
You run a search for any triples using r and it returns (s, p, r).<br>
To make the change, you add a version of that triple with the desired change made: (s, p, o)<br>
Graph: { (s, p, o), (s, p, r) }. Note that nothing was actually added since (s, p, o) was already in the graph.<br>
Now, to finish the "change", you remove the unmodified version: (s, p, r). <br>
Graph: { (s, p, o) }.<br>
Instead of modifying something, all you actually did was delete it!

Protecting against this is simple. We add an "if" statement. If the modified version of the triple is already in the graph, don't add or delete anything! Simply print out a warning to the curator that something seems suspicious. It is very unlikely you will have two terms that really do map to the same thing in such a way as to cause duplicate triples, so whenever it occurs, it's a good sign that something was mismapped. 

In [54]:
# first use an IRI that is DEFINITELY not matched already
# to type every <example.com/publication>

if (None, Literal('PubTitle'), None) in nkbGraph:
    # if it is, loop every triple with the column predicate, and
    for s, p, o in nkbGraph.triples((None, Literal('PubTitle'), None)):
        
        # add <row> a <assay>
        nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(publication_iri)))
        # add <assay_iri> <rdfs:label> "assay"
        nkbGraph.add((URIRef(publication_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('publication')))  

# now loop the list of (column, iri, label) for the publication table
for column, iri, label in publication_edits:
    
    # check for triples with the column literal predicate 
    if (None, Literal(column), None) in nkbGraph:
        # if present, loop every triple with the column predicate, and
        for s, p, o in nkbGraph.triples((None, Literal(column), None)):
            # make sure subject rdf:type is <publication>
            if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(publication_iri)) in nkbGraph:
                #NEW TO V2
                #make sure we aren't adding something that already exists
                if (s, URIRef(iri), o) in nkbGraph:
                    print("We already have this triple, so it is unwise to add it and then delete it...")
                    print(column)
                    print(iri)
                else:
                    # add <row> <column_iri> "data"
                    nkbGraph.add((s, URIRef(iri), o))
                    # add <column_iri> <rdfs:label> "label"
                    nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
                    # remove old triple
                    nkbGraph.remove((s, p, o))
            
    else:
        # if it is not, it is already matched, or, not existant
        if term_lookup(publication_match[0], column):
            # set the search predicate to the column iri in match dictionary
            col_iri = term_lookup(publication_match[0], column)[1]
        else:
            col_iri = publication_iri
        
        # loop every triple with the column IRI predicate, and 
        for s, p, o in nkbGraph.triples((None, URIRef(col_iri), None)):
            # make sure subject rdf:type is <publication>
            if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(publication_iri)) in nkbGraph:
                if (s, URIRef(iri), o) in nkbGraph:
                    print("We already have this triple, so it is unwise to add it and then delete it...")
                    print(column)
                    print(iri)
                else:
                    # add <row> <column_iri> "data"
                    nkbGraph.add((s, URIRef(iri), o))
                    # add <column_iri> <rdfs:label> "label"
                    nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
                    # remove old triple
                    nkbGraph.remove((s, p, o))

We already have this triple, so it is unwise to add it and then delete it...
Journal
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976
We already have this triple, so it is unwise to add it and then delete it...
Journal
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976
We already have this triple, so it is unwise to add it and then delete it...
Journal
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976
We already have this triple, so it is unwise to add it and then delete it...
Journal
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976
We already have this triple, so it is unwise to add it and then delete it...
Journal
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976
We already have this triple, so it is unwise to add it and then delete it...
Journal
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976
We already have this triple, so it is unwise to add it and then delete it...
Journal
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C40976

We already have this triple, so it is unwise to add it and then delete it...
Correspondence
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C97019
We already have this triple, so it is unwise to add it and then delete it...
Correspondence
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C97019
We already have this triple, so it is unwise to add it and then delete it...
Correspondence
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C97019
We already have this triple, so it is unwise to add it and then delete it...
Correspondence
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C97019
We already have this triple, so it is unwise to add it and then delete it...
Correspondence
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C97019
We already have this triple, so it is unwise to add it and then delete it...
Correspondence
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C97019
We already have this triple, so it is unwise to add it and then delete it...
Correspondence
http://n

## RDF Conversion
MEDIUM TABLE

### Setup:
   ######     MEDIUM

In [55]:
# CSV import
medium_csv = load_data("fullCSV/medium//")

# CSV to dataframe
medium_dfs = table_from_file("fullCSV/medium//")

In [56]:
# run matcher
medium_match, medium_unmatch = matcher(onto, medium_csv, context=True)


number unmatched terms in MediumID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in publication_DOI: 2

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in MediumDescription: 44

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 9


In [57]:
# load basic term RDF
basic_rdf(medium_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(medium_dfs, medium_match[0], nkbGraph)

 IMPORTANT: MediumID needs manual curation
 IMPORTANT: publication_DOI needs manual curation
 IMPORTANT: MediumDescription needs manual curation


### Mapping:
#####      MEDIUM

In [58]:
medium_iri = 'http://purl.bioontology.org/ontology/npo#NPO_1853'

medium_edits = [
                  ('MediumID', 'http://purl.org/dc/terms/identifier', 'id'),
                  ('MediumDescription', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25365', 'description'),
                  ('publication_DOI', 'http://purl.obolibrary.org/obo/OBI_0002110', 'doi'),
                 ]

In [59]:
for s, p, o in nkbGraph.triples((None, Literal('MediumID'), None)):
    # add <row> a <medium>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(medium_iri)))
    # add <assay_iri> <rdfs:label> "medium"
    nkbGraph.add((URIRef(medium_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('medium')))

for column, iri, label in medium_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(medium_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

## RDF Print
### Primary Tables

<hr>

In [60]:
# if you want to print only the primary tables
#nkbGraph.serialize(format="turtle", destination="FirstFix_nkb_rdf_complete_9_19_partial.ttl")

<hr>

## SPARQL 
### Primary Tables
Here, we see some sample queries on the RDF product so far.

#####      Publication

In [61]:
nkbQry = nkbGraph.query("""
            SELECT DISTINCT ?doi ?keywords WHERE {{
                
               ?publication a <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C48471> ;
                              <http://purl.obolibrary.org/obo/OBI_0002110> ?doi ;
                              <http://purl.obolibrary.org/obo/IAO_0000630> ?keywords .
               
            }}""")



tsm = pd.DataFrame(nkbQry.bindings)
tsm.columns = tsm.columns.str.strip()

In [62]:
tsm

Unnamed: 0,doi,keywords
0,10.1002/cyto.22342,flow cytometry; silver nanoparticles; silver; ...
1,10.1002/cyto.a.20927,nanoparticles; side scatter; titanium dioxide;...
2,10.1002/cyto.a.22793,flow cytometry; nanoparticles; submicron parti...
3,10.1002/em.21848,micronucleus; comet assay; flow cytometry
4,10.1002/etc.1858,nano-tio2; phototoxicity; simulated solar radi...
...,...,...
124,10.2134/jeq2016.12.0485,
125,10.3109/17435390.2010.539711,nanomaterial; tio2; ceo2; immuno-spin trapping...
126,10.3109/17435390.2013.858794,bioaccumulation; bioavailability; polychlorina...
127,10.3109/17435390.2015.1107142,nanoparticles; nanotoxicology; particle toxico...


#####      Medium

In [63]:
nkbQry = nkbGraph.query("""
            SELECT DISTINCT ?doi ?type ?id ?type_label WHERE {{
                
               ?medium a <http://purl.bioontology.org/ontology/npo#NPO_1853> ;
                         <http://purl.obolibrary.org/obo/OBI_0002110> ?doi ;
                         <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25365> ?type ;
                         <http://purl.org/dc/terms/identifier> ?id .
                
                OPTIONAL {
                    ?type rdfs:label ?type_label .
                }
               
            }}""")



tsm = pd.DataFrame(nkbQry.bindings)
tsm.columns = tsm.columns.str.strip()

In [64]:
tsm

Unnamed: 0,doi,id,type,type_label
0,10.1002/cyto.22342,1,dulbecco's modified eagle medium / ham's f12 m...,
1,10.1002/cyto.a.20927,1,dulbecco's modified eagle medium / ham's f12 m...,
2,10.1002/cyto.a.22793,1,http://purl.bioontology.org/ontology/npo#NPO_1848,deionized water
3,10.1002/em.21848,1,http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus...,dulbecco's modified eagle medium
4,10.1002/etc.1858,1,moderately hard reconstituted water,
...,...,...,...,...
250,10.1016/j.envpol.2013.10.011,8,http://purl.obolibrary.org/obo/ENVO_00002006,liquid water
251,10.1021/es400483k,8,wastewater,
252,10.1021/es4015844,8,http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus...,orange juice
253,10.1021/es4015844,9,http://purl.obolibrary.org/obo/ENVO_00002006,liquid water


#####      Material & Assay

In [65]:
'''
nkbQry = nkbGraph.query("""
            PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
            
            SELECT DISTINCT ?assay_doi ?name ?core WHERE {{
                
                # ASSAY info and material_id
                 ?assay <http://purl.obolibrary.org/obo/OBI_0002110> ?assay_doi ;
                        <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614> ?name ;
                        <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400> ?mat_bag .
                      
                 ?mat_bag <http://purl.org/dc/terms/identifier> ?assay_material_id .
               
               # MATERIAL info
                 ?material <http://purl.obolibrary.org/obo/OBI_0002110> ?mat_doi ;
                           <http://purl.bioontology.org/ontology/npo#NPO_1808> ?core ;
                           <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400> ?material_id .
               
               FILTER(?assay_doi = ?mat_doi)
               FILTER(xsd:integer(?material_id) = xsd:integer(?assay_material_id))
            }}""")



nkbTbls = pd.DataFrame(nkbQry.bindings)
nkbTbls.columns = nkbTbls.columns.str.strip()
'''

'\nnkbQry = nkbGraph.query("""\n            PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n            \n            SELECT DISTINCT ?assay_doi ?name ?core WHERE {{\n                \n                # ASSAY info and material_id\n                 ?assay <http://purl.obolibrary.org/obo/OBI_0002110> ?assay_doi ;\n                        <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614> ?name ;\n                        <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400> ?mat_bag .\n                      \n                 ?mat_bag <http://purl.org/dc/terms/identifier> ?assay_material_id .\n               \n               # MATERIAL info\n                 ?material <http://purl.obolibrary.org/obo/OBI_0002110> ?mat_doi ;\n                           <http://purl.bioontology.org/ontology/npo#NPO_1808> ?core ;\n                           <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400> ?material_id .\n               \n               FILTER(?assay_doi = ?mat_doi)\

In [66]:
#nkbTbls

## RDF Conversion, Part 2
### Secondary Tables

- **Results**
- **Parameters**
- **MaterialFG**
- **Additives**
- **Contam**

## RDF Conversion
RESULTS TABLE

### Setup:
   ######     RESULTS

In [67]:
# CSV import
result_csv = load_data("fullCSV/result//")

# CSV to dataframe
result_dfs = table_from_file("fullCSV/result//")

# run matcher
result_match, result_unmatch = matcher(onto, result_csv, context=True)

# load basic term RDF
basic_rdf(result_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(result_dfs, result_match[0], nkbGraph)


number unmatched terms in ResultID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ResultType: 469

RUN: EDAM_dev.owl | exact_binary 
matches: 4
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 26
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 27
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ResultDetails: 376

RUN: EDAM_dev.owl | exact_binary 
matches: 1
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 3
thre

### Mapping:
#####      RESULT

In [68]:
result_iri = 'http://www.bioassayontology.org/bao#BAO_0000179'

assay_iri = 'http://purl.obolibrary.org/obo/OBI_0000070'

result_edits = [
                  ('ResultID', 'http://purl.org/dc/terms/identifier', 'id'),
                  ('ResultType', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614', 'name'),
                  ('ResultDetails', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25480', 'details'),
                  ('ResultValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('ResultApproxSymbol', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C54191', 'symbol'),
                  ('ResultUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('ResultUncertainty', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71478', 'uncertainty'),
                  ('ResultLow', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42857', 'low value'),
                  ('ResultHigh', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42853', 'high value'), 
                 ]

In [69]:
for s, p, o in nkbGraph.triples((None, Literal('ResultID'), None)):
    # add <row> a <result>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(result_iri)))
    # add <assay_iri> <rdfs:label> "result"
    nkbGraph.add((URIRef(result_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('result')))

# Assay ID
for s, p, o in nkbGraph.triples((None, Literal('assay_AssayID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(result_iri)) in nkbGraph:
        # create unique bag for row
        assay_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef(assay_iri), assay_bag))
        # <medium_node> <id> "id"
        nkbGraph.add((assay_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        #attempt to add the medium_DOI to the bag
        #first, find the relevant row
        #same first part as the ID to match the row, second part is the column, third we wildcard
        for x, y, z in nkbGraph.triples((s, Literal('assay_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((assay_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            #then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))
    
# result columns
for column, iri, label in result_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(result_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

## RDF Conversion
PARAMETERS TABLE

### Setup:
   ######     PARAMETERS

In [70]:
# CSV import
parameters_csv = load_data("fullCSV/parameters//")

# CSV to dataframe
parameters_dfs = table_from_file("fullCSV/parameters//")

# run matcher
parameters_match, parameters_unmatch = matcher(onto, parameters_csv, context=True)

# load basic term RDF
basic_rdf(parameters_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(parameters_dfs, parameters_match[0], nkbGraph)


number unmatched terms in ParametersID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ParameterName: 269

RUN: EDAM_dev.owl | exact_binary 
matches: 3
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 22
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 16
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 2
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ParameterNumberValue: 9

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
mat

### Mapping:
#####      PARAMETER

In [71]:
parameter_iri = 'http://purl.bioontology.org/ontology/npo#NPO_1680'

assay_iri = 'http://purl.obolibrary.org/obo/OBI_0000070'

parameter_edits = [
                  ('ParametersID', 'http://purl.org/dc/terms/identifier', 'id'),
                  ('ParameterName', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614', 'name'),
                  ('ParameterNumberValue', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('ParameterUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  # NOTE: using IRI for "text" as this describes data well
                  ('ParameterNonNumberValue', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25704', 'text'),
                 ]

In [72]:
for s, p, o in nkbGraph.triples((None, Literal('ParametersID'), None)):
    # add <row> a <parameter>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(parameter_iri)))
    # add <assay_iri> <rdfs:label> "parameter"
    nkbGraph.add((URIRef(parameter_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('parameter')))

# Assay ID
for s, p, o in nkbGraph.triples((None, Literal('assay_AssayID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(parameter_iri)) in nkbGraph:
        # create unique bag for row
        assay_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef(assay_iri), assay_bag))
        # <medium_node> <id> "id"
        nkbGraph.add((assay_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        #attempt to add the medium_DOI to the bag
        #first, find the relevant row
        #same first part as the ID to match the row, second part is the column, third we wildcard
        for x, y, z in nkbGraph.triples((s, Literal('assay_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((assay_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            #then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))
    
# parameter columns
for column, iri, label in parameter_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(parameter_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

## RDF Conversion
MATERIALFG TABLE

### Setup:
   ######     MATERIALFG

In [73]:
# CSV import
materialfg_csv = load_data("fullCSV/materialfg//")

# CSV to dataframe
materialfg_dfs = table_from_file("fullCSV/materialfg//")

# run matcher
materialfg_match, materialfg_unmatch = matcher(onto, materialfg_csv, context=True)

# load basic term RDF
basic_rdf(materialfg_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(materialfg_dfs, materialfg_match[0], nkbGraph)


number unmatched terms in MaterialFGID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in functionalgroup_FunctionalGroup: 5

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 2
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 1
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in FunctionalizationProtocol: 4

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl 

### Mapping:
#####      MATERIALFG

In [74]:
materialfg_iri = 'http://purl.bioontology.org/ontology/npo#NPO_174'

material_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400'

materialfg_edits = [
                  ('MaterialFGID', 'http://purl.org/dc/terms/identifier', 'id'),
                  ('functionalgroup_FunctionalGroup', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614', 'name'),
                  # NOTE: used 'functionalization of nanoparticle' from NPO/eNM
                  ('FunctionalizationProtocol', 'http://purl.bioontology.org/ontology/npo#NPO_1616', 'functionalization of nanoparticle'),
                 ]

In [75]:
for s, p, o in nkbGraph.triples((None, Literal('MaterialFGID'), None)):
    # add <row> a <materialfg>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(materialfg_iri)))
    # add <assay_iri> <rdfs:label> "materialfg"
    nkbGraph.add((URIRef(materialfg_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('materialfg')))

# Material ID
for s, p, o in nkbGraph.triples((None, Literal('material_MaterialID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(materialfg_iri)) in nkbGraph:
        # create unique bag for row
        material_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef(material_iri), material_bag))
        # <medium_node> <id> "id"
        nkbGraph.add((material_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        #attempt to add the medium_DOI to the bag
        #first, find the relevant row
        #same first part as the ID to match the row, second part is the column, third we wildcard
        for x, y, z in nkbGraph.triples((s, Literal('material_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((material_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            #then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))
    
# materialfg columns
for column, iri, label in materialfg_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(materialfg_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

## RDF Conversion
ADDITIVES TABLE

### Setup:
   ######     ADDITIVES

In [76]:
# CSV import
additives_csv = load_data("fullCSV/additive/")

# CSV to dataframe
additives_dfs = table_from_file("fullCSV/additive/")

# run matcher
additives_match, additives_unmatch = matcher(onto, additives_csv, context=True)

# load basic term RDF
basic_rdf(additives_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(additives_dfs, additives_match[0], nkbGraph)


number unmatched terms in AdditiveID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in AdditiveName: 127

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 10
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 62
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 1
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in AdditiveAmount: 3

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
t

### Mapping:
#####      ADDITIVES

In [77]:
additives_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C63495'

material_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400'

additives_edits = [
                  ('AdditiveID', 'http://purl.org/dc/terms/identifier', 'id'),
                  ('AdditiveName', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614', 'name'),
                  # NOTE: 'value' used instead of amount. units provide unambiguous context.
                  #       worth the slight change for RDF consistency
                  ('AdditiveAmount', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('AdditiveUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                 ]

In [78]:
for s, p, o in nkbGraph.triples((None, Literal('AdditiveID'), None)):
    # add <row> a <additives>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(additives_iri)))
    # add <assay_iri> <rdfs:label> "additives"
    nkbGraph.add((URIRef(additives_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('additive')))

# Medium ID
for s, p, o in nkbGraph.triples((None, Literal('medium_MediumID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(additives_iri)) in nkbGraph:
        # create unique bag for row
        medium_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef(medium_iri), medium_bag))
        # <medium_node> <id> "id"
        nkbGraph.add((medium_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        for x, y, z, in nkbGraph.triples((s, Literal('medium_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((medium_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            # then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))
    
# additives columns
for column, iri, label in additives_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(additives_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

## RDF Conversion
CONTAM TABLE

### Setup:
   ######     CONTAM

In [79]:
# CSV import
contam_csv = load_data("fullCSV/contam//")

# CSV to dataframe
contam_dfs = table_from_file("fullCSV/contam//")

# run matcher
contam_match, contam_unmatch = matcher(onto, contam_csv, context=True)

# load basic term RDF
basic_rdf(contam_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(contam_dfs, contam_match[0], nkbGraph)


number unmatched terms in ContamID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in Contaminant: 27

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 3
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 22
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 2
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in ContamAmount: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshol

### Mapping:
#####      CONTAM

In [80]:
contam_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C84280'

material_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C93400'

contam_edits = [
                  ('ContamID', 'http://purl.org/dc/terms/identifier', 'id'),
                  # NOTE: 'value' used instead of amount. units provide unambiguous context.
                  #       worth the slight change for RDF consistency
                  ('ContamAmount', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#value', 'value'),
                  ('ContamUnit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C68553', 'unit'),
                  ('ContamMethod', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71460', 'method'),
                 ]

In [81]:
for s, p, o in nkbGraph.triples((None, Literal('ContamID'), None)):
    # add <row> a <contam>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(contam_iri)))
    # add <assay_iri> <rdfs:label> "contam"
    nkbGraph.add((URIRef(contam_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('contam')))

# material ID
for s, p, o in nkbGraph.triples((None, Literal('material_MaterialID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(contam_iri)) in nkbGraph:
        # create unique bag for row
        material_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef(material_iri), material_bag))
        # <material_node> <id> "id"
        nkbGraph.add((material_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        #attempt to add the material_DOI to the bag
        #first, find the relevant row
        #same first part as the ID to match the row, second part is the column, third we wildcard
        for x, y, z in nkbGraph.triples((s, Literal('material_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((material_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            #then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))
    
# contam columns
for column, iri, label in contam_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(contam_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

## RDF Conversion
MOLECULAR RESULT

### Setup:
   ######     MOLECULAR RESULT

In [82]:
# CSV import
molecular_csv = load_data("fullCSV/molecularresult/")

# CSV to dataframe
molecular_dfs = table_from_file("fullCSV/molecularresult/")

# run matcher
molecular_match, molecular_unmatch = matcher(onto, molecular_csv, context=True)

# load basic term RDF
basic_rdf(molecular_match[0], nkbGraph)

# load relational RDF relationships
relational_rdf_loader(molecular_dfs, molecular_match[0], nkbGraph)


number unmatched terms in MolecularResultID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in assay_AssayID: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
matches: 0
threshold value: 90
RUN: ncit.owl | exact_binary 
matches: 0
threshold value: 90
RUN: npo.owl | exact_binary 
matches: 0
threshold value: 90
RUN: obi.owl | exact_binary 
matches: 0
threshold value: 90
RUN: SCTO.owl | exact_binary 
matches: 0
threshold value: 90

number unmatched terms in assay_publication_DOI: 1

RUN: EDAM_dev.owl | exact_binary 
matches: 0
threshold value: 90
RUN: enanomapper.owl | exact_binary 
m

### Mapping:
#####      MOLECULAR RESULT

In [83]:
molecular_iri = 'http://purl.obolibrary.org/obo/ERO_0000833'
assay_iri = 'http://purl.obolibrary.org/obo/OBI_0000070'
species_iri = 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C45293'


molecular_edits = [
                  ('MolecularResultID', 'http://purl.org/dc/terms/identifier', 'id'),
                  ('GEOAccession', 'http://edamontology.org/data_1147', 'GEO accession number'),
                  ('OrganismName', 'http://edamontology.org/data_2909', 'organism name'),
                  ('SampleCount', 'http://purl.dataone.org/odo/ECSO_00001188', 'sample count'),
                 ]

molecular_iri_edits = [
                      ('Platform', 'http://purl.obolibrary.org/obo/OBI_0000052', 'microarray platform'),
                      ('Series', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674', 'series'),
                      ('URL', 'http://edamontology.org/data_1052', 'doi'),
]

In [84]:
# Molecular Result
for s, p, o in nkbGraph.triples((None, Literal('MolecularResultID'), None)):
    # add <row> a <molecular>
    nkbGraph.add((s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)))
    # add <species_iri> <rdfs:label> "molecular"
    nkbGraph.add((URIRef(molecular_iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal('molecularresult')))

# Assay ID
for s, p, o in nkbGraph.triples((None, Literal('assay_AssayID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)) in nkbGraph:
        # create unique bag for row
        assay_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef(assay_iri), assay_bag))
        # <species_node> <id> "id"
        nkbGraph.add((assay_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        #attempt to add the assay_DOI to the bag
        #first, find the relevant row
        #same first part as the ID to match the row, second part is the column, third we wildcard
        for x, y, z in nkbGraph.triples((s, Literal('assay_publication_DOI'), None)):
            # <bag> <DOI> "doi"
            nkbGraph.add((assay_bag, URIRef('http://purl.obolibrary.org/obo/OBI_0002110'), z))
            #then remove the old
            nkbGraph.remove((x, y, z))
        nkbGraph.remove((s, p, o))
    
for s, p, o in nkbGraph.triples((None, Literal('AssayType'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)) in nkbGraph:
        nkbGraph.add((s, URIRef('http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C42614'), o))
        nkbGraph.remove((s, p, o))
        
# Species ID
for s, p, o in nkbGraph.triples((None, Literal('SpeciesID'), None)):
    if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)) in nkbGraph:
        # create unique bag for row
        species_bag = BNode()
        # <row> <parameter> <bag>
        nkbGraph.add((s, URIRef(species_iri), species_bag))
        # <species_node> <id> "id"
        nkbGraph.add((species_bag, URIRef('http://purl.org/dc/terms/identifier'), o))
        nkbGraph.remove((s, p, o))
    
# IRI-Matched Replacements
for term, iri, label in molecular_iri_edits:
    for s, p, o in nkbGraph.triples((None, None, Literal(term))):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)) in nkbGraph:
            if ((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label))) in nkbGraph:
                print("we already have this triple, so unwise to add it and then delete it...")
                print(term)
            else:
                nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
                nkbGraph.remove((s, p, o))
    
    oldiri = term_lookup(molecular_match[0], term)[1]
    
    for s, p, o in nkbGraph.triples((None, URIRef(oldiri), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)) in nkbGraph:
            if (s, URIRef(iri), o) in nkbGraph:
                print("We already have this triple, so unwise to add it and then delete it...")
                print(iri)
                print("1")
            else:
                nkbGraph.add((s, URIRef(iri), o))
                nkbGraph.remove((s, p, o))

    for s, p, o in nkbGraph.triples((None, None, URIRef(oldiri))):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)) in nkbGraph:
            if (s, p, URIRef(iri)) in nkbGraph:
                print("We already have this triple, so unwise to add it and then delete it...")
                print(iri)
                print("2")
            else:
                nkbGraph.add((s, p, URIRef(iri)))
                nkbGraph.remove((s, p, o))
    
# molecular columns
for column, iri, label in molecular_edits:
    for s, p, o in nkbGraph.triples((None, Literal(column), None)):
        if (s, URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), URIRef(molecular_iri)) in nkbGraph:
            # add <row> <column_iri> "data"
            nkbGraph.add((s, URIRef(iri), o))
            # add <column_iri> <rdfs:label> "label"
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), Literal(label)))
            # remove old triple
            nkbGraph.remove((s, p, o))

We already have this triple, so unwise to add it and then delete it...
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674
1
We already have this triple, so unwise to add it and then delete it...
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674
1
We already have this triple, so unwise to add it and then delete it...
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674
1
We already have this triple, so unwise to add it and then delete it...
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674
1
We already have this triple, so unwise to add it and then delete it...
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674
1
We already have this triple, so unwise to add it and then delete it...
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674
1
We already have this triple, so unwise to add it and then delete it...
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25674
1
We already have this triple, so unwise to add it and then delete it...
http://ncicb

## MATCH CURATION

All through the above, we recommended to perform any corrections on a table-by-table basis. That is ideal. It means that your changes propagate as little as possible and you reduce the risk of introducing additional errors. Plus, it keeps everything neat and tidy. 

Unfortunately, the reality is that these are big datasets and it is entirely possible for mistakes to get past your first review. The below is included to show that mistakes happen and you can patch them up after the end if necessary. Alternatively, there's nothing wrong with going back, making the changes in the individual table sections, and rerunning your whole notebook!   

Below, we also include the curation notes from these last steps of the process, as they looked when the job was done, to show that it's normal to have some confusion about similar terms from different ontologies, or to have a change of heart about which term fits your data best.

In [85]:
#records output BEFORE the curation, for verification
#nkbGraph.serialize(format="turtle", destination="FirstFix_nkb_rdf_complete_9_26-preMC.ttl")

In [86]:
# MATCH CURATION
# list of tuples, all the correct 
# associations of erroneous matches
# ('term', 'correct_iri')

curated_matches = [
    # NCIT
    ("doi", "http://purl.obolibrary.org/obo/OBI_0002110"),
    ('abstract', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C60765'),
    # 'corresponding author indicator' IRI
    ('correspondence', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C164481'),
    ('hydrogen', 'http://purl.bioontology.org/ontology/npo#NPO_520'),
    ('sulfate', 'http://purl.obolibrary.org/obo/CHEBI_16189'),
    ('glycerol', 'http://purl.obolibrary.org/obo/CHEBI_17754'),
    ('lead', 'http://purl.bioontology.org/ontology/npo#NPO_424'),
    ('domain', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C62289'),
    ('dispersion', 'http://purl.bioontology.org/ontology/npo#NPO_1969'),
    ('phosphate', 'http://purl.obolibrary.org/obo/CHEBI_26020'),
    # 'water' weird form CHEBI, BAO, etc
    ('water', 'http://purl.obolibrary.org/obo/CHEBI_15377'),
    ('composition', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C53414'),
    ('parts', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C45313'),
    ('normality', 'http://purl.bioontology.org/ontology/npo#NPO_1192'),
    ('sulfur', 'http://purl.bioontology.org/ontology/npo#NPO_796'),
    ('vanadium','http://purl.bioontology.org/ontology/npo#NPO_486'),
    ('vanadium','http://purl.bioontology.org/ontology/npo#NPO_486'),
    ('crystalline', 'http://purl.bioontology.org/ontology/npo#NPO_1512'),
    ('volume', 'http://purl.obolibrary.org/obo/PATO_0000918'),
    # using 'sample deposition' from CHMO
    ('deposition', 'http://purl.obolibrary.org/obo/CHMO_0001310'),
    ('neutrophils', 'http://purl.jp/bio/4/id/200906021556987754'),
    # using NCIT for 'sequence', generic
    ('sequences', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25673'),
    # NCIT to specific, using PATO generic 'malformed'
    ('malformation', 'http://purl.obolibrary.org/obo/PATO_0000646'),
    ('melting temperature', 'http://semanticscience.org/resource/CHEMINF_000256'),
    ('column', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C43379'),
    ('sedimentation rate', 'http://sbmi.uth.tmc.edu/ontology/ochv#C0200704'),
    ('treatment', 'http://www.ebi.ac.uk/efo/EFO_0000727'),
    # switching from generic 'control' to NCIT 'study control'
    ('control', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C142703'),
]

# NOTES

# is 'vehicle' the eNM definition, or a literal vehicle?
#     ns6:ENM_8000219 rdfs:label "vehicle" ;

# add label for this:
# http://purl.bioontology.org/ontology/npo#NPO_1824

# two IRIs with application label, NCIT (general) and eNM (specific)
#    ns5:CHEBI_33232 rdfs:label "application" .

# WEIRDNESS WITH CHEBI< BAO< FIX EBI, like:
# this is a typo/error in the NanoParticleOntology?
# leaving because common in NPO and ported into eNM, maybe intentional....
#     <http://purl.org/obo/owl/CHEBI#CHEBI_15377> rdfs:label "water" .

#     <http://purl.org/obo/owl/CHEBI#CHEBI_33244> rdfs:label "organic functional class" .

#     <http://purl.org/obo/owl/FIX#FIX_0000402> rdfs:label "light scattering" .
#     <http://www.bioassayontology.org/bao#BAO_0010044> rdfs:label "targeted transcriptional assay" .

#     <http://www.ebi.ac.uk/efo/EFO_0004557> rdfs:label "population measurement" .

In [87]:
# using rdflib to directly change term-IRI
# because term_editor() functions need bug fixes

# CURATIONS
for term, iri in curated_matches:
    for (oldiri, p, label) in nkbGraph.triples((None, None, Literal(term))):
        if (URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), label) in nkbGraph:
            print("We already have this triple, so it is unwise to add it and then delete it...")
            print(term)
            print(iri)
        else:
            nkbGraph.add((URIRef(iri), URIRef('http://www.w3.org/2000/01/rdf-schema#label'), label))
            nkbGraph.remove((oldiri, p, label))

            for (s, oldiri, o) in nkbGraph.triples((None, URIRef(oldiri), None)):
                if (s, URIRef(iri), o) in nkbGraph:    
                    print("We already have this triple, so it is unwise to add it and then delete it...")
                    print('1')
                else:
                    nkbGraph.add((s, URIRef(iri), o))
                    nkbGraph.remove((s, oldiri, o))

            for (s, p, oldiri) in nkbGraph.triples((None, None, URIRef(oldiri))):
                if (s, p, URIRef(iri)) in nkbGraph:
                    print("We already have this triple, so it is unwise to add it and then delete it...")
                    print('2')
                else:
                    nkbGraph.add((s, p, URIRef(iri)))
                    nkbGraph.remove((s, p, oldiri))

We already have this triple, so it is unwise to add it and then delete it...
doi
http://purl.obolibrary.org/obo/OBI_0002110
We already have this triple, so it is unwise to add it and then delete it...
doi
http://purl.obolibrary.org/obo/OBI_0002110
We already have this triple, so it is unwise to add it and then delete it...
abstract
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C60765
We already have this triple, so it is unwise to add it and then delete it...
abstract
http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C60765
We already have this triple, so it is unwise to add it and then delete it...
water
http://purl.obolibrary.org/obo/CHEBI_15377
We already have this triple, so it is unwise to add it and then delete it...
vanadium
http://purl.bioontology.org/ontology/npo#NPO_486
We already have this triple, so it is unwise to add it and then delete it...
volume
http://purl.obolibrary.org/obo/PATO_0000918


In [88]:
# REMOVALS
#    note: never remove something that is a predicate anywhere in the data
#          always find a substitute
removals = [
    # association extremely unlikely, better as string
    ('p-25', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C102496'),
    # mathmatical and procedural definitions for mode- better as string
    ('mode', 'http://purl.bioontology.org/ontology/npo#NPO_1802'),
    # 'hm' association- to vague to trust IRI association
    ('hm', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C102641'),
    # no way this is in NKB (some subclass out there too)
    ('indic language', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C161844')
]

for term, iri in removals:
    for (oldiri, p, label) in nkbGraph.triples((None, None, Literal(term))):
        nkbGraph.remove((oldiri, p, label))
        
        for (s, oldiri, o) in nkbGraph.triples((None, URIRef(oldiri), None)):
            nkbGraph.remove((s, oldiri, o))
            
        for (s, p, oldiri) in nkbGraph.triples((None, None, URIRef(oldiri))):
            nkbGraph.add((s, p, Literal(term)))
            nkbGraph.remove((s, p, oldiri))

In [89]:
# ADDS
# terms to add to matches, for future curation
# additions = [
#     ('model prediction'),
#     ('concentration copper'),
#     ('milligrams/kilogram'),
#     ('medium saturation'),
#     ('silica sand'),
#     ('milli-q water')
# ]

In [90]:
#serialize before applying two final techniques
nkbGraph.serialize(format="turtle", destination="NKB_RDF_V3_before_ID.ttl")

<Graph identifier=N67e6d73bee004a20b9f76d20269cb565 (<class 'rdflib.graph.Graph'>)>

<a id='id_mapping'></a>
## Handling ID data

NEW TO V3 - Ontosearcher attempts to replace things by searching ontologies for good matches. This approach works well when trying to find matches for a word or phrase, but sometimes your data isn't like that. For instance, the NaKnowBase includes IDs that may be useful in finding or working with other resources, such as CAS Registry Numbers for chemicals. Typically, you will have stored the ID in your database as a string or number. In some cases, these IDs are part of existing URIs, and it is easy to expand them into that form in your RDF.

Below, we perform the manual curations for the two examples of this in the NaKnowBase.

In [91]:
#CAS Registry Numbers work as URIs already
#here, we find all triples that say "<subject> has a CAS Registry Number of <object>"
#before we do anything, those triples will look like "<subject> has a CAS Registry Number of '138-59-0' "
#after, it will look like "<subject> has a CAS Registry Number of http://identifiers.org/cas:138-59-0"
#we replace the ID number with a URIRef linking directly to the resource
for s, p, o in (nkbGraph.triples((None, URIRef("http://semanticscience.org/resource/CHEMINF_000446"), None))):
    #we focus only on ones where a real CASRN was provided
    if o != Literal("nan"):
        #we append the existing literal object to the address where you can pull up chemicals by their CASRN
        #these can usually be found online for any resource that assigns IDs to things
        newVal = "http://identifiers.org/cas:" + o
        #we add a version of the triple, in which the object is replaced with the URI of that address+CASRN combo
        nkbGraph.add((s, p, URIRef(newVal)))
        #finally, we remove the original
        nkbGraph.remove((s, p, o))

In [92]:
#DTXSIDs are handled the same way
#find DTXSID triples
for s, p, o in (nkbGraph.triples((None, URIRef("http://semanticscience.org/resource/CHEMINF_000568"), None))):
    #ignore the NANs
    if o != Literal("nan"):
        #convert to a comptox address
        newVal = "http://identifiers.org/comptox/" + o.upper()
        #add modified triple with URI object
        nkbGraph.add((s, p, URIRef(newVal)))
        #remove original
        nkbGraph.remove((s, p, o))

## Serializing
You've made it to the end. Now, we've added all of the data to our RDF. We've explicitly stated implied facts and bagged related data under those implied headings. We've let Ontosearcher handle the bulk of the work, but we've stepped in to fill in the gaps or override decisions we disagreed with. We've even taken a little extra time to add connections to other resources.
Let's go ahead and serialize it to a file, but there's one last thing we'll want to do afterwards.

In [93]:
nkbGraph.serialize(format="turtle", destination="NKB_RDF_V3_wip.ttl")

<Graph identifier=N67e6d73bee004a20b9f76d20269cb565 (<class 'rdflib.graph.Graph'>)>

<hr>

<a id='namespaces'></a>
## Formally Naming Namespaces

As a last step, we want to make life easier for anyone using the RDF in the future. One easy way to improve the user experience is to make sure there are no hidden changes over time.

If you've looked at other RDFs already (or if you use a text editor to open the serialized file we just made), you might have noticed that the file begins by declaring a list of *prefixes*. If you look at the CASRN or DTXSID examples above, you'll notice that the process of turning them into URIs was to simply add them to the end of an address. Most of the URIs in your RDF are set up similarly: an address pointing to the ontology's online presence and then to the specific term. 

Imagine you had the following RDF:

    example.org/myOntology/term1 a example.org/myOntology/person.
    example.org/myOntology/term1 example.org/myOntology/name "Alice".
    example.org/myOntology/term1 example.org/myOntology/age "20".
    
Anyone writing queries about your RDF would get really tired of writing example.org/myOntology/ over and over again.
Instead, when your RDF is written to the .ttl file, it simplifies things by identifying that "example.org/myOntology" is a common **namespace** for your URIs. Then, it gives that namespace a **prefix** as a form of shorthand. So, instead, your RDF will become:

    @prefix ns1: <example.org/myOntology/>.
    
    ns1:term1 a ns1:person.
    ns1:term1 ns1:name "Alice".
    ns1:term1 ns1:age "20".
    
That first line is saying: "The prefix *ns1:* means *example.org/myOntology/*." 

This saves a lot of time for anyone using your RDF, both by making them type less and by giving them something simpler to remember. The problem? While a few common namespaces have specialized prefixes by default, rdflib will simply assign most of them *ns* with a number. That succeeds at being shorter, but it can make things a lot more confusing if you have more than a few namespaces. Even worse, if you come back and make a substantial update to your RDF in the future, the namespaces might be encountered in a different order and be given different numbers!

Fortunately, we can manually assign prefixes to each namespace, making things more intuitive and more consistent. 

In [94]:
#let's load our data back in, calling it loadedGraph
loadedGraph = Graph()
loadedGraph.parse("NKB_RDF_V3_wip.ttl")

<Graph identifier=N56294d88e57b416f80951771f1065c01 (<class 'rdflib.graph.Graph'>)>

In [95]:
#look at the current state of the namespaces
for namespace in loadedGraph.namespaces():
    print(namespace)

('owl', rdflib.term.URIRef('http://www.w3.org/2002/07/owl#'))
('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#'))
('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#'))
('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#'))
('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace'))
('ns1', rdflib.term.URIRef('http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#'))
('ns10', rdflib.term.URIRef('http://www.geneontology.org/formats/oboInOwl#'))
('ns11', rdflib.term.URIRef('http://edamontology.org/'))
('ns12', rdflib.term.URIRef('http://purl.dataone.org/odo/'))
('ns2', rdflib.term.URIRef('http://purl.obolibrary.org/obo/'))
('ns3', rdflib.term.URIRef('http://purl.org/dc/terms/'))
('ns4', rdflib.term.URIRef('http://purl.bioontology.org/ontology/npo#'))
('ns5', rdflib.term.URIRef('http://semanticscience.org/resource/'))
('ns6', rdflib.term.URIRef('http://purl.enanomapper.org/onto/'))
('ns7', rdflib.term.URIRef('http://purl.org/dc/elements

The first five entries are for common namespaces present in many RDFs. There are more namespaces with special prefixes like this, but only the ones present in your RDF will be included.

After that, we have 12 namespaces that were assigned generic "ns" prefixes. These are the ones we should work on.

There are a few options for how to pick a prefix. Some sources will suggest a prefix in their documentation. For others, you might find another RDF that references it and follow their lead. If you have to come up with one on your own, try to come up with one that is both short and distinct. Finally, if there's just a few you couldn't find a better prefix for, there's nothing wrong with using the default prefixes, as long as you explicitly bind the prefix to the namespace and include it in your documentation.

If you look closely, you'll also notice that ns8 is identical to the pre-named rdfs, except that it starts with "https" instead of "http." HTTP and HTTPS are both protocols for accessing resources on the Internet, with HTTPS being a newer and more secure version of HTTP. One of the sources Ontosearcher selected must have used https instead. While it is technically correct as is, we can make life easier for our users by consolidating these two namespaces together. We'll do that before getting to the bindings.

In [96]:
#consolidating ns8 and rdfs into just rdfs

#look at each triple in our graph
for s, p, o in loadedGraph.triples((None, None, None)):
    #make copies of each part of the triple
    nS, nP, nO = s, p, o
    #if s comes from the https namespace
    if s.startswith("https://www.w3.org/2000/01/rdf-schema#"):
        #split s up to get the suffix, then append the suffix to the http namespace and replace nS with this new verion
        _, newS = s.split("https://www.w3.org/2000/01/rdf-schema#")
        nS = URIRef("http://www.w3.org/2000/01/rdf-schema#" + newS) 
    #do the same for p
    if p.startswith("https://www.w3.org/2000/01/rdf-schema#"):
        _, newP = p.split("https://www.w3.org/2000/01/rdf-schema#")
        nP = URIRef("http://www.w3.org/2000/01/rdf-schema#" + newP)
    #do the same for o
    if o.startswith("https://www.w3.org/2000/01/rdf-schema#"):
        _, newO = o.split("https://www.w3.org/2000/01/rdf-schema#")
        nP = URIRef("http://www.w3.org/2000/01/rdf-schema#" + newO)
    #if any part of the triple had to be changed
    if (nS, nP, nO) != (s, p, o):
        loadedGraph.add( (nS, nP, nO) )
        loadedGraph.remove( (s, p, o) )

In [97]:
#like we've done before, let's set up a structure holding all of the mappings
#in each tuple, we list the new name, followed by the URI we want to map it to
newPrefixes = [
    ('ncit', URIRef('http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#')),
    ('obo', URIRef('http://purl.obolibrary.org/obo/')),
    ('dcterms', URIRef('http://purl.org/dc/terms/')),
    ('npo', URIRef('http://purl.bioontology.org/ontology/npo#')),
    ('dc', URIRef('http://purl.org/dc/elements/1.1/')),
    ('enm', URIRef('http://purl.enanomapper.org/onto/')),
    ('sio', URIRef('http://semanticscience.org/resource/')),
    ('edam', URIRef('http://edamontology.org/')),
    ('birnlex', URIRef('http://bioontology.org/projects/ontologies/birnlex#')),
    ('odo', URIRef('http://purl.dataone.org/odo/')),
    ('oboInOwl', URIRef('http://www.geneontology.org/formats/oboInOwl#'))
]

In [98]:
#now, implement the bindings
for newOne, oldOne in newPrefixes:
    loadedGraph.bind(newOne, oldOne, override=True)

In [99]:
#verifying the changes took place
for namespace in loadedGraph.namespaces():
    print(namespace)

('owl', rdflib.term.URIRef('http://www.w3.org/2002/07/owl#'))
('rdf', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#'))
('rdfs', rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#'))
('xsd', rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#'))
('xml', rdflib.term.URIRef('http://www.w3.org/XML/1998/namespace'))
('ns8', rdflib.term.URIRef('https://www.w3.org/2000/01/rdf-schema#'))
('ncit', rdflib.term.URIRef('http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#'))
('obo', rdflib.term.URIRef('http://purl.obolibrary.org/obo/'))
('dcterms', rdflib.term.URIRef('http://purl.org/dc/terms/'))
('npo', rdflib.term.URIRef('http://purl.bioontology.org/ontology/npo#'))
('dc', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/'))
('enm', rdflib.term.URIRef('http://purl.enanomapper.org/onto/'))
('sio', rdflib.term.URIRef('http://semanticscience.org/resource/'))
('edam', rdflib.term.URIRef('http://edamontology.org/'))
('birnlex', rdflib.term.URIRef('http://bioontology.org/p

Note that, despite being listed above, ns8 will not appear in the final .ttl. Since namespaces are finalized during serialization, that's when rdflib will notice that all ns8 values have been removed and the prefix can also be removed.

In [100]:
#it looks good, and the new, custom prefixes have been applied, so let's serialize again
#Congratulations! Your RDF is finished!
loadedGraph.serialize(format="turtle", destination="NKB_RDF_V3.ttl")

<Graph identifier=N56294d88e57b416f80951771f1065c01 (<class 'rdflib.graph.Graph'>)>

# The End
With your namespace prefixes finalized, your RDF is now complete and ready for use.

<hr>

<a id='appendix'></a>
## APPENDIX

If you previously worked with Version 1, you should have a general understanding of the process. Here, we draw special attention to any changes.

### Version 2 Additions

* Each row of data from the input has now been given a triple which defines the "is a" relationship. An example of how to do this can be found [here,](#isA_relationship) but it is performed for every table.
* The nodebag method was rewritten. The way we use it has not changed, but the logic within the method has, so make sure you replace the old version with [this one](#nodebag_change) in your own code. 
* The process for dealing with situations where a row of data needs multiple nodebags of the same type has been [updated.](#metadata_bag) The previous methodology lead to the loss of data. This version preserves data and stores relevant metadata in a separate bag (previously, metadata was placed in the first bag and absent from all others, making locating it non-intuitive).
* A [recommendation](#foreign_key_tip) to use nodebags when dealing with foreign composite keys in your original dataset. 
* Additional "if" statements have been added throughout the process to verify that changes are only applied to data from a specific table. Check [here](#table_check) for an explanation and example.
* Additional "if" statements have been added throughout the process to verify that suspicious edits do not cause erroneous deletions. Check [here](#preexisting_check) for an explanation and example.

### Version 3 Additions

* Data involving IDs can present an [opportunity](#id_mapping) for quick and easy mapping to URIs that aren't readily found by Ontosearcher in ontologies. 
* Manually defining prefixes for the [namespaces](#namespaces) used in your RDF is a simple process that improves the user experience. You can also redefine them, if needed.