# Measurement Techniques Graph - Preparing the data

Measurement Techniques are currently scattered across multiple different ontologies. In order to improve the mapping of measTech terms to ontology terms, the measurement techniques across multiple ontologies need to be combined to reduce the likelihood of treating synonymous terms as multiple separate entities

To do this, we will:

1. Convert measurement technique ontology branches into subject (parent) predicate (has subclass) object (child) triples.
2. Use NCBO BioPortals to map synonymous terms between ontologies to de-duplicate nodes
3. Use term similarity and shared nodes to identify potential synonymous terms for de-duplication
4. iterate on 3 until the graph is relatively unique

This notebook covers the network generation using the triples and mapping data

In [1]:
import os
import requests
import pandas as pd
import json

In [2]:
script_path = os.getcwd()
raw_path = os.path.join(script_path,'raw_files')
raw_file_list = os.listdir(raw_path)
result_path = os.path.join(script_path,'results')

In [3]:
high2low_priority = ["MMO", "CHMO","OBI","BAO","EFO","topic","NCIT"]
low2high_priority = ["NCIT","topic","EFO","BAO","OBI","CHMO","MMO"]

## Load the triples and de-dup between ontologies

In [4]:
all_measTech_triples = pd.read_csv(os.path.join(result_path,'measTechOnly_mapped_triples.tsv'),delimiter='\t',header=0,index_col=0)
print(all_measTech_triples.head(n=2))

                                         subject_id  \
10  http://www.bioassayontology.org/bao#BAO_0002540   
74  http://www.bioassayontology.org/bao#BAO_0010204   

                                       predicate_id  \
10  http://www.w3.org/2000/01/rdf-schema#subClassOf   
74  http://www.w3.org/2000/01/rdf-schema#subClassOf   

                                          object_id  \
10  http://www.bioassayontology.org/bao#BAO_0000248   
74  http://www.bioassayontology.org/bao#BAO_0003008   

                               subject             object   predicate  
10  ECL western blotting detection kit          assay kit  subClassOf  
74         transporter substrate assay  transporter assay  subClassOf  


In [5]:
sameAsSubset = all_measTech_triples.loc[all_measTech_triples['predicate']=='sameAs']
#sameAsSubset.to_csv(os.path.join(result_path,'measTechOnlyMappings.tsv'),sep='\t',header=True)
print(len(sameAsSubset))
#print(sameAsSubset.head(n=2))
#print(sameAsSubset.groupby(['subject','object','predicate']).size())

640


In [6]:
def generate_ordered_mapping(high2low_priority,low2high_priority,sameAsSubset):
    ordered_mapping = pd.DataFrame(columns=['subject_id','predicate_id','object_id','subject','predicate','object'])
    i = 0
    while i < len(low2high_priority):
        inclusion_list = low2high_priority[i:len(low2high_priority)]
        eachonto = low2high_priority[i]
        tempSubset = sameAsSubset.loc[sameAsSubset['subject_id'].astype(str).str.contains(eachonto)]
        reordered = pd.DataFrame(columns=['subject_id','predicate_id','object_id','subject','predicate','object'])
        for eachpriority in high2low_priority[0:len(high2low_priority)-i]:
            tempdf = tempSubset.loc[tempSubset['object_id'].astype(str).str.contains(eachpriority)]
            tmpdf = tempdf.loc[~tempdf['object_id'].astype(str).str.contains("EMMO")]
            reordered = pd.concat((reordered,tmpdf),ignore_index=True)
            unique_map = reordered.drop_duplicates(subset='subject_id',keep='first')
        for includeonto in inclusion_list:
            include_map = unique_map.loc[unique_map['object_id'].astype(str).str.contains(includeonto)]
            ordered_mapping = pd.concat((ordered_mapping,include_map),ignore_index=True)
        i=i+1
    return ordered_mapping

def export_ordered_mapping_dict(result_path,ordered_mapping):
    ordered_mapping_dict = dict(zip(ordered_mapping.subject_id, ordered_mapping.object_id))
    with open(os.path.join(result_path,'ordered_mapping_dict.json'),'w') as outwrite:
        outwrite.write(json.dumps(ordered_mapping_dict,indent=4))
    return ordered_mapping_dict

In [7]:
ordered_mapping = generate_ordered_mapping(high2low_priority,low2high_priority,sameAsSubset)
print(len(ordered_mapping))
ordered_mapping_dict = export_ordered_mapping_dict(result_path,ordered_mapping)

203


In [8]:
## clean up the triples
lineage_triples = all_measTech_triples.loc[all_measTech_triples['predicate']=='subClassOf'].copy()
print(len(lineage_triples))

8678


### Iteratively reduce duplication

In [16]:
### Load the manual mappings
with open(os.path.join(result_path,'results_from_cytoscape','mappings_found_via_network.json'),'r') as infile:
    manual_map = json.load(infile)
print(list(manual_map.keys())[0])

### check for overlap in ordered_mapping_dict
overlap = list(set(list(ordered_mapping_dict.keys())).intersection(set(list(manual_map.keys()))))
print(overlap)
print(ordered_mapping_dict[overlap[0]], manual_map[overlap[0]])
#### The overlapping mapped values are the same, it's fine to merge

### Merge the dictionaries
print(len(list(ordered_mapping_dict.keys())), len(list(manual_map.keys())))
ordered_mapping_dict.update(manual_map)
print(len(list(ordered_mapping_dict.keys())))

http://purl.obolibrary.org/obo/OBI_0002119
['http://purl.obolibrary.org/obo/CHMO_0000591']
http://purl.obolibrary.org/obo/MMO_0000709 http://purl.obolibrary.org/obo/MMO_0000709
203 29
231


In [21]:
### Load the manual triples
manual_triples = pd.read_csv(os.path.join(result_path,'results_from_cytoscape','triples_found_via_network.tsv'),header=0,delimiter='\t')
#print(manual_triples.head(n=2))
#print(lineage_triples.head(n=2))
### Merge the manual triples
print(len(lineage_triples), len(manual_triples))
lineage_triples = pd.concat((lineage_triples,manual_triples),ignore_index=True)
print(len(lineage_triples))
lineage_triples.drop_duplicates(keep='first',inplace=True)
print(len(lineage_triples))

8685 7
8692
8685


In [33]:
## iteratively replace values in the triples and de-duplicate

i=0
lineage_triples['subject_id'] = lineage_triples['subject_id'].replace(to_replace=ordered_mapping_dict)
lineage_triples['object_id'] = lineage_triples['object_id'].replace(to_replace=ordered_mapping_dict)
clean_triples = lineage_triples.drop_duplicates(keep='first').copy()
while i < len(low2high_priority):
    clean_triples['subject_id'] = clean_triples['subject_id'].replace(to_replace=ordered_mapping_dict)
    clean_triples['object_id'] = clean_triples['object_id'].replace(to_replace=ordered_mapping_dict)
    clean_triples = clean_triples.drop_duplicates(keep='first').copy()
    i=i+1
    print(i, len(clean_triples))
    #print(i, len(clean_triples.loc[clean_triples['subject_id'].astype(str).str.contains('EFO')]))

#example_dict = {'http://purl.obolibrary.org/obo/NCIT_C114102': 'http://www.ebi.ac.uk/efo/EFO_0009638', 'http://purl.obolibrary.org/obo/NCIT_C124040': 'http://www.ebi.ac.uk/efo/EFO_0010719'}
#print(lineage_triples.loc[lineage_triples['subject_id']=='http://www.ebi.ac.uk/efo/EFO_0009638'])
#print(clean_triples.loc[clean_triples['subject_id']=='http://www.ebi.ac.uk/efo/EFO_0009638'])

1 8659
2 8659
3 8659
4 8659
5 8659
6 8659
7 8659


In [23]:
name_map = pd.read_csv(os.path.join(result_path,'name_iri_map.tsv'),delimiter='\t',header=0,index_col=0)
print(name_map.head(n=2))

                                                id        name
0            https://www.w3.org/2002/07/owl#sameAs      sameAs
1  http://www.w3.org/2000/01/rdf-schema#subClassOf  subClassOf


In [34]:
## clean up the labels for the triples
tmpdf = clean_triples[['subject_id','predicate_id','object_id']].copy()
def map_triples(clean_triples,name_map):
    clean_triples.rename(columns={'subject':'subject_id','predicate':'predicate_id','object':'object_id'},inplace=True)
    subject_map = name_map.copy()
    subject_map.rename(columns={'name':'subject','id':'subject_id'},inplace=True)
    predicate_map = name_map.copy()
    predicate_map.rename(columns={'name':'predicate','id':'predicate_id'},inplace=True)
    object_map = name_map.copy()
    object_map.rename(columns={'name':'object','id':'object_id'},inplace=True)
    tmpdf = clean_triples.merge(subject_map,on='subject_id',how='left')
    tmp2df = tmpdf.merge(object_map,on='object_id',how='left')
    tmp3df = tmp2df.merge(predicate_map,on='predicate_id',how='left')
    mapped_triples = tmp3df.drop_duplicates(keep='first')
    return mapped_triples

clean_triples = map_triples(tmpdf,name_map)

In [35]:
print(clean_triples.head(n=2))

                                        subject_id  \
0  http://www.bioassayontology.org/bao#BAO_0002540   
1  http://www.bioassayontology.org/bao#BAO_0010204   

                                      predicate_id  \
0  http://www.w3.org/2000/01/rdf-schema#subClassOf   
1  http://www.w3.org/2000/01/rdf-schema#subClassOf   

                                         object_id  \
0       http://purl.obolibrary.org/obo/OBI_0003369   
1  http://www.bioassayontology.org/bao#BAO_0003008   

                              subject             object   predicate  
0  ECL western blotting detection kit          assay kit  subClassOf  
1         transporter substrate assay  transporter assay  subClassOf  


### Verify the cleaning process

In [28]:
test_list = list(ordered_mapping_dict.keys())
test_df = lineage_triples.loc[(lineage_triples['subject_id'].isin(test_list[0:10])|(lineage_triples['object_id']).isin(test_list[0:10]))].copy()
#print(test_list[0:30])
print(ordered_mapping_dict[test_list[0]])
print(len(test_df))
print(test_df)

['http://purl.obolibrary.org/obo/NCIT_C101294', 'http://purl.obolibrary.org/obo/NCIT_C16681', 'http://edamontology.org/topic_4028', 'http://edamontology.org/topic_3177', 'http://edamontology.org/topic_3676', 'http://edamontology.org/topic_3169', 'http://edamontology.org/topic_3170', 'http://edamontology.org/topic_0177', 'http://edamontology.org/topic_2271', 'http://edamontology.org/topic_0183', 'http://edamontology.org/topic_3474', 'http://edamontology.org/topic_0221', 'http://edamontology.org/topic_4014', 'http://edamontology.org/topic_3077', 'http://edamontology.org/topic_0182', 'http://edamontology.org/topic_3385', 'http://edamontology.org/topic_4017', 'http://edamontology.org/topic_0133', 'http://edamontology.org/topic_3448', 'http://edamontology.org/topic_2828', 'http://edamontology.org/topic_0611', 'http://edamontology.org/topic_0134', 'http://edamontology.org/topic_4016', 'http://edamontology.org/topic_3452', 'http://www.ebi.ac.uk/efo/EFO_0010935', 'http://www.ebi.ac.uk/efo/EFO_

In [27]:
i=0
test_df['subject_id'] = test_df['subject_id'].replace(to_replace=ordered_mapping_dict)
test_df['object_id'] = test_df['object_id'].replace(to_replace=ordered_mapping_dict)
clean_triples = test_df.drop_duplicates(keep='first')
print(clean_triples)

Empty DataFrame
Columns: [subject_id, predicate_id, object_id, subject, object, predicate]
Index: []


## Focus on nodes with multiple parents

In [36]:
## Get the nodes with multiple parents
tmp = clean_triples.groupby(['subject_id']).size().reset_index(name="counts")
multi_parent = tmp.loc[tmp['counts']>1]
print(len(multi_parent))

1172


In [37]:
## get the immediate parents and children
multilist = multi_parent['subject_id'].unique().tolist()
p1f1 = clean_triples.loc[(clean_triples['subject_id'].isin(multilist))|(clean_triples['object_id'].isin(multilist))].copy()
print(len(p1f1))

4065


In [38]:
## expand to the grandparents and grandchildren
p1f1list = list(set(p1f1['object_id'].unique().tolist()).union(set(p1f1['subject_id'].unique().tolist())))
print(len(p1f1list))
p2f2 = clean_triples.loc[(clean_triples['subject_id'].isin(p1f1list))|(clean_triples['object_id'].isin(p1f1list))].copy()
print(len(p2f2))

3043
6061


In [39]:
#### search for different nodes with same parent and same children

## if subject is subClassOf 2 objects AND the 2 objects(now subjects) are subClassOf same object, keep

## 1. Get the subjects that have 2 parents
print(len(multi_parent))

## 2. Get the parents of those subjects
p1 = clean_triples.loc[clean_triples['subject_id'].isin(multilist)].copy()
p1_list = clean_triples['object_id'].unique().tolist()
#print(p1.head(n=2))

## 3. Get the grandparents of those subjects
p2 = clean_triples.loc[clean_triples['subject_id'].isin(p1_list)].copy()
#print(p2)

## 4. Get the subjects for which grandparents have multiple children
tmp_kids = p1.groupby(['object_id']).size().reset_index(name='counts')
multi_kids = tmp_kids.loc[tmp_kids['counts']>1]
kidslist = tmp_kids['object_id'].unique().tolist()
qualkids = p1.loc[p1['object_id'].isin(kidslist)]
print(len(qualkids))
print(qualkids.head(n=2))
cleankids = qualkids.drop_duplicates(subset=['subject_id','object_id'],keep='first')
print(len(cleankids))
potential_syn = cleankids.copy()

tmp_kids2 = p2.groupby(['object_id']).size().reset_index(name='counts')
multi_kids2 = tmp_kids2.loc[tmp_kids2['counts']>1]
kidslist2 = tmp_kids2['object_id'].unique().tolist()
qualkids2 = p2.loc[p2['object_id'].isin(kidslist2)]
cleankids2 = qualkids2.drop_duplicates(subset=['subject_id','object_id'],keep='first')
potential_syn2 = cleankids2.copy()

## 5. Assemble original subject (with multi-parents), with grandparents that have multiple kids
#potential_syn = pd.concat((multi_parent,qualkids),ignore_index=True)
#print(potential_syn.head(n=4))
#print(len(potential_syn))

1172
2558
                                         subject_id  \
23      http://purl.obolibrary.org/obo/CHMO_0000063   
30  http://www.bioassayontology.org/bao#BAO_0003003   

                                       predicate_id  \
23  http://www.w3.org/2000/01/rdf-schema#subClassOf   
30  http://www.w3.org/2000/01/rdf-schema#subClassOf   

                                          object_id  \
23  http://www.bioassayontology.org/bao#BAO_0000050   
30  http://www.bioassayontology.org/bao#BAO_0000042   

                                      subject                     object  \
23  bioluminescence resonance energy transfer            bioluminescence   
30                   cytokine secretion assay  Cue Signal Response assay   

     predicate  
23  subClassOf  
30  subClassOf  
2511


## Export for Cytoscape

In [41]:
dfs2export = {"allnodes":clean_triples,"p1f1":p1f1,"p2f2":p2f2,"potential_synonyms":potential_syn,"potential_synonyms2":potential_syn2}
iteration_round = "1"
def export_df(dfname,dfs2export):
    df = dfs2export[dfname]
    df.to_csv(os.path.join(result_path,'for_cytoscape',f'{dfname}_table_gen_{iteration_round}.tsv'),sep='\t',header=True)
    


In [78]:
## Additional tables for cytoscape (unnecessary)

def export_nodes(dfname,dfs2export):
    df = dfs2export[dfname]
    subject_table = df[['subject','subject_id']].copy()
    subject_table.rename(columns={'subject':'name','subject_id':'id'},inplace=True)
    object_table = df[['object','object_id']].copy()
    object_table.rename(columns={'object':'name','object_id':'id'},inplace=True)
    node_table = pd.concat((subject_table,object_table))
    node_table.drop_duplicates(keep='first',inplace=True)
    node_table.to_csv(os.path.join(result_path,'for_cytoscape',f'{dfname}_node_table.tsv'),sep='\t',header=True)

def export_edges(dfname,dfs2export):
    df = dfs2export[dfname]
    df['value'] = 1
    edge_table = df[['predicate','predicate_id','value']].copy()
    edge_table.rename(columns={'predicate':'name','predicate_id':'id'})
    edge_table.drop_duplicates(keep='first',inplace=True)
    edge_table.to_csv(os.path.join(result_path,'for_cytoscape',f'{dfname}_edge_table.tsv'),sep='\t',header=True)
    network_name_table = df[['subject','predicate','object']].copy()
    network_name_table.to_csv(os.path.join(result_path,'for_cytoscape',f'{dfname}_network_name_table.tsv'),sep='\t',header=True)
    network_id_table = df[['subject_id','predicate_id','object_id']].copy()
    network_id_table.to_csv(os.path.join(result_path,'for_cytoscape',f'{dfname}_network_id_table.tsv'),sep='\t',header=True)
    network_val_table = df[['subject','value','object']].copy()
    network_val_table.to_csv(os.path.join(result_path,'for_cytoscape',f'{dfname}_network_val_table.tsv'),sep='\t',header=True)

In [42]:
for dfname in list(dfs2export.keys()):
    #export_nodes(dfname,dfs2export)
    #export_edges(dfname,dfs2export)
    export_df(dfname,dfs2export)

## Mappings found based on cytoscape analysis:


### From p1 network
Microscopy Assay: http://purl.obolibrary.org/obo/OBI_0002119
Microscopy: http://purl.obolibrary.org/obo/CHMO_0000067

Tandem Mass spec: http://purl.obolibrary.org/obo/OBI_0003540
LC MS: http://purl.obolibrary.org/obo/OBI_0003097 
LC Tandem MS: http://purl.obolibrary.org/obo/CHMO_0000701 (subtype of tandem Mass/LCMS spec)

### From networks going farther up
Already in mappings:
Array: http://www.ebi.ac.uk/efo/EFO_0002698
Array assay: http://purl.obolibrary.org/obo/OBI_0001865

DNA Array: http://www.ebi.ac.uk/efo/EFO_0002701
DNA Microarray: http://purl.obolibrary.org/obo/OBI_0400148


Relationships not in mappings:
DNA Array http://www.ebi.ac.uk/efo/EFO_0002701 subClassOf nucleic acid array http://www.bioassayontology.org/bao#BAO_0000504


### Questions raised from inspecting the knowledge graph
* How to treat terms that are semantically different, but related? For example, the microarray device vs the microarray technique both exist in some ontologies and have different roots
  * Decision: Since the purpose is for standardizing measurementTechniques, device terms should map to the technique terms whenever possible to minimize duplication

## Analyze network with Networkx

https://stackoverflow.com/questions/73154911/how-to-draw-a-graph-with-networkx-from-pandas-dataframe-with-node-size-depending

How BioPortal mappings are done: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4159173/ (LOOM = Lexical mapping)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx

p1f1['values'] = 1
g = nx.from_pandas_edgelist(p1f1, source="subject", target="object")

d = p1f1.groupby("subject")["values"].sum().to_dict()
for node in g.nodes:
    d.setdefault(node, 1)

nodes, values = zip(*d.items())
nx.draw(g, nodelist=list(nodes), node_size=[v * 100 for v in values], with_labels=False)
plt.show()