# Semantic Enrichment of GOLPH3 Gene Dataset

## Description

This project aims to enrich genomic and phenotypic datasets extracted from Ensembl's datamart through semantic enhancement techniques. The target dataset focuses on the GOLPH3 gene and includes both genomic and phenotypic data points.

The GOLPH3 gene encodes an adapter essential for the recycling of Golgi glycosylation enzymes involved in the pathway of glycosphingolipid (GLS) production. Overexpression of GOLPH3 leads to chemoresistance in tumor cells ([Rizzo et al.](https://pubmed.ncbi.nlm.nih.gov/33749896/)). This project aims to uncover additional disease pathways in which the GOLPH3 gene may be involved through a knowledge-driven research approach based on disease ontologies.


The phenotypic data within the dataset has been enriched using two disease ontologies: DOID ([Human Disease Ontology](https://bioportal.bioontology.org/ontologies/DOID)) and MONDO ([Mondo Disease Ontology](https://bioportal.bioontology.org/ontologies/MONDO)). An ontology matcher was developed to facilitate this process. This matcher searches for matches within all synonyms of the ontology classes and appends the IRI (Internationalized Resource Identifier) class ID to the original dataset.

The semantic enrichment of the dataset focusing on the GOLPH3 gene involved cross-linking phenotypic data extracted from Ensembl with disease ontology terms using the DOID and MONDO ontologies. This approach enhances data interoperability and adherence to FAIR principles by appending Internationalized Resource Identifier (IRI) class IDs to the dataset, thereby enabling more efficient data integration and retrieval.


### Key Features
- **Semantic Enrichment**: Enriches phenotypic data using Human Disease Ontology (DOID) and Mondo Disease Ontology (MONDO).
- **Ontology Matcher**: Custom-built tool for matching dataset terms with ontology synonyms and appending Internationalized Resource Identifiers (IRIs) for enhanced data linkage.
- **Data Integration**: Facilitates efficient data integration and retrieval through semantic annotations.


[Here is a link to the project repository](https://github.com/your-repo)

### Contact
For questions or contributions, please contact [giovannimaria.defilippis@unina.it].

# Import requirements

In [1]:
import os
import csv
import gdown
import requests
import pandas as pd


if 'data' not in os.getcwd():
    if not os.path.exists('data'):
        os.makedirs('data')

    os.chdir('data/')


# Bio-Dataset:: Ensembl Gine-GOLPH3 features

In [None]:
%%time
# Define the Biomart server URL and dataset (Human)
url = 'http://www.ensembl.org/biomart/martservice'
dataset = 'hsapiens_gene_ensembl'

# Define the XML query for the gene GOLPH3
xml_query = """
<Query virtualSchemaName = "default" formatter = "TSV" header = "1" uniqueRows = "1" count = "" datasetConfigVersion = "0.6">
    <Dataset name = "{}" interface = "default">
        <Filter name="hgnc_symbol" value="GOLPH3"/>
        <Attribute name="ensembl_gene_id"/>
        <Attribute name="ensembl_gene_id_version"/>
        <Attribute name="ensembl_peptide_id"/>
        <Attribute name="ensembl_peptide_id_version"/>
        <Attribute name="description"/>
        <Attribute name="chromosome_name"/>
        <Attribute name="start_position"/>
        <Attribute name="end_position"/>
        <Attribute name="strand"/>
        <Attribute name="band"/>
        <Attribute name="external_gene_name"/>
        <Attribute name="external_gene_source"/>
        <Attribute name="percentage_gene_gc_content"/>
        <Attribute name="gene_biotype"/>
        <Attribute name="source"/>
        <Attribute name="version"/>
        <Attribute name="peptide_version"/>
        <Attribute name="external_synonym"/>
        <Attribute name="phenotype_description"/>
        <Attribute name="source_name"/>
        <Attribute name="study_external_id"/>
        <Attribute name="strain_name"/>
        <Attribute name="strain_gender"/>
        <Attribute name="p_value"/>
    </Dataset>
</Query>
""".format(dataset)

# Send the request to the Biomart server
response = requests.post(url, data={'query': xml_query})


In [10]:
from io import StringIO
# Assuming 'response.text' contains the data in some delimited format
data = response.text
data_io = StringIO(data)
ds = pd.read_table(data_io)
new_col_names = ['ensembl_gene_id', 'ensembl_gene_id_version', 'ensembl_peptide_id',
                 'ensembl_peptide_id_version', 'description', 'chromosome_name',
                 'start_position', 'end_position', 'strand', 'band',
                 'external_gene_name', 'external_gene_source',
                 'percentage_gene_gc_content', 'gene_biotype', 'source', 'version',
                 'peptide_version', 'external_synonym', 'phenotype_description',
                 'source_name', 'study_external_id', 'strain_name', 'strain_gender',
                 'p_value']
# Set the new column names to the DataFrame
ds.columns = new_col_names
ds.to_csv('emsembl_golph3_dataset.csv')
ds

Unnamed: 0,ensembl_gene_id,ensembl_gene_id_version,ensembl_peptide_id,ensembl_peptide_id_version,description,chromosome_name,start_position,end_position,strand,band,...,source,version,peptide_version,external_synonym,phenotype_description,source_name,study_external_id,strain_name,strain_gender,p_value
0,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,,,,
1,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Hepatobiliary Neoplasm,Cancer Gene Census,,,,
2,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,ND,Cancer Gene Census,,,,
3,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Ovarian Endometrioid Adenocarcinoma with Squam...,Cancer Gene Census,,,,
4,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Uterine Carcinosarcoma,Cancer Gene Census,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,skin carcinoma,Cancer Gene Census,,,,
452,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,Small cell lung carcinoma,Cancer Gene Census,,,,
453,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,soft tissue sarcoma,Cancer Gene Census,,,,
454,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,squamous cell lung carcinoma,Cancer Gene Census,,,,


## Import Dataset

In [2]:
# load dataset
ds_name = 'emsembl_golph3_dataset'#'ensemble_gene_features_entrez64083_golph3'
ds = pd.read_csv(ds_name+'.csv', index_col=0)
ds

Unnamed: 0,ensembl_gene_id,ensembl_gene_id_version,ensembl_peptide_id,ensembl_peptide_id_version,description,chromosome_name,start_position,end_position,strand,band,...,source,version,peptide_version,external_synonym,phenotype_description,source_name,study_external_id,strain_name,strain_gender,p_value
0,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,,,,
1,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Hepatobiliary Neoplasm,Cancer Gene Census,,,,
2,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,ND,Cancer Gene Census,,,,
3,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Ovarian Endometrioid Adenocarcinoma with Squam...,Cancer Gene Census,,,,
4,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Uterine Carcinosarcoma,Cancer Gene Census,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,skin carcinoma,Cancer Gene Census,,,,
452,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,Small cell lung carcinoma,Cancer Gene Census,,,,
453,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,soft tissue sarcoma,Cancer Gene Census,,,,
454,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,squamous cell lung carcinoma,Cancer Gene Census,,,,


# Import Bio Ontologies

## Download DOID
Source Dataset :: https://data.bioontology.org/ontologies/DOID/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv

In [11]:
import gdown
import gzip
import shutil

# The URL to download the CSV file
url = "https://data.bioontology.org/ontologies/DOID/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv"

# The path where the file will be saved
output = "DOID.gz"
extracted_file = "DOID.csv"

# Download the file
gdown.download(url, output, quiet=False)

# Extract the .gz file
with gzip.open(output, 'rb') as f_in:
    with open(extracted_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# Delete the .gz file
os.remove(output)

Downloading...
From: https://data.bioontology.org/ontologies/DOID/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv
To: G:\Altri computer\Horizon\horizon_workspace\projects\work\semantics\ontology\onto_gene_disease_analysis\data\DOID.gz
100%|██████████| 1.82M/1.82M [00:05<00:00, 329kB/s]


## Import DOID

In [11]:
# load csv ontology
doid = pd.read_csv('DOID.csv', low_memory=False)
onto_label = 'DOID'
onto = doid
onto

Unnamed: 0,Class ID,Preferred Label,Synonyms,Definitions,Obsolete,CUI,Semantic Types,Parents,adjacent to,auto-generated-by,...,saved-by,sexually_transmitted_infectious_disease,spatially disjoint from,subset_property,term replaced by,tick-borne_infectious_disease,title,TopNodes_DOcancerslim,transmitted by,zoonotic_infectious_disease
0,http://purl.obolibrary.org/obo/DOID_3412,obsolete infectious canine hepatitis,,A viral infectious disease that involves necro...,True,,,,,,...,,,,,,,,,,
1,http://purl.obolibrary.org/obo/UBERON_0009623,spinal nerve root,,,False,,,http://purl.obolibrary.org/obo/UBERON_0002211,,,...,,,,,,,,,,
2,http://purl.obolibrary.org/obo/DOID_1204,obsolete arthropathy due to hypersensitivity r...,Arthropathy associated with hypersensitivity r...,,True,,,,,,...,,,,,,,,,,
3,http://purl.obolibrary.org/obo/DOID_12040,obsolete immune hydrops fetalis,Hydrops fetalis due to isoimmunization [dup] (...,,True,,,,,,...,,,,,,,,,,
4,http://purl.obolibrary.org/obo/DOID_12043,kernicterus due to isoimmunization,,,False,,,http://purl.obolibrary.org/obo/DOID_2383,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18653,http://purl.obolibrary.org/obo/DOID_8080,obsolete ovarian mucinous cystic tumor associa...,,,True,,,,,,...,,,,,,,,,,
18654,http://purl.obolibrary.org/obo/DOID_9878,obsolete Excessive vomiting starting after 22 ...,antepartum late vomiting of pregnancy,,True,,,,,,...,,,,,,,,,,
18655,http://purl.obolibrary.org/obo/MIM_226400,susceptibility to epidermodysplasia verrucifor...,,,False,,,http://purl.obolibrary.org/obo/MIM_000000,,,...,,,,,,,,,,
18656,http://purl.obolibrary.org/obo/DOID_3705,fallopian tube mucinous tumor,,A fallopian tube benign neoplasm that produces...,False,,,http://purl.obolibrary.org/obo/DOID_0060111,,,...,,,,,,,,,,


## Download MONDO

In [13]:
import gdown
import gzip
import shutil

# The URL to download the CSV file
url = "https://data.bioontology.org/ontologies/MONDO/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv"

# The path where the file will be saved
output = "MONDO.gz"
extracted_file = "MONDO.csv"

# Download the file
gdown.download(url, output, quiet=False)

# Extract the .gz file
with gzip.open(output, 'rb') as f_in:
    with open(extracted_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# Delete the .gz file
os.remove(output)

Downloading...
From: https://data.bioontology.org/ontologies/MONDO/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv
To: G:\Altri computer\Horizon\horizon_workspace\projects\work\semantics\ontology\onto_gene_disease_analysis\data\MONDO.gz
100%|██████████| 5.62M/5.62M [00:11<00:00, 473kB/s]


## Import MONDO

In [3]:
# load csv ontology
mondo = pd.read_csv('MONDO.csv', low_memory=False)
onto_label = 'MONDO'
onto = mondo
onto

Unnamed: 0,Class ID,Preferred Label,Synonyms,Definitions,Obsolete,CUI,Semantic Types,Parents,A synonym that is historic and discouraged,A synonym that is recorded for consistency with another source but is a misspelling,...,realized in response to stimulus,seeAlso,shorthand,should_conform_to,subset_property,Synonym to be removed from public release but maintained in edit version as record of external usage,synonym_type_property,term replaced by,transmitted by,UK spelling synonym
0,http://purl.obolibrary.org/obo/MONDO_0044647,kyphosis-lateral tongue atrophy-myofibrillar m...,,,False,,,http://purl.obolibrary.org/obo/MONDO_0018943,,,...,,,,,,,,,,
1,http://purl.obolibrary.org/obo/MONDO_1010032,"Jacobsen syndrome, non-human animal",,Jacobsen syndrome that occurs in non-human ani...,False,,,http://purl.obolibrary.org/obo/MONDO_1011360|h...,,,...,,,,,,,,,,
2,http://purl.obolibrary.org/obo/MONDO_0007146,"obsolete apnea, central sleep",,,True,,,,,,...,,,,,,,,MONDO:0008807,,
3,http://purl.obolibrary.org/obo/MONDO_0017964,"obsolete 46,XX disorder of sex development ind...","46,XX DSD induced by exogenous maternal-derive...",,True,,,,,,...,,,,,,,,,,
4,http://purl.obolibrary.org/obo/MONDO_0014635,"microphthalmia, isolated, with coloboma 10","MCOPCB10|microphthalmia, isolated, with colobo...","Any microphthalmia, isolated, with coloboma in...",False,,,http://purl.obolibrary.org/obo/MONDO_0000170,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34019,http://purl.obolibrary.org/obo/MONDO_0011369,"hypercholesterolemia, autosomal dominant, 3",Fh3|HCHOLA3|low density lipoprotein cholestero...,Any familial hypercholesterolemia in which the...,False,,,http://purl.obolibrary.org/obo/MONDO_0005439,,,...,,,,,,,,,,
34020,http://purl.obolibrary.org/obo/MONDO_0012591,osteogenesis imperfecta type 5,OI with calcification in interosseous membrane...,Osteogenesis imperfecta type V is a moderate t...,False,,,http://purl.obolibrary.org/obo/MONDO_0019019|h...,,,...,,,,,,,,,,
34021,http://purl.obolibrary.org/obo/MONDO_0016727,extraventricular neurocytoma,extraventricular neurocytoma (WHO grade II)|EVN,"Extraventricular neurocytoma (EVN), a variant ...",False,,,http://purl.obolibrary.org/obo/MONDO_0016729,,,...,,,,,,,,,,
34022,http://purl.obolibrary.org/obo/MONDO_0015308,laminopathy type Decaudain-Vigouroux,laminopathy with severe metabolic syndrome and...,"Laminopathy, type Decaudain-Vigouroux is chara...",False,,,http://purl.obolibrary.org/obo/MONDO_0021187,,,...,,,,,,,,,,


In [6]:
import tiktoken

def load_file(file='', path=os.getcwd()):
    with open(os.path.join(path, file),'r', encoding='utf-8') as file:
        my_file = file.read()
        file.close()
    return my_file

# Initialize the tokenizer for the specified encoder
tokenizer = tiktoken.encoding_for_model("gpt-4")

# Load the file and count the tokens
tokens = len(tokenizer.encode(str(load_file(file='DOID.csv'))))
tokens

3587462

In [18]:
tokens = len(tokenizer.encode(str(load_file(file='MONDO.csv'))))
tokens

11244106

## Extract Ontology class Synonims as list

In [41]:
onto.Synonyms

0                                                      NaN
1                                                      NaN
2                                                      NaN
3        46,XX DSD induced by exogenous maternal-derive...
4        MCOPCB10|microphthalmia, isolated, with colobo...
                               ...                        
34019    Fh3|HCHOLA3|low density lipoprotein cholestero...
34020    OI with calcification in interosseous membrane...
34021      extraventricular neurocytoma (WHO grade II)|EVN
34022    laminopathy with severe metabolic syndrome and...
34023    OFC5|cleft lip with or without cleft palate, n...
Name: Synonyms, Length: 34024, dtype: object

In [4]:
# Function to split the synonyms and return as a list
def split_synonyms(synonyms):
    if synonyms is None:
        return []
    return synonyms.split('|')
    
# Apply the function to create the new column
def df_splitter(df):
    df['Synonyms'] = df['Synonyms'].astype(str)
    df['synonims_list'] = df['Synonyms'].apply(lambda x: split_synonyms(x))
    return df


# Function to add value to lists
def add_value(row):
    row['synonims_list'].append(row['Preferred Label'])
    return row

# Composite Function
def extract_syn_list(onto):
    onto = df_splitter(onto)
    
    # Apply the function to each row
    onto = onto.apply(add_value, axis=1)
    return onto

In [5]:
onto = extract_syn_list(onto)

# Output the updated DataFrame
onto[['synonims_list','Preferred Label']]

Unnamed: 0,synonims_list,Preferred Label
0,"[nan, kyphosis-lateral tongue atrophy-myofibri...",kyphosis-lateral tongue atrophy-myofibrillar m...
1,"[nan, Jacobsen syndrome, non-human animal]","Jacobsen syndrome, non-human animal"
2,"[nan, obsolete apnea, central sleep]","obsolete apnea, central sleep"
3,"[46,XX DSD induced by exogenous maternal-deriv...","obsolete 46,XX disorder of sex development ind..."
4,"[MCOPCB10, microphthalmia, isolated, with colo...","microphthalmia, isolated, with coloboma 10"
...,...,...
34019,"[Fh3, HCHOLA3, low density lipoprotein cholest...","hypercholesterolemia, autosomal dominant, 3"
34020,[OI with calcification in interosseous membran...,osteogenesis imperfecta type 5
34021,"[extraventricular neurocytoma (WHO grade II), ...",extraventricular neurocytoma
34022,[laminopathy with severe metabolic syndrome an...,laminopathy type Decaudain-Vigouroux


In [12]:
%%time
doid = extract_syn_list(doid)
doid[['synonims_list','Preferred Label']].head()

CPU times: total: 1.05 s
Wall time: 1.05 s


Unnamed: 0,synonims_list,Preferred Label
0,"[nan, obsolete infectious canine hepatitis]",obsolete infectious canine hepatitis
1,"[nan, spinal nerve root]",spinal nerve root
2,[Arthropathy associated with hypersensitivity ...,obsolete arthropathy due to hypersensitivity r...
3,[Hydrops fetalis due to isoimmunization [dup] ...,obsolete immune hydrops fetalis
4,"[nan, kernicterus due to isoimmunization]",kernicterus due to isoimmunization


In [13]:
%%time
mondo = extract_syn_list(mondo)
mondo[['synonims_list','Preferred Label']].head()

CPU times: total: 2.25 s
Wall time: 2.32 s


Unnamed: 0,synonims_list,Preferred Label
0,"[nan, kyphosis-lateral tongue atrophy-myofibri...",kyphosis-lateral tongue atrophy-myofibrillar m...
1,"[nan, Jacobsen syndrome, non-human animal]","Jacobsen syndrome, non-human animal"
2,"[nan, obsolete apnea, central sleep]","obsolete apnea, central sleep"
3,"[46,XX DSD induced by exogenous maternal-deriv...","obsolete 46,XX disorder of sex development ind..."
4,"[MCOPCB10, microphthalmia, isolated, with colo...","microphthalmia, isolated, with coloboma 10"


# Ontology Matcher

In [21]:
# Dataset column to match
ds.phenotype_description

0                Endometrial Endometrioid Adenocarcinoma
1                                 Hepatobiliary Neoplasm
2                                                     ND
3      Ovarian Endometrioid Adenocarcinoma with Squam...
4                                 Uterine Carcinosarcoma
                             ...                        
451                                       skin carcinoma
452                            Small cell lung carcinoma
453                                  soft tissue sarcoma
454                         squamous cell lung carcinoma
455                                    thyroid carcinoma
Name: phenotype_description, Length: 456, dtype: object

In [14]:
# chack data
for index, syn_list in enumerate(onto['synonims_list'][:10]):
    print(index, syn_list)

0 ['nan', 'obsolete infectious canine hepatitis']
1 ['nan', 'spinal nerve root']
2 ['Arthropathy associated with hypersensitivity reaction', 'Arthropathy associated with a hypersensitivity reaction (disorder)', 'obsolete arthropathy due to hypersensitivity reaction']
3 ['Hydrops fetalis due to isoimmunization [dup] (disorder)', 'Hydrops fetalis due to isoimmunization (disorder)', 'Hydrops fetalis due to isoimmunization', 'Hydrops fetalis - due to isoim', 'obsolete immune hydrops fetalis']
4 ['nan', 'kernicterus due to isoimmunization']
5 ['muscular dystrophy-dystroglycanopathy limb-girdle GMPPB-related', 'muscular dystrophy-dystroglycanopathy (limb-girdle) type C14', 'muscular dystrophy limb-girdle type 2T', 'MDDGC14', 'LGMD2T', 'autosomal recessive limb-girdle muscular dystrophy type 2T']
6 ['muscular dystrophy-dystroglycanopathy (limb-girdle) type C7', 'muscular dystrophy limb-girdle type 2U', 'autosomal recessive limb-girdle muscular dystrophy due to ISPD deficiency', 'MDDGC7', 'LGM

In [18]:
# check data by id
id = 314
doid['synonims_list'].iloc[id], doid['Class ID'].iloc[id]

(['nan', 'microcephaly, growth deficiency, seizures, and brain malformations'],
 'http://purl.obolibrary.org/obo/DOID_0081051')

In [20]:
# check syn list
for syn_list in doid['synonims_list']:
    print(syn_list)

['nan', 'obsolete infectious canine hepatitis']
['nan', 'spinal nerve root']
['Arthropathy associated with hypersensitivity reaction', 'Arthropathy associated with a hypersensitivity reaction (disorder)', 'obsolete arthropathy due to hypersensitivity reaction']
['Hydrops fetalis due to isoimmunization [dup] (disorder)', 'Hydrops fetalis due to isoimmunization (disorder)', 'Hydrops fetalis due to isoimmunization', 'Hydrops fetalis - due to isoim', 'obsolete immune hydrops fetalis']
['nan', 'kernicterus due to isoimmunization']
['muscular dystrophy-dystroglycanopathy limb-girdle GMPPB-related', 'muscular dystrophy-dystroglycanopathy (limb-girdle) type C14', 'muscular dystrophy limb-girdle type 2T', 'MDDGC14', 'LGMD2T', 'autosomal recessive limb-girdle muscular dystrophy type 2T']
['muscular dystrophy-dystroglycanopathy (limb-girdle) type C7', 'muscular dystrophy limb-girdle type 2U', 'autosomal recessive limb-girdle muscular dystrophy due to ISPD deficiency', 'MDDGC7', 'LGMD2U', 'autosom

## Ontology Matcher

In [2]:
# Load Dataset
ds = pd.read_csv('emsembl_golph3_dataset.csv', index_col=0)

In [5]:
# Choose Ontology 
onto = pd.read_csv('MONDO.csv', low_memory=False)
onto_label = 'MONDO'
onto = extract_syn_list(onto)

# Output the updated DataFrame
onto[['synonims_list','Preferred Label']]

Unnamed: 0,synonims_list,Preferred Label
0,"[nan, kyphosis-lateral tongue atrophy-myofibri...",kyphosis-lateral tongue atrophy-myofibrillar m...
1,"[nan, Jacobsen syndrome, non-human animal]","Jacobsen syndrome, non-human animal"
2,"[nan, obsolete apnea, central sleep]","obsolete apnea, central sleep"
3,"[46,XX DSD induced by exogenous maternal-deriv...","obsolete 46,XX disorder of sex development ind..."
4,"[MCOPCB10, microphthalmia, isolated, with colo...","microphthalmia, isolated, with coloboma 10"
...,...,...
34019,"[Fh3, HCHOLA3, low density lipoprotein cholest...","hypercholesterolemia, autosomal dominant, 3"
34020,[OI with calcification in interosseous membran...,osteogenesis imperfecta type 5
34021,"[extraventricular neurocytoma (WHO grade II), ...",extraventricular neurocytoma
34022,[laminopathy with severe metabolic syndrome an...,laminopathy type Decaudain-Vigouroux


In [None]:
%%time
from tqdm import tqdm
# Ontology Matcher on dataset
ds[onto_label+'_match'] = False

# Iterate over each row in the 'ds' dataframe
for index, row in ds.iterrows():
    # Get the lowercase 'phenotype_description' value for the current row
    phenotype_desc = str(row['phenotype_description']).lower()

    # Iterate over each list in the 'synonims_list' column of the 'onto' dataframe
    for syn_list in onto['synonims_list']:
        # Check if the lowercase 'phenotype_description' value is present in the current list
        if phenotype_desc in [str(syn).lower() for syn in syn_list]:
            # If it is present, set the corresponding row in 'ds' to True
            ds.at[index, onto_label+'_match'] = True
            break  # Exit the loop once a match is found
        #else:
        #    ds.at[index, onto_label+'_match'] = False

# Note: Before running this code, make sure you have a column named 'is_true' in the 'ds' dataframe to store the True values.
#       You may need to initialize this column to False for all rows before the loop.

In [9]:
ds[['phenotype_description',
    'source_name', onto_label+'_match']]#.head()

Unnamed: 0,phenotype_description,source_name,MONDO_match
0,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,True
1,Hepatobiliary Neoplasm,Cancer Gene Census,True
2,ND,Cancer Gene Census,True
3,Ovarian Endometrioid Adenocarcinoma with Squam...,Cancer Gene Census,True
4,Uterine Carcinosarcoma,Cancer Gene Census,True
...,...,...,...
451,skin carcinoma,Cancer Gene Census,True
452,Small cell lung carcinoma,Cancer Gene Census,True
453,soft tissue sarcoma,Cancer Gene Census,True
454,squamous cell lung carcinoma,Cancer Gene Census,True


## Run Ontology Matching

In [ ]:
ds

In [None]:
onto.T.head()

In [42]:
# Iterate over each row in the 'ds' dataframe
def ontology_matcher(ds, ontology, label):
    for index, row in ds.iterrows():
        # Get the lowercase 'phenotype_description' value for the current row
        phenotype_desc = str(row['phenotype_description']).lower()

        # Iterate over each list in the 'synonims_list' column of the 'doid' dataframe
        for onto_index, syn_list in enumerate(ontology['synonims_list']):
            #print(index, syn_list)
            # Check if the lowercase 'phenotype_description' value is present in the current list
            if phenotype_desc in [str(syn).lower() for syn in syn_list]:
                # If it is present, set the corresponding row in 'ds' to True
                ds.at[index, label+'_match'] = True
                ds.at[index, label+'_ClassID'] = ontology['Class ID'].iloc[onto_index]
                ds.at[index, label+'_synonims'] = str(syn_list)#ontology['synonims_list'].iloc[onto_index] 
                break  # Exit the loop once a match is found

## MONDO

In [49]:
%%time 
# Choose Ontology
mondo = pd.read_csv('MONDO.csv', low_memory=False)
mondo = extract_syn_list(mondo)

ds = pd.read_csv('emsembl_golph3_dataset.csv', index_col=0)
ds[onto_label+'_match'] = False

ontology_matcher(ds, mondo, 'MONDO')
# Note: Before running this code, make sure you have a column named 'is_true' in the 'ds' dataframe to store the True values.
#       You may need to initialize this column to False for all rows before the loop.

CPU times: total: 15.9 s
Wall time: 18.3 s


In [51]:
label = 'MONDO'
cols = ['ensembl_gene_id', 'ensembl_gene_id_version', 'ensembl_peptide_id',
        'ensembl_peptide_id_version', 'description', 'chromosome_name',
        'start_position', 'end_position', 'strand', 'band',
        'external_gene_name', 'external_gene_source',
        'percentage_gene_gc_content', 'gene_biotype', 'source', 'version',
        'peptide_version', 'external_synonym', 'phenotype_description',
        'source_name', 'p_value', label+'_match', label+'_ClassID', label+'_synonims']
ds[cols]#.to_csv('golph3_mondo_match.csv')

Unnamed: 0,ensembl_gene_id,ensembl_gene_id_version,ensembl_peptide_id,ensembl_peptide_id_version,description,chromosome_name,start_position,end_position,strand,band,...,source,version,peptide_version,external_synonym,phenotype_description,source_name,p_value,MONDO_match,MONDO_ClassID,MONDO_synonims
0,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0005461,"['endometrial endometrioid adenocarcinoma', 'e..."
1,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Hepatobiliary Neoplasm,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0002514,"['hepatobiliary benign neoplasm', 'hepatobilia..."
2,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,ND,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0010691,"['NDP', 'nd', 'Norrie syndrome', 'pseudoglioma..."
3,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Ovarian Endometrioid Adenocarcinoma with Squam...,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0006336,"['ovarian adenoacanthoma', 'ovarian adenosquam..."
4,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,6.0,GOPP1,Uterine Carcinosarcoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0016259,"['carcinosarcoma of the uterus', 'malignant mi..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,skin carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0002656,"['skin cancer, non-melanoma', 'non-melanoma ca..."
452,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,Small cell lung carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0008433,"['SCLC1', 'small cell cancer of the lung', 'po..."
453,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,soft tissue sarcoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0018078,"['malignant soft tissue tumour', 'malignant so..."
454,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,p13.3,...,ensembl_havana,14,,VPS74,squamous cell lung carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0005097,"['squamous cell carcinoma of lung', 'squamous ..."


In [106]:
label ='MONDO'
small_ds = ds[['phenotype_description',
               'source_name',
               label+'_match', label+'_ClassID', label+'_synonims', 
               onto_label+'_match', onto_label+'_ClassID', onto_label+'_synonims', ]]

small_ds.to_excel(ds_name+'_'+onto_label+'_'+label+'_match_s.xlsx')
small_ds

Unnamed: 0,phenotype_description,source_name,MONDO_match,MONDO_ClassID,MONDO_synonims,DOID_match,DOID_ClassID,DOID_synonims
0,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0005461,"[endometrial endometrioid adenocarcinoma, endo...",True,http://purl.obolibrary.org/obo/DOID_2870,"[adenocarcinoma of the Endometrium, endometria..."
1,Hepatobiliary Neoplasm,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0002514,"[hepatobiliary benign neoplasm, hepatobiliary ...",False,,
2,ND,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0010691,"[NDP, nd, Norrie syndrome, pseudoglioma, Norri...",False,,
3,Ovarian Endometrioid Adenocarcinoma with Squam...,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0006336,"[ovarian adenoacanthoma, ovarian adenosquamous...",False,,
4,Uterine Carcinosarcoma,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0016259,[malignant mixed müllerian tumour of corpus ut...,True,http://purl.obolibrary.org/obo/DOID_6171,"[mixed mullerian sarcoma of uterus, uterine ca..."
...,...,...,...,...,...,...,...,...
451,skin carcinoma,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0002656,"[skin cancer, non-melanoma, non-melanoma cance...",True,http://purl.obolibrary.org/obo/DOID_3451,"[carcinoma of skin, skin carcinoma]"
452,Small cell lung carcinoma,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0008433,"[SCLC1, small cell cancer of the lung, poorly ...",False,,
453,soft tissue sarcoma,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0018078,"[malignant soft tissue tumour, malignant soft ...",False,,
454,squamous cell lung carcinoma,Cancer Gene Census,True,http://purl.obolibrary.org/obo/MONDO_0005097,"[squamous cell carcinoma of lung, squamous cel...",False,,


### DOID

In [46]:
%%time 
# Choose Ontology 
doid = pd.read_csv('DOID.csv', low_memory=False)
onto_label = 'DOID'
doid = extract_syn_list(doid)

ds = pd.read_csv('emsembl_golph3_dataset.csv', index_col=0)
ds[onto_label+'_match'] = False

ontology_matcher(ds, doid, 'DOID')
# Note: Before running this code, make sure you have a column named 'is_true' in the 'ds' dataframe to store the True values.
#       You may need to initialize this column to False for all rows before the loop.

CPU times: total: 7.45 s
Wall time: 7.89 s


In [48]:
cols = ['ensembl_gene_id', 'ensembl_gene_id_version', 'ensembl_peptide_id',
        'ensembl_peptide_id_version', 'description', 'chromosome_name',
        'start_position', 'end_position', 'strand', 'band',
        'external_gene_name', 'external_gene_source',
        'percentage_gene_gc_content', 'gene_biotype', 'source', 'version',
        'peptide_version', 'external_synonym', 'phenotype_description',
        'source_name', 'p_value', onto_label+'_match', onto_label+'_ClassID', onto_label+'_synonims']

ds[cols]#.to_csv('golph3_doid_match.csv')

# Analyze results

In [2]:
golph3_doid = pd.read_csv('golph3_doid_match.csv')
golph3_mondo = pd.read_csv('golph3_mondo_match.csv')
golph3_doid

Unnamed: 0.1,Unnamed: 0,ensembl_gene_id,ensembl_gene_id_version,ensembl_peptide_id,ensembl_peptide_id_version,description,chromosome_name,start_position,end_position,strand,...,source,version,peptide_version,external_synonym,phenotype_description,source_name,p_value,DOID_match,DOID_ClassID,DOID_synonims
0,0,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_2870,"['endometrioid carcinoma of Endometrium', 'end..."
1,1,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Hepatobiliary Neoplasm,Cancer Gene Census,,False,,
2,2,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,ND,Cancer Gene Census,,False,,
3,3,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Ovarian Endometrioid Adenocarcinoma with Squam...,Cancer Gene Census,,False,,
4,4,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Uterine Carcinosarcoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_6171,"['mixed mullerian sarcoma of uterus', 'uterine..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451,451,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,skin carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_3451,"['carcinoma of skin', 'skin carcinoma']"
452,452,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,Small cell lung carcinoma,Cancer Gene Census,,False,,
453,453,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,soft tissue sarcoma,Cancer Gene Census,,False,,
454,454,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,squamous cell lung carcinoma,Cancer Gene Census,,False,,


In [58]:
golph3_doid.query('DOID_match')

Unnamed: 0.1,Unnamed: 0,ensembl_gene_id,ensembl_gene_id_version,ensembl_peptide_id,ensembl_peptide_id_version,description,chromosome_name,start_position,end_position,strand,...,source,version,peptide_version,external_synonym,phenotype_description,source_name,p_value,DOID_match,DOID_ClassID,DOID_synonims
0,0,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_2870,"['endometrioid carcinoma of Endometrium', 'end..."
4,4,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Uterine Carcinosarcoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_6171,"['mixed mullerian sarcoma of uterus', 'uterine..."
5,5,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Acute myeloid leukemia,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_9119,"['acute myeloid leukaemia', 'acute myelogenous..."
6,6,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Anaplastic astrocytoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_3078,"['grade III astrocytoma', 'grade III Astrocyti..."
7,7,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,bile duct carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_4897,"['nan', 'bile duct carcinoma']"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
448,448,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,prostate adenocarcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_2526,"['nan', 'prostate adenocarcinoma']"
449,449,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,prostate carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_10286,"['carcinoma of prostate', 'cancer of prostate'..."
450,450,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,rectal adenocarcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_1996,"['Rectal adenocarcinoma', 'rectum adenocarcino..."
451,451,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,skin carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/DOID_3451,"['carcinoma of skin', 'skin carcinoma']"


In [None]:
golph3_mondo.query('MONDO_match')

In [10]:
golph3_mondo.query('MONDO_match == True').query('phenotype_description.str.contains("carcinoma")')

Unnamed: 0.1,Unnamed: 0,ensembl_gene_id,ensembl_gene_id_version,ensembl_peptide_id,ensembl_peptide_id_version,description,chromosome_name,start_position,end_position,strand,...,source,version,peptide_version,external_synonym,phenotype_description,source_name,p_value,MONDO_match,MONDO_ClassID,MONDO_synonims
0,0,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Endometrial Endometrioid Adenocarcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0005461,"['endometrial endometrioid adenocarcinoma', 'e..."
3,3,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Ovarian Endometrioid Adenocarcinoma with Squam...,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0006336,"['ovarian adenoacanthoma', 'ovarian adenosquam..."
7,7,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,bile duct carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0005496,"['bile duct cancer', 'bile duct cancer (includ..."
8,8,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,bladder transitional cell carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0005611,"['BLCA', 'urothelial carcinoma of the urinary ..."
10,10,ENSG00000113384,ENSG00000113384.14,ENSP00000265070,ENSP00000265070.6,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,6.0,GOPP1,Breast carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0004989,"['cancer of breast', 'cancer, breast', 'cancer..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450,450,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,rectal adenocarcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0002169,"['read', 'rectum adenocarcinoma', 'adenocarcin..."
451,451,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,skin carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0002656,"['skin cancer, non-melanoma', 'non-melanoma ca..."
452,452,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,Small cell lung carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0008433,"['SCLC1', 'small cell cancer of the lung', 'po..."
454,454,ENSG00000113384,ENSG00000113384.14,,,golgi phosphoprotein 3 [Source:HGNC Symbol;Acc...,5,32124716,32174319,-1,...,ensembl_havana,14,,VPS74,squamous cell lung carcinoma,Cancer Gene Census,,True,http://purl.obolibrary.org/obo/MONDO_0005097,"['squamous cell carcinoma of lung', 'squamous ..."


In [17]:
golph3_mondo.query('MONDO_match').groupby('phenotype_description').describe(include=[object])

Unnamed: 0_level_0,ensembl_gene_id,ensembl_gene_id,ensembl_gene_id,ensembl_gene_id,ensembl_gene_id_version,ensembl_gene_id_version,ensembl_gene_id_version,ensembl_gene_id_version,ensembl_peptide_id,ensembl_peptide_id,...,source_name,source_name,MONDO_ClassID,MONDO_ClassID,MONDO_ClassID,MONDO_ClassID,MONDO_synonims,MONDO_synonims,MONDO_synonims,MONDO_synonims
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,...,top,freq,count,unique,top,freq,count,unique,top,freq
phenotype_description,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Acute myeloid leukemia,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0018874,12,12,1,"['leukemia, acute myeloid', 'acute non lymphob...",12
Anaplastic astrocytoma,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0016684,12,12,1,"['grade III astrocytic tumor', 'anaplastic ast...",12
Breast carcinoma,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0004989,12,12,1,"['cancer of breast', 'cancer, breast', 'cancer...",12
Chronic lymphocytic leukemia,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0004948,12,12,1,"['small lymphocytic lymphoma', 'leukemia, chro...",12
Clear cell renal carcinoma,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0005005,12,12,1,"['hypernephroma', 'clear-cell metastatic renal...",12
Endometrial Endometrioid Adenocarcinoma,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0005461,12,12,1,"['endometrial endometrioid adenocarcinoma', 'e...",12
Hepatobiliary Neoplasm,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0002514,12,12,1,"['hepatobiliary benign neoplasm', 'hepatobilia...",12
Hepatocellular Carcinoma,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0007256,12,12,1,"['adult primary hepatocellular carcinoma', 'ad...",12
Lung adenocarcinoma,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0005061,12,12,1,"['non-small cell lung adenocarcinoma', 'adenoc...",12
Lung carcinoma,12,1,ENSG00000113384,12,12,1,ENSG00000113384.14,12,8,2,...,Cancer Gene Census,12,12,1,http://purl.obolibrary.org/obo/MONDO_0005138,12,12,1,"['lung cancer, NOS', 'lung cancer', 'cancer of...",12


# Results
The semantic enrichment process yielded improvements in data interoperability and discoverability:

### Matched Terms
Through the ontology matching process, a considerable number of phenotypic terms were successfully matched to disease ontology terms. Specifically, out of the total phenotypic terms present in the dataset, **68%** were matched with DOID terms and **100%** with MONDO terms.

### Ontology Cross-Referencing
The inclusion of IRI class IDs allowed for enhanced cross-referencing between phenotypic data and disease ontology terms. This cross-referencing facilitates seamless integration with other datasets and ontologies, fostering improved data analytics and hypothesis generation.

## Discussion
The successful integration of DOID and MONDO ontologies into the GOLPH3 dataset enhances its usability in bioinformatics research. The semantically enriched dataset now allows researchers to more easily query and draw connections between the GOLPH3 gene and various diseases. This is critical for uncovering new disease pathways and understanding GOLPH3's role in different pathological conditions.

Additionally, the use of IRI class IDs ensures that the dataset is amenable to automated data retrieval systems and interoperable with other ontological resources. This alignment with FAIR principles further underscores the dataset's utility for the broader scientific community.

## Conclusion
The semantic enrichment of the GOLPH3 gene dataset through ontology matching has significantly advanced its FAIR compliance and research applicability. Future work will focus on expanding the range of ontologies used and refining the matching algorithms to improve the precision and recall of the ontology matcher.

# Extra

## load OWL

## Load OBO

In [3]:
%%time
# https://notebook.community/dhimmel/obo/examples/go-obonet
import sys, time
import obonet
import networkx as nx

#3. Download and load the Mondo ontology file:
mondo_url = 'http://purl.obolibrary.org/obo/mondo.obo'
graph = obonet.read_obo(mondo_url)
graph_nodes = graph.nodes()
print('Number of nodes:',len(graph))
time.sleep(0.5)
print("Memory Usage:", sys.getsizeof(graph), "bytes")
print("Memory Usage:", sys.getsizeof(graph_nodes), "bytes")
list(graph_nodes)[:5]

Number of nodes: 24399
CPU times: total: 7.14 s
Wall time: 11.3 s


In [174]:
#graph.nodes['DOID:9744']
graph.nodes['MONDO:0000023']

{'name': 'type 1 diabetes mellitus',
 'def': '"A diabetes mellitus that is characterized by destruction of pancreatic beta cells resulting in absent or extremely low insulin production." [url:http\\://en.wikipedia.org/wiki/Diabetes, url:http\\://en.wikipedia.org/wiki/Diabetes_mellitus_type_1]',
 'comment': 'Xref MGI.\\nOMIM mapping confirmed by DO. [SN].',
 'subset': ['DO_rare_slim', 'NCIthesaurus'],
 'synonym': ['"IDDM" EXACT []',
  '"insulin-dependent diabetes mellitus" EXACT []',
  '"type I diabetes mellitus" EXACT []'],
 'xref': ['GARD:10268',
  'ICD10CM:E10',
  'KEGG:04940',
  'MESH:D003922',
  'NCI:C2986',
  'OMIM:222100',
  'SNOMEDCT_US_2023_03_01:46635009',
  'UMLS_CUI:C0011854'],
 'is_a': ['DOID:0060005', 'DOID:9351']}

In [12]:
#You now have the Mondo ontology loaded in Python as a directed graph. You can perform various operations on the graph, such as accessing nodes and edges, traversing the ontology hierarchy, and querying specific terms.

#For example, you can access information about a specific term using its ID:
term_id = 'MONDO:0005737'  # Replace with your desired term ID
term_data = graph.nodes[term_id]
print(term_data)
#You can also explore the relationships between terms by accessing the graph edges:

term_children = list(graph.successors(term_id))
print('\n'+term_id+' Childerns:',term_children)

{'name': 'Ebola hemorrhagic fever', 'def': '"A viral hemorrhagic fever that is caused by the Ebola virus, which is transmitted by contact with infected animals or humans; it is characterized by high fever, unexplained bleeding, and a high mortality rate." [NCIT:P378]', 'subset': ['gard_rare', 'nord_rare', 'ordo_disease', 'orphanet_rare', 'rare'], 'synonym': ['"Ebola" EXACT [NCIT:C36171]', '"Ebola fever" EXACT [Orphanet:319218]', '"Ebola virus disease" EXACT [DOID:4325, Orphanet:319218]', '"Ebolavirus caused disease or disorder" EXACT [MONDO:patterns/specific_infectious_disease_by_agent]', '"Ebolavirus disease or disorder" EXACT []', '"Ebolavirus infectious disease" EXACT []', '"EHF" EXACT ABBREVIATION [Orphanet:319218]'], 'xref': ['DOID:4325', 'EFO:0007243', 'GARD:2035', 'MedDRA:10014071', 'MESH:D019142', 'NCIT:C36171', 'Orphanet:319218', 'SCTID:37109004', 'UMLS:C0282687'], 'is_a': ['MONDO:0005762', 'MONDO:0018087'], 'relationship': ['disease_has_feature MONDO:0001517', 'disease_has_fe

In [137]:
# Check if the ontology is a DAG
nx.is_directed_acyclic_graph(graph)

False


A DAG (Directed Acyclic Graph) is a type of graph where the edges have a defined direction, and there are no cycles or loops. In other words, it is a directed graph that does not contain any directed cycles.

In a DAG, each vertex or node represents a concept, and the directed edges represent the relationships between those concepts. The direction of the edges indicates the direction of the relationship, and the absence of cycles ensures that there is no circular or contradictory relationship.

DAGs are commonly used to represent various types of hierarchical structures or relationships, such as family trees, dependencies between tasks in project management, or the hierarchical structure of categories in a classification system.

In the context of ontologies, a DAG is often used to represent the hierarchical relationships between terms or concepts. Each term in the ontology is represented as a node, and the edges represent the parent-child relationships between the terms. For example, in a disease ontology, a term representing a specific disease might have multiple sub-disease terms that represent more specific types or subtypes of that disease.

### Lookup node properties

In [15]:
# Retreive properties of phagocytosis
term = graph.nodes['MONDO:0006909']
print('term:',term)

# Retreive properties of pilus shaft
def get_parent(term):
    parent = term['is_a'][0]
    return graph.nodes[parent]

term2 = get_parent(term)
print('\nparent:',term2)
#get_parent(term2)

term: {'name': 'pituitary dwarfism', 'def': '"Proportionately decreased bodily growth due to failure of the pituitary gland to produce an adequate supply of growth hormone." [NCIT:P378]', 'xref': ['EFO:1001109', 'ICD9:253.3', 'MedDRA:10035083', 'MESH:D004393', 'SCTID:367460001', 'UMLS:C0013338'], 'is_a': ['MONDO:0005495'], 'property_value': ['closeMatch http://identifiers.org/meddra/10035083', 'exactMatch http://identifiers.org/mesh/D004393', 'exactMatch http://identifiers.org/snomedct/367460001', 'exactMatch http://linkedlifedata.com/resource/umls/id/C0013338']}

parent: {'name': 'adrenal gland disorder', 'def': '"A disease involving the adrenal gland." [https://orcid.org/0000-0002-6601-2165]', 'synonym': ['"adrenal gland disease" EXACT [MONDO:patterns/location, NCIT:C26690]', '"adrenal gland disease or disorder" EXACT [MONDO:design_pattern, MONDO:patterns/location]', '"adrenal gland diseases" EXACT [NCIT:C26690]', '"adrenal gland disorder" EXACT [NCIT:C26690]', '"adrenal gland disord

### Create name mappings

In [ ]:
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}
name_to_id = {data['name']: id_ for id_, data in graph.nodes(data=True) if 'name' in data}

# Get the name for GO:0042552
id_to_name['GO:0042552']

# Get the id for myelination
name_to_id['myelination']

### Find parent or child relationships

In [ ]:
# Find edges to parent terms
#node = name_to_id['pilus part']
node = name_to_id['myelination']

for child, parent, key in graph.out_edges(node, keys=True):
    print(f'• {id_to_name[child]} ⟶ {key} ⟶ {id_to_name[parent]}')

### Find edges to children terms

In [ ]:
#node = name_to_id['pilus part']
node = name_to_id['myelination']
for parent, child, key in graph.in_edges(node, keys=True):
    print(f'• {id_to_name[child]} ⟵ {key} ⟵ {id_to_name[parent]}')

### Find all superterms of myelination

In [ ]:
sorted(id_to_name[superterm] for superterm in nx.descendants(graph, 'GO:0042552'))

### Find all subterms of myelination

In [ ]:
sorted(id_to_name[subterm] for subterm in nx.ancestors(graph, 'GO:0042552'))

### Find all paths to the root

In [ ]:
paths = nx.all_simple_paths(
    graph,
    source=name_to_id['starch binding'],
    target=name_to_id['molecular_function']
)
for path in paths:
    print('•', ' ⟶ '.join(id_to_name[node] for node in path))

### See the ontology metadata

In [ ]:
graph.graph

## SPARQL on owl with IRI