## Compare Xrefs

This notebook is to compare xrefs from two lists of ontology terms that contain one or more xrefs. When working with data as ontologies, the lists of terms can be extracted using SPARQL queries found in the "sparql" directory using ROBOT.

In [1]:
# Imports 
import pandas as pd

pd.set_option('display.max_colwidth', None)

In [2]:
# Read in file of Mondo xrefs

df_mondo = pd.read_csv('reports/mondo_xrefs.tsv', sep='\t')
# The Mondo file contains one line per term per xref. Only Mondo terms with xrefs are included in the file.

df_mondo.head()

Unnamed: 0,?mondo_curie,?mondo_term,?label,?xref
0,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,DOID:4
1,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,ICD9:799.9
2,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,MEDGEN:4347
3,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,MESH:D004194
4,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,NCIT:C2991


In [3]:
df_mondo.nunique()

?mondo_curie     27092
?mondo_term      27092
?label           27092
?xref           131281
dtype: int64

In [4]:
# Read in file from other ontology source
ont_filepath = 'reports/ird_xrefs.tsv'

df_ont = pd.read_csv(ont_filepath, sep='\t')
# The IRD file contains _all_ IRD terms and references to terms with values from CUI, ICD-11, OMIM, ORPHA, SNOMED.
# These references in the IRD ontology file are modeled as property values vs. database cross references as the ontology does not use 
# OBO Foundry formatting and properties.

df_ont.head()

Unnamed: 0,?class,?label,?xref
0,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD>,IRD_6_Congenital & Stationary Retinal Diseases,UMLS:C4073105
1,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD>,IRD_6_Congenital & Stationary Retinal Diseases,SCTID:232061009
2,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#AICA-ribosiduria>,IRD_5_1_AICA-ribosiduria,UMLS:C1837530
3,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#AICA-ribosiduria>,IRD_5_1_AICA-ribosiduria,ICD11:5C55.0Y
4,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#AICA-ribosiduria>,IRD_5_1_AICA-ribosiduria,Orphanet:250977


In [5]:
df_ont.nunique()

?class    163
?label    163
?xref     494
dtype: int64

---
## Map terms

Find terms in IRD that map to Mondo term(s) based on shared xrefs between the two terms.

In [6]:
# Rename columns
df_mondo_clean = df_mondo.rename(columns={
    '?mondo_curie': 'mondo_curie',
    '?label': 'mondo_label',
    '?xref': 'xref'
})

df_ont_clean = df_ont.rename(columns={
    df_ont.columns[0]: 'ird_term',
    df_ont.columns[1]: 'ird_label',
    df_ont.columns[2]: 'xref'
})

# Flatten Mondo dataframe - Group and collect all xrefs for each MONDO term so there is one Mondo term and a list of xrefs
mondo_xrefs = df_mondo_clean.groupby(['mondo_curie', 'mondo_label'])['xref'].apply(lambda x: sorted(set(x))).reset_index()
mondo_xrefs = mondo_xrefs.rename(columns={'xref': 'mondo_all_xrefs'})


# Flatten IRD dataframe - Group and collect all xrefs for each IRD term so there is one IRD term and a list of xrefs
ird_xrefs = df_ont_clean.groupby(['ird_term', 'ird_label'])['xref'].apply(lambda x: sorted(set(x))).reset_index()
ird_xrefs = ird_xrefs.rename(columns={'xref': 'ird_all_xrefs'})


# Explode the IRD xrefs for matching
ird_exploded = df_ont_clean.copy()


# Join IRD xrefs with MONDO on xref
matched = pd.merge(ird_exploded, df_mondo_clean, on='xref', how='left')


# Group matched xrefs
matched_grouped = matched.groupby(['ird_term', 'ird_label', 'mondo_curie', 'mondo_label'])['xref'].apply(
    lambda x: sorted(set(x.dropna()))
).reset_index()
matched_grouped = matched_grouped.rename(columns={'xref': 'matching_xrefs'})


# Merge with the full IRD set and bring in the xref lists
final_df = pd.merge(ird_xrefs, matched_grouped, on=['ird_term', 'ird_label'], how='left')
final_df = pd.merge(final_df, mondo_xrefs, on=['mondo_curie', 'mondo_label'], how='left')

# Sort rows by ird_label
final_df = final_df.sort_values(by='ird_label').reset_index(drop=True)

final_df.head()


Unnamed: 0,ird_term,ird_label,ird_all_xrefs,mondo_curie,mondo_label,matching_xrefs,mondo_all_xrefs
0,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#Retinitis_punctata_albescens>,IRD_1_1_1_Retinitis_punctata_albescens,"[ICD11:9B70, OMIM:136880, Orphanet:52427, SCTID:715562001, UMLS:C1405854]",MONDO:0007639,fundus albipunctatus,[OMIM:136880],"[DOID:11105, GARD:13809, ICD10CM:H35.5, ICD9:362.74, ICD9:362.76, MEDGEN:86317, MESH:C562733, OMIM:136880, Orphanet:227796, SCTID:68222009, UMLS:C0311338, icd11.foundation:1981512475]"
1,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#Retinitis_punctata_albescens>,IRD_1_1_1_Retinitis_punctata_albescens,"[ICD11:9B70, OMIM:136880, Orphanet:52427, SCTID:715562001, UMLS:C1405854]",MONDO:0018877,retinitis punctata albescens,"[Orphanet:52427, SCTID:715562001, UMLS:C1405854]","[GARD:16655, ICD10CM:H35.5, MEDGEN:278050, Orphanet:52427, SCTID:715562001, UMLS:C1405854, icd11.foundation:567796529]"
2,"<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#Retinopathy,_pericentral_pigmentary,_autosomal_recessive>","IRD_1_1_2_1_Retinopathy,_pericentral_pigmentary,_autosomal_recessive","[OMIM:268060, UMLS:C1849398]",MONDO:0009987,autosomal recessive pericentral pigmentary retinopathy,"[OMIM:268060, UMLS:C1849398]","[DOID:0110422, GARD:15231, ICD10CM:H35.5, MEDGEN:340314, MESH:C564838, OMIM:268060, UMLS:C1849398]"
3,"<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#Retinopathy,_pericentral_pigmentary,_dominant>","IRD_1_1_2_2_Retinopathy,_pericentral_pigmentary,_dominant","[OMIM:180210, UMLS:C1867261]",MONDO:0008381,dominant pericentral pigmentary retinopathy,"[OMIM:180210, UMLS:C1867261]","[DOID:0110420, GARD:15111, ICD10CM:H35.5, MEDGEN:357237, MESH:C566713, OMIM:180210, UMLS:C1867261]"
4,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#Pericentral_retinitis_pigmentosa>,IRD_1_1_2_Pericentral_retinitis_pigmentosa,"[OMIM:268060, SCTID:26216008+28835009, UMLS:C1849398]",MONDO:0009987,autosomal recessive pericentral pigmentary retinopathy,"[OMIM:268060, UMLS:C1849398]","[DOID:0110422, GARD:15231, ICD10CM:H35.5, MEDGEN:340314, MESH:C564838, OMIM:268060, UMLS:C1849398]"


## Further analysis

From here, to determine the mapping will need curator review. Here are some things to review. 

There are cases where there is more than one IRD term match to a Mondo term, e.g. "IRD_1_5_1_Cone_dystrophy,_X-linked,_with_tapetal-like_sheen" and "IRD_1_5_Progressive_cone_dystrophy" both map to MONDO:0000455 cone dystrophy via the shared xref Orphanet:1871. There may also be cases where there are no mappings between the IRD term and Mondo.

NOTE: The ICD11 identifiers in IRD are the linearization of ICD11 and this is not what Mondo uses. Mondo has mappings to the ICD11 Foundation, the linearization is a derivative of the foundation.

In [7]:
# Save output file

final_df.to_csv('reports/mondo-ird-mappings.tsv', sep='\t', index=False)