## Compare Xrefs

This notebook is to compare xrefs from two lists of ontology terms that contain one or more xrefs.

In [1]:
# Imports 
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', None)

In [2]:
# Read in file of Mondo xrefs

df_mondo = pd.read_csv('reports/mondo_xrefs.tsv', sep='\t')

df_mondo.head()

Unnamed: 0,?mondo_curie,?mondo_term,?label,?xref
0,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,DOID:4
1,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,ICD9:799.9
2,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,MEDGEN:4347
3,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,MESH:D004194
4,MONDO:0000001,<http://purl.obolibrary.org/obo/MONDO_0000001>,disease,NCIT:C2991


In [3]:
# Read in file from other ontology source
ont_filepath = 'reports/ird_xrefs.tsv'

df_ont = pd.read_csv(ont_filepath, sep='\t')

df_ont.head()

Unnamed: 0,?class,?label,?xref
0,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD>,IRD_6_Congenital & Stationary Retinal Diseases,UMLS:C4073105
1,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD>,IRD_6_Congenital & Stationary Retinal Diseases,SCTID:232061009
2,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#AICA-ribosiduria>,IRD_5_1_AICA-ribosiduria,UMLS:C1837530
3,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#AICA-ribosiduria>,IRD_5_1_AICA-ribosiduria,ICD11:5C55.0Y
4,<http://www.semanticweb.org/msh/ontologies/2023/1/IRD#AICA-ribosiduria>,IRD_5_1_AICA-ribosiduria,Orphanet:250977


---
## Map terms

Find terms in IRD that map to Mondo term(s) based on shared xrefs between the two terms.

In [4]:
# Step 1: Rename and clean up df_mondo
df_mondo_clean = df_mondo.rename(columns={
    '?mondo_curie': 'mondo_curie',
    '?label': 'mondo_label',
    '?xref': 'xref'
})

# Step 2: Rename and clean up df_ont
df_ont_clean = df_ont.rename(columns={
    df_ont.columns[0]: 'ird_term',
    df_ont.columns[1]: 'ird_label',
    df_ont.columns[2]: 'xref'
})

# Step 3: Aggregate all xrefs for MONDO
mondo_xrefs = df_mondo_clean.groupby(['mondo_curie', 'mondo_label'])['xref'].apply(lambda x: sorted(set(x))).reset_index()
mondo_xrefs = mondo_xrefs.rename(columns={'xref': 'mondo_all_xrefs'})

# Step 4: Aggregate all xrefs for IRD
ird_xrefs = df_ont_clean.groupby(['ird_label'])['xref'].apply(lambda x: sorted(set(x))).reset_index()
ird_xrefs = ird_xrefs.rename(columns={'xref': 'ird_all_xrefs'})

# Step 5: Find matching xrefs between MONDO and IRD
matched = pd.merge(df_mondo_clean, df_ont_clean, on='xref', how='inner')

# Step 6: Group by MONDO label + IRD label and collect shared xrefs
matched_grouped = matched.groupby(['mondo_curie', 'mondo_label', 'ird_label'])['xref'].apply(lambda x: sorted(set(x))).reset_index()
matched_grouped = matched_grouped.rename(columns={'xref': 'matching_xrefs'})

# Step 7: Merge in all xref lists
final_df = matched_grouped.merge(mondo_xrefs, on=['mondo_curie', 'mondo_label'], how='left')
final_df = final_df.merge(ird_xrefs, on='ird_label', how='left')

# Show the result
final_df.head()


Unnamed: 0,mondo_curie,mondo_label,ird_label,matching_xrefs,mondo_all_xrefs,ird_all_xrefs
0,MONDO:0000390,vitelliform macular dystrophy,IRD_2_2_Vitelliform_degenerations,"[SCTID:90036004, UMLS:C0339510]","[DOID:0050661, ICD10CM:H35.5, MEDGEN:137920, MESH:D057826, NANDO:1200932, NCIT:C118788, OMIMPS:153840, SCTID:90036004, UMLS:C0339510]","[ICD11:9B70, SCTID:90036004, UMLS:C0339510]"
1,MONDO:0000455,cone dystrophy,"IRD_1_5_1_Cone_dystrophy,_X-linked,_with_tapetal-like_sheen",[Orphanet:1871],"[DOID:0050795, GARD:11897, ICD9:362.75, MEDGEN:676499, MESH:D000077765, NANDO:1200936, Orphanet:1871, SCTID:312917007, UMLS:C0730290]","[OMIM:304030, Orphanet:1871, UMLS:C1844775]"
2,MONDO:0000455,cone dystrophy,IRD_1_5_Progressive_cone_dystrophy,[Orphanet:1871],"[DOID:0050795, GARD:11897, ICD9:362.75, MEDGEN:676499, MESH:D000077765, NANDO:1200936, Orphanet:1871, SCTID:312917007, UMLS:C0730290]","[ICD11:9B70, Orphanet:1871, SCTID:267613004, UMLS:C3665342]"
3,MONDO:0007176,helicoid peripapillary chorioretinal degeneration,IRD_3_1_1_Helicoid_peripapillary_chorioretinal_degeneration;_HPCD,"[OMIM:108985, Orphanet:86813, SCTID:724384008, UMLS:C1862382]","[DOID:0111228, GARD:16757, MEDGEN:354733, MESH:C566236, OMIM:108985, Orphanet:86813, SCTID:724384008, UMLS:C1862382, icd11.foundation:896652469]","[ICD11:9B70, OMIM:108985, Orphanet:86813, SCTID:724384008, UMLS:C1862382]"
4,MONDO:0007353,coloboma of macula-brachydactyly type B syndrome,IRD_5_13_Coloboma_of_macula_with_type_B_brachydactyly_(Sorsby_syndrome),"[OMIM:120400, Orphanet:1471, SCTID:717785002, UMLS:C1852752]","[GARD:1437, MEDGEN:343882, MESH:C535969, OMIM:120400, Orphanet:1471, SCTID:717785002, UMLS:C1852752]","[ICD11:LD2F.1Y, OMIM:120400, Orphanet:1471, SCTID:717785002, UMLS:C1852752]"


## Further analysis

From here, to determine the mapping will need curator review. Here are some things to review. 

There are cases where there is more than one IRD term match to a Mondo term, e.g. "IRD_1_5_1_Cone_dystrophy,_X-linked,_with_tapetal-like_sheen" and "IRD_1_5_Progressive_cone_dystrophy" both map to MONDO:0000455 cone dystrophy via the shared xref Orphanet:1871.

NOTE: The ICD11 identifiers in IRD are the linearization of ICD11 and this is not what Mondo uses. Mondo has mappings to the ICD11 Foundation, the linearization is a derivative of the foundation.