# Compare Xrefs - DS-Determined

This notebook is to find mappings in Mondo that correspond to the ICD9 and ICD10 code in the file `CDM_extracted_ICD_codes.csv` from https://www.synapse.org/Synapse:syn63923531.

The overall process is to first extract the xrefs from the ontologies using SPARQL queries and then find the matches. The SPARQL query can be run using the [ROBOT `query`](https://robot.obolibrary.org/query) command. ROBOT can be installed following these [instructions](https://robot.obolibrary.org/).

The Python data analysis library `pandas` will be used to find the mappings between the xrefs in each ontology. Pandas can be installed following the instructions [here](https://pandas.pydata.org/docs/getting_started/install.html).


## Prerequisites

A Python environment, pyenv or conda, that contains:
- pandas
- ROBOT

See installation instructions above.

## Imports and Data Preparation

In [1]:
# Imports 
import pandas as pd

pd.set_option('display.max_colwidth', None)

In [2]:
# Get release version of mondo.owl. See https://github.com/monarch-initiative/mondo/tags for all Mondo release tags

# Comment out to prevent re-running this step while actively developing the notebook.
#!wget https://github.com/monarch-initiative/mondo/releases/download/v2025-05-06/mondo.owl -O data/v2025-05-06_mondo.owl


In [3]:
# Run query to get Mondo Xrefs for sources of interest

# Comment out to prevent re-running this step while actively developing the notebook.
#!robot query --input data/v2025-05-06_mondo.owl --use-graphs true -f tsv --query sparql/extract_mondo_xrefs.sparql reports/mondo_xrefs.tsv


In [4]:
# Read in file of Mondo xrefs

mondo_df = pd.read_csv('reports/mondo_xrefs.tsv', sep='\t')
# The `mondo_xrefs.tsv` file contains one line per mondo term per xref. Only Mondo terms with xrefs 
# are included in the file.

mondo_df.head()

Unnamed: 0,?mondo_curie,?label,?xref,?is_obsolete,?has_equivalentTo
0,MONDO:0000001,disease,ICD9:799.9,False,False
1,MONDO:0000001,disease,DOID:4,False,True
2,MONDO:0000001,disease,MEDGEN:4347,False,True
3,MONDO:0000001,disease,MESH:D004194,False,True
4,MONDO:0000001,disease,NCIT:C2991,False,True


In [5]:
mondo_df.nunique()

?mondo_curie          27108
?label                27108
?xref                131778
?is_obsolete              2
?has_equivalentTo         2
dtype: int64

In [6]:
# Read in file from other ontology source or data source

data_filepath = 'data/CDM_extracted_ICD_codes.csv' # DS-Determined file
data_df = pd.read_csv(data_filepath)

data_df.head()

Unnamed: 0,icd_code,icd_version
0,78,9
1,G93.2,10
2,I31.39,10
3,M08.90,10
4,F63.81,10


In [7]:
data_df.nunique()

icd_code       2130
icd_version       2
dtype: int64

## Prepare Data File

In [8]:
# Create two dataframes from `data_df` where each only contains codes from either ICD9 or ICD10
# Add prefixes as well, "ICD9:" or "ICD10CM:"

icd9_df = data_df[data_df['icd_version'] == 9]
icd9_df = icd9_df.copy()
icd9_df['icd_code'] = 'ICD9:'+icd9_df['icd_code'].astype(str)
display(icd9_df.head())


icd10_df = data_df[data_df['icd_version'] == 10]
icd10_df = icd10_df.copy()
icd10_df['icd_code'] = 'ICD10CM:'+icd10_df['icd_code'].astype(str)
display(icd10_df.head())

Unnamed: 0,icd_code,icd_version
0,ICD9:78,9
6,ICD9:519.11,9
9,ICD9:228,9
19,ICD9:52.9,9
37,ICD9:312.9,9


Unnamed: 0,icd_code,icd_version
1,ICD10CM:G93.2,10
2,ICD10CM:I31.39,10
3,ICD10CM:M08.90,10
4,ICD10CM:F63.81,10
5,ICD10CM:F02.80,10


## Map terms

Find terms in data source, `data_df`, that map to Mondo term(s) based on shared xrefs between the two terms.

In [9]:
# Map the terms between each dataframe to get the ICD to Mondo translation

# Rename columns in mondo_df
mondo_clean_df = mondo_df.rename(columns={
    mondo_df.columns[0]: 'mondo_curie',
    mondo_df.columns[1]: 'mondo_label',
    mondo_df.columns[2]: 'mondo_xref',
    mondo_df.columns[3]: 'mondo_is_obsolete',
    mondo_df.columns[4]: 'mondo_has_equivalentTo'
})

# Make sure xref columns have type string
mondo_clean_df['mondo_xref'] = mondo_clean_df['mondo_xref'].astype(str)

In [10]:
# Find ICD9 matches

icd9_merged_df = icd9_df.merge(
    mondo_clean_df,
    left_on='icd_code',
    right_on='mondo_xref',
    how='left'
)

display(icd9_merged_df.head())

display(icd9_merged_df.nunique())

Unnamed: 0,icd_code,icd_version,mondo_curie,mondo_label,mondo_xref,mondo_is_obsolete,mondo_has_equivalentTo
0,ICD9:78,9,,,,,
1,ICD9:519.11,9,,,,,
2,ICD9:228,9,,,,,
3,ICD9:52.9,9,,,,,
4,ICD9:312.9,9,MONDO:0005352,conduct disorder,ICD9:312.9,False,True


icd_code                  429
icd_version                 1
mondo_curie               261
mondo_label               261
mondo_xref                130
mondo_is_obsolete           2
mondo_has_equivalentTo      2
dtype: int64

In [11]:
# Find ICD10 matches

icd10_merged_df = icd10_df.merge(
    mondo_clean_df,
    left_on='icd_code',
    right_on='mondo_xref',
    how='left'
)

display(icd10_merged_df.head())

display(icd10_merged_df.nunique())

Unnamed: 0,icd_code,icd_version,mondo_curie,mondo_label,mondo_xref,mondo_is_obsolete,mondo_has_equivalentTo
0,ICD10CM:G93.2,10,,,,,
1,ICD10CM:I31.39,10,,,,,
2,ICD10CM:M08.90,10,,,,,
3,ICD10CM:F63.81,10,MONDO:0001521,intermittent explosive disorder,ICD10CM:F63.81,False,True
4,ICD10CM:F02.80,10,,,,,


icd_code                  1701
icd_version                  1
mondo_curie                114
mondo_label                114
mondo_xref                  64
mondo_is_obsolete            2
mondo_has_equivalentTo       2
dtype: int64

In [12]:
# Save to file of the ICD9 mapping results

icd9_merged_df.to_csv('data/ds-determined-icd9_icd-mondo_mappings.tsv', sep='\t', index=False)

In [13]:
# Save to file of the ICD10 mapping results

icd10_merged_df.to_csv('data/ds-determined-icd10_icd-mondo_mappings.tsv', sep='\t', index=False)