### Overview

Given the file "results.txt", generated by `run_query.py`, check that each unique Mondo ID that has an xref annotation of `MONDO:includedEntryInOMIM` also has an xref annotation of `MONDO:equivalentTo`.

NOTE: The script `run_query.py` contains a SPARQL query that uses all OMIM CURIEs that are from an OMIM INCLUDED entry (this can be a phenotype/disease entry or gene entry as I later learned from OMIM) and gets all MONDO CURIEs and their xrefs and their source, each result is on a single line).

Also check that all unique Mondo IDs that have an xref annotation of `MONDO:equivalentTo` also have an xref annotation of `MONDO:includedEntryInOMIM`. Since the CURIEs in "results.txt" already represent the subset of OMIM CURIEs that are for "INCLUDED" OMIM entries, this should also be true for this file. However, overall for Mondo this is not true because there will be valid cases where Mondo has an xref to an OMIM entry that does not contain an INCLUDED entry and therefore only the `MONDO:equivalentTo` annotation will be present.

In [1]:
# Imports
import pandas as pd

In [2]:
df = pd.read_csv("results.txt", quotechar="'", skipinitialspace=True)

df.head()

Unnamed: 0,Entity,Label,Xref,Source
0,MONDO:0020742,"obsolete cataract, microcephaly, failure to th...",OMIM:212540,GARD:0001060
1,MONDO:0020742,"obsolete cataract, microcephaly, failure to th...",OMIM:212540,MONDO:obsoleteEquivalent
2,MONDO:0008030,facioscapulohumeral muscular dystrophy 1,OMIM:158900,GARD:0009941
3,MONDO:0008030,facioscapulohumeral muscular dystrophy 1,OMIM:158900,MONDO:equivalentTo
4,MONDO:0014448,"hyperthyroxinemia, familial dysalbuminemic",OMIM:615999,MONDO:equivalentTo


In [3]:
df.nunique()

Entity    822
Label     822
Xref      643
Source    909
dtype: int64

---

### Check that all Mondo IDs with an OMIM xref with source MONDO:includedEntryInOMIM _also_ has an xref source of MONDO:equivalentTo

In [4]:
# From the data in "results.txt" (set of OMIM INCLUDED entries), check that any MONDO CURIE with 
# an OMIM xref and source MONDO:includedEntryInOMIM _also_ has an xref source of MONDO:equivalentTo

# Step 1: Filter rows where Source is 'MONDO:includedEntryInOMIM' and get unique Xref values(OMIM CURIE)
included_entry_xrefs = df[df['Source'] == 'MONDO:includedEntryInOMIM']['Xref'].unique()

# Step 2: Check if there is a corresponding row where Source is 'MONDO:equivalentTo' for each Xref(OMIM CURIE)
missing_equivalent_to_xrefs = []
for xref in included_entry_xrefs:
    if df[(df['Xref'] == xref) & (df['Source'] == 'MONDO:equivalentTo')].empty:
        missing_equivalent_to_xrefs.append(xref)

# Step 3: Create a DataFrame of these rows (includes rows from all Xref Sources)
invalid_MondoIDs_df = df[df['Xref'].isin(missing_equivalent_to_xrefs)]
invalid_MondoIDs_df.head(6)

Unnamed: 0,Entity,Label,Xref,Source
450,MONDO:0009427,obsolete infantile hypophosphatasia,OMIM:241500,DOID:0110914
451,MONDO:0009427,obsolete infantile hypophosphatasia,OMIM:241500,MONDO:obsoleteEquivalent
452,MONDO:0009427,obsolete infantile hypophosphatasia,OMIM:241500,Orphanet:247651
453,MONDO:0009427,obsolete infantile hypophosphatasia,OMIM:241500,Orphanet:247651/e
454,MONDO:0016605,perinatal lethal hypophosphatasia,OMIM:241500,MONDO:includedEntryInOMIM
455,MONDO:0016605,perinatal lethal hypophosphatasia,OMIM:241500,Orphanet:247623


In [5]:
# Step 4: Filter to display only those rows with a "MONDO:" source
invalid_pairs_only_df = invalid_MondoIDs_df[invalid_MondoIDs_df['Source'].str.startswith('MONDO:')]
invalid_pairs_only_df.head(len(invalid_pairs_only_df))

Unnamed: 0,Entity,Label,Xref,Source
451,MONDO:0009427,obsolete infantile hypophosphatasia,OMIM:241500,MONDO:obsoleteEquivalent
454,MONDO:0016605,perinatal lethal hypophosphatasia,OMIM:241500,MONDO:includedEntryInOMIM
651,MONDO:0007127,diffuse idiopathic skeletal hyperostosis,OMIM:106400,MONDO:includedEntryInOMIM
1124,MONDO:0044259,"obsolete skin/hair/eye pigmentation, variation...",OMIM:266300,MONDO:obsoleteEquivalent
1125,MONDO:0800410,"UV-induced skin damage, susceptibility to",OMIM:266300,MONDO:includedEntryInOMIM
1127,MONDO:0007798,obsolete adult hypophosphatasia,OMIM:146300,MONDO:obsoleteEquivalent
1130,MONDO:0016607,odontohypophosphatasia,OMIM:146300,MONDO:includedEntryInOMIM
1385,MONDO:0013799,"obsolete efavirenz, poor metabolism of",OMIM:614546,MONDO:obsoleteEquivalent
1386,MONDO:0800431,"efavirenz central nervous system toxicity, sus...",OMIM:614546,MONDO:includedEntryInOMIM


In [6]:
# Group the results by Xref

# # Group by 'Xref' and aggregate the combined values
# grouped_df = invalid_pairs_only_df.groupby('Xref').apply(lambda x: x[['Entity', 'Label', 'Source']].values.tolist()).reset_index()
# # Rename the columns
# grouped_df.columns = ['Xref', 'CombinedValues']

# --- Alternative Display --- #
# Group by 'Xref' and aggregate the values from 'Entity', 'Label', and 'Source'
grouped_df = invalid_pairs_only_df.groupby('Xref').agg({
    'Entity': lambda x: list(x),
    'Label': lambda x: list(x),
    'Source': lambda x: list(x)
}).reset_index()

# Set the display option to show full column width
pd.set_option('display.max_colwidth', None)

grouped_df.head()

Unnamed: 0,Xref,Entity,Label,Source
0,OMIM:106400,[MONDO:0007127],[diffuse idiopathic skeletal hyperostosis],[MONDO:includedEntryInOMIM]
1,OMIM:146300,"[MONDO:0007798, MONDO:0016607]","[obsolete adult hypophosphatasia, odontohypophosphatasia]","[MONDO:obsoleteEquivalent, MONDO:includedEntryInOMIM]"
2,OMIM:241500,"[MONDO:0009427, MONDO:0016605]","[obsolete infantile hypophosphatasia, perinatal lethal hypophosphatasia]","[MONDO:obsoleteEquivalent, MONDO:includedEntryInOMIM]"
3,OMIM:266300,"[MONDO:0044259, MONDO:0800410]","[obsolete skin/hair/eye pigmentation, variation in, 2, UV-induced skin damage, susceptibility to]","[MONDO:obsoleteEquivalent, MONDO:includedEntryInOMIM]"
4,OMIM:614546,"[MONDO:0013799, MONDO:0800431]","[obsolete efavirenz, poor metabolism of, efavirenz central nervous system toxicity, susceptibility to]","[MONDO:obsoleteEquivalent, MONDO:includedEntryInOMIM]"


### Summary 

All OMIM CURIEs that contains an INCLUDED entry where the source is `MONDO:includedEntryInOMIM` also has a "pair" where the OMIM CURIE is used as an xref where the source is either `MONDO:equivalentTo` or `MONDO:obsoleteEquivalent` EXCEPT for **OMIM:106400**.

---

### Check the _opposite_ since the result file used in this Notebook only has OMIM CURIEs that are INCLUDED

In [7]:
# Check that each unique Mondo ID that has an xref annotation of MONDO:equivalentTo
# also has an xref annotation of MONDO:includedEntryInOMIM

# Step 1: Filter rows where Source is 'MONDO:equivalentTo' and get unique Xref value(OMIM CURIE)
equivalent_entry_xrefs = df[df['Source'] == 'MONDO:equivalentTo']['Xref'].unique()
#print('equivalent_entry_xrefs: ', len(equivalent_entry_xrefs))

# Step 2: Check if there is a corresponding row where Source is 'MONDO:includedEntryInOMIM' for each Xref
missing_included_to_xrefs = []
for xref in equivalent_entry_xrefs:
    if df[(df['Xref'] == xref) & (df['Source'] == 'MONDO:includedEntryInOMIM')].empty:
        missing_included_to_xrefs.append(xref)

# Create a DataFrame of these rows
missing_included_pairs = df[df['Xref'].isin(missing_included_to_xrefs)]
missing_included_pairs.head(len(missing_included_pairs))

Unnamed: 0,Entity,Label,Xref,Source
2,MONDO:0008030,facioscapulohumeral muscular dystrophy 1,OMIM:158900,GARD:0009941
3,MONDO:0008030,facioscapulohumeral muscular dystrophy 1,OMIM:158900,MONDO:equivalentTo
4,MONDO:0014448,"hyperthyroxinemia, familial dysalbuminemic",OMIM:615999,MONDO:equivalentTo
7,MONDO:0011527,Charcot-Marie-Tooth disease type 4E,OMIM:605253,DOID:0110195
8,MONDO:0011527,Charcot-Marie-Tooth disease type 4E,OMIM:605253,GARD:0006170
...,...,...,...,...
1730,MONDO:0008224,hyperkalemic periodic paralysis,OMIM:170500,DOID:14451
1731,MONDO:0008224,hyperkalemic periodic paralysis,OMIM:170500,GARD:0000195
1732,MONDO:0008224,hyperkalemic periodic paralysis,OMIM:170500,MONDO:equivalentTo
1733,MONDO:0008224,hyperkalemic periodic paralysis,OMIM:170500,Orphanet:682


In [8]:
# Filter out any rows where Source is MONDO:preferredExternal
missing_included_pairs_filtered = missing_included_pairs[~missing_included_pairs['Source'].isin(['MONDO:preferredExternal'])]

missing_included_pairs_only = missing_included_pairs_filtered[missing_included_pairs_filtered['Source'].str.startswith('MONDO:')]
missing_included_pairs_only.head(len(missing_included_pairs_only))

Unnamed: 0,Entity,Label,Xref,Source
3,MONDO:0008030,facioscapulohumeral muscular dystrophy 1,OMIM:158900,MONDO:equivalentTo
4,MONDO:0014448,"hyperthyroxinemia, familial dysalbuminemic",OMIM:615999,MONDO:equivalentTo
9,MONDO:0011527,Charcot-Marie-Tooth disease type 4E,OMIM:605253,MONDO:equivalentTo
13,MONDO:0007251,campomelic dysplasia,OMIM:114290,MONDO:equivalentTo
19,MONDO:0007240,"progressive familial heart block, type 1A",OMIM:113900,MONDO:equivalentTo
...,...,...,...,...
1719,MONDO:0011225,severe combined immunodeficiency due to DCLRE1C deficiency,OMIM:602450,MONDO:equivalentTo
1724,MONDO:0007116,hereditary neurocutaneous angioma,OMIM:106070,MONDO:equivalentTo
1728,MONDO:0008990,"cleft larynx, posterior",OMIM:215800,MONDO:equivalentTo
1729,MONDO:0054699,proteasome-associated autoinflammatory syndrome 3,OMIM:617591,MONDO:equivalentTo


In [9]:
# NOTE: This is the work that found why the dataframe did not contain a unique number of Mondo IDs
## Find which Entity occurs more than once in missing_included_pairs_only

# entity_counts = missing_included_pairs_only['Entity'].value_counts()
# duplicate_entities = entity_counts[entity_counts > 1].index.tolist()

# # Display the duplicate entities
# print("Entities that occur more than once:", duplicate_entities)

# ## Entities that occur more than once: 
# ## ['MONDO:0010600', 'MONDO:0007418', 'MONDO:0009744', 'MONDO:0011835'] because also have MONDO:preferredExternal

In [10]:
uniq = missing_included_pairs_only['Entity'].nunique()
uniq

# Now the unique number of Mondo IDs is 459, the same count as rows in the dataframe 
# now with MONDO:preferredExternal rows filtered out

459

In [11]:
# Write to file
missing_included_pairs_only.to_csv('missing_included_pairs_only.csv', index=False)

### Summary 

There are 459 unique Mondo IDs where there is an xref to an "INCLUDED OMIM", where the OMIM ID is not used
as an xref on a second Mondo term. Manually looking at a few entries, in some cases the "INCLUDED OMIM" 
title is added to Mondo as either an EXACT or RELATED synonym.

## More Analysis

An earlier curation practice was to add the name of the INCLUDED OMIM entry (not the main entry) as 
a RELATED synonym to the same Mondo ID. This can be analyzed with the new omim.owl file where the INCLUDED entry
name is now in a special Mondo property, e.g. `http://purl.obolibrary.org/obo/mondo#omim_included`.

While some of this can be analyzed by coding, review by curators will probably be needed as well.