## Alignment Stats


### Description

This notebook is to review counts for the Mapping Progress Report in the mondo-ingest repo.

These calculations are based on the slides [here](https://docs.google.com/presentation/d/14Xmcfwe9xhoGG0ZnDbFbwzKqHWHQR4OaOF4W7II2BaM/edit?slide=id.p#slide=id.p).

- DISEASE is gnerally calculated as mirror signature - (all excluded terms and all ingest source obsolete terms)
  - The mirror signature file is from reports/mirror_signature-<ONTOLOGY_NAME>.tsv
  - The set of all excluded terms is from reports/<ONTOLOGY_NAME>_term_exclusions.txt
  - The set of all ingest source obsolete terms was created by querying the OAK ontology database for obsolete terms.
  - IRIs were converted to CURIEs as needed.
  - The final count only includes CURIEs that are from the source ingest base namespace.  


- MAPPED is calculated as starting with the `mondo.sssom.tsv` file and filtering to only skos:exactMatch rows and then removing both any rows that also have an obsolete Mondo term or an obsolete source ingest term.
QUESTION: Should the source ingest terms that exist in the `mondo.sssom.tsv` file BUT are also in the term exclusion file for the source also be removed?


NOTE: The files of obsolete terms were created using the pattern: `runoak -i components/icd10cm.db obsoletes > icd10cm_obsoletes.txt`
Files were used from the recent data build so all files were created late April 2025.

In [136]:
#!./run.sh make reports/mondo_obsoletes_exactmatch.tsv tmp/mondo.sssom.tsv


In [137]:
# Imports
import pandas as pd

pd.set_option('display.width', 1000)


In [138]:

# Function to extract CURIE from IRI
def extract_curie(iri):
    clean_iri = iri.replace('>', '').replace('<', '')
    curie_mapping = {
        'https://omim.org/entry/': 'OMIM:',
        'http://omim.org/entry/': 'OMIM:',
        'https://omim.org/phenotypicSeries/PS': 'OMIMPS:',
        'http://omim.org/phenotypicSeries/PS': 'OMIMPS:',
        'http://purl.obolibrary.org/obo/OMIMPS_': 'OMIMPS:',
        'http://purl.obolibrary.org/obo/MONDO_': 'MONDO:',
        'http://www.orpha.net/ORDO/Orphanet_': 'Orphanet:',
        'http://purl.obolibrary.org/obo/DOID_': 'DOID:',
        'http://purl.obolibrary.org/obo/NCIT_': 'NCIT:',
        'http://id.who.int/icd/entity/': 'icd11.foundation:',
        'http://purl.bioontology.org/ontology/ICD10CM/': 'ICD10CM:',
        'http://purl.obolibrary.org/obo/mondo/sources/icd11foundation/': 'icd11.foundation:',
    }
    for uri_prefix, curie_prefix in curie_mapping.items():
        if clean_iri.startswith(uri_prefix):
            curie = clean_iri.replace(uri_prefix, curie_prefix)
            return curie
    return clean_iri

def compute_stats(source, mondo_sssom_df,df_mondo_obsoletes_exactmatch):
    
    # THIS FILES HAVE TO EXIST AND BE UP TO DATE
    f_mirror_signature = f'reports/mirror_signature-{source}.tsv'
    f_initial_inclusions = f'tmp/{source}_relevant_signature.txt'
    f_exclusions = f'reports/{source}_term_exclusions.txt'
    f_deprecated = f'reports/mirror_deprecated-{source}.tsv'
    
    # Load the mirror signature file
    df_signature = pd.read_csv(f_mirror_signature, header=None, names=['iri'])
    df_signature['curie'] = df_signature['iri'].apply(extract_curie)
    df_signature_clean = df_signature.dropna(subset=['curie']) # Keep only rows where a CURIE was extracted
    signature_all = set(df_signature_clean['curie'].dropna())
    #print(f"signature_all: {list(signature_all)[0:5]}")

    # Load initial inclusion file
    # Move this to "reports" folder
    df_initial_inclusions = pd.read_csv(f_initial_inclusions, header=None, names=['curie_include'])
    df_initial_inclusions['curie_include'] = df_initial_inclusions['curie_include'].apply(extract_curie)
    signature_initial_inclusions = set(df_initial_inclusions['curie_include'].dropna())
    #print(f"signature_initial_inclusions: {list(signature_initial_inclusions)[0:5]}")

    # Load the term exclusions file
    df_manual_exclusions = pd.read_csv(f_exclusions, header=None, names=['curie_exclude'])
    signature_manual_exclusions = set(df_manual_exclusions['curie_exclude'].dropna())
    #print(f"signature_manual_exclusions: {list(signature_manual_exclusions)[0:5]}")

    # Load the obsolete terms from Mondo
    df_mondo_obsoletes = df_mondo_obsoletes_exactmatch[df_mondo_obsoletes_exactmatch['exact_match'].str.startswith('OMIM')].copy()
    signature_mondo_obsoletes = set(df_mondo_obsoletes['exact_match'].dropna().unique())
    #print(f"signature_obsoletes: {list(signature_mondo_obsoletes)[0:5]}")

    # Load the obsolete terms file for source
    df_source_obsoletes = pd.read_csv(f_deprecated, header=None, names=['curie'])
    df_source_obsoletes['curie'] = df_source_obsoletes['curie'].apply(extract_curie)
    signature_source_obsoletes = set(df_source_obsoletes['curie'].dropna().unique())
    #print(f"source_obsoletes: {list(signature_source_obsoletes)[0:5]}")

    # Compute the various sets
    non_inclusion_set = set(signature_all - signature_initial_inclusions)
    exclusion_set = non_inclusion_set | signature_manual_exclusions | signature_source_obsoletes | signature_mondo_obsoletes
    inclusion_set = signature_initial_inclusions - exclusion_set

    df_mapped = mondo_sssom_df[mondo_sssom_df['object_id'].isin(inclusion_set)].copy()
    mapped_set = set(df_mapped['object_id'].dropna().unique())
    non_mapped = inclusion_set - mapped_set
    print(f"First five IDs that have not been mapped yet: {list(non_mapped)[0:5]}")

    all_count = len(signature_all)
    excluded_count = len(exclusion_set)
    included_count = len(inclusion_set)
    initial_inclusion_count = len(signature_initial_inclusions)
    initial_inclusion_percent = (initial_inclusion_count / all_count) * 100 if all_count > 0 else 0
    included_percent = (included_count / initial_inclusion_count) * 100 if initial_inclusion_count > 0 else 0
    mapped_count  = len(mapped_set)
    mapped_percent = (mapped_count / included_count) * 100 if included_count > 0 else 0
    print(f"\nFinal Count of {source}:")
    print(f"  Mirror Signature [ALL]: {all_count}")
    print(f"  Initial Inclusions [IN-SCOPE]: {initial_inclusion_count} ({initial_inclusion_percent:.2f}%)")
    print(f"  Excluded Terms [EXCLUDED]: {excluded_count}")
    print(f"  Included Terms [DISEASE]: {included_count} ({included_percent:.2f}%)")
    print(f"  Mapped Terms [MAPPED]: {mapped_count} ({mapped_percent:.2f}%)")


In [139]:
## Load depreated Mondo classes and any exact matches they might have

f_deprecated_exactmatch = 'reports/mondo_obsoletes_exactmatch.csv'
df_mondo_obsoletes_exactmatch = pd.read_csv(f_deprecated_exactmatch)
df_mondo_obsoletes_exactmatch['class'] = df_mondo_obsoletes_exactmatch['class'].apply(extract_curie)
df_mondo_obsoletes_exactmatch['exact_match'] = df_mondo_obsoletes_exactmatch['exact_match'].apply(extract_curie)
df_mondo_obsoletes_exactmatch.head()


Unnamed: 0,class,exact_match
0,MONDO:0013733,OMIM:614401
1,MONDO:0019748,Orphanet:93618
2,MONDO:0007139,OMIM:107290
3,MONDO:0044256,OMIM:227240
4,MONDO:0019711,icd11.foundation:395969787


In [140]:
## Load Mondo SSSOM file

mondo_sssom_df = pd.read_csv('tmp/mondo.sssom.tsv', sep='\t', comment='#')
mondo_sssom_df.head()

Unnamed: 0,subject_id,subject_label,predicate_id,object_id,object_label,mapping_justification
0,MONDO:0000001,disease,skos:exactMatch,DOID:4,disease,semapv:UnspecifiedMatching
1,MONDO:0000001,disease,skos:exactMatch,MEDGEN:4347,,semapv:UnspecifiedMatching
2,MONDO:0000001,disease,skos:exactMatch,NCIT:C2991,Disease or Disorder,semapv:UnspecifiedMatching
3,MONDO:0000001,disease,skos:exactMatch,Orphanet:377788,Disease,semapv:UnspecifiedMatching
4,MONDO:0000001,disease,skos:exactMatch,SCTID:64572001,,semapv:UnspecifiedMatching


In [141]:
# OMIM Alignment
source = 'omim'
compute_stats(source, mondo_sssom_df, df_mondo_obsoletes_exactmatch)

First five IDs that have not been mapped yet: ['OMIM:620975', 'OMIM:271250', 'term']

Final Count of omim:
  Mirror Signature [ALL]: 36277
  Initial Inclusions [IN-SCOPE]: 12143 (33.47%)
  Excluded Terms [EXCLUDED]: 27583
  Included Terms [DISEASE]: 8703 (71.67%)
  Mapped Terms [MAPPED]: 8700 (99.97%)


In [142]:
# Orphanet Alignment

source = 'ordo'
compute_stats(source, mondo_sssom_df, df_mondo_obsoletes_exactmatch)

First five IDs that have not been mapped yet: ['Orphanet:689021', 'Orphanet:662392', 'Orphanet:684216', 'Orphanet:689397', 'Orphanet:659707']

Final Count of ordo:
  Mirror Signature [ALL]: 15580
  Initial Inclusions [IN-SCOPE]: 11097 (71.23%)
  Excluded Terms [EXCLUDED]: 6614
  Included Terms [DISEASE]: 9269 (83.53%)
  Mapped Terms [MAPPED]: 9195 (99.20%)


In [143]:
# DOID Alignment
source = 'doid'
compute_stats(source, mondo_sssom_df, df_mondo_obsoletes_exactmatch)

First five IDs that have not been mapped yet: ['DOID:0070610', 'DOID:0051001', 'DOID:0070623', 'DOID:0051033', 'DOID:0051027']

Final Count of doid:
  Mirror Signature [ALL]: 19115
  Initial Inclusions [IN-SCOPE]: 11766 (61.55%)
  Excluded Terms [EXCLUDED]: 7816
  Included Terms [DISEASE]: 11578 (98.40%)
  Mapped Terms [MAPPED]: 11442 (98.83%)


In [144]:
# NCIT Alignment
source = 'ncit'
compute_stats(source, mondo_sssom_df, df_mondo_obsoletes_exactmatch)

First five IDs that have not been mapped yet: ['NCIT:C173137', 'NCIT:C92180', 'NCIT:C188956', 'NCIT:C136770', 'NCIT:C96489']

Final Count of ncit:
  Mirror Signature [ALL]: 185903
  Initial Inclusions [IN-SCOPE]: 15751 (8.47%)
  Excluded Terms [EXCLUDED]: 175699
  Included Terms [DISEASE]: 15746 (99.97%)
  Mapped Terms [MAPPED]: 3688 (23.42%)


In [145]:
# ICD 11 Alignment
source = 'icd11foundation'
compute_stats(source, mondo_sssom_df, df_mondo_obsoletes_exactmatch)

First five IDs that have not been mapped yet: ['icd11.foundation:1233116564', 'icd11.foundation:1357248582', 'icd11.foundation:960436403', 'icd11.foundation:1462112221', 'icd11.foundation:2028524981']

Final Count of icd11foundation:
  Mirror Signature [ALL]: 101136
  Initial Inclusions [IN-SCOPE]: 57873 (57.22%)
  Excluded Terms [EXCLUDED]: 49318
  Included Terms [DISEASE]: 52098 (90.02%)
  Mapped Terms [MAPPED]: 4104 (7.88%)


In [146]:
# ICD 10 Alignment
source = 'icd10cm'
compute_stats(source, mondo_sssom_df, df_mondo_obsoletes_exactmatch)

First five IDs that have not been mapped yet: ['ICD10CM:S50.851', 'ICD10CM:S12.190K', 'ICD10CM:S50.322', 'ICD10CM:S59.299', 'ICD10CM:H15.049']

Final Count of icd10cm:
  Mirror Signature [ALL]: 95975
  Initial Inclusions [IN-SCOPE]: 95848 (99.87%)
  Excluded Terms [EXCLUDED]: 15857
  Included Terms [DISEASE]: 80396 (83.88%)
  Mapped Terms [MAPPED]: 1166 (1.45%)
