## Alignment Stats


### Description

This notebook is to review counts for the Mapping Progress Report in the mondo-ingest repo.

These calculations are based on the slides [here](https://docs.google.com/presentation/d/14Xmcfwe9xhoGG0ZnDbFbwzKqHWHQR4OaOF4W7II2BaM/edit?slide=id.p#slide=id.p).

- DISEASE is gnerally calculated as mirror signature - (all excluded terms and all ingest source obsolete terms)
  - The mirror signature file is from reports/mirror_signature-<ONTOLOGY_NAME>.tsv
  - The set of all excluded terms is from reports/<ONTOLOGY_NAME>_term_exclusions.txt
  - The set of all ingest source obsolete terms was created by querying the OAK ontology database for obsolete terms.
  - IRIs were converted to CURIEs as needed.
  - The final count only includes CURIEs that are from the source ingest base namespace.  


- MAPPED is calculated as starting with the `mondo.sssom.tsv` file and filtering to only skos:exactMatch rows and then removing both any rows that also have an obsolete Mondo term or an obsolete source ingest term.
QUESTION: Should the source ingest terms that exist in the `mondo.sssom.tsv` file BUT are also in the term exclusion file for the source also be removed?


NOTE: The files of obsolete terms were created using the pattern: `runoak -i components/icd10cm.db obsoletes > icd10cm_obsoletes.txt`
Files were used from the recent data build so all files were created late April 2025.

In [1]:
# Imports
import pandas as pd

pd.set_option('display.width', 1000)

## Get obsolete Mondo terms

In [2]:
# Query for obsolete Mondo terms and any xrefs for the term that exist

!robot query -i tmp/mondo.owl -q ../sparql/mondo_obsoletes_and_xrefs.sparql tmp/mondo_obsoletes_and_xrefs.tsv

In [3]:
obs_mondo_df = pd.read_csv('tmp/mondo_obsoletes_and_xrefs.tsv', sep='\t')

obs_mondo_df.head()

Unnamed: 0,?mondo_curie,?xref,?xref_source,?mondoSources
0,MONDO:0000002,,,
1,MONDO:0000003,,,
2,MONDO:0000006,,,
3,MONDO:0000007,,,
4,MONDO:0000008,,,


## Get Mondo Mapping file 

In [4]:
# Read in mondo.sssom.tsv 

mondo_sssom_df = pd.read_csv('tmp/mondo.sssom.tsv', sep='\t', skiprows=51)

print(len(mondo_sssom_df))

106545


In [5]:
# Filter mondo_sssom_df to only rows where the predicate_id os skos:exactMatch

mondo_sssom_skosExactMatch_df = mondo_sssom_df[mondo_sssom_df['predicate_id'] == 'skos:exactMatch']

mondo_sssom_skosExactMatch_df.head()
print(len(mondo_sssom_skosExactMatch_df))

106468


## Generalize for all sources

In [6]:
# Generalize for all Sources

# Function to extract CURIE from IRI
def extract_curie(iri):
    curie_mapping = {
        'https://omim.org/entry/': 'OMIM:',
        'http://omim.org/entry/': 'OMIM:',
        'https://omim.org/phenotypicSeries/PS': 'OMIMPS:',
        'http://omim.org/phenotypicSeries/PS': 'OMIMPS:',
        'http://purl.obolibrary.org/obo/OMIMPS_': 'OMIMPS:',
        'http://purl.obolibrary.org/obo/MONDO_': 'MONDO:',
        'http://www.orpha.net/ORDO/Orphanet_': 'Orphanet:',
        'http://purl.obolibrary.org/obo/DOID_': 'DOID:',
        'http://purl.obolibrary.org/obo/NCIT_': 'NCIT:',
        'http://id.who.int/icd/entity/': 'icd11.foundation:',
        'http://purl.bioontology.org/ontology/ICD10CM/': 'ICD10CM:',
        'http://purl.obolibrary.org/obo/mondo/sources/icd11foundation/': 'icd11.foundation:',
    }

    if pd.notna(iri):
        for prefix, curie_prefix in curie_mapping.items():
            if iri.startswith(prefix):
                return iri.replace(prefix, curie_prefix)
    return pd.NA


def compute_stats(source, mondo_sssom_df):

    # NOTE: All files must be from latest build for accurate numbers
    f_mirror_signature = f'reports/mirror_signature-{source}.tsv'
    f_initial_inclusions = f'tmp/{source}_relevant_signature.txt'
    f_exclusions = f'reports/{source}_term_exclusions.txt'
    f_deprecated = f'reports/mirror_deprecated-{source}.tsv'

    # Load the mirror signature file and get list of all mirror signature CURIEs
    df_signature = pd.read_csv(f_mirror_signature, skiprows=1, names=['iri'])

    df_signature['iri_clean'] = df_signature['iri'].str.strip('<>')
    df_signature['curie'] = df_signature['iri_clean'].apply(extract_curie)
    print(f"LEN - df_signature: {len(df_signature)}")
    
    df_signature_clean = df_signature.dropna(subset=['curie']) # Keep only rows where a CURIE was extracted
    print(f"LEN - df_signature_clean: {len(df_signature_clean)}")
    
    signature_all = set(df_signature_clean['curie'].dropna())
    # print("\n** signature_all")
    # print(list(signature_all)[:5])


    # Load the initial inclusion file
    df_initial_exclusions = pd.read_csv(f_initial_inclusions, skiprows=1, names=['curie_include'])
    df_initial_exclusions['curie_include'] = df_initial_exclusions['curie_include'].apply(extract_curie).str.replace('>', '').str.replace('<', '')
    signature_initial_inclusions = set(df_initial_exclusions['curie_include'].dropna())
    # print('\n** signature_initial_inclusions')
    # print(list(signature_initial_inclusions)[:5])


    # Load the term exclusions file (manual exclusions)
    df_manual_exclusions = pd.read_csv(f_exclusions, header=None, names=['curie_exclude'])
    signature_manual_exclusions = set(df_manual_exclusions['curie_exclude'].dropna())
    # print('\n** signature_manual_exclusions')
    # print(list(signature_manual_exclusions)[:5])



    # Query OAK to get obsoletes    
    db_path = f"components/{source}.db"
    output_file = f"{source}_obsoletes.txt"
    # print(f"Looking for database: {db_path}")
    !runoak -i $db_path obsoletes > $output_file
    
    # Load obsoletes file
    df_obsoletes = pd.read_csv(f'{source}_obsoletes.txt', sep='\t', skiprows=1, names=['curie', 'reason'])
    # Filter to keep only obsolete terms with source prefix
    prefix_mapping = {
        'omim': 'omim',
        'ordo': 'orphanet',
        'doid': 'doid',
        'ncit': 'ncit',
        'icd10cm': 'icd10cm',
        'icd11': 'icd11.foundation'
    }
    prefix = prefix_mapping.get(source, '') 
    df_obsoletes_filtered = df_obsoletes[
        df_obsoletes['curie'].str.lower().str.startswith(f'{prefix}')
    ]
    
    signature_source_obsoletes = set(df_obsoletes_filtered['curie'].dropna().unique())
    # print('\n** signature_source_obsoletes')
    # print(list(signature_source_obsoletes)[:5])
    
    # Process Mondo obsoletes
    signature_mondo_obsoletes = set(obs_mondo_df['?mondo_curie'].dropna().unique())
    # print('\n** signature_mondo_obsoletes')
    # print(list(signature_mondo_obsoletes)[:5])
    

    # Compute the various sets
    non_inclusion_set = set(signature_all - signature_initial_inclusions)
    exclusion_set = non_inclusion_set | signature_manual_exclusions | signature_source_obsoletes | signature_mondo_obsoletes
    inclusion_set = signature_initial_inclusions - exclusion_set
    
    # Load mondo.sssom.tsv
    df_mapped = mondo_sssom_skosExactMatch_df[mondo_sssom_skosExactMatch_df['object_id'].isin(inclusion_set)].copy()
    mapped_set = set(df_mapped['object_id'].dropna().unique())
    non_mapped = inclusion_set - mapped_set
    # print(f"\n** First five IDs that have not been mapped yet: {list(non_mapped)[0:5]}")
    
    
    all_count = len(signature_all)
    excluded_count = len(exclusion_set)
    included_count = len(inclusion_set)
    initial_inclusion_count = len(signature_initial_inclusions)
    initial_inclusion_percent = (initial_inclusion_count / all_count) * 100 if all_count > 0 else 0
    included_percent = (included_count / initial_inclusion_count) * 100 if initial_inclusion_count > 0 else 0
    mapped_count  = len(mapped_set)
    mapped_percent = (mapped_count / included_count) * 100 if included_count > 0 else 0
    print(f"\nFinal Count of {source}:")
    print(f"  Mirror Signature [ALL]: {all_count}")
    print(f"  Initial Inclusions [IN-SCOPE]: {initial_inclusion_count} ({initial_inclusion_percent:.2f}%)")
    print(f"  Excluded Terms [EXCLUDED]: {excluded_count}")
    print(f"  Included Terms [DISEASE]: {included_count} ({included_percent:.2f}%)")
    print(f"  Mapped Terms [MAPPED]: {mapped_count} ({mapped_percent:.2f}%)")



In [7]:
# OMIM Alignment
source = 'omim'
compute_stats(source, mondo_sssom_skosExactMatch_df)

LEN - df_signature: 36276
LEN - df_signature_clean: 29721

Final Count of omim:
  Mirror Signature [ALL]: 29721
  Initial Inclusions [IN-SCOPE]: 12173 (40.96%)
  Excluded Terms [EXCLUDED]: 24785
  Included Terms [DISEASE]: 8855 (72.74%)
  Mapped Terms [MAPPED]: 8841 (99.84%)


In [8]:
# Orphanet Alignment
source = 'ordo'
compute_stats(source, mondo_sssom_skosExactMatch_df)

LEN - df_signature: 15579
LEN - df_signature_clean: 15578

Final Count of ordo:
  Mirror Signature [ALL]: 15578
  Initial Inclusions [IN-SCOPE]: 11096 (71.23%)
  Excluded Terms [EXCLUDED]: 10247
  Included Terms [DISEASE]: 9268 (83.53%)
  Mapped Terms [MAPPED]: 9195 (99.21%)


In [9]:
# DOID Alignment
source = 'doid'
compute_stats(source, mondo_sssom_skosExactMatch_df)

LEN - df_signature: 19114
LEN - df_signature_clean: 14377

Final Count of doid:
  Mirror Signature [ALL]: 14377
  Initial Inclusions [IN-SCOPE]: 11858 (82.48%)
  Excluded Terms [EXCLUDED]: 6606
  Included Terms [DISEASE]: 11683 (98.52%)
  Mapped Terms [MAPPED]: 11442 (97.94%)


In [10]:
# NCIT Alignment
source = 'ncit'
compute_stats(source, mondo_sssom_skosExactMatch_df)

LEN - df_signature: 185902
LEN - df_signature_clean: 185902

Final Count of ncit:
  Mirror Signature [ALL]: 185902
  Initial Inclusions [IN-SCOPE]: 15750 (8.47%)
  Excluded Terms [EXCLUDED]: 179332
  Included Terms [DISEASE]: 15745 (99.97%)
  Mapped Terms [MAPPED]: 3687 (23.42%)


In [11]:
# ICD11 Alignment
source = 'icd11foundation'
compute_stats(source, mondo_sssom_skosExactMatch_df)

LEN - df_signature: 101135
LEN - df_signature_clean: 101134

Final Count of icd11foundation:
  Mirror Signature [ALL]: 101134
  Initial Inclusions [IN-SCOPE]: 57872 (57.22%)
  Excluded Terms [EXCLUDED]: 52950
  Included Terms [DISEASE]: 52097 (90.02%)
  Mapped Terms [MAPPED]: 4104 (7.88%)


In [12]:
# ICD10 Alignment
source = 'icd10cm'
compute_stats(source, mondo_sssom_skosExactMatch_df)

LEN - df_signature: 95974
LEN - df_signature_clean: 95847

Final Count of icd10cm:
  Mirror Signature [ALL]: 95847
  Initial Inclusions [IN-SCOPE]: 95847 (100.00%)
  Excluded Terms [EXCLUDED]: 19364
  Included Terms [DISEASE]: 80395 (83.88%)
  Mapped Terms [MAPPED]: 1121 (1.39%)
