## Alignment Stats


### Description

This notebook is to review counts for the Mapping Progress Report in the mondo-ingest repo.

These calculations are based on the slides [here](https://docs.google.com/presentation/d/14Xmcfwe9xhoGG0ZnDbFbwzKqHWHQR4OaOF4W7II2BaM/edit?slide=id.p#slide=id.p).

- DISEASE is gnerally calculated as mirror signature - (all excluded terms and all ingest source obsolete terms)
  - The mirror signature file is from reports/mirror_signature-<ONTOLOGY_NAME>.tsv
  - The set of all excluded terms is from reports/<ONTOLOGY_NAME>_term_exclusions.txt
  - The set of all ingest source obsolete terms was created by querying the OAK ontology database for obsolete terms.
  - IRIs were converted to CURIEs as needed.
  - The final count only includes CURIEs that are from the source ingest base namespace.  


- MAPPED is calculated as starting with the `mondo.sssom.tsv` file and filtering to only skos:exactMatch rows and then removing both any rows that also have an obsolete Mondo term or an obsolete source ingest term.
QUESTION: Should the source ingest terms that exist in the `mondo.sssom.tsv` file BUT are also in the term exclusion file for the source also be removed?


NOTE: The files of obsolete terms were created using the pattern: `runoak -i components/icd10cm.db obsoletes > icd10cm_obsoletes.txt`
Files were used from the recent data build so all files were created late April 2025.

In [102]:
# Imports
import pandas as pd

pd.set_option('display.width', 1000)

## Get obsolete Mondo terms

In [103]:
# Query for obsolete Mondo terms and any xrefs for the term that exist

!robot query -i tmp/mondo.owl -q ../sparql/mondo_obsoletes_and_xrefs.sparql tmp/mondo_obsoletes_and_xrefs.tsv

In [104]:
obs_mondo_df = pd.read_csv('tmp/mondo_obsoletes_and_xrefs.tsv', sep='\t')

obs_mondo_df.head()

Unnamed: 0,?mondo_curie,?xref,?xref_source,?mondoSources
0,MONDO:0000002,,,
1,MONDO:0000003,,,
2,MONDO:0000006,,,
3,MONDO:0000007,,,
4,MONDO:0000008,,,


## Get Mondo Mapping file 

In [105]:
# Read in mondo.sssom.tsv 

mondo_sssom_df = pd.read_csv('tmp/mondo.sssom.tsv', sep='\t', skiprows=51)

print(len(mondo_sssom_df))

106545


In [106]:
# Filter mondo_sssom_df to only rows where the predicate_id os skos:exactMatch

mondo_sssom_skosExactMatch_df = mondo_sssom_df[mondo_sssom_df['predicate_id'] == 'skos:exactMatch']

mondo_sssom_skosExactMatch_df.head()
print(len(mondo_sssom_skosExactMatch_df))

106468


---
# Calculate DISEASE

### OMIM

In [107]:
# OMIM Alignment

# Function to extract CURIE from IRI
def extract_omim_curie(iri):
    if pd.notna(iri):
        match = pd.NA
        if 'https://omim.org/entry/' in iri:
            match = iri.replace('https://omim.org/entry/', 'OMIM:')
        elif 'https://omim.org/phenotypicSeries/PS' in iri:
            match = iri.replace('https://omim.org/phenotypicSeries/', 'OMIMPS:')
        elif 'http://purl.obolibrary.org/obo/OMIMPS_' in iri:
            match = iri.replace('http://purl.obolibrary.org/obo/OMIMPS_', 'OMIMPS:')
        return match
    return pd.NA


# Load the mirror signature file and get list of all mirror signature CURIEs
omim_df_signature = pd.read_csv('reports/mirror_signature-omim.tsv', skiprows=1, names=['iri'])
print(f"LEN - omim_df_signature: {len(omim_df_signature)}")

omim_df_signature['curie'] = omim_df_signature['iri'].apply(extract_omim_curie).str.replace('>', '').str.replace('<', '') # Remove potential trailing '>'
print(f"LEN - omim_df_signature: {len(omim_df_signature)}")

omim_df_signature_clean = omim_df_signature.dropna(subset=['curie']) # Keep only rows where a CURIE was extracted
print(f"LEN - omim_df_signature_clean: {len(omim_df_signature_clean)}")

signature_all = set(omim_df_signature_clean['curie'].dropna())
print("\n** signature_all")
print(list(signature_all)[:5])


# Load the initial inclusion file
omim_df_initial_exclusions = pd.read_csv('tmp/omim_relevant_signature.txt', skiprows=1, names=['curie_include'])
omim_df_initial_exclusions['curie_include'] = omim_df_initial_exclusions['curie_include'].apply(extract_omim_curie).str.replace('>', '').str.replace('<', '')
signature_initial_inclusions = set(omim_df_initial_exclusions['curie_include'].dropna())
print('\n** signature_initial_inclusions')
print(list(signature_initial_inclusions)[:5])


# Load the term exclusions file (manual exclusions)
omim_df_manual_exclusions = pd.read_csv('reports/omim_term_exclusions.txt', header=None, names=['curie_exclude'])
signature_manual_exclusions = set(omim_df_manual_exclusions['curie_exclude'].dropna())
print('\n** signature_manual_exclusions')
print(list(signature_manual_exclusions)[:5])


# Query OAK to get obsoletes
!runoak -i components/omim.db obsoletes > omim_obsoletes.txt
# Load obsoletes file
omim_df_obsoletes = pd.read_csv('omim_obsoletes.txt', sep='\t', skiprows=1, names=['curie', 'reason'])
# Filter to keep only OMIM | OMIMPS rows
omim_df_obsoletes_filtered = omim_df_obsoletes[
    omim_df_obsoletes['curie'].str.startswith('OMIM') | omim_df_obsoletes['curie'].str.startswith('OMIMPS')
]
signature_source_obsoletes = set(omim_df_obsoletes_filtered['curie'].dropna().unique())
print('\n** signature_source_obsoletes')
print(list(signature_source_obsoletes)[:5])


# Process Mondo obsoletes
# df_mondo_obsoletes = obs_mondo_df[obs_mondo_df['exact_match'].str.startswith('OMIM')].copy()
signature_mondo_obsoletes = set(obs_mondo_df['?mondo_curie'].dropna().unique())
print('\n** signature_mondo_obsoletes')
print(list(signature_mondo_obsoletes)[:5])


# Compute the various sets
non_inclusion_set = set(signature_all - signature_initial_inclusions)
exclusion_set = non_inclusion_set | signature_manual_exclusions | signature_source_obsoletes | signature_mondo_obsoletes
inclusion_set = signature_initial_inclusions - exclusion_set


# Load mondo.sssom.tsv
df_mapped = mondo_sssom_skosExactMatch_df[mondo_sssom_skosExactMatch_df['object_id'].isin(inclusion_set)].copy()
mapped_set = set(df_mapped['object_id'].dropna().unique())
non_mapped = inclusion_set - mapped_set
print(f"\n** First five IDs that have not been mapped yet: {list(non_mapped)[0:5]}")




all_count = len(signature_all)
excluded_count = len(exclusion_set)
included_count = len(inclusion_set)
initial_inclusion_count = len(signature_initial_inclusions)
initial_inclusion_percent = (initial_inclusion_count / all_count) * 100 if all_count > 0 else 0
included_percent = (included_count / initial_inclusion_count) * 100 if initial_inclusion_count > 0 else 0
mapped_count  = len(mapped_set)
mapped_percent = (mapped_count / included_count) * 100 if included_count > 0 else 0
print(f"\nFinal Count of OMIM:")
print(f"  Mirror Signature [ALL]: {all_count}")
print(f"  Initial Inclusions [IN-SCOPE]: {initial_inclusion_count} ({initial_inclusion_percent:.2f}%)")
print(f"  Excluded Terms [EXCLUDED]: {excluded_count}")
print(f"  Included Terms [DISEASE]: {included_count} ({included_percent:.2f}%)")
print(f"  Mapped Terms [MAPPED]: {mapped_count} ({mapped_percent:.2f}%)")


LEN - omim_df_signature: 36276
LEN - omim_df_signature: 36276
LEN - omim_df_signature_clean: 29721

** signature_all
['OMIM:118230', 'OMIM:619972', 'OMIM:620758', 'OMIM:604527', 'OMIM:618472']

** signature_initial_inclusions
['OMIM:609812', 'OMIM:609750', 'OMIM:600121', 'OMIM:118230', 'OMIM:619972']

** signature_manual_exclusions
['OMIM:607717', 'OMIM:606704', 'OMIM:118230', 'OMIM:604512', 'OMIM:608110']

** signature_source_obsoletes
['OMIM:213002', 'OMIM:605291', 'OMIM:233500', 'OMIM:600406', 'OMIM:126140']

** signature_mondo_obsoletes
['MONDO:0001619', 'MONDO:0004676', 'MONDO:0008351', 'MONDO:0018191', 'MONDO:0016361']

** First five IDs that have not been mapped yet: ['OMIMPS:PS600669', 'OMIMPS:PS245480', 'OMIMPS:PS600513', 'OMIMPS:PS137950', 'OMIMPS:PS618332']

Final Count of OMIM:
  Mirror Signature [ALL]: 29721
  Initial Inclusions [IN-SCOPE]: 12173 (40.96%)
  Excluded Terms [EXCLUDED]: 24785
  Included Terms [DISEASE]: 8856 (72.75%)
  Mapped Terms [MAPPED]: 8248 (93.13%)


### Orphanet

In [21]:
# Load the mirror signature file
orphanet_df_signature = pd.read_csv('reports/mirror_signature-ordo.tsv', header=None, names=['iri'])
print("Mirror Signature (ORDO) Head:")
print(orphanet_df_signature.head(3))

# Load the term exclusions file
orphanet_df_exclusions = pd.read_csv('reports/ordo_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nORDO Term Exclusions Head:")
print(orphanet_df_exclusions.head(3))


# Query for obsoletes
!runoak -i components/ordo.db obsoletes > orphanet_obsoletes.txt

# Load the obsolete terms file (assuming tab-separated with a header)
orphanet_df_obsoletes = pd.read_csv('orphanet_obsoletes.txt', sep='\t')
print("\nOrphanet Obsoletes Head:")
print(orphanet_df_obsoletes.head(3))

# Function to extract Orphanet CURIE from IRI
def extract_orphanet_curie(iri):
    if pd.notna(iri):
        if 'http://www.orpha.net/ORDO/Orphanet_' in iri:
            return iri.replace('<http://www.orpha.net/ORDO/Orphanet_', 'Orphanet:')
    return pd.NA

# Apply the extraction function
orphanet_df_signature['curie'] = orphanet_df_signature['iri'].apply(extract_orphanet_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", orphanet_df_signature['curie'].head())

# Drop rows where CURIE extraction failed (non-Orphanet IRIs)
orphanet_df_signature_clean = orphanet_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with Orphanet CURIEs (Head):")
print(orphanet_df_signature_clean.head(3))

# Prepare exclusion set
orphanet_exclude_set = set(orphanet_df_exclusions['curie_exclude'].dropna())
print("\nSize of Orphanet Exclusion Set:", len(orphanet_exclude_set))

# Prepare obsolete set (assuming the 'id' column contains Orphanet CURIEs)
orphanet_obsoletes_set = set(orphanet_df_obsoletes['id'].dropna()[orphanet_df_obsoletes['id'].str.startswith('Orphanet:')])
print("Size of Orphanet Obsoletes Set:", len(orphanet_obsoletes_set))

# Filter the signature DataFrame
df_filtered_orphanet = orphanet_df_signature_clean[
    ~orphanet_df_signature_clean['curie'].isin(orphanet_exclude_set) &
    ~orphanet_df_signature_clean['curie'].isin(orphanet_obsoletes_set)
]

# Get the final count
final_count_ordo = len(df_filtered_orphanet)
print("\nFinal Count of Orphanet CURIEs after filtering:", final_count_ordo)

Mirror Signature (ORDO) Head:
                                           iri
0                                        ?term
1  <http://www.orpha.net/ORDO/Orphanet_100000>
2  <http://www.orpha.net/ORDO/Orphanet_100001>

ORDO Term Exclusions Head:
     curie_exclude
0  Orphanet:100039
1  Orphanet:100040
2  Orphanet:100041

Orphanet Obsoletes Head:
                id                                              label
0  Orphanet:100039                 Familial pseudohyperkalemia type 1
1  Orphanet:100040       OBSOLETE: Familial pseudohyperkalemia type 2
2  Orphanet:100041  OBSOLETE: Familial pseudohyperkalemia, Cardiff...

**TEST 0               <NA>
1    Orphanet:100000
2    Orphanet:100001
3    Orphanet:100002
4    Orphanet:100003
Name: curie, dtype: object

Signature DataFrame with Orphanet CURIEs (Head):
                                           iri            curie
1  <http://www.orpha.net/ORDO/Orphanet_100000>  Orphanet:100000
2  <http://www.orpha.net/ORDO/Orphanet_100001>  Orphan

### DOID

In [22]:
# Load the mirror signature file
doid_df_signature = pd.read_csv('reports/mirror_signature-doid.tsv', header=None, names=['iri'])
print("\nDOID Mirror Signature Head:")
print(doid_df_signature.head(3))

# Load the term exclusions file
doid_df_exclusions = pd.read_csv('reports/doid_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nDOID Term Exclusions Head:")
print(doid_df_exclusions.head(3))

# Query for obsoletes
!runoak -i components/doid.db obsoletes > doid_obsoletes.txt

# Load the obsolete terms file
doid_df_obsoletes = pd.read_csv('doid_obsoletes.txt', sep='\t')
print("\nDOID Obsoletes Head:")
print(doid_df_obsoletes.head(3))

# Function to extract DOID CURIE from IRI
def extract_doid_curie(iri):
    if pd.notna(iri):
        if 'http://purl.obolibrary.org/obo/DOID_' in iri:
            return iri.replace('<http://purl.obolibrary.org/obo/DOID_', 'DOID:')
    return pd.NA

# doid_df_signature['curie'] = doid_df_signature['iri'].apply(extract_doid_curie)
doid_df_signature['curie'] = doid_df_signature['iri'].apply(extract_doid_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", doid_df_signature['curie'].head())

doid_df_signature_clean = doid_df_signature.dropna(subset=['curie'])
print("\n**TEST", doid_df_signature_clean['curie'].head())

doid_exclude_set = set(doid_df_exclusions['curie_exclude'].dropna())
print("\nSize of DOID Exclusion Set:", len(doid_exclude_set))

doid_obsoletes_set = set(doid_df_obsoletes['id'].dropna()[doid_df_obsoletes['id'].str.startswith('DOID:')])
print("Size of DOID Obsoletes Set:", len(doid_obsoletes_set))

doid_df_filtered = doid_df_signature_clean[
    ~doid_df_signature_clean['curie'].isin(doid_exclude_set) &
    ~doid_df_signature_clean['curie'].isin(doid_obsoletes_set)
]

final_count_doid = len(doid_df_filtered)
print("\nFinal Count of DOID CURIEs after filtering:", final_count_doid)


DOID Mirror Signature Head:
                                             iri
0                                          ?term
1  <http://purl.obolibrary.org/obo/CHEBI_102166>
2  <http://purl.obolibrary.org/obo/CHEBI_103210>

DOID Term Exclusions Head:
  curie_exclude
0  DOID:0040001
1  DOID:0040002
2  DOID:0040003

DOID Obsoletes Head:
             id                                              label
0  DOID:0050001   obsolete Actinomadura madurae infectious disease
1  DOID:0050002  obsolete Actinomadura pelletieri infectious di...
2  DOID:0050003  obsolete Streptomyces somaliensis infectious d...

**TEST 0    <NA>
1    <NA>
2    <NA>
3    <NA>
4    <NA>
Name: curie, dtype: object

**TEST 609    DOID:0001816
610    DOID:0002116
611    DOID:0014667
612    DOID:0040001
613    DOID:0040002
Name: curie, dtype: object

Size of DOID Exclusion Set: 2675
Size of DOID Obsoletes Set: 2502

Final Count of DOID CURIEs after filtering: 11683


### NCIT

In [23]:
# Load the mirror signature file
ncit_df_signature = pd.read_csv('reports/mirror_signature-ncit.tsv', header=None, names=['iri'])
print("\nNCIT Mirror Signature Head:")
print(ncit_df_signature.head(3))

# Load the term exclusions file
ncit_df_exclusions = pd.read_csv('reports/ncit_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nNCIT Term Exclusions Head:")
print(ncit_df_exclusions.head(3))


# Query for obsoletes
!runoak -i components/ncit.db obsoletes > ncit_obsoletes.txt

# Load the obsolete terms file
ncit_df_obsoletes = pd.read_csv('ncit_obsoletes.txt', sep='\t')
print("\nNCIT Obsoletes Head:")
print(ncit_df_obsoletes.head(3))

# Function to extract NCIT CURIE from IRI
def extract_ncit_curie(iri):
    if pd.notna(iri):
        if 'http://purl.obolibrary.org/obo/NCIT_' in iri:
            return iri.replace('http://purl.obolibrary.org/obo/NCIT_', 'NCIT:')
    return pd.NA

ncit_df_signature['curie'] = ncit_df_signature['iri'].apply(extract_ncit_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", ncit_df_signature['curie'].head())

ncit_df_signature_clean = ncit_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with Orphanet CURIEs (Head):")
print(ncit_df_signature_clean.head())

ncit_exclude_set = set(ncit_df_exclusions['curie_exclude'].dropna())
print("\nSize of NCIT Exclusion Set:", len(ncit_exclude_set))

ncit_obsoletes_set = set(ncit_df_obsoletes['id'].dropna()[ncit_df_obsoletes['id'].str.startswith('NCIT:')])
print("Size of NCIT Obsoletes Set:", len(ncit_obsoletes_set))

ncit_df_filtered = ncit_df_signature_clean[
    ~ncit_df_signature_clean['curie'].isin(ncit_exclude_set) &
    ~ncit_df_signature_clean['curie'].isin(ncit_obsoletes_set)
]

final_count_ncit = len(ncit_df_filtered)
print("\nFinal Count of NCIT CURIEs after filtering:", final_count_ncit)


NCIT Mirror Signature Head:
                                          iri
0                                       ?term
1   http://purl.obolibrary.org/obo/NCIT_C1000
2  http://purl.obolibrary.org/obo/NCIT_C10000

NCIT Term Exclusions Head:
  curie_exclude
0   NCIT:124251
1   NCIT:134528
2   NCIT:134530

NCIT Obsoletes Head:
             id                                            label
0  NCIT:C100067                   Coronary Reperfusion Procedure
1  NCIT:C100421  Activated PTT to Standard PTT Ratio Measurement
2  NCIT:C100426                   Beta-Trace Protein Measurement

**TEST 0            <NA>
1      NCIT:C1000
2     NCIT:C10000
3    NCIT:C100000
4    NCIT:C100001
Name: curie, dtype: object

Signature DataFrame with Orphanet CURIEs (Head):
                                           iri         curie
1    http://purl.obolibrary.org/obo/NCIT_C1000    NCIT:C1000
2   http://purl.obolibrary.org/obo/NCIT_C10000   NCIT:C10000
3  http://purl.obolibrary.org/obo/NCIT_C100000  NCIT:C1

### ICD10CM

In [24]:
# Load the mirror signature file
icd10cm_df_signature = pd.read_csv('reports/mirror_signature-icd10cm.tsv', header=None, names=['iri'])
print("ICD10CM Mirror Signature Head:")
print(icd10cm_df_signature.head(3))

# Load the term exclusions file
icd10cm_df_exclusions = pd.read_csv('reports/icd10cm_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nICD10CM Term Exclusions Head:")
print(icd10cm_df_exclusions.head(3))

# Query for obsoletes
!runoak -i components/icd10cm.db obsoletes > icd10cm_obsoletes.txt

# Load the obsolete terms file, handling potential EmptyDataError
try:
    icd10cm_df_obsoletes = pd.read_csv('icd10cm_obsoletes.txt', sep='\t')
    print("\nICD10CM Obsoletes Head:")
    print(icd10cm_df_obsoletes.head(3))
except pd.errors.EmptyDataError:
    print("\nICD10CM Obsoletes file is empty.")
    icd10cm_df_obsoletes = pd.DataFrame() # Create an empty DataFrame

# Function to extract ICD10CM CURIE from IRI
def extract_icd10cm_curie(iri):
    if pd.notna(iri):
        if 'http://purl.bioontology.org/ontology/ICD10CM/' in iri:
            # Extract the code part after the last '/'
            code_part = iri.split('/')[-1]
            # Take the part before the first '-' if it exists, otherwise take the whole code
            if '-' in code_part:
                base_code = code_part.split('-')[0]
                return f"ICD10CM:{base_code}"
            else:
                return f"ICD10CM:{code_part}"
    return pd.NA

# Apply the extraction function
icd10cm_df_signature['curie'] = icd10cm_df_signature['iri'].apply(extract_icd10cm_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", icd10cm_df_signature['curie'].head())

# Drop rows where CURIE extraction failed (non-ICD10CM IRIs)
icd10cm_df_signature_clean = icd10cm_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with ICD10CM CURIEs (Head):")
print(icd10cm_df_signature_clean.head(3))

# Prepare exclusion set
icd10cm_exclude_set = set(icd10cm_df_exclusions['curie_exclude'].dropna())
print("\nSize of ICD10CM Exclusion Set:", len(icd10cm_exclude_set))

# Prepare obsolete set, handling empty DataFrame
if not icd10cm_df_obsoletes.empty and 'id' in icd10cm_df_obsoletes.columns:
    icd10cm_obsoletes_set = set(icd10cm_df_obsoletes['id'].dropna()[icd10cm_df_obsoletes['id'].str.startswith('ICD10CM:')])
else:
    icd10cm_obsoletes_set = set()
print("Size of ICD10CM Obsoletes Set:", len(icd10cm_obsoletes_set))

# Filter the signature DataFrame
icd10cm_df_filtered = icd10cm_df_signature_clean[
    ~icd10cm_df_signature_clean['curie'].isin(icd10cm_exclude_set) &
    ~icd10cm_df_signature_clean['curie'].isin(icd10cm_obsoletes_set)
]

# Get the final count
final_count_icd10cm = len(icd10cm_df_filtered)
print("\nFinal Count of ICD10CM CURIEs after filtering:", final_count_icd10cm)

ICD10CM Mirror Signature Head:
                                                 iri
0                                              ?term
1  <http://purl.bioontology.org/ontology/ICD10CM/...
2  <http://purl.bioontology.org/ontology/ICD10CM/...

ICD10CM Term Exclusions Head:
     curie_exclude
0      ICD10CM:B95
1  ICD10CM:B95-B97
2    ICD10CM:B95.0

ICD10CM Obsoletes file is empty.

**TEST 0             <NA>
1      ICD10CM:A00
2      ICD10CM:A00
3    ICD10CM:A00.0
4    ICD10CM:A00.1
Name: curie, dtype: object

Signature DataFrame with ICD10CM CURIEs (Head):
                                                 iri          curie
1  <http://purl.bioontology.org/ontology/ICD10CM/...    ICD10CM:A00
2  <http://purl.bioontology.org/ontology/ICD10CM/...    ICD10CM:A00
3  <http://purl.bioontology.org/ontology/ICD10CM/...  ICD10CM:A00.0

Size of ICD10CM Exclusion Set: 15452
Size of ICD10CM Obsoletes Set: 0

Final Count of ICD10CM CURIEs after filtering: 80388


### ICD11 Foundation

In [26]:
# Load the mirror signature file
icd11_df_signature = pd.read_csv('reports/mirror_signature-icd11foundation.tsv', header=None, names=['iri'])
print("ICD11 Foundation Mirror Signature Head:")
print(icd11_df_signature.head(3))
print("LEN icd11_df_signature:", len(icd11_df_signature))

# Load the term exclusions file
icd11_df_exclusions = pd.read_csv('reports/icd11foundation_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nICD11 Foundation Term Exclusions Head:")
print(icd11_df_exclusions.head(3))

# Query for obsoletes
!runoak -i components/icd11foundation.db obsoletes > icd11foundation_obsoletes.txt

# Load the obsolete terms file (assuming tab-separated with a header)
icd11_df_obsoletes = pd.read_csv('icd11foundation_obsoletes.txt', sep='\t')
print("\nICD11 Foundation Obsoletes Head:")
print(icd11_df_obsoletes.head(3))

# Function to extract ICD11 Foundation CURIE from IRI
def extract_icd11_curie(iri):
    if pd.notna(iri):
        if 'http://id.who.int/icd/entity/' in iri:
            return iri.replace('<http://id.who.int/icd/entity/', 'icd11.foundation:')
    return pd.NA

# Apply the extraction function
icd11_df_signature['curie'] = icd11_df_signature['iri'].apply(extract_icd11_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", icd11_df_signature['curie'].head())

# Drop rows where CURIE extraction failed (non-ICD11 IRIs)
icd11_df_signature_clean = icd11_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with ICD11 Foundation CURIEs (Head):")
print(icd11_df_signature_clean.head(3))

# Prepare exclusion set
icd11_exclude_set = set(icd11_df_exclusions['curie_exclude'].dropna())
print("\nSize of ICD11 Foundation Exclusion Set:", len(icd11_exclude_set))

# Prepare obsolete set (assuming the 'id' column contains ICD11 Foundation CURIEs)
if not icd11_df_obsoletes.empty and 'id' in icd11_df_obsoletes.columns:
    icd11_obsoletes_set = set(icd11_df_obsoletes['id'].dropna()[icd11_df_obsoletes['id'].str.startswith('icd11.foundation:')])
else:
    icd11_obsoletes_set = set()
print("Size of ICD11 Foundation Obsoletes Set:", len(icd11_obsoletes_set))

# Filter the signature DataFrame
icd11_df_filtered = icd11_df_signature_clean[
    ~icd11_df_signature_clean['curie'].isin(icd11_exclude_set) &
    ~icd11_df_signature_clean['curie'].isin(icd11_obsoletes_set)
]

# Get the final count
final_count_icd11 = len(icd11_df_filtered)
print("\nFinal Count of ICD11 Foundation CURIEs after filtering:", final_count_icd11)

ICD11 Foundation Mirror Signature Head:
                                         iri
0                                      ?term
1  <http://id.who.int/icd/entity/1000004774>
2  <http://id.who.int/icd/entity/1000010185>
LEN icd11_df_signature: 101136

ICD11 Foundation Term Exclusions Head:
                 curie_exclude
0  icd11.foundation:1000034337
1  icd11.foundation:1000093173
2  icd11.foundation:1000136681

ICD11 Foundation Obsoletes Head:
                            id                                              label
0  icd11.foundation:1000312374  Recurrent and persistent haematuria : diffuse ...
1  icd11.foundation:1001085090                           Tetraplegia, unspecified
2  icd11.foundation:1002125483  Hypertensive heart and renal disease with both...

**TEST 0                           <NA>
1    icd11.foundation:1000004774
2    icd11.foundation:1000010185
3    icd11.foundation:1000034337
4     icd11.foundation:100006598
Name: curie, dtype: object

Signature DataFrame wi

---
# Calculate MAPPED 

## Process mondo.sssom.tsv file

In [8]:
# Read in mondo.sssom.tsv file

mondo_sssom_df = pd.read_csv('tmp/mondo.sssom.tsv', sep='\t', skiprows=51)

mondo_sssom_df.head()

Unnamed: 0,subject_id,subject_label,predicate_id,object_id,object_label,mapping_justification
0,MONDO:0000001,disease,skos:exactMatch,DOID:4,disease,semapv:UnspecifiedMatching
1,MONDO:0000001,disease,skos:exactMatch,MEDGEN:4347,,semapv:UnspecifiedMatching
2,MONDO:0000001,disease,skos:exactMatch,NCIT:C2991,Disease or Disorder,semapv:UnspecifiedMatching
3,MONDO:0000001,disease,skos:exactMatch,Orphanet:377788,Disease,semapv:UnspecifiedMatching
4,MONDO:0000001,disease,skos:exactMatch,SCTID:64572001,,semapv:UnspecifiedMatching


In [9]:
# Filter mondo_sssom_df to only rows where the predicate_id os skos:exactMatch

mondo_sssom_skosExactMatch_df = mondo_sssom_df[mondo_sssom_df['predicate_id'] == 'skos:exactMatch']

mondo_sssom_skosExactMatch_df.head()

Unnamed: 0,subject_id,subject_label,predicate_id,object_id,object_label,mapping_justification
0,MONDO:0000001,disease,skos:exactMatch,DOID:4,disease,semapv:UnspecifiedMatching
1,MONDO:0000001,disease,skos:exactMatch,MEDGEN:4347,,semapv:UnspecifiedMatching
2,MONDO:0000001,disease,skos:exactMatch,NCIT:C2991,Disease or Disorder,semapv:UnspecifiedMatching
3,MONDO:0000001,disease,skos:exactMatch,Orphanet:377788,Disease,semapv:UnspecifiedMatching
4,MONDO:0000001,disease,skos:exactMatch,SCTID:64572001,,semapv:UnspecifiedMatching


In [10]:
mondo_sssom_skosExactMatch_df.nunique()

subject_id                24215
subject_label             24215
predicate_id                  1
object_id                106468
object_label              37567
mapping_justification         1
dtype: int64

In [11]:
# Read in file of obsolete Mondo terms and any xref annotation the term has
# NOTE: This file is created using the query src/sparql/reports/mondo_obsoletes_and_xrefs.sparql and the output file
# is created as: robot query -i mondo.owl -q ../sparql/reports/mondo_obsoletes_and_xrefs.sparql mondo_obsoletes_and_xrefs.tsv
# See PR https://github.com/monarch-initiative/mondo/pull/9006 for query

# obs_mondo_df = pd.read_csv('~/git/mondo/src/ontology/mondo_obsoletes_and_xrefs.tsv', sep='\t')

# obs_mondo_df.head()

Unnamed: 0,?mondo_curie,?xref,?xref_source,?mondoSources
0,MONDO:0000002,,,
1,MONDO:0000003,,,
2,MONDO:0000006,,,
3,MONDO:0000007,,,
4,MONDO:0000008,,,


## Get MAPPED count for OMIM

In [12]:
# (1) Use "mondo_sssom_skosExactMatch_df" and filter for all rows with object_id starting with 'OMIM' (remember we want OMIM and OMIMPS).
# (2) Remove all rows where the subject_id is also found in the list of mondo obsolete terms 
# (3) Remove all rows where the object_id is found in omim_df_obsoletes
# Optional (4): Using this count is pending discussion - Remove terms found in "omim_term_exclusions.txt"

omim_mondo_sssom_skosExactMatch_df = mondo_sssom_skosExactMatch_df[
    mondo_sssom_skosExactMatch_df['object_id'].str.startswith('OMIM', na=False)
]

print("(1) omim_mondo_sssom_skosExactMatch_df")
print(omim_mondo_sssom_skosExactMatch_df['subject_id'].nunique())

omim_filtered_obs1_df = omim_mondo_sssom_skosExactMatch_df[~omim_mondo_sssom_skosExactMatch_df['subject_id'].isin(obs_mondo_df['?mondo_curie'])]

print("\n(2) omim_filtered_obs1_df (remove obsolete Mondo terms)")
print(omim_filtered_obs1_df['subject_id'].nunique())

omim_filtered_obs1_obs2_df = omim_filtered_obs1_df[~omim_filtered_obs1_df['object_id'].isin(omim_df_obsoletes['id'])]

print("\n(3) OMIM MAPPED - omim_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete OMIM terms)")
print(omim_filtered_obs1_obs2_df['subject_id'].nunique())


omim_filtered_obs1_obs2_excl_df = omim_filtered_obs1_obs2_df[~omim_filtered_obs1_obs2_df['object_id'].isin(omim_df_exclusions['curie_exclude'])]

print("\n(4) OMIM MAPPED??? - omim_filtered_obs1_obs2_excl_df (remove obs Mondo & OMIM terms and excluded OMIM terms")
print(omim_filtered_obs1_obs2_excl_df['subject_id'].nunique())

(1) omim_mondo_sssom_skosExactMatch_df
10482

(2) omim_filtered_obs1_df (remove obsolete Mondo terms)
10206

(3) OMIM MAPPED - omim_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete OMIM terms)
10205

(4) OMIM MAPPED??? - omim_filtered_obs1_obs2_excl_df (remove obs Mondo & OMIM terms and excluded OMIM terms
8714


## Get MAPPED count for Orphanet

In [13]:
# (1) Use "mondo_sssom_skosExactMatch_df" and filter for all rows with object_id starting with 'Orphanet'
# (2) Remove all rows where the subject_id is also found in the list of mondo obsolete terms 
# (3) Remove all rows where the object_id is found in orphanet_df_obsoletes
# Optional (4): Using this count is pending discussion - Remove terms found in "orphanet_term_exclusions.txt"

orphanet_mondo_sssom_skosExactMatch_df = mondo_sssom_skosExactMatch_df[
    mondo_sssom_skosExactMatch_df['object_id'].str.startswith('Orphanet', na=False)
]

print("(1) orphanet_mondo_sssom_skosExactMatch_df")
print(orphanet_mondo_sssom_skosExactMatch_df['subject_id'].nunique())

orphanet_filtered_obs1_df = orphanet_mondo_sssom_skosExactMatch_df[~orphanet_mondo_sssom_skosExactMatch_df['subject_id'].isin(obs_mondo_df['?mondo_curie'])]

print("\n(2) orphanet_filtered_obs1_df (remove obsolete Mondo terms)")
print(orphanet_filtered_obs1_df['subject_id'].nunique())

orphanet_filtered_obs1_obs2_df = orphanet_filtered_obs1_df[~orphanet_filtered_obs1_df['object_id'].isin(orphanet_df_obsoletes['id'])]

print("\n(3) Orphanet MAPPED - orphanet_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete Orphanet terms)")
print(orphanet_filtered_obs1_obs2_df['subject_id'].nunique())


orphanet_filtered_obs1_obs2_excl_df = orphanet_filtered_obs1_obs2_df[~orphanet_filtered_obs1_obs2_df['object_id'].isin(orphanet_df_exclusions['curie_exclude'])]

print("\n(4) Orphanet MAPPED??? - orphanet_filtered_obs1_obs2_excl_df (remove obs Mondo & Orphanet terms and excluded Orphanet terms")
print(orphanet_filtered_obs1_obs2_excl_df['subject_id'].nunique())

(1) orphanet_mondo_sssom_skosExactMatch_df
9563

(2) orphanet_filtered_obs1_df (remove obsolete Mondo terms)
8200

(3) Orphanet MAPPED - orphanet_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete Orphanet terms)
8200

(4) Orphanet MAPPED??? - orphanet_filtered_obs1_obs2_excl_df (remove obs Mondo & Orphanet terms and excluded Orphanet terms
8089


## Get MAPPED count for DOID

In [14]:
# (1) Use "mondo_sssom_skosExactMatch_df" and filter for all rows with object_id starting with 'DOID'
# (2) Remove all rows where the subject_id is also found in the list of mondo obsolete terms 
# (3) Remove all rows where the object_id is found in doid_df_obsoletes
# Optional (4): Using this count is pending discussion - Remove terms found in "doid_term_exclusions.txt"

doid_mondo_sssom_skosExactMatch_df = mondo_sssom_skosExactMatch_df[
    mondo_sssom_skosExactMatch_df['object_id'].str.startswith('DOID', na=False)
]

print("(1) doid_mondo_sssom_skosExactMatch_df")
print(doid_mondo_sssom_skosExactMatch_df['subject_id'].nunique())

doid_filtered_obs1_df = doid_mondo_sssom_skosExactMatch_df[~doid_mondo_sssom_skosExactMatch_df['subject_id'].isin(obs_mondo_df['?mondo_curie'])]

print("\n(2) doid_filtered_obs1_df (remove obsolete Mondo terms)")
print(doid_filtered_obs1_df['subject_id'].nunique())

doid_filtered_obs1_obs2_df = doid_filtered_obs1_df[~doid_filtered_obs1_df['object_id'].isin(doid_df_obsoletes['id'])]

print("\n(3) DOID MAPPED - doid_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete DOID terms)")
print(doid_filtered_obs1_obs2_df['subject_id'].nunique())


doid_filtered_obs1_obs2_excl_df = doid_filtered_obs1_obs2_df[~doid_filtered_obs1_obs2_df['object_id'].isin(doid_df_exclusions['curie_exclude'])]

print("\n(4) DOID MAPPED??? - doid_filtered_obs1_obs2_excl_df (remove obs Mondo & DOID terms and excluded DOID terms")
print(doid_filtered_obs1_obs2_excl_df['subject_id'].nunique())

(1) doid_mondo_sssom_skosExactMatch_df
11406

(2) doid_filtered_obs1_df (remove obsolete Mondo terms)
11260

(3) DOID MAPPED - doid_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete DOID terms)
11258

(4) DOID MAPPED??? - doid_filtered_obs1_obs2_excl_df (remove obs Mondo & DOID terms and excluded DOID terms
11236


## Get MAPPED count for NCIT

In [15]:
# (1) Use "mondo_sssom_skosExactMatch_df" and filter for all rows with object_id starting with 'NCIT'
# (2) Remove all rows where the subject_id is also found in the list of mondo obsolete terms 
# (3) Remove all rows where the object_id is found in ncit_df_obsoletes
# Optional (4): Using this count is pending discussion - Remove terms found in "ncit_term_exclusions.txt"

ncit_mondo_sssom_skosExactMatch_df = mondo_sssom_skosExactMatch_df[
    mondo_sssom_skosExactMatch_df['object_id'].str.startswith('NCIT', na=False)
]

print("(1) ncit_mondo_sssom_skosExactMatch_df")
print(ncit_mondo_sssom_skosExactMatch_df['subject_id'].nunique())

ncit_filtered_obs1_df = ncit_mondo_sssom_skosExactMatch_df[~ncit_mondo_sssom_skosExactMatch_df['subject_id'].isin(obs_mondo_df['?mondo_curie'])]

print("\n(2) ncit_filtered_obs1_df (remove obsolete Mondo terms)")
print(ncit_filtered_obs1_df['subject_id'].nunique())

ncit_filtered_obs1_obs2_df = ncit_filtered_obs1_df[~ncit_filtered_obs1_df['object_id'].isin(ncit_df_obsoletes['id'])]

print("\n(3) NCIT MAPPED - ncit_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete NCIT terms)")
print(ncit_filtered_obs1_obs2_df['subject_id'].nunique())


ncit_filtered_obs1_obs2_excl_df = ncit_filtered_obs1_obs2_df[~ncit_filtered_obs1_obs2_df['object_id'].isin(ncit_df_exclusions['curie_exclude'])]

print("\n(4) NCIT MAPPED??? - ncit_filtered_obs1_obs2_excl_df (remove obs Mondo & NCIT terms and excluded NCIT terms")
print(ncit_filtered_obs1_obs2_excl_df['subject_id'].nunique())

(1) ncit_mondo_sssom_skosExactMatch_df
7298

(2) ncit_filtered_obs1_df (remove obsolete Mondo terms)
7244

(3) NCIT MAPPED - ncit_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete NCIT terms)
7244

(4) NCIT MAPPED??? - ncit_filtered_obs1_obs2_excl_df (remove obs Mondo & NCIT terms and excluded NCIT terms
3806


## Get MAPPED count for ICD10CM

In [16]:
# (1) Use "mondo_sssom_skosExactMatch_df" and filter for all rows with object_id starting with 'ICD10CM'
# (2) Remove all rows where the subject_id is also found in the list of mondo obsolete terms 
# (3) Remove all rows where the object_id is found in icd10cm_df_obsoletes
# Optional (4): Using this count is pending discussion - Remove terms found in "icd10cm_term_exclusions.txt"

icd10cm_mondo_sssom_skosExactMatch_df = mondo_sssom_skosExactMatch_df[
    mondo_sssom_skosExactMatch_df['object_id'].str.startswith('ICD10CM', na=False)
]

print("(1) icd10cm_mondo_sssom_skosExactMatch_df")
print(icd10cm_mondo_sssom_skosExactMatch_df['subject_id'].nunique())

icd10cm_filtered_obs1_df = icd10cm_mondo_sssom_skosExactMatch_df[~icd10cm_mondo_sssom_skosExactMatch_df['subject_id'].isin(obs_mondo_df['?mondo_curie'])]

print("\n(2) icd10cm_filtered_obs1_df (remove obsolete Mondo terms)")
print(icd10cm_filtered_obs1_df['subject_id'].nunique())

#icd10cm_filtered_obs1_obs2_df = icd10cm_filtered_obs1_df[~icd10cm_filtered_obs1_df['object_id'].isin(icd10cm_df_obsoletes['id'])]
# There are 0 icd10cm obsolete terms

# print("\n(3) ICD10CM MAPPED - icd10cm_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete ICD10CM terms)")
# print(icd10cm_filtered_obs1_obs2_df['subject_id'].nunique())


icd10cm_filtered_obs1_excl_df = icd10cm_filtered_obs1_df[~icd10cm_filtered_obs1_df['object_id'].isin(icd10cm_df_exclusions['curie_exclude'])]

print("\n(4) ICD10CM MAPPED??? - icd10cm_filtered_obs1_excl_df (remove obs Mondo & ICD10CM terms and excluded ICD10CM terms")
print(icd10cm_filtered_obs1_excl_df['subject_id'].nunique())

(1) icd10cm_mondo_sssom_skosExactMatch_df
1135

(2) icd10cm_filtered_obs1_df (remove obsolete Mondo terms)
1120

(4) ICD10CM MAPPED??? - icd10cm_filtered_obs1_excl_df (remove obs Mondo & ICD10CM terms and excluded ICD10CM terms
1105


## Get MAPPED count for ICD11Foundation

In [17]:
# (1) Use "mondo_sssom_skosExactMatch_df" and filter for all rows with object_id starting with 'icd11.foundation'
# (2) Remove all rows where the subject_id is also found in the list of mondo obsolete terms 
# (3) Remove all rows where the object_id is found in icd11_df_obsoletes
# Optional (4): Using this count is pending discussion - Remove terms found in "icd11_term_exclusions.txt"

icd11_mondo_sssom_skosExactMatch_df = mondo_sssom_skosExactMatch_df[
    mondo_sssom_skosExactMatch_df['object_id'].str.startswith('icd11.foundation', na=False)
]
print("(1) icd11_mondo_sssom_skosExactMatch_df")
print(icd11_mondo_sssom_skosExactMatch_df['subject_id'].nunique())

icd11_filtered_obs1_df = icd11_mondo_sssom_skosExactMatch_df[~icd11_mondo_sssom_skosExactMatch_df['subject_id'].isin(obs_mondo_df['?mondo_curie'])]
print("\n(2) icd11_filtered_obs1_df (remove obsolete Mondo terms)")
print(icd11_filtered_obs1_df['subject_id'].nunique())


icd11_filtered_obs1_obs2_df = icd11_filtered_obs1_df[~icd11_filtered_obs1_df['object_id'].isin(icd11_df_obsoletes['id'])]
print("\n(3) ICD11 MAPPED - icd11_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete ICD11 terms)")
print(icd11_filtered_obs1_obs2_df['subject_id'].nunique())


icd11_filtered_obs1_excl_df = icd11_filtered_obs1_obs2_df[~icd11_filtered_obs1_obs2_df['object_id'].isin(icd11_df_exclusions['curie_exclude'])]
print("\n(4) ICD11 MAPPED??? - icd11_filtered_obs1_excl_df (remove obs Mondo & ICD11 terms and excluded ICD11 terms")
print(icd11_filtered_obs1_excl_df['subject_id'].nunique())

(1) icd11_mondo_sssom_skosExactMatch_df
4111

(2) icd11_filtered_obs1_df (remove obsolete Mondo terms)
4044

(3) ICD11 MAPPED - icd11_filtered_obs1_obs2_df (remove obs Mondo terms and obsolete ICD11 terms)
4041

(4) ICD11 MAPPED??? - icd11_filtered_obs1_excl_df (remove obs Mondo & ICD11 terms and excluded ICD11 terms
4021
