## Alignment Stats


### Description

This notebook is to review counts for the Mapping Progress Report in the mondo-ingest repo.

These calculations are based on the slides [here](https://docs.google.com/presentation/d/14Xmcfwe9xhoGG0ZnDbFbwzKqHWHQR4OaOF4W7II2BaM/edit?slide=id.p#slide=id.p).

- DISEASE is gnerally calculated as mirror signature - (all excluded terms and all ingest source obsolete terms)
  - The mirror signature file is from reports/mirror_signature-<ONTOLOGY_NAME>.tsv
  - The set of all excluded terms is from reports/<ONTOLOGY_NAME>_term_exclusions.txt
  - The set of all ingest source obsolete terms was created by querying the OAK ontology database for obsolete terms.
  - IRIs were converted to CURIEs as needed.
  - The final count only includes CURIEs that are from the source ingest base namespace.  


- MAPPED is calculated as ... TBA 


NOTE: The files of obsolete terms were created using the pattern: `runoak -i components/icd10cm.db obsoletes > icd10cm_obsoletes.txt`
Files were used from the recent data build so all files were created late April 2025.

In [1]:
# Imports
import pandas as pd

pd.set_option('display.width', 1000)

### OMIM

In [8]:
# OMIM Alignment

# Load the mirror signature file
omim_df_signature = pd.read_csv('reports/mirror_signature-omim.tsv', header=None, names=['iri'])
print("** omim_df_signature")
print(omim_df_signature.head(3))

# Load the term exclusions file
omim_df_exclusions = pd.read_csv('reports/omim_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\n** omim_df_exclusions")
print(omim_df_exclusions.head(3))

# Load the obsolete terms file (assuming tab-separated with a header)
omim_df_obsoletes = pd.read_csv('omim_obsoletes.txt', sep='\t')
print("\n** df_obsoletes")
print(omim_df_obsoletes.head(3))


# Function to extract CURIE from IRI
def extract_omim_curie(iri):
    if pd.notna(iri):
        match = pd.NA
        if 'https://omim.org/entry/' in iri:
            match = iri.replace('<https://omim.org/entry/', 'OMIM:')
        elif 'https://omim.org/phenotypicSeries/PS' in iri:
            match = iri.replace('<https://omim.org/phenotypicSeries/', 'OMIMPS:')
        elif 'http://purl.obolibrary.org/obo/OMIMPS_' in iri:
            match = iri.replace('<http://purl.obolibrary.org/obo/OMIMPS_', 'OMIMPS:')
        return match
    return pd.NA

omim_df_signature['curie'] = omim_df_signature['iri'].apply(extract_omim_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", omim_df_signature['curie'].head())

omim_df_signature_clean = omim_df_signature.dropna(subset=['curie']) # Keep only rows where a CURIE was extracted
print("\nSignature DataFrame with OMIM Foundation CURIEs (Head):")
print(omim_df_signature_clean.head(3))

omim_exclude_set = set(omim_df_exclusions['curie_exclude'].dropna())
print("\nSize of OMIM Exclusion Set:", len(omim_exclude_set))


# Extract OMIM CURIEs from the obsolete file
omim_obsoletes_set = set(omim_df_obsoletes['id'].dropna()[omim_df_obsoletes['id'].str.startswith('OMIM:')])
print("Size of OMIM Obsoletes Set:", len(omim_obsoletes_set))

omim_df_filtered = omim_df_signature_clean[
    ~omim_df_signature_clean['curie'].isin(omim_exclude_set) &
    ~omim_df_signature_clean['curie'].isin(omim_obsoletes_set)
]

final_count = len(omim_df_filtered)
print(f"\nFinal Count of OMIM CURIEs after filtering - OMIM DISEASE: {final_count}")

** omim_df_signature
                                   iri
0                                ?term
1  <http://identifiers.org/hgnc/10004>
2  <http://identifiers.org/hgnc/10006>

** omim_df_exclusions
  curie_exclude
0   OMIM:100050
1   OMIM:100200
2   OMIM:100640

** df_obsoletes
            id                  label
0  OMIM:102570  removed from database
1  OMIM:102920  removed from database
2  OMIM:102930  removed from database

**TEST 0    <NA>
1    <NA>
2    <NA>
3    <NA>
4    <NA>
Name: curie, dtype: object

Signature DataFrame with OMIM Foundation CURIEs (Head):
                                  iri        curie
5499  <https://omim.org/entry/100050>  OMIM:100050
5500  <https://omim.org/entry/100070>  OMIM:100070
5501  <https://omim.org/entry/100100>  OMIM:100100

Size of OMIM Exclusion Set: 19503
Size of OMIM Obsoletes Set: 1371

Final Count of OMIM CURIEs after filtering - OMIM DISEASE: 8849


### Orphanet

In [3]:
# Load the mirror signature file
orphanet_df_signature = pd.read_csv('reports/mirror_signature-ordo.tsv', header=None, names=['iri'])
print("Mirror Signature (ORDO) Head:")
print(orphanet_df_signature.head(3))

# Load the term exclusions file
orphanet_df_exclusions = pd.read_csv('reports/ordo_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nORDO Term Exclusions Head:")
print(orphanet_df_exclusions.head(3))

# Load the obsolete terms file (assuming tab-separated with a header)
orphanet_df_obsoletes = pd.read_csv('orphanet_obsoletes.txt', sep='\t')
print("\nOrphanet Obsoletes Head:")
print(orphanet_df_obsoletes.head(3))

# Function to extract Orphanet CURIE from IRI
def extract_orphanet_curie(iri):
    if pd.notna(iri):
        if 'http://www.orpha.net/ORDO/Orphanet_' in iri:
            return iri.replace('<http://www.orpha.net/ORDO/Orphanet_', 'Orphanet:')
    return pd.NA

# Apply the extraction function
orphanet_df_signature['curie'] = orphanet_df_signature['iri'].apply(extract_orphanet_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", orphanet_df_signature['curie'].head())

# Drop rows where CURIE extraction failed (non-Orphanet IRIs)
orphanet_df_signature_clean = orphanet_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with Orphanet CURIEs (Head):")
print(orphanet_df_signature_clean.head(3))

# Prepare exclusion set
orphanet_exclude_set = set(orphanet_df_exclusions['curie_exclude'].dropna())
print("\nSize of Orphanet Exclusion Set:", len(orphanet_exclude_set))

# Prepare obsolete set (assuming the 'id' column contains Orphanet CURIEs)
orphanet_obsoletes_set = set(orphanet_df_obsoletes['id'].dropna()[orphanet_df_obsoletes['id'].str.startswith('Orphanet:')])
print("Size of Orphanet Obsoletes Set:", len(orphanet_obsoletes_set))

# Filter the signature DataFrame
df_filtered_orphanet = orphanet_df_signature_clean[
    ~orphanet_df_signature_clean['curie'].isin(orphanet_exclude_set) &
    ~orphanet_df_signature_clean['curie'].isin(orphanet_obsoletes_set)
]

# Get the final count
final_count_ordo = len(df_filtered_orphanet)
print("\nFinal Count of Orphanet CURIEs after filtering:", final_count_ordo)

Mirror Signature (ORDO) Head:
                                           iri
0                                        ?term
1  <http://www.orpha.net/ORDO/Orphanet_100000>
2  <http://www.orpha.net/ORDO/Orphanet_100001>

ORDO Term Exclusions Head:
     curie_exclude
0  Orphanet:100039
1  Orphanet:100040
2  Orphanet:100041

Orphanet Obsoletes Head:
                id                                              label
0  Orphanet:100039                 Familial pseudohyperkalemia type 1
1  Orphanet:100040       OBSOLETE: Familial pseudohyperkalemia type 2
2  Orphanet:100041  OBSOLETE: Familial pseudohyperkalemia, Cardiff...

**TEST 0               <NA>
1    Orphanet:100000
2    Orphanet:100001
3    Orphanet:100002
4    Orphanet:100003
Name: curie, dtype: object

Signature DataFrame with Orphanet CURIEs (Head):
                                           iri            curie
1  <http://www.orpha.net/ORDO/Orphanet_100000>  Orphanet:100000
2  <http://www.orpha.net/ORDO/Orphanet_100001>  Orphan

### DOID

In [4]:
# Load the mirror signature file
doid_df_signature = pd.read_csv('reports/mirror_signature-doid.tsv', header=None, names=['iri'])
print("\nDOID Mirror Signature Head:")
print(doid_df_signature.head(3))

# Load the term exclusions file
doid_df_exclusions = pd.read_csv('reports/doid_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nDOID Term Exclusions Head:")
print(doid_df_exclusions.head(3))

# Load the obsolete terms file
doid_df_obsoletes = pd.read_csv('doid_obsoletes.txt', sep='\t')
print("\nDOID Obsoletes Head:")
print(doid_df_obsoletes.head(3))

# Function to extract DOID CURIE from IRI
def extract_doid_curie(iri):
    if pd.notna(iri):
        if 'http://purl.obolibrary.org/obo/DOID_' in iri:
            return iri.replace('<http://purl.obolibrary.org/obo/DOID_', 'DOID:')
    return pd.NA

# doid_df_signature['curie'] = doid_df_signature['iri'].apply(extract_doid_curie)
doid_df_signature['curie'] = doid_df_signature['iri'].apply(extract_doid_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", doid_df_signature['curie'].head())

doid_df_signature_clean = doid_df_signature.dropna(subset=['curie'])
print("\n**TEST", doid_df_signature_clean['curie'].head())

doid_exclude_set = set(doid_df_exclusions['curie_exclude'].dropna())
print("\nSize of DOID Exclusion Set:", len(doid_exclude_set))

doid_obsoletes_set = set(doid_df_obsoletes['id'].dropna()[doid_df_obsoletes['id'].str.startswith('DOID:')])
print("Size of DOID Obsoletes Set:", len(doid_obsoletes_set))

doid_df_filtered = doid_df_signature_clean[
    ~doid_df_signature_clean['curie'].isin(doid_exclude_set) &
    ~doid_df_signature_clean['curie'].isin(doid_obsoletes_set)
]

final_count_doid = len(doid_df_filtered)
print("\nFinal Count of DOID CURIEs after filtering:", final_count_doid)


DOID Mirror Signature Head:
                                             iri
0                                          ?term
1  <http://purl.obolibrary.org/obo/CHEBI_102166>
2  <http://purl.obolibrary.org/obo/CHEBI_103210>

DOID Term Exclusions Head:
  curie_exclude
0  DOID:0040001
1  DOID:0040002
2  DOID:0040003

DOID Obsoletes Head:
             id                                              label
0  DOID:0050001   obsolete Actinomadura madurae infectious disease
1  DOID:0050002  obsolete Actinomadura pelletieri infectious di...
2  DOID:0050003  obsolete Streptomyces somaliensis infectious d...

**TEST 0    <NA>
1    <NA>
2    <NA>
3    <NA>
4    <NA>
Name: curie, dtype: object

**TEST 609    DOID:0001816
610    DOID:0002116
611    DOID:0014667
612    DOID:0040001
613    DOID:0040002
Name: curie, dtype: object

Size of DOID Exclusion Set: 2675
Size of DOID Obsoletes Set: 2502

Final Count of DOID CURIEs after filtering: 11683


### NCIT

In [5]:
# Load the mirror signature file
ncit_df_signature = pd.read_csv('reports/mirror_signature-ncit.tsv', header=None, names=['iri'])
print("\nNCIT Mirror Signature Head:")
print(ncit_df_signature.head(3))

# Load the term exclusions file
ncit_df_exclusions = pd.read_csv('reports/ncit_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nNCIT Term Exclusions Head:")
print(ncit_df_exclusions.head(3))

# Load the obsolete terms file
ncit_df_obsoletes = pd.read_csv('ncit_obsoletes.txt', sep='\t')
print("\nNCIT Obsoletes Head:")
print(ncit_df_obsoletes.head(3))

# Function to extract NCIT CURIE from IRI
def extract_ncit_curie(iri):
    if pd.notna(iri):
        if 'http://purl.obolibrary.org/obo/NCIT_' in iri:
            return iri.replace('http://purl.obolibrary.org/obo/NCIT_', 'NCIT:')
    return pd.NA

ncit_df_signature['curie'] = ncit_df_signature['iri'].apply(extract_ncit_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", ncit_df_signature['curie'].head())

ncit_df_signature_clean = ncit_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with Orphanet CURIEs (Head):")
print(ncit_df_signature_clean.head())

ncit_exclude_set = set(ncit_df_exclusions['curie_exclude'].dropna())
print("\nSize of NCIT Exclusion Set:", len(ncit_exclude_set))

ncit_obsoletes_set = set(ncit_df_obsoletes['id'].dropna()[ncit_df_obsoletes['id'].str.startswith('NCIT:')])
print("Size of NCIT Obsoletes Set:", len(ncit_obsoletes_set))

ncit_df_filtered = ncit_df_signature_clean[
    ~ncit_df_signature_clean['curie'].isin(ncit_exclude_set) &
    ~ncit_df_signature_clean['curie'].isin(ncit_obsoletes_set)
]

final_count_ncit = len(ncit_df_filtered)
print("\nFinal Count of NCIT CURIEs after filtering:", final_count_ncit)


NCIT Mirror Signature Head:
                                          iri
0                                       ?term
1   http://purl.obolibrary.org/obo/NCIT_C1000
2  http://purl.obolibrary.org/obo/NCIT_C10000

NCIT Term Exclusions Head:
  curie_exclude
0   NCIT:124251
1   NCIT:134528
2   NCIT:134530

NCIT Obsoletes Head:
             id                                            label
0  NCIT:C100067                   Coronary Reperfusion Procedure
1  NCIT:C100421  Activated PTT to Standard PTT Ratio Measurement
2  NCIT:C100426                   Beta-Trace Protein Measurement

**TEST 0            <NA>
1      NCIT:C1000
2     NCIT:C10000
3    NCIT:C100000
4    NCIT:C100001
Name: curie, dtype: object

Signature DataFrame with Orphanet CURIEs (Head):
                                           iri         curie
1    http://purl.obolibrary.org/obo/NCIT_C1000    NCIT:C1000
2   http://purl.obolibrary.org/obo/NCIT_C10000   NCIT:C10000
3  http://purl.obolibrary.org/obo/NCIT_C100000  NCIT:C1

### ICD10CM

In [6]:
# Load the mirror signature file
icd10cm_df_signature = pd.read_csv('reports/mirror_signature-icd10cm.tsv', header=None, names=['iri'])
print("ICD10CM Mirror Signature Head:")
print(icd10cm_df_signature.head(3))

# Load the term exclusions file
icd10cm_df_exclusions = pd.read_csv('reports/icd10cm_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nICD10CM Term Exclusions Head:")
print(icd10cm_df_exclusions.head(3))

# Load the obsolete terms file, handling potential EmptyDataError
try:
    icd10cm_df_obsoletes = pd.read_csv('icd10cm_obsoletes.txt', sep='\t')
    print("\nICD10CM Obsoletes Head:")
    print(icd10cm_df_obsoletes.head(3))
except pd.errors.EmptyDataError:
    print("\nICD10CM Obsoletes file is empty.")
    icd10cm_df_obsoletes = pd.DataFrame() # Create an empty DataFrame

# Function to extract ICD10CM CURIE from IRI
def extract_icd10cm_curie(iri):
    if pd.notna(iri):
        if 'http://purl.bioontology.org/ontology/ICD10CM/' in iri:
            # Extract the code part after the last '/'
            code_part = iri.split('/')[-1]
            # Take the part before the first '-' if it exists, otherwise take the whole code
            if '-' in code_part:
                base_code = code_part.split('-')[0]
                return f"ICD10CM:{base_code}"
            else:
                return f"ICD10CM:{code_part}"
    return pd.NA

# Apply the extraction function
icd10cm_df_signature['curie'] = icd10cm_df_signature['iri'].apply(extract_icd10cm_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", icd10cm_df_signature['curie'].head())

# Drop rows where CURIE extraction failed (non-ICD10CM IRIs)
icd10cm_df_signature_clean = icd10cm_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with ICD10CM CURIEs (Head):")
print(icd10cm_df_signature_clean.head(3))

# Prepare exclusion set
icd10cm_exclude_set = set(icd10cm_df_exclusions['curie_exclude'].dropna())
print("\nSize of ICD10CM Exclusion Set:", len(icd10cm_exclude_set))

# Prepare obsolete set, handling empty DataFrame
if not icd10cm_df_obsoletes.empty and 'id' in icd10cm_df_obsoletes.columns:
    icd10cm_obsoletes_set = set(icd10cm_df_obsoletes['id'].dropna()[icd10cm_df_obsoletes['id'].str.startswith('ICD10CM:')])
else:
    icd10cm_obsoletes_set = set()
print("Size of ICD10CM Obsoletes Set:", len(icd10cm_obsoletes_set))

# Filter the signature DataFrame
icd10cm_df_filtered = icd10cm_df_signature_clean[
    ~icd10cm_df_signature_clean['curie'].isin(icd10cm_exclude_set) &
    ~icd10cm_df_signature_clean['curie'].isin(icd10cm_obsoletes_set)
]

# Get the final count
final_count_icd10cm = len(icd10cm_df_filtered)
print("\nFinal Count of ICD10CM CURIEs after filtering:", final_count_icd10cm)

ICD10CM Mirror Signature Head:
                                                 iri
0                                              ?term
1  <http://purl.bioontology.org/ontology/ICD10CM/...
2  <http://purl.bioontology.org/ontology/ICD10CM/...

ICD10CM Term Exclusions Head:
     curie_exclude
0      ICD10CM:B95
1  ICD10CM:B95-B97
2    ICD10CM:B95.0

ICD10CM Obsoletes file is empty.

**TEST 0             <NA>
1      ICD10CM:A00
2      ICD10CM:A00
3    ICD10CM:A00.0
4    ICD10CM:A00.1
Name: curie, dtype: object

Signature DataFrame with ICD10CM CURIEs (Head):
                                                 iri          curie
1  <http://purl.bioontology.org/ontology/ICD10CM/...    ICD10CM:A00
2  <http://purl.bioontology.org/ontology/ICD10CM/...    ICD10CM:A00
3  <http://purl.bioontology.org/ontology/ICD10CM/...  ICD10CM:A00.0

Size of ICD10CM Exclusion Set: 15452
Size of ICD10CM Obsoletes Set: 0

Final Count of ICD10CM CURIEs after filtering: 80388


### ICD11 Foundation

In [7]:
# Load the mirror signature file
icd11_df_signature = pd.read_csv('reports/mirror_signature-icd11foundation.tsv', header=None, names=['iri'])
print("ICD11 Foundation Mirror Signature Head:")
print(icd11_df_signature.head(3))
print("LEN icd11_df_signature:", len(icd11_df_signature))

# Load the term exclusions file
icd11_df_exclusions = pd.read_csv('reports/icd11foundation_term_exclusions.txt', header=None, names=['curie_exclude'])
print("\nICD11 Foundation Term Exclusions Head:")
print(icd11_df_exclusions.head(3))

# Load the obsolete terms file (assuming tab-separated with a header)
icd11_df_obsoletes = pd.read_csv('icd11foundation_obsoletes.txt', sep='\t')
print("\nICD11 Foundation Obsoletes Head:")
print(icd11_df_obsoletes.head(3))

# Function to extract ICD11 Foundation CURIE from IRI
def extract_icd11_curie(iri):
    if pd.notna(iri):
        if 'http://id.who.int/icd/entity/' in iri:
            return iri.replace('<http://id.who.int/icd/entity/', 'icd11.foundation:')
    return pd.NA

# Apply the extraction function
icd11_df_signature['curie'] = icd11_df_signature['iri'].apply(extract_icd11_curie).str.replace('>', '') # Remove potential trailing '>'
print("\n**TEST", icd11_df_signature['curie'].head())

# Drop rows where CURIE extraction failed (non-ICD11 IRIs)
icd11_df_signature_clean = icd11_df_signature.dropna(subset=['curie'])
print("\nSignature DataFrame with ICD11 Foundation CURIEs (Head):")
print(icd11_df_signature_clean.head(3))

# Prepare exclusion set
icd11_exclude_set = set(icd11_df_exclusions['curie_exclude'].dropna())
print("\nSize of ICD11 Foundation Exclusion Set:", len(icd11_exclude_set))

# Prepare obsolete set (assuming the 'id' column contains ICD11 Foundation CURIEs)
if not icd11_df_obsoletes.empty and 'id' in icd11_df_obsoletes.columns:
    icd11_obsoletes_set = set(icd11_df_obsoletes['id'].dropna()[icd11_df_obsoletes['id'].str.startswith('icd11.foundation:')])
else:
    icd11_obsoletes_set = set()
print("Size of ICD11 Foundation Obsoletes Set:", len(icd11_obsoletes_set))

# Filter the signature DataFrame
icd11_df_filtered = icd11_df_signature_clean[
    ~icd11_df_signature_clean['curie'].isin(icd11_exclude_set) &
    ~icd11_df_signature_clean['curie'].isin(icd11_obsoletes_set)
]

# Get the final count
final_count_icd11 = len(icd11_df_filtered)
print("\nFinal Count of ICD11 Foundation CURIEs after filtering:", final_count_icd11)

ICD11 Foundation Mirror Signature Head:
                                         iri
0                                      ?term
1  <http://id.who.int/icd/entity/1000004774>
2  <http://id.who.int/icd/entity/1000010185>
LEN icd11_df_signature: 101136

ICD11 Foundation Term Exclusions Head:
                 curie_exclude
0  icd11.foundation:1000034337
1  icd11.foundation:1000093173
2  icd11.foundation:1000136681

ICD11 Foundation Obsoletes Head:
                            id                                              label
0  icd11.foundation:1000312374  Recurrent and persistent haematuria : diffuse ...
1  icd11.foundation:1001085090                           Tetraplegia, unspecified
2  icd11.foundation:1002125483  Hypertensive heart and renal disease with both...

**TEST 0                           <NA>
1    icd11.foundation:1000004774
2    icd11.foundation:1000010185
3    icd11.foundation:1000034337
4     icd11.foundation:100006598
Name: curie, dtype: object

Signature DataFrame wi