# Analysis of ICD11 lexical alignment results

## Description
This notebook further analyzes the results from the lexical alignment pipeline in the `mondo-ingest` repo for the first round of the ICD11 results. The file being analyzed is the "unmapped_icd11foundation_lex.xlsx" file, which includes matches that are considered _not exact_ and need further curator review before making a decision on whether to add these mappings into Mondo.

The aim of this notebook is to look for other patterns within these matches that can be accounted for to detect matches that should be considered as exact matches or patterns that indicate matches that should not be considered as exact matches. As a result of this review, some bugs or feature requests may come out of this review, but this notebook is aimed for curator review to see if these patterns are ones we can trust well enough to lessen review and add these as mappings.

Git Location: https://github.com/monarch-initiative/mondo-curation-analysis


### Overview of unmapped_icd11foundation_lex.xlsx file

| subject_match_field | object_match_field | num_rows | 
| --- | --- | --- |
| Mondo Label | ICD11 Label | 4 |
| Mondo Label | ICD11 Synonym | 1338 |
| Mondo Synonym | ICD11 Label | 1820 |
| Mondo Synonym | ICD11 ~Label~ Synonym | 2923 |
| --- |
_Total rows in dataframe:_ 6085
_Total Unique Mondo IDs:_ 3542


See the README file for the thoughts on what items to analyze.

## Analysis Results
- Mondo Label and ICD11 Label matches
    - The 4 mappings between the Mondo Label and ICD11 Label are correct and are due extra whitespace between words in the term label in ICD11 and should be considered exact matches.
        - [Mondo Label to ICD11 Label Matches](#Mondo-Label-and-ICD11-Label-Match-Results)
- Mondo Label and ICD11 Synonym matches
    -  Of the 1338 matches, 141 can be verified as exact matches taking into account differences in word order, punctutation, and ICD11 terms that end in a word enclosed in parenthesis
        - [Mondo Label to ICD11 Synonym matches](#Mondo-Label-to-ICD11-Synonym-Match-Results)

## Prepare Notebook and Data file

In [9]:
# Install imports
import pandas as pd
import re

In [10]:
# Read in data file
df = pd.read_excel('../data/input/unmapped_icd11foundation_lex.xlsx')

# Drop the first row (index 0) that contains the ROBOT directives
df = df.drop(index=0)

# Reset the index if needed
df = df.reset_index(drop=True)

df.head()

Unnamed: 0,subject_id,subject_label,object_id,predicate_id,object_label,mapping_justification,mapping_tool,confidence,subject_match_field,object_match_field,match_string
0,MONDO:0000001,disease,icd11.foundation:1659232486,MONDO:equivalentTo,"Mitral valve disease, unspecified",semapv:LexicalMatching,oaklib,0.8,oio:hasExactSynonym,oio:hasExactSynonym,disease
1,MONDO:0000001,disease,icd11.foundation:1659232486,MONDO:equivalentTo,"Mitral valve disease, unspecified",semapv:LexicalMatching,oaklib,0.8,rdfs:label,oio:hasExactSynonym,disease
2,MONDO:0000050,isolated congenital growth hormone deficiency,icd11.foundation:936501166,MONDO:equivalentTo,Nonacquired isolated growth hormone deficiency,semapv:LexicalMatching,oaklib,0.8,oio:hasExactSynonym,oio:hasExactSynonym,congenital isolated growth hormone deficiency
3,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,MONDO:equivalentTo,Peripheral precocious puberty,semapv:LexicalMatching,oaklib,0.8,oio:hasExactSynonym,oio:hasExactSynonym,pubertas praecox
4,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,MONDO:equivalentTo,Peripheral precocious puberty,semapv:LexicalMatching,oaklib,0.8,oio:hasExactSynonym,oio:hasExactSynonym,sexual precocity


In [11]:
# Get overview of unique values in dataframe
df.nunique()

subject_id               3542
subject_label            3542
object_id                3772
predicate_id                1
object_label             3737
mapping_justification       1
mapping_tool                1
confidence                  2
subject_match_field         2
object_match_field          2
match_string             5350
dtype: int64

In [12]:
# Remove columns with redundant information
df = df.drop(columns=['predicate_id', 'mapping_justification', 'mapping_tool', 'confidence'])
df.head()

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
0,MONDO:0000001,disease,icd11.foundation:1659232486,"Mitral valve disease, unspecified",oio:hasExactSynonym,oio:hasExactSynonym,disease
1,MONDO:0000001,disease,icd11.foundation:1659232486,"Mitral valve disease, unspecified",rdfs:label,oio:hasExactSynonym,disease
2,MONDO:0000050,isolated congenital growth hormone deficiency,icd11.foundation:936501166,Nonacquired isolated growth hormone deficiency,oio:hasExactSynonym,oio:hasExactSynonym,congenital isolated growth hormone deficiency
3,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,oio:hasExactSynonym,oio:hasExactSynonym,pubertas praecox
4,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,oio:hasExactSynonym,oio:hasExactSynonym,sexual precocity


## Create dataframe containing only rows where the subject_match_field is the Mondo Label

In [13]:
# Create dataframe where the 'subject_match_field' is rdfs:label
df_label = df[df['subject_match_field'] == 'rdfs:label']

df_label.head()

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
1,MONDO:0000001,disease,icd11.foundation:1659232486,"Mitral valve disease, unspecified",rdfs:label,oio:hasExactSynonym,disease
5,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,rdfs:label,oio:hasExactSynonym,precocious puberty
11,MONDO:0000229,Indian tick typhus,icd11.foundation:1771381430,Spotted fever due to Rickettsia conorii,rdfs:label,oio:hasExactSynonym,indian tick typhus
14,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,rdfs:label,oio:hasExactSynonym,adiaspiromycosis
17,MONDO:0000241,Keshan disease,icd11.foundation:1307765114,Keshan disease due to selenium deficiency,rdfs:label,oio:hasExactSynonym,keshan disease


In [14]:
# How many rows are in df_label dataframe, ie Mondo Label to ICD11 Synonym?
print(len(df_label))

1342


In [15]:
# Overview of unique values in the "Mondo Label" dataframe
df_label.nunique()

subject_id             1309
subject_label          1309
object_id              1284
object_label           1279
subject_match_field       1
object_match_field        2
match_string           1309
dtype: int64

The overview of the df_label dataframe shows that there are 2 unique values for "object_match_field" in the dataframe, these are: `rdfs:label` and `oio:hasExactSynonym`. 

Amongst the rows where the subject_match_field contains `rdfs:label` (referred to as "label" in later analysis cells) 
there are 4 rows where the object_match_field also contains `rdfs:label`.

However, matches between the subject and object where the match field is rfdfs:label should be in the lex exact match file. This bug has been reported.

<a id="label-label-results"></a>
## Mondo Label and ICD11 Label Match Results
Filter dataframe of "lex" matches where the subject(MONDO) matches on it's label to only matches where the object(ICD!! Foundation) also only matches on it's label.
[Top](#Analysis-of-ICD11-lexical-alignment-results)

In [16]:
# Show the values where there are label to label matches
df_label_label_matches = df_label[(df_label['subject_match_field'] == 'rdfs:label') & (df_label['object_match_field'] == 'rdfs:label')]

df_label_label_matches.head()

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
1254,MONDO:0003595,sclerosing liposarcoma,icd11.foundation:1681730143,Sclerosing liposarcoma,rdfs:label,rdfs:label,sclerosing liposarcoma
4152,MONDO:0012089,ichthyosis prematurity syndrome,icd11.foundation:524111974,Ichthyosis prematurity syndrome,rdfs:label,rdfs:label,ichthyosis prematurity syndrome
5645,MONDO:0021495,benign neoplasm of sublingual gland,icd11.foundation:1267377894,Benign neoplasm of sublingual gland,rdfs:label,rdfs:label,benign neoplasm of sublingual gland
6073,MONDO:0800453,juvenile absence epilepsy,icd11.foundation:519416529,Juvenile absence epilepsy,rdfs:label,rdfs:label,juvenile absence epilepsy


---
## Review matches between Mondo Label and ICD11 Synonym

### What matches exist where the _subject_match_field_ and _object_match_field_ match except for differences in (1) punctuation between the Mondo Label and ICD11 Synonym?

In [9]:
# How often does the subject_match_field and object_match_field match except for punctuation?

# Function to remove punctuation and convert to lowercase
def preprocess(text):
    return re.sub(r'[^\w\s]', '', text).lower()

# # Check if processed values are the same
# df_label['same_excluding_punctuation_and_case'] = df_label.apply(
#     lambda row: preprocess(row['subject_label']) == preprocess(row['object_label']),
#     axis=1
# )

# Apply the function to both columns and create new processed columns using loc
df_label.loc[:, 'subject_processed'] = df_label['subject_label'].apply(preprocess)
df_label.loc[:, 'object_processed'] = df_label['object_label'].apply(preprocess)

# Check if processed values are the same and assign the result to a new column
df_label.loc[:, 'same_excluding_punctuation_and_case'] = df_label['subject_processed'] == df_label['object_processed']

# Drop intermediate columns if not needed
df_label = df_label.drop(columns=['subject_processed', 'object_processed'])

# Get all rows where the value for 'same_excluding_punctuation_and_case' is True
df_label_true = df_label[df_label['same_excluding_punctuation_and_case']]

df_label_true.head(len(df_label_true))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_label.loc[:, 'subject_processed'] = df_label['subject_label'].apply(preprocess)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_label.loc[:, 'object_processed'] = df_label['object_label'].apply(preprocess)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_label.loc[:, 'same_excluding_punctuati

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string,same_excluding_punctuation_and_case
2847,MONDO:0007177,auriculoosteodysplasia,icd11.foundation:1153492135,Auriculo-osteodysplasia,rdfs:label,oio:hasExactSynonym,auriculoosteodysplasia,True
4477,MONDO:0015449,criss-cross heart,icd11.foundation:856695997,Crisscross heart,rdfs:label,oio:hasExactSynonym,criss-cross heart,True
4900,MONDO:0018082,aorto-ventricular tunnel,icd11.foundation:470594532,Aortoventricular tunnel,rdfs:label,oio:hasExactSynonym,aorto-ventricular tunnel,True


### What matches exist where the _subject_match_field_ and _object_match_field_ match except for (1) punctutation and/or a difference in word order and/or (2) punctutation between the Mondo Label and ICD11 Synonym?

In [10]:
# How often does the subject_match_field and object_match_field match except for a difference in word order and punctuation?

# Function to preprocess text by removing punctuation, converting to lowercase, and sorting words
def preprocess_and_sort(text):
    # Remove punctuation and convert to lowercase
    text = re.sub(r'[^\w\s]', '', text).lower()
    # Split into words, sort, and join back into a string
    sorted_words = ' '.join(sorted(text.split()))
    return sorted_words

# # Create dataframe where the 'subject_match_field' is rdfs:label
# df_label = df[df['subject_match_field'] == 'rdfs:label']

# Apply the function to both columns and compare
df_label['subject_processed'] = df_label['subject_label'].apply(preprocess_and_sort)
df_label['object_processed'] = df_label['object_label'].apply(preprocess_and_sort)

# Check if processed values are the same
df_label['same_excluding_word_order'] = df_label['subject_processed'] == df_label['object_processed']

# Drop intermediate columns if not needed
df_label = df_label.drop(columns=['subject_processed', 'object_processed'])

# Get all rows where same_excluding_word_order is True
filtered_df = df_label[df_label['same_excluding_word_order']]

filtered_df.head(len(filtered_df))

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string,same_excluding_punctuation_and_case,same_excluding_word_order
1254,MONDO:0003595,sclerosing liposarcoma,icd11.foundation:1681730143,Sclerosing liposarcoma,rdfs:label,rdfs:label,sclerosing liposarcoma,False,True
2847,MONDO:0007177,auriculoosteodysplasia,icd11.foundation:1153492135,Auriculo-osteodysplasia,rdfs:label,oio:hasExactSynonym,auriculoosteodysplasia,True,True
4152,MONDO:0012089,ichthyosis prematurity syndrome,icd11.foundation:524111974,Ichthyosis prematurity syndrome,rdfs:label,rdfs:label,ichthyosis prematurity syndrome,False,True
4477,MONDO:0015449,criss-cross heart,icd11.foundation:856695997,Crisscross heart,rdfs:label,oio:hasExactSynonym,criss-cross heart,True,True
4900,MONDO:0018082,aorto-ventricular tunnel,icd11.foundation:470594532,Aortoventricular tunnel,rdfs:label,oio:hasExactSynonym,aorto-ventricular tunnel,True,True
5645,MONDO:0021495,benign neoplasm of sublingual gland,icd11.foundation:1267377894,Benign neoplasm of sublingual gland,rdfs:label,rdfs:label,benign neoplasm of sublingual gland,False,True
6073,MONDO:0800453,juvenile absence epilepsy,icd11.foundation:519416529,Juvenile absence epilepsy,rdfs:label,rdfs:label,juvenile absence epilepsy,False,True


### What matches exist where the _subject_match_field_ and _object_match_field_ match except for (1) the object_match_field contains an extra value at the end in parenthesis in addition to (2) differences in word order and (3) punctutation?

In [11]:
# How often does the subject_match_field and object_match_field match except for that the 
# object_match_field contains an extra value at the end in parenthesis?

# Function to normalize strings
def normalize_string(s):
    # Strip off the value in parenthesis when at end of string
    s = re.sub(r'\s*\(.*?\)$', '', s)
    # Remove punctuation and convert to lowercase
    text = re.sub(r'[^\w\s]', '', s).lower()
    # Split into words, sort, and join back into a string
    sorted_words = ' '.join(sorted(text.split()))
    return sorted_words


# Apply normalization to columns and check for equality
df_label['subject_normalized'] = df_label['subject_label'].apply(normalize_string)
df_label['object_normalized'] = df_label['object_label'].apply(normalize_string)

df_label['values_match'] = df_label['subject_normalized'] == df_label['object_normalized']

# # Debug normalize_string function
# result = df_label.loc[df_label['subject_label'] == 'pharyngitis']
# result.head()

# Drop intermediate columns if not needed
df_label = df_label.drop(columns=['subject_normalized', 'object_normalized'])

# Get all rows where same_excluding_word_order is True
proposed_mondo_label_matches_df = df_label[df_label['values_match']]

<a id="label-synonym-results"></a>
### Mondo Label to ICD11 Synonym Match Results after excluding trailing token(s) in parenthesis
[Top](#Analysis-of-ICD11-lexical-alignment-results)

In [12]:
# Write out matches
proposed_mondo_label_matches_df.head(len(filtered_df))

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string,same_excluding_punctuation_and_case,same_excluding_word_order,values_match
23,MONDO:0000261,adenoiditis,icd11.foundation:697120322,Adenoiditis (TM2),rdfs:label,oio:hasExactSynonym,adenoiditis,False,False,True
118,MONDO:0000728,ptosis,icd11.foundation:529558368,Ptosis (TM2),rdfs:label,oio:hasExactSynonym,ptosis,False,False,True
120,MONDO:0000739,uvulitis,icd11.foundation:491894702,Uvulitis (TM2),rdfs:label,oio:hasExactSynonym,uvulitis,False,False,True
144,MONDO:0000918,endometritis,icd11.foundation:1925127728,Endometritis (TM2),rdfs:label,oio:hasExactSynonym,endometritis,False,False,True
171,MONDO:0000986,pleurisy,icd11.foundation:1999568083,Pleurisy (TM2),rdfs:label,oio:hasExactSynonym,pleurisy,False,False,True
193,MONDO:0001039,tonsillitis,icd11.foundation:510786647,Tonsillitis (nearest) (TM2),rdfs:label,oio:hasExactSynonym,tonsillitis,False,False,True
259,MONDO:0001166,nephritis,icd11.foundation:840913416,Nephritis (nearest) (TM2),rdfs:label,oio:hasExactSynonym,nephritis,False,False,True


### Save results to file

In [13]:
# Save to file
# proposed_mondo_label_matches_df.to_excel('../data/output/new_lex_exact.xlsx', index=False)

---
## Review additional cases of matches between Mondo Label and ICD11 Synonym Matches

## Find all matches where the same Mondo term matches to an ICD11 term based on more than one match

In [14]:
# (1) Find all matches where the same Mondo term matches to an ICD11 term based on more than one match row, 
# e.g. match label to synonym and also synonym to synonym or multiple synonym to synonym matches

# Use the original dataframe (df) that contains all of the "lex" matches to find these duplicated Mondo IDs first
# and then subtract out the Mondo IDs that are in "proposed_mondo_label_matches_df" (these were verified by other methods earlier)

# Create a new dataframe with only the duplicated subject_id rows in the original dataframe
duplicated_ids = df[df.duplicated(subset='subject_id', keep=False)]['subject_id'].unique()
print(len(duplicated_ids))

duplicated_mondo_ids_df = df[df['subject_id'].isin(duplicated_ids)]
duplicated_mondo_ids_df.head(10)

1509


Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
0,MONDO:0000001,disease,icd11.foundation:1659232486,"Mitral valve disease, unspecified",oio:hasExactSynonym,oio:hasExactSynonym,disease
1,MONDO:0000001,disease,icd11.foundation:1659232486,"Mitral valve disease, unspecified",rdfs:label,oio:hasExactSynonym,disease
3,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,oio:hasExactSynonym,oio:hasExactSynonym,pubertas praecox
4,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,oio:hasExactSynonym,oio:hasExactSynonym,sexual precocity
5,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,rdfs:label,oio:hasExactSynonym,precocious puberty
8,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,oio:hasExactSynonym,11-beta-hydroxysteroid dehydrogenase deficienc...
9,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,rdfs:label,hyperandrogenism due to cortisone reductase de...
12,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,oio:hasExactSynonym,oio:hasExactSynonym,adiaspiromycosis
13,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,oio:hasExactSynonym,rdfs:label,pulmonary adiaspiromycosis
14,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,rdfs:label,oio:hasExactSynonym,adiaspiromycosis


In [15]:
duplicated_mondo_ids_df.nunique()

subject_id             1509
subject_label          1509
object_id              1913
object_label           1878
subject_match_field       2
object_match_field        2
match_string           3327
dtype: int64

In [16]:
# For all Mondo IDs in "proposed_mondo_label_matches_df", subtract out any row from "duplicated_mondo_ids_df" 
# that has a Mondo ID in "proposed_mondo_label_matches_df" since the "proposed_mondo_label_matches_df" were 
# already verified by other analysis steps above

# Step 1: Get all unique subject_id values from "proposed_mondo_label_matches_df"
unique_subject_ids = proposed_mondo_label_matches_df['subject_id'].unique()

# Step 2: Remove rows from "duplicated_mondo_ids_df" that have the same subject_id value from "proposed_mondo_label_matches_df"
filtered_duplicated_mondo_ids_df = duplicated_mondo_ids_df[~duplicated_mondo_ids_df['subject_id'].isin(unique_subject_ids)]

filtered_duplicated_mondo_ids_df.head(10)


Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
0,MONDO:0000001,disease,icd11.foundation:1659232486,"Mitral valve disease, unspecified",oio:hasExactSynonym,oio:hasExactSynonym,disease
1,MONDO:0000001,disease,icd11.foundation:1659232486,"Mitral valve disease, unspecified",rdfs:label,oio:hasExactSynonym,disease
3,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,oio:hasExactSynonym,oio:hasExactSynonym,pubertas praecox
4,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,oio:hasExactSynonym,oio:hasExactSynonym,sexual precocity
5,MONDO:0000088,precocious puberty,icd11.foundation:1495024153,Peripheral precocious puberty,rdfs:label,oio:hasExactSynonym,precocious puberty
8,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,oio:hasExactSynonym,11-beta-hydroxysteroid dehydrogenase deficienc...
9,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,rdfs:label,hyperandrogenism due to cortisone reductase de...
12,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,oio:hasExactSynonym,oio:hasExactSynonym,adiaspiromycosis
13,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,oio:hasExactSynonym,rdfs:label,pulmonary adiaspiromycosis
14,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,rdfs:label,oio:hasExactSynonym,adiaspiromycosis


In [17]:
filtered_duplicated_mondo_ids_df.nunique()

subject_id             1429
subject_label          1429
object_id              1780
object_label           1750
subject_match_field       2
object_match_field        2
match_string           3197
dtype: int64

In [18]:
# Test that Mondo IDs in "proposed_mondo_label_matches_df" are removed

# Specific subject_id to find
# MONDO:0000261	adenoiditis	icd11.foundation:697120322	Adenoiditis (TM2)	rdfs:label	oio:hasExactSynonym	adenoiditis
specific_subject_id = 'MONDO:0000261' 

# Find rows with the specific subject_id
temp_result = filtered_duplicated_mondo_ids_df[filtered_duplicated_mondo_ids_df['subject_id'] == specific_subject_id]

temp_result.head()


Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string


---
### Find all rows where the Mondo ID is duplicated and exists in rows with both Mondo label to ICD11 synonym and Mondo synonym to ICD11 synonym matches

In [19]:
# Step 1: Identify duplicated subject_id values
duplicated_subject_ids = filtered_duplicated_mondo_ids_df['subject_id'][filtered_duplicated_mondo_ids_df['subject_id'].duplicated()].unique()
print('duplicated_subject_ids: ', len(duplicated_subject_ids))

# Step 2: Find subject_ids that have both 'rdfs:label' and 'oio:hasExactSynonym'
subject_ids_with_both = filtered_duplicated_mondo_ids_df.groupby('subject_id').filter(
    lambda x: set(['rdfs:label', 'oio:hasExactSynonym']).issubset(x['object_match_field'].values))['subject_id'].unique()

# Step 3: Filter the DataFrame to include only the rows with these subject_id values
label_and_syn_matches_df = filtered_duplicated_mondo_ids_df[filtered_duplicated_mondo_ids_df['subject_id'].isin(subject_ids_with_both)]

label_and_syn_matches_df.nunique()

duplicated_subject_ids:  1429


subject_id              654
subject_label           654
object_id               921
object_label            908
subject_match_field       2
object_match_field        2
match_string           1720
dtype: int64

In [20]:
label_and_syn_matches_df.head(10)

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
8,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,oio:hasExactSynonym,11-beta-hydroxysteroid dehydrogenase deficienc...
9,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,rdfs:label,hyperandrogenism due to cortisone reductase de...
12,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,oio:hasExactSynonym,oio:hasExactSynonym,adiaspiromycosis
13,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,oio:hasExactSynonym,rdfs:label,pulmonary adiaspiromycosis
14,MONDO:0000239,adiaspiromycosis,icd11.foundation:1111587867,Pulmonary adiaspiromycosis,rdfs:label,oio:hasExactSynonym,adiaspiromycosis
15,MONDO:0000239,adiaspiromycosis,icd11.foundation:139402453,Adiaspirosis,oio:hasExactSynonym,oio:hasExactSynonym,haplosporangiosis
16,MONDO:0000239,adiaspiromycosis,icd11.foundation:139402453,Adiaspirosis,oio:hasExactSynonym,rdfs:label,adiaspirosis
18,MONDO:0000242,tinea barbae,icd11.foundation:1201486458,Dermatophytosis of beard,oio:hasExactSynonym,rdfs:label,dermatophytosis of beard
19,MONDO:0000242,tinea barbae,icd11.foundation:1201486458,Dermatophytosis of beard,rdfs:label,oio:hasExactSynonym,tinea barbae
35,MONDO:0000337,exanthema subitum,icd11.foundation:1883970802,Roseola infantum,oio:hasExactSynonym,oio:hasExactSynonym,roseola


### Find all rows where the Mondo ID is duplicated and exists in rows with _only_ synonym and synonym to synonym matches

In [21]:
# Subtract out all rows from "filtered_duplicated_mondo_ids_df" that are in "label_and_syn_matches_df" to get only rows 
# where the duplicated Mondo ID matches are due to synonym synonym matches

# Step 1: Identify duplicated subject_id values
duplicated_subject_ids = filtered_duplicated_mondo_ids_df['subject_id'][filtered_duplicated_mondo_ids_df['subject_id'].duplicated()].unique()

# Step 2: Find subject_ids that have only 'oio:hasExactSynonym' in the subject_match_field and object_match_field
subject_ids_with_both_conditions = filtered_duplicated_mondo_ids_df.groupby('subject_id').filter(
    lambda x: all((x['subject_match_field'] == 'oio:hasExactSynonym') & (x['object_match_field'] == 'rdfs:label'))
)['subject_id'].unique()


# Step 3: Filter the DataFrame to include only the rows with these subject_id values
exact_synonym_only_matches_df = filtered_duplicated_mondo_ids_df[filtered_duplicated_mondo_ids_df['subject_id'].isin(subject_ids_with_both_conditions)]


exact_synonym_only_matches_df.nunique()

# NOTE: Some of these could be found as exact matches by removing stop words and doing the match on an ordered set of tokens in the term labels

subject_id              58
subject_label           58
object_id              121
object_label           108
subject_match_field      1
object_match_field       1
match_string           108
dtype: int64

In [22]:
exact_synonym_only_matches_df.head(10)

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
42,MONDO:0000371,oral cavity carcinoma in situ,icd11.foundation:1389868484,Carcinoma in situ of oral cavity,oio:hasExactSynonym,rdfs:label,carcinoma in situ of oral cavity
43,MONDO:0000371,oral cavity carcinoma in situ,icd11.foundation:1859678392,Carcinoma in situ of mouth,oio:hasExactSynonym,rdfs:label,carcinoma in situ of mouth
55,MONDO:0000402,small cell carcinoma,icd11.foundation:1294602000,Small cell neuroendocrine carcinoma,oio:hasExactSynonym,rdfs:label,small cell neuroendocrine carcinoma
56,MONDO:0000402,small cell carcinoma,icd11.foundation:947483629,Oat cell carcinoma,oio:hasExactSynonym,rdfs:label,oat cell carcinoma
270,MONDO:0001175,immature cataract,icd11.foundation:1712242501,Incipient cataract,oio:hasExactSynonym,rdfs:label,incipient cataract
271,MONDO:0001175,immature cataract,icd11.foundation:848377336,Water clefts,oio:hasExactSynonym,rdfs:label,water clefts
304,MONDO:0001229,small intestine diverticulitis,icd11.foundation:1351188281,Diverticulitis of small intestine,oio:hasExactSynonym,rdfs:label,diverticulitis of small intestine
305,MONDO:0001229,small intestine diverticulitis,icd11.foundation:752440271,Diverticulosis of small intestine with haemorr...,oio:hasExactSynonym,rdfs:label,diverticulosis of small intestine with haemorr...
747,MONDO:0002056,breast fibroadenoma,icd11.foundation:143326763,Fibroadenoma of breast,oio:hasExactSynonym,rdfs:label,fibroadenoma of breast
748,MONDO:0002056,breast fibroadenoma,icd11.foundation:639666497,Juvenile fibroadenoma of breast,oio:hasExactSynonym,rdfs:label,juvenile fibroadenoma of breast


---
## Find all rows with duplicated Mondo IDs and Mondo Synonym to ICD11 Label and Mondo Synonym to ICD11 Synonym matches

In [23]:
# Step 1: Identify duplicated object_id values
duplicated_object_ids = filtered_duplicated_mondo_ids_df['object_id'][filtered_duplicated_mondo_ids_df['object_id'].duplicated(keep=False)]

# Step 2: Filter the DataFrame to include only the rows with these duplicated object_id values
temp_df = filtered_duplicated_mondo_ids_df[filtered_duplicated_mondo_ids_df['object_id'].isin(duplicated_object_ids)]
temp_df.head(11)

# Step 3: Identify subject_id values that have both 'oio:hasExactSynonym' and 'rdfs:label'
subject_ids_with_both = temp_df.groupby('object_id').filter(
    lambda x: set(['rdfs:label', 'oio:hasExactSynonym']).issubset(x['subject_match_field'].values)
)['subject_id'].unique()

# Step 4: Filter the DataFrame to exclude rows with these subject_id values
icd11_label_and_syn_results_df = temp_df[~temp_df['subject_id'].isin(subject_ids_with_both)]

icd11_label_and_syn_results_df.nunique()

# NOTE: Some of these _may_ match as ordered set of tokens with stop words removed, BUT I predict these 'matches' will have more 
# variability and not be correct matches since the ICD11 synonyms vary greatly and this dataset is not grounded by any Mondo IDs that 
# had a match from a Mondo Label

subject_id              460
subject_label           460
object_id               464
object_label            462
subject_match_field       1
object_match_field        2
match_string           1121
dtype: int64

In [24]:
icd11_label_and_syn_results_df.head(len(icd11_label_and_syn_results_df))

Unnamed: 0,subject_id,subject_label,object_id,object_label,subject_match_field,object_match_field,match_string
8,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,oio:hasExactSynonym,11-beta-hydroxysteroid dehydrogenase deficienc...
9,MONDO:0000193,cortisone reductase deficiency,icd11.foundation:1515798114,Hyperandrogenism due to cortisone reductase de...,oio:hasExactSynonym,rdfs:label,hyperandrogenism due to cortisone reductase de...
44,MONDO:0000372,pharynx carcinoma in situ,icd11.foundation:1272356737,Carcinoma in situ of pharynx,oio:hasExactSynonym,oio:hasExactSynonym,pharyngeal carcinoma in situ
45,MONDO:0000372,pharynx carcinoma in situ,icd11.foundation:1272356737,Carcinoma in situ of pharynx,oio:hasExactSynonym,rdfs:label,carcinoma in situ of pharynx
48,MONDO:0000384,bladder benign neoplasm,icd11.foundation:750827946,Benign neoplasm of bladder,oio:hasExactSynonym,oio:hasExactSynonym,benign bladder tumour
...,...,...,...,...,...,...,...
6060,MONDO:0800042,restrictive dermopathy 1,icd11.foundation:713433700,Lethal restrictive dermopathy,oio:hasExactSynonym,oio:hasExactSynonym,hyperkeratosis-contracture syndrome
6061,MONDO:0800042,restrictive dermopathy 1,icd11.foundation:713433700,Lethal restrictive dermopathy,oio:hasExactSynonym,oio:hasExactSynonym,"tight skin contracture syndrome, lethal"
6082,MONDO:8000015,"46,XY sex reversal 11",icd11.foundation:1581551380,Embryonic testicular regression syndrome,oio:hasExactSynonym,oio:hasExactSynonym,testicular regression syndrome
6083,MONDO:8000015,"46,XY sex reversal 11",icd11.foundation:1581551380,Embryonic testicular regression syndrome,oio:hasExactSynonym,oio:hasExactSynonym,vanishing testes syndrome
