### Analysis of NCIT Subclass Source provenance after running Subclass Sync pipeline

See repo `README.md` file for more information.

#### Input Data Files
- ncit_subclass_source_provenance.csv - contains all Mondo classes that have a SubclassOf source annotation with a value from NCIT _after_ the Subclass Sync is run. It was created by running the SPARQL query in `ncit_subclass_source.ru` against a version of `mondo-edit.obo` that contains changes from the Subclass Sync pipeline (it was copied from the files in [#8503](https://github.com/monarch-initiative/mondo/pull/8503)).
</br></br>
- ncit_neoplasm_child_terms.csv - contains all NCIT terms in the neoplasm branch. It was created by running the SPARQL query `ncit_neoplasm_children.ru` on the file `component-download-ncit.owl.owl` copied from the `main` branch in the mondo-ingest repo.
</br></br>
- mondo-edit_all_ncit_subclass_source_provenance.csv - contains all Mondo class that have a SubclassOf source annotation with a value from NCIT _before_ the updated Subclass Sync was run. The list of subclass sources extracted in this data file is not limited to only values from NCIT. The data file was created by running the SPARQL query `ncit_and_all_subclass_source_provenance.ru` on a version of `mondo-edit.obo` from the `master` branch from the mondo repo copied from 6Jan2025 and it before the updated Subclass Sync pipeline was run. The xrefList in this data file contains all xref values, not just those limited to NCIT.

In [1]:
# imports
import pandas as pd

---

### Read file of Mondo classes that contain Subclass axioms with a source annotation from NCIT

In [2]:
# Read in file. The file `ncit_subclass_source_provenance.csv` contains all Mondo classes that have a 
# SubclassOf source annotation with a value from NCIT _after_ the Subclass Sync is run.

df = pd.read_csv('ncit_subclass_source_provenance.csv')
df.head()

Unnamed: 0,mondoIRI,mondoLabel,parentClass,source,xrefList
0,MONDO:0000147,polyposis,http://purl.obolibrary.org/obo/MONDO_0021075,NCIT:C4089,NCIT:C4089
1,MONDO:0000337,exanthema subitum,_:b0,NCIT:C128420,NCIT:C128420
2,MONDO:0000371,oral cavity carcinoma in situ,http://purl.obolibrary.org/obo/MONDO_0044925,NCIT:C4587,NCIT:C4587
3,MONDO:0000372,pharynx carcinoma in situ,http://purl.obolibrary.org/obo/MONDO_0004647,NCIT:C4942,NCIT:C4942
4,MONDO:0000372,pharynx carcinoma in situ,http://purl.obolibrary.org/obo/MONDO_0021345,NCIT:C4942,NCIT:C4942


### Confirm the NCIT subclassOf source annotation value also exists as an xref for the Mondo term

In [3]:
# Check that the value for `source` is found within the `xrefList` of values. This is a "sanity check" since 
# any row with a subclassOf source annotation from NCIT should also have an xref to this same NCIT value.

# Check if 'source' matches a value in the 'xref' column for each row
df["source_in_xrefList"] = df.apply(
    lambda row: row["source"] in row["xrefList"].split(", "), axis=1
)

# Display the resulting DataFrame
df.head()

Unnamed: 0,mondoIRI,mondoLabel,parentClass,source,xrefList,source_in_xrefList
0,MONDO:0000147,polyposis,http://purl.obolibrary.org/obo/MONDO_0021075,NCIT:C4089,NCIT:C4089,True
1,MONDO:0000337,exanthema subitum,_:b0,NCIT:C128420,NCIT:C128420,True
2,MONDO:0000371,oral cavity carcinoma in situ,http://purl.obolibrary.org/obo/MONDO_0044925,NCIT:C4587,NCIT:C4587,True
3,MONDO:0000372,pharynx carcinoma in situ,http://purl.obolibrary.org/obo/MONDO_0004647,NCIT:C4942,NCIT:C4942,True
4,MONDO:0000372,pharynx carcinoma in situ,http://purl.obolibrary.org/obo/MONDO_0021345,NCIT:C4942,NCIT:C4942,True


In [4]:
df.nunique()

mondoIRI              2865
mondoLabel            2865
parentClass           1626
source                2869
xrefList              2863
source_in_xrefList       2
dtype: int64

In [5]:
print(len(df))

4108


In [6]:
# Find all rows where `source_equals_xref` is False and therefore the NCIT source used as the provenance for 
# the subclassOf annotation is not found as an xref.

# Filter rows where 'source' and 'xref' are not the same
mismatch_rows = df[df["source_in_xrefList"] == False]

mismatch_rows.head(len(mismatch_rows))

Unnamed: 0,mondoIRI,mondoLabel,parentClass,source,xrefList,source_in_xrefList
124,MONDO:0001076,glucose intolerance,_:b2,NCIT:C34646-textdef,NCIT:C34646,False
827,MONDO:0002868,bile duct mucinous cystic neoplasm with an ass...,_:b22,NCIT:C4130-modified,NCIT:C4130,False
2414,MONDO:0005377,nephrotic syndrome,_:b55,NCIT:C34845-def,NCIT:C34845,False
2415,MONDO:0005377,nephrotic syndrome,_:b56,NCIT:C34845-def,NCIT:C34845,False
2416,MONDO:0005377,nephrotic syndrome,_:b57,NCIT:C34845-def,NCIT:C34845,False
2926,MONDO:0007566,multiple self-healing squamous epithelioma,_:b76,MONDO:NCIT,NCIT:C4461,False
2992,MONDO:0010726,Rett syndrome,_:b89,NCIT:C88412,NCIT:C75488,False
3441,MONDO:0021081,anti-NMDA receptor encephalitis,_:b118,MONDO:from-NCIT-text-def,NCIT:C94853,False


In [7]:
mismatch_rows.nunique()

mondoIRI              6
mondoLabel            6
parentClass           8
source                6
xrefList              6
source_in_xrefList    1
dtype: int64

---
### Read file of NCIT neoplasm terms

In [8]:
# Read in a list of all NCIT curies of terms in the neoplasm branch

df_ncit = pd.read_csv("ncit_neoplasm_child_terms.csv")

df_ncit.head()

Unnamed: 0,subclassCURIE
0,NCIT:C3263
1,NCIT:C3010
2,NCIT:C7335
3,NCIT:C3575
4,NCIT:C4536


In [9]:
df_ncit.nunique()

subclassCURIE    13924
dtype: int64

### Confirm that all NCIT subclassOf source annotation values are found in the neoplasm branch

In [10]:
# Check that all values in `df` in the source column where source_in_xrefList is True are found in df_ncit

# Get all `source` values where `source_in_xrefList` is True
sources_to_check = df[df["source_in_xrefList"] == True]["source"]

# Check if all `source` values are in df_ncit['subclassCURIE']
all_present = sources_to_check.isin(df_ncit["subclassCURIE"]).all()

# Output result
print("All NCIT subclassOf source values present in df_ncit:", all_present)

# If you want to see which sources are not in `df_ncit`
missing_sources = sources_to_check[~sources_to_check.isin(df_ncit["subclassCURIE"])]
print("\nSources not in df_ncit:")


# I spot checked a few of these NCIT values and they do still exist in NCIT.
missing_sources.head(len(missing_sources))

All NCIT subclassOf source values present in df_ncit: False

Sources not in df_ncit:


1       NCIT:C128420
152      NCIT:C27597
153      NCIT:C27597
509      NCIT:C27166
759      NCIT:C27335
            ...     
4101    NCIT:C134557
4102    NCIT:C134558
4103    NCIT:C134945
4104    NCIT:C135004
4107     NCIT:C61283
Name: source, Length: 66, dtype: object

In [11]:
missing_sources.nunique()

57

---

### Check number of Mondo classes with NCIT subclass source provenance _before_ running Subclass Sync

In [12]:
df_all = pd.read_csv('mondo-edit_all_ncit_subclass_source_provenance.csv')

df_all.head()

Unnamed: 0,mondoIRI,mondoLabel,parentClass,ncitSource,allSourceList,xrefList
0,MONDO:0000022,nocturnal enuresis,http://purl.obolibrary.org/obo/MONDO_0024290,NCIT:C118172,"icd11.foundation:1048673005, NCIT:C118172",NCIT:C118172
1,MONDO:0000087,polymicrogyria,http://purl.obolibrary.org/obo/MONDO_0002320,NCIT:C116936,NCIT:C116936,NCIT:C116936
2,MONDO:0000128,giant axonal neuropathy,http://purl.obolibrary.org/obo/MONDO_0005244,NCIT:C84728,"NCIT:C84728, MONDO:Redundant, MONDO:0000128/in...",NCIT:C84728
3,MONDO:0000141,mosaic variegated aneuploidy syndrome,http://purl.obolibrary.org/obo/MONDO_0021058,NCIT:C128192,"NCIT:C128192, MONDO:Redundant",NCIT:C128192
4,MONDO:0000147,polyposis,http://purl.obolibrary.org/obo/MONDO_0021075,NCIT:C4089,NCIT:C4089,NCIT:C4089


In [13]:
df_all.nunique()

mondoIRI         5831
mondoLabel       5831
parentClass      2339
ncitSource       6793
allSourceList    8190
xrefList         5811
dtype: int64

In [14]:
print(len(df_all))

8954


In [15]:
# Find which Mondo classes are in df_all but not in df (mondo subclass with NCIT source 
# provenance _after_ running Subclass Sync). This is to find how many Subclass source annotations would be lost
# given the current way the Subclass Sync information is added.

# Find rows in df_all that are not in df
diff_df = df_all.merge(df, on=["mondoIRI", "mondoLabel", "parentClass"], how="left",indicator=True)
rows_in_df_all_not_in_df = diff_df[diff_df["_merge"] == "left_only"].drop(columns=["_merge"])

rows_in_df_all_not_in_df.head()

Unnamed: 0,mondoIRI,mondoLabel,parentClass,ncitSource,allSourceList,xrefList_x,source,xrefList_y,source_in_xrefList
0,MONDO:0000022,nocturnal enuresis,http://purl.obolibrary.org/obo/MONDO_0024290,NCIT:C118172,"icd11.foundation:1048673005, NCIT:C118172",NCIT:C118172,,,
1,MONDO:0000087,polymicrogyria,http://purl.obolibrary.org/obo/MONDO_0002320,NCIT:C116936,NCIT:C116936,NCIT:C116936,,,
2,MONDO:0000128,giant axonal neuropathy,http://purl.obolibrary.org/obo/MONDO_0005244,NCIT:C84728,"NCIT:C84728, MONDO:Redundant, MONDO:0000128/in...",NCIT:C84728,,,
3,MONDO:0000141,mosaic variegated aneuploidy syndrome,http://purl.obolibrary.org/obo/MONDO_0021058,NCIT:C128192,"NCIT:C128192, MONDO:Redundant",NCIT:C128192,,,
5,MONDO:0000190,ventricular fibrillation,http://purl.obolibrary.org/obo/MONDO_0007263,NCIT:C50799,"NCIT:C50799, EFO:0004287",NCIT:C50799,,,


In [16]:
rows_in_df_all_not_in_df.nunique()

mondoIRI              3985
mondoLabel            3985
parentClass           1303
ncitSource            4304
allSourceList         4598
xrefList_x            3968
source                   0
xrefList_y               0
source_in_xrefList       0
dtype: int64

In [17]:
print(len(rows_in_df_all_not_in_df))

4811


In [18]:
rows_in_df_all_not_in_df.dtypes

mondoIRI              object
mondoLabel            object
parentClass           object
ncitSource            object
allSourceList         object
xrefList_x            object
source                object
xrefList_y            object
source_in_xrefList    object
dtype: object

In [19]:
# Find any rows in rows_in_df_all_not_in_df where allSourceList contains only 1 source value and the value
# is from from NCIT. These are rows where all subclassOf source provenance will be lost with the current 
# Subclass Sync process.

# Convert `allSourceList` to string
rows_in_df_all_not_in_df["allSourceList"] = rows_in_df_all_not_in_df["allSourceList"].astype(str)

# Filter rows where there is only 1 value and it contains "NCIT:"
filtered_rows = rows_in_df_all_not_in_df[
    rows_in_df_all_not_in_df["allSourceList"].apply(
        lambda x: len(x.split(", ")) == 1 and "NCIT:" in x
    )
]

filtered_rows.head()

Unnamed: 0,mondoIRI,mondoLabel,parentClass,ncitSource,allSourceList,xrefList_x,source,xrefList_y,source_in_xrefList
1,MONDO:0000087,polymicrogyria,http://purl.obolibrary.org/obo/MONDO_0002320,NCIT:C116936,NCIT:C116936,NCIT:C116936,,,
8,MONDO:0000334,multinodular goiter,http://purl.obolibrary.org/obo/MONDO_0006869,NCIT:C131438,NCIT:C131438,NCIT:C131438,,,
35,MONDO:0000390,vitelliform macular dystrophy,http://purl.obolibrary.org/obo/MONDO_0020242,NCIT:C118788,NCIT:C118788,NCIT:C118788,,,
42,MONDO:0000410,funisitis,http://purl.obolibrary.org/obo/MONDO_0024575,NCIT:C97077,NCIT:C97077,NCIT:C97077,,,
63,MONDO:0000502,villous adenoma,http://purl.obolibrary.org/obo/MONDO_0024276,NCIT:C7399,NCIT:C7399,NCIT:C7399,,,


In [20]:
filtered_rows.nunique()

mondoIRI              789
mondoLabel            789
parentClass           382
ncitSource            788
allSourceList         788
xrefList_x            783
source                  0
xrefList_y              0
source_in_xrefList      0
dtype: int64

In [21]:
print(len(filtered_rows))

818


In [22]:
# Prepare dataframe as ROBOT template

# Remove columns that are not needed
ncit_subclass_provenance_df = filtered_rows[["mondoIRI", "mondoLabel", "parentClass", "ncitSource"]].copy()

# Convert the IRIs in "parentClass" to CURIEs
ncit_subclass_provenance_df['parentClass'] = ncit_subclass_provenance_df['parentClass'].str.replace(
    r'http://purl.obolibrary.org/obo/MONDO_', 'MONDO:', regex=True
)

# # Define new headers
# new_headers = ["subject_mondo_id", "subject_mondo_label", "object_mondo_id", "object_mondo_id"]

# # Replace the column names of the data with new headers
# ncit_subclass_provenance_df.columns = new_headers

# # Add an additional header row
# extra_header_row = [["ID", "", "SC %", ">A oboInOwl:source"]]
# extra_header_df = pd.DataFrame(extra_header_row, columns=new_headers)

# # Concatenate the additional header row and the data
# ncit_subclass_provenance_robot_df = pd.concat([extra_header_df, ncit_subclass_provenance_df], ignore_index=True)


# Add an additional header row
extra_header_row = [["ID", "", "SC %", ">A oboInOwl:source"]]
extra_header_df = pd.DataFrame(extra_header_row, columns=ncit_subclass_provenance_df.columns)

# Concatenate the additional header row and the data
ncit_subclass_provenance_robot_df = pd.concat([extra_header_df, ncit_subclass_provenance_df], ignore_index=True)



ncit_subclass_provenance_robot_df.head()

Unnamed: 0,mondoIRI,mondoLabel,parentClass,ncitSource
0,ID,,SC %,>A oboInOwl:source
1,MONDO:0000087,polymicrogyria,MONDO:0002320,NCIT:C116936
2,MONDO:0000334,multinodular goiter,MONDO:0006869,NCIT:C131438
3,MONDO:0000390,vitelliform macular dystrophy,MONDO:0020242,NCIT:C118788
4,MONDO:0000410,funisitis,MONDO:0024575,NCIT:C97077


In [23]:
# In hidsight, let's keep the existing record of the current ncitSource values and rename to ncitSource2 
# and remove the '>A oboInOwl:source' and then add another column named ncitSource that does 
# >A oboInOwl:source and add values MONDO:notVerified for all rows


# Rename the column 'ncitSource' to 'ncitSource2'
ncit_subclass_provenance_robot_df.rename(columns={'ncitSource': 'ncitSource2'}, inplace=True)

# Insert the new column 'ncitSource' with the value 'MONDO:notVerified' after 'parentClass'
ncit_subclass_provenance_robot_df.insert(
    ncit_subclass_provenance_robot_df.columns.get_loc('parentClass') + 1,
    'ncitSource',
    'MONDO:notVerified'
)

# Update the second row header for the new column 'ncitSource'
second_header = ["ID", "", "SC %", ">A oboInOwl:source", ">A oboInOwl:source"]
ncit_subclass_provenance_robot_df.iloc[0] = second_header



ncit_subclass_provenance_robot_df.head()

Unnamed: 0,mondoIRI,mondoLabel,parentClass,ncitSource,ncitSource2
0,ID,,SC %,>A oboInOwl:source,>A oboInOwl:source
1,MONDO:0000087,polymicrogyria,MONDO:0002320,MONDO:notVerified,NCIT:C116936
2,MONDO:0000334,multinodular goiter,MONDO:0006869,MONDO:notVerified,NCIT:C131438
3,MONDO:0000390,vitelliform macular dystrophy,MONDO:0020242,MONDO:notVerified,NCIT:C118788
4,MONDO:0000410,funisitis,MONDO:0024575,MONDO:notVerified,NCIT:C97077


In [24]:
# Save to file the information in filtered_rows (contains information on Mondo classes where the
# subclassOf provenance would have no values since there is only 1 NCIT source annotation). This file will
# be the source file to create a ROBOT template to add back source subclassOf annotations with a value of 
# MONDO:notVerified to these classes.

ncit_subclass_provenance_robot_df.to_csv('ncit_subclass_provenance_robot_df.tsv', sep='\t', index=False)