## Analyze the MONDO Synonym Xrefs


#### Mondo Input File Creation
The "syns.csv" file was created by using [ROBOT](https://robot.obolibrary.org/) to convert "mondo-edit.obo" into OWL and then running the SPARQL query below over the file.
- `robot convert -i mondo-edit.obo -o TEST-mondo-edit.owl`

- `robot query -i TEST-mondo-edit.owl -q query_synonym_xrefs.sparql syns.csv`


#### SPARQL Query 

```
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?identifier ?synonym ?synonym_xref ?synonym_type ?synonym_type_value
WHERE {
    # Match the main class and get its identifier
    ?entity rdfs:label ?label ;
            oboInOwl:id ?identifier .

    # Match the synonym properties and their values
    VALUES ?synonym_property {
        oboInOwl:hasRelatedSynonym
        oboInOwl:hasExactSynonym
        oboInOwl:hasNarrowSynonym
        oboInOwl:hasBroadSynonym
    }
    ?entity ?synonym_property ?synonym .

    # Match the annotations for the synonyms
    OPTIONAL {
        ?anno a owl:Axiom ;
              owl:annotatedSource ?entity ;
              owl:annotatedProperty ?synonym_property ;
              owl:annotatedTarget ?synonym ;
              oboInOwl:hasDbXref ?synonym_xref .
        OPTIONAL {
            ?anno oboInOwl:hasSynonymType ?synonym_type_value .
        }
    }

    # Ensure the entity is the specific MONDO class
    FILTER(STRSTARTS(STR(?entity), "http://purl.obolibrary.org/obo/MONDO_"))

    # Bind the synonym type based on the property
    BIND(
        IF(?synonym_property = oboInOwl:hasExactSynonym, "exact",
            IF(?synonym_property = oboInOwl:hasRelatedSynonym, "related",
                IF(?synonym_property = oboInOwl:hasNarrowSynonym, "narrow",
                    IF(?synonym_property = oboInOwl:hasBroadSynonym, "broad", "unknown")
                )
            )
        ) AS ?synonym_type
    )
}
ORDER BY ?identifier
```

In [6]:
# imports 
import pandas as pd
import numpy as np

In [7]:
# read in file
df = pd.read_csv('syns.csv')
df.head()

Unnamed: 0,identifier,synonym,synonym_xref,synonym_type,synonym_type_value
0,MONDO:0000001,condition,NCIT:C2991,exact,
1,MONDO:0000001,disease,NCIT:C2991,exact,
2,MONDO:0000001,disease or disorder,NCIT:C2991,exact,
3,MONDO:0000001,"disease or disorder, non-neoplastic",NCIT:C2991,exact,
4,MONDO:0000001,diseases,NCIT:C2991,exact,


In [8]:
# Get the unqique synonym xrefs
df['prefix'] = df['synonym_xref'].str.split(':').str[0]

# Get unique list of prefixes
unique_prefixes = df['prefix'].unique()
unique_prefixes

array(['NCIT', nan, 'DOID', 'GARD', 'OMIMPS', 'https', 'Orphanet', 'PMID',
       'MONDO', 'OMIM', 'Wikipedia', 'MESH', 'MONDORULE', 'NORD', 'UMLS',
       'ONCOTREE', 'ICD9CM', 'EFO', 'doi', 'http', 'ICD10CM', 'OMOP',
       'SCTID', 'MTH', 'ISBN-13', 'GTR', 'HP', 'OGMS', 'MedDRA',
       'DECIPHER', 'ClinGen', 'MEDGEN', 'ICD9', 'MedGen', 'ICD11', 'SCDO',
       'OMIA'], dtype=object)

In [10]:
# Get all prefix identifiers that are not "numeric", e.g. OMIM:genemap2, OMIA:001441-9615

# Extract values after the colon
df['value'] = df['synonym_xref'].str.split(':').str[1]

# Convert all values to strings
df['value'] = df['value'].astype(str)

# Filter non-numeric values (including mixed values)
non_numeric_values = df['value'][~df['value'].apply(lambda x: x.replace('.', '', 1).isdigit())]

# Find unique non-numeric values
unique_non_numeric_values = non_numeric_values.unique()

# Display unique non-numeric values
len(unique_non_numeric_values)

7259

In [11]:
# Save these values to a file
output_file = 'unique_non_numeric_values.txt'

# Write each unique non-numeric value to the file, one per line
with open(output_file, 'w') as file:
    for value in unique_non_numeric_values:
        file.write(f"{value}\n")

---

### Re-Scope Analysis Issue

Due to the large number of values to evaluate, limit the scope of this analysis to known ontology prefixes.

In [14]:
# Only include prefix values from certain sources, e.g. 'OMIM', 'OMIMPS', 'Orphanet', 'DOID', 'NCIT', 'MESH', 
# 'UMLS', 'ICD9CM', 'EFO', 'ICD10CM', 'OMOP', 'SCTID', 'ClinGen', 'MEDGEN', 'ICD9', 'MedGen', 'ICD11', 'OMIA'

# List of prefixes to filter
prefixes = [
    'OMIM', 'OMIMPS', 'Orphanet', 'DOID', 'NCIT', 'MESH', 'UMLS', 
    'ICD9CM', 'EFO', 'ICD10CM', 'OMOP', 'SCTID', 'ClinGen', 'MEDGEN', 
    'ICD9', 'MedGen', 'ICD11', 'OMIA'
]

# Extract the prefix from the 'Column' and store it in a new column 'Prefix'
df['prefix'] = df['synonym_xref'].str.split(':').str[0]

# Filter the DataFrame to only include rows where the 'Prefix' is in the list of prefixes
filtered_df = df[df['prefix'].isin(prefixes)].copy()

# Drop the 'Prefix' column if you no longer need it
filtered_df.drop(columns='prefix', inplace=True)

# Display the filtered DataFrame
filtered_df.head()


Unnamed: 0,identifier,synonym,synonym_xref,synonym_type,synonym_type_value,value
0,MONDO:0000001,condition,NCIT:C2991,exact,,C2991
1,MONDO:0000001,disease,NCIT:C2991,exact,,C2991
2,MONDO:0000001,disease or disorder,NCIT:C2991,exact,,C2991
3,MONDO:0000001,"disease or disorder, non-neoplastic",NCIT:C2991,exact,,C2991
4,MONDO:0000001,diseases,NCIT:C2991,exact,,C2991


In [19]:
# Now get a unique list of non-numeric values for synonym_xref from filtered_df

# Extract values after the colon
filtered_df['value'] = filtered_df['synonym_xref'].str.split(':').str[1]

# Convert all values to strings
filtered_df['value'] = filtered_df['value'].astype(str)

# Filter non-numeric values (including mixed values)
non_numeric_values = filtered_df['value'][~filtered_df['value'].apply(lambda x: x.replace('.', '', 1).isdigit())]

# Find unique non-numeric values
unique_non_numeric_values = non_numeric_values.unique()

# Display unique non-numeric values
print(len(unique_non_numeric_values))

6459


In [20]:
# Filter out any value that starts with 'C' or 'D', these are from MESH 

filtered_values = [value for value in unique_non_numeric_values if not (value.startswith('C') or value.startswith('D'))]


In [21]:
# Save these values to a file
output_file = 'unique_non_numeric_values-SELECT-PREFIXES-ONLY.txt'

# Write each unique non-numeric value to the file, one per line
with open(output_file, 'w') as file:
    for value in filtered_values:
        file.write(f"{value}\n")