# Generating the knowledge graph for medical diagnostics

## Preliminaries 
We import required libraries, select the folder that contains data, and we define global variables.

In [1]:
import os
import pandas as pd
import os
from rdflib import Graph, URIRef, Literal, Namespace, RDFS

os.chdir('/Volumes/GoogleDrive/My Drive/SNOMED-CT/Snapshot/Terminology')

MDX = Namespace('https://w3id.org/hacid/onto/mdx/')
MDXD = Namespace('https://w3id.org/hacid/mdx/data/')

## Loading dataframes from input tabular data
We read the file ```sct2_Relationship_Snapshot_INT_20230630``` that contains all the relations among terms in SNOMED-CT.

This file will be used for generating the ontology modules and populating those modules.

In [2]:
relations = pd.read_csv('sct2_Relationship_Snapshot_INT_20230630.txt', delimiter='\t', na_filter=False, dtype=str)
relations = relations[relations['active']=='1']
#print(relations)

Then, we read the file ```sct2_Description_Snapshot-en_INT_20230630.txt``` that contains all the descriptions (i.e. labels) of clinical terms in SNOMED-CT. 

From the list of descriptions we keep only those that are marked as active (i.e. the field _active = 1_)

In this file we are interested with the columns _id_, _conceptId_, _term_, and _typeId_, as:
 * _id_ idetifies uniquely the association of the clinical term with the description;
 * _conceptId_ identifies the clinical term;
 * _term_ is the label;
 * _typeId_ is the code that distinguishes between fully specified names (i.e. 900000000000003001) and synomyms (i.e. 900000000000013009).

In [3]:
descriptions = pd.read_csv('sct2_Description_Snapshot-en_INT_20230630.txt', delimiter='\t', na_filter=False, dtype=str)
descriptions = descriptions[descriptions['active'] == '1']
descriptions = descriptions[['id', 'conceptId', 'term', 'typeId']]

#print(descriptions)

Now we read the file ```der2_cRefset_LanguageSnapshot-en_INT_20230630.txt``` that contains language reference sets. Basically, it allows to distinguish between preferred and alternative labels according to a give language. 

From the orginal fields avaible when reading the dataframe we only keep the following:
 * _refsetId_ that identified the reference set. As we are using the interational edition of SNOMED-CT only two reference sets are available, that is US Enghish (i.e. 900000000000509007) and Great Britain English (i.e. 900000000000508004);
 * _referencedComponentId_ that is the identifier of the association of a clinical term with a given description in the file ```sct2_Relationship_Snapshot_INT_20230630.txt```;
 * _acceptabilityId_ that allows to distinguish between preferred labels (i.e. 900000000000548007) and alternative ones (i.e. 900000000000549004).

In [4]:
refset = pd.read_csv('der2_cRefset_LanguageSnapshot-en_INT_20230630.txt', delimiter='\t', na_filter=False, dtype=str)
refset = refset[refset['active'] == '1' ][['refsetId', 'referencedComponentId', 'acceptabilityId']]
#print(refset)

## Terms with language sets
We create a dataframe that joins definitions and language sets.
This dataframe is then used for deriving fully specified names, preferred terms among synomyms, and alternative terms among synonyms.

In [5]:
terms = descriptions.merge(refset, how='left', left_on='id', right_on='referencedComponentId')

fsn = terms[(terms['typeId'] == '900000000000003001') & (terms['acceptabilityId'] == '900000000000548007')]
fsn = fsn[['conceptId', 'term', 'refsetId']]
#print(fsn)

Then we derive a dataframe for preferred synonyms, i.e. _typeId_=900000000000013009 and _acceptabilityId_= 900000000000548007

In [6]:
prefs = terms[(terms['typeId'] == '900000000000013009') & (terms['acceptabilityId'] == '900000000000548007')]
prefs = prefs[['conceptId', 'term', 'refsetId']]
#print(prefs)

Finally, we derive a dataframe for alternative synonyms, i.e. _typeId_=900000000000013009 and _acceptabilityId_= 900000000000549004

In [7]:
alts = terms[(terms['typeId'] == '900000000000013009') & (terms['acceptabilityId'] == '900000000000549004')]
alts = alts[['conceptId', 'term', 'refsetId']]
#print(alts)

## RDF Graphs production

We now generate the RDF graphs and we save them with a N-Triples serialisation.

We start from the fully specified names, that are associated to terms both with the property MDX.fullySpecifiedName and RDFS.label.

In [8]:
def to_triple(row, property: URIRef) -> tuple[URIRef, URIRef, Literal]:
    # us-en
    if row[2] == '900000000000509007':
        return (MDXD[row[0]], property, Literal(row[1], lang='en-us'))
    # gb-en
    else:
        return (MDXD[row[0]], property, Literal(row[1], lang='en-gb'))
        
g = Graph()

triples = [to_triple(row, MDX.fullySpecifiedName) for row in fsn.to_numpy()]
for triple in triples:
  g.add(triple)

triples = [to_triple(row, RDFS.label) for row in fsn.to_numpy()]
for triple in triples:
  g.add(triple)

g.serialize(destination='MDX Knowledge Graph/full-specified-names.nt', format='nt')



<Graph identifier=Ncadb7ff13a704b19825f340c741f43af (<class 'rdflib.graph.Graph'>)>

Now is the turn of preferred synomyms, which are associated to terms both with the property MDX.fullySpecifiedName and RDFS.label.

In [9]:
g = Graph()
triples = [to_triple(row, MDX.preferredSynonym) for row in prefs.to_numpy()]
for triple in triples:
  g.add(triple)

g.serialize(destination='MDX Knowledge Graph/preferred-synonyms.nt', format='nt')

<Graph identifier=N265fdc5c2a694ceabac712a4a19e1fff (<class 'rdflib.graph.Graph'>)>

In [10]:
g = Graph()
triples = [to_triple(row, MDX.alternativeSynonym) for row in alts.to_numpy()]
for triple in triples:
  g.add(triple)

g.serialize(destination='MDX Knowledge Graph/alternative-synonyms.nt', format='nt')

<Graph identifier=Na6b620518ae94275b9e5e6352c68b527 (<class 'rdflib.graph.Graph'>)>