# Task 1 - Disease Named Entity Recognition (D-NER)

## Introduction and Contex

The main goal of this project is to provide a pipeline for the input preprocessing for sequence tagging and named entity recognition for a medical domain project with Knodle. In other words, the tagging of categories of entities, here diseases, within text. For this we make use of the MIMIC III dataset and create weakly annotated labels with a lexicon look-up with keyword lists from relevante datasets. Those shall then be denoised within Knodle.

### Introduction - Data and Preprocessing

As mentioned, this project will utilize the Mimic III dataset as training data. For the development and evaluation of the model, the NCIT dataset will be used, for which the existing tags are as follows:
DiseaseClass, SpecificDisease, Modfier and CompositeMention

For the Class of CompositeDisease the occurence of a term of DiseaseClass or SpecificDisease followed by another seems like a viable solution, with an instance of and, or or / \ between them. Given that the following preprocessing pipeline will only deal with unigrams and the class Mod is not used in the final model, the elements in use ne

In order to supplement the Mimic III dataset for this task, three other datasets where consulted to aid in the fitting of specific categories for the matching with the structure of the classes:

National Cancer Institute Thesaurus: NCI9d - Thesaurus.txt: NCI Thesaurus Version 22.09d - flattened from the owl format 
from their own webpage: https://evs.nci.nih.gov/evs-download/thesaurus-downloads

The vast ULMS dataset (Unified Medical Language System),
after authentification, to be accessed at: https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
Specifically the table NB.DB

MEDIC - CTD's MEDIC disease vocabulary is a modified subset of descriptors from the “Diseases” [C] branch of the U.S. National Library of Medicine's Medical Subject Headings (MeSH®),combined with genetic disorders from the Online Mendelian Inheritance in Man® (OMIM®) database. Comparative Toxicogenomics Database: https://ctdbase.org/downloads/#alldiseases

In order to access the MIMIC II dataset, a PhysioNet account and a completion of the necessary credentialing process are needed. This involves passing the "Data or Specimens Only Research" course from the "Human Research" curriculum offered by the Collaborative Institutional Training Initiative (CITI Program). This can entail some waiting time.

MIMIC-III ("Medical Information Mart for Intensive Care") is a large, single-center database consisting of 26 tables, containing detailed information on patients admitted to critical care units at a large tertiary care hospital. Events such as notes, laboratory tests, and fluid balance are stored in a series of ‘events’ tables. The NOTEEVENTS table contains all clinical notes related to an event for a given patient, for the purposes of the training data, this will serve as text to be labeled, while D_ICD_DIAGNOSES, which carries the disease mentions will serve as the basis of two of the keyword lists for labeling.

On the scope and its potential for scaleable expansion: 

Modifier, CompositeMention classes will not be implemented in the final input for Knodle, for computational and conceptual purposes. But the code is written with scaleability in mind. The inclusion of Modifiers can at any point be included and would only require mirroring of the steps in the second notebook, to also include the Mod dataset. The inclusion of CompositeMention would merely require the labeling function to include another for loop and clause to map the occurence of a term of DiseaseClass or SpecificDisease followed by another, with an instance of and, or or / \ between them. Given that the following preprocessing pipeline will only deal with unigrams, the ngram creation pipeline would need to be mirrored for a bigger set for this as well.

Further potential for performance increase lies in the final inclusion of all the processed datasets all the while adjusting the sampling parameter to maximize the quality of the input with regards to the computational possibilities.

Note 1: Given the large size of the original dataset, in order to provide a proof of concept, only a subset is used, yet the pipelines are built in a way that the greater whole could be passed through in batches/chunks.

Note 2: In a subsequent sections all of the prevously processed elements shall be exported and  later on imported in bulk.

## Preprocessing Steps

### Examination

In the following section each of the datasets in use will be examined, the relevant sections are stripped from them and further preprocessed into lists of strings.
For this we remove stopwords, punctuation and numbers as well as normalize the input format.

In [42]:
import pandas as pd
import os
from collections import Counter
import scipy
import re
import string
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
import copy
import string

punctuation = string.punctuation.replace('/', '')
punctuation = punctuation.replace('\\', '')


In [43]:
D_ICD_DIAGNOSES=pd.read_csv('../../../mimic-iii-clinical-database-1.4/D_ICD_DIAGNOSES.csv', sep=',', header=0)
D_ICD_DIAGNOSES.head(5)

Unnamed: 0,ROW_ID,ICD9_CODE,SHORT_TITLE,LONG_TITLE
0,174,1166,TB pneumonia-oth test,"Tuberculous pneumonia [any form], tubercle bac..."
1,175,1170,TB pneumothorax-unspec,"Tuberculous pneumothorax, unspecified"
2,176,1171,TB pneumothorax-no exam,"Tuberculous pneumothorax, bacteriological or h..."
3,177,1172,TB pneumothorx-exam unkn,"Tuberculous pneumothorax, bacteriological or h..."
4,178,1173,TB pneumothorax-micro dx,"Tuberculous pneumothorax, tubercle bacilli fou..."


In [44]:
DiseaseClass1=D_ICD_DIAGNOSES.loc[:,'SHORT_TITLE']
DC1=set(DiseaseClass1.values.tolist()) 
SpecificDisease1=D_ICD_DIAGNOSES.loc[:,'LONG_TITLE']
SD1=set(SpecificDisease1.values.tolist())

In [45]:
CTD_diseases=pd.read_csv('./data/CTD_diseases.csv', sep=',', header=0)
CTD_diseases.head(5)

Unnamed: 0,DiseaseName,DiseaseID,AltDiseaseIDs,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms,SlimMappings
0,10p Deletion Syndrome (Partial),MESH:C538288,,,MESH:D002872|MESH:D025063,C16.131.260/C538288|C16.320.180/C538288|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,"Chromosome 10, 10p- Partial|Chromosome 10, mon...",Congenital abnormality|Genetic disease (inborn...
1,13q deletion syndrome,MESH:C535484,,,MESH:D002872|MESH:D025063,C16.131.260/C535484|C16.320.180/C535484|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,Chromosome 13q deletion|Chromosome 13q deletio...,Congenital abnormality|Genetic disease (inborn...
2,15q24 Microdeletion,MESH:C579849,DO:DOID:0060395,,MESH:D002872|MESH:D008607|MESH:D025063,C10.597.606.360/C579849|C16.131.260/C579849|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,15q24 Deletion|15q24 Microdeletion Syndrome|In...,Congenital abnormality|Genetic disease (inborn...
3,16p11.2 Deletion Syndrome,MESH:C579850,,,MESH:D001321|MESH:D002872|MESH:D008607|MESH:D0...,C10.597.606.360/C579850|C16.131.260/C579850|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,,Congenital abnormality|Genetic disease (inborn...
4,"17,20-Lyase Deficiency, Isolated",MESH:C567076,,,MESH:D000312,C12.050.351.875.253.090.500/C567076|C12.200.70...,C12.050.351.875.253.090.500|C12.200.706.316.09...,"17-Alpha-Hydroxylase-17,20-Lyase Deficiency, C...",Congenital abnormality|Endocrine system diseas...


In [46]:
DiseaseClass2=CTD_diseases.loc[:,'SlimMappings']
DC2_input= DiseaseClass2.values.tolist()
DC2=[]

In [47]:
def split_on_pipe(input_list, new_list):
    for former_row in input_list:
        Sublist=str(former_row).split('|')
        for item in Sublist:
            new_list.append(item) 
    new_list=set(new_list)
    return new_list

In [48]:
DC2 = split_on_pipe(DC2_input, DC2)

In [49]:
SpecificDisease2=CTD_diseases.loc[:,'DiseaseName']
SD2_input=set(SpecificDisease2.values.tolist())

In [50]:
SD2=[]
SD2 = split_on_pipe(SD2_input, SD2)

In [51]:
NCI9d=pd.read_csv('./data/Thesaurus.text', sep='\t')
NCI9d_disease_subset=NCI9d.loc[NCI9d['Therapeutic or Preventive Procedure'] == 'Disease or Syndrome']
NCI9d_disease_subset.head()

Unnamed: 0,C100000,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C100000>,C99521,"Percutaneous Coronary Intervention for ST Elevation Myocardial Infarction-Stable-Over 12 Hours From Symptom Onset|PERCUTANEOUS CORONARY INTERVENTION (PCI) FOR ST ELEVATION MYOCARDIAL INFARCTION (STEMI) (STABLE, >12 HRS FROM SYMPTOM ONSET)","A percutaneous coronary intervention is necessary for a myocardial infarction that presents with ST segment elevation and the subject does not have recurrent or persistent symptoms, symptoms of heart failure or ventricular arrhythmia. The presentation is past twelve hours since onset of symptoms. (ACC)",Unnamed: 5,Unnamed: 6,Therapeutic or Preventive Procedure
12,C100012,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C50797,Severe Cardiac Valve Regurgitation,Evidence of severe retrograde blood flow throu...,,,Disease or Syndrome
21,C100020,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35317,Three Vessel Coronary Disease|THREE VESSEL DIS...,There was greater than or equal to 50% stenosi...,,,Disease or Syndrome
24,C100023,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35317,Two Vessel Coronary Disease|TWO VESSEL DISEASE,There was greater than or equal to 50% stenosi...,,,Disease or Syndrome
67,C100062,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C99938,Chronic Total Coronary Artery Occlusion,Prolonged complete obstruction of the coronary...,,,Disease or Syndrome
76,C100070,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35279,Coronary Venous Dissection,A tear within the wall of a coronary vein. (ACC),,,Disease or Syndrome


In [52]:
SpecificDisease4=NCI9d_disease_subset.iloc[:,3]
SD3_input=SpecificDisease4.values.tolist()

SD3=[]
SD3=split_on_pipe(SD3_input, SD3)

In [53]:
ULMS_terminology=pd.read_csv("./data/NC.DB", sep = '|')
Modifiers=ULMS_terminology.iloc[:,0]
Mod_pre=Modifiers.values.tolist()

In [54]:
def skip_brackets(test_str):
    ret = ''
    skip1c = 0
    skip2c = 0
    for i in test_str:
        if i == '[':
            skip1c += 1
        elif i == '(':
            skip2c += 1
        elif i == ']' and skip1c > 0:
            skip1c -= 1
        elif i == ')'and skip2c > 0:
            skip2c -= 1
        elif skip1c == 0 and skip2c == 0:
            ret += i
    return ret

    

In [55]:
Mod=[]

for term in Mod_pre:
    a=skip_brackets(str(term))
    Mod.append(a)

for term in Mod_pre:
    x=term.replace('(','').replace(')','').replace(r'[0-9]+', '')
    Mod.append(x)
    
Mod=set(Mod)

In [56]:
def lower(list):
    [str(x).lower() for x in list]
    return list

SD1 = lower('SD1')
SD2 = lower('SD2')
SD3 = lower('SD3')
DC1 = lower('DC1')
DC2 = lower('DC2')
Mod = lower('Mod')

### Measuring and Pruning

In order to evaluate the requirements for a total n-gram upper limit, we take a look at our keyword element distributions.
At the same time we will remove stopwords, punctuation and digits.

In [57]:
%pprint
num_words_DC1 = [len(element.split()) for element in DC1]  
print(Counter(num_words_DC1))

DC1_tidy = []

def no_stop(list, tidy_list):
    temp_cleaned = []
    temp_cleaned=copy.deepcopy(list)
    for string in temp_cleaned:
        if string not in stopwords:
            tidy_list.append(string.translate(str.maketrans('', '', punctuation)))


no_stop(DC1, DC1_tidy)

Pretty printing has been turned ON
Counter({1: 3})


In [58]:
%pprint
num_words_DC2 = [len(element.split()) for element in DC2]  
print(Counter(num_words_DC2))

DC2_tidy = []

no_stop(DC2, DC2_tidy)

Pretty printing has been turned OFF
Counter({1: 3})


In [59]:
%pprint
num_words_SD1 = [len(element.split()) for element in SD1]  
print(Counter(num_words_SD1))

Pretty printing has been turned ON
Counter({1: 3})


In [60]:
%pprint
num_words_SD3 = [len(element.split()) for element in SD3]  
print(Counter(num_words_SD3))

Pretty printing has been turned OFF
Counter({1: 3})


In [61]:
%pprint
num_words_SD1 = [len(element.split()) for element in SD1]  
#print(Counter(num_words_SD1))
SD1_tidy = []
no_stop(SD1, SD1_tidy)

%pprint
num_words_SD1_tidy = [len(element.split()) for element in SD1_tidy]  
print(Counter(num_words_SD1_tidy))

Pretty printing has been turned ON
Pretty printing has been turned OFF
Counter({1: 3})


In [62]:
%pprint
num_words_SD2 = [len(element.split()) for element in SD2]  
print(Counter(num_words_SD2))

SD2_tidy = []
no_stop(SD2, SD2_tidy)

num_words_SD2_tidy = [len(element.split()) for element in SD2_tidy]  
print(Counter(num_words_SD2_tidy))

Pretty printing has been turned ON
Counter({1: 3})
Counter({1: 3})


In [63]:
%pprint
num_words_SD3 = [len(element.split()) for element in SD3]  
#print(Counter(num_words_SD3))

SD3_tidy = []
no_stop(SD3, SD3_tidy)

num_words_SD3_tidy = [len(element.split()) for element in SD3_tidy]  
print(Counter(num_words_SD3_tidy))

Pretty printing has been turned OFF
Counter({1: 3})


In [64]:
%pprint
num_words_Mod = [len(element.split()) for element in Mod]  
print(Counter(num_words_Mod))

Mod_tidy = []
no_stop(Mod, Mod_tidy)

Pretty printing has been turned ON
Counter({1: 3})


A sensible cutoff point where most of our keywords are retained for rules would be 7_grams. In the nextstep, yet for testing purposes we will proceed with unigrams, in the following the lists could be ridded of any elements with more terms within them. The proceding creation of ngrams would need it's paramters set accordingly.

In [65]:
def no_digit(input, output):
    for element in input:
        element=result = ''.join([i for i in element if not i.isdigit()])
        if len(element.split()) < 2:
                output.append(element)

In [66]:
DC1_trim = []
no_digit(DC1_tidy, DC1_trim)

DC2_trim = []
no_digit(DC2_tidy, DC2_trim)

SD1_trim = []
no_digit(SD1_tidy, SD1_trim)

SD2_trim = []
no_digit(SD2_tidy, SD2_trim)

SD3_trim = []
no_digit(SD3_tidy, SD3_trim)

Mod_trim = []
no_digit(Mod_tidy, Mod_trim)


In order to mittigate the problems of the excessive RAM usage, we will be sampling the dataset heavily and only use 1/100 of the original data.

In [67]:
# df_notes = pd.read_csv('../../../mimic-iii-clinical-database-1.4/NOTEEVENTS.csv', low_memory=False)
# df_notes = df_notes.sample(frac=0.01, random_state=19)

In [68]:
# notes_text = df_notes.TEXT
# notes_text = notes_text.str.replace(r'\[\*\*(.*?)\*\*\]', '') # extract tag placeholders
# notes_text = notes_text.str.replace(r'[0-9]+', '') # extract all digits
# punctuation = string.punctuation.replace('/', '')
# punctuation = punctuation.replace('\\', '') # excluding slashes from punctuation removal, since we need them to match with some disease

# notes_text = notes_text.str.translate(str.maketrans('', '', punctuation))

# stopwords = stopwords.words('english')
# stopwords.remove('and')
# stopwords.remove('or')
# notes_text = notes_text.apply(lambda x: ' '.join([word for word in x.lower().split() if word not in (stopwords)]))

In [69]:
#notes_text.to_csv('../data/notes_cleaned.csv', index=False)

## Export

Now that our lists as well as the traing data are sufficiently tidy, we store them for later use.

In [70]:
def write_to_file(filepath, variable):
    try:
        os.remove('filepath')
    except OSError:
        pass
    with open(filepath, 'w') as fp:
        for item in variable:
            #writes each entry on a new line
            fp.write("%s\n" % item)


write_to_file('./data/DC1.txt', DC1_trim)

write_to_file('./data/DC2.txt', DC2_trim)

write_to_file('./data/SD1.txt', SD1_trim)

write_to_file('./data/SD2.txt', SD2_trim)

write_to_file('./data/SD3.txt', SD3_trim)

write_to_file('./data/Mod.txt', Mod_trim)

## Now for the preprocessing of the test and dev data - NCBI

The same steps are used here, the data is cleaned, stripped of unwanted elements, brought into shape of the future training data and exported.

In [71]:
text_t = []
labels_t = []

file = open('./data/NCBItestset_corpus.txt', 'r')
Lines = file.readlines()
  
for line in Lines:
    if line.find("|") > 0:
        text_t.append(line)
    elif len(line) == 0:
        continue
    else:
        labels_t.append(line)

In [74]:
text_d = []
labels_d = []

file = open('./data/NCBIdevelopset_corpus.txt', 'r')
Lines = file.readlines()
  
for line in Lines:
    if line.find("|") > 0:
        text_d.append(line)
    elif len(line) == 0:
        continue
    else:
        labels_d.append(line)

In [75]:
test_text = pd.DataFrame([elem.split("|") for elem in text_t])
test_text.drop(test_text.columns[[1,3]],axis=1, inplace=True)
test_text.columns = ['ID','Text']
test_text[test_text.columns[0]] = test_text[test_text.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
test_text_f = test_text[test_text['ID'] == 0].index
test_text.drop(test_text_f, inplace=True)
test_text = test_text.groupby(['ID']).agg({'Text': ' '.join})

In [76]:
dev_text = pd.DataFrame([elem.split("|") for elem in text_d])
dev_text.drop(dev_text.columns[[1,3]],axis=1, inplace=True)
dev_text.columns = ['ID','Text']
dev_text[dev_text.columns[0]] = dev_text[dev_text.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
dev_text_f = dev_text[dev_text['ID'] == 0].index
dev_text.drop(dev_text_f, inplace=True)
dev_text = dev_text.groupby(['ID']).agg({'Text': ' '.join})

In [77]:
test_label = pd.DataFrame([elem.split('\t') for elem in labels_t])
test_label.drop(test_label.columns[[1,2,5]],axis=1, inplace=True)
test_label.columns =['ID', 'Label', 'Class']
test_label[test_label.columns[0]] = test_label[test_label.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
test_label_f = test_label[test_label['ID'] == 0].index
test_label.drop(test_label_f, inplace=True)
test_label = test_label[test_label['Label'].str.count(' ') == 0]


In [78]:
dev_label = pd.DataFrame([elem.split('\t') for elem in labels_d])
dev_label.drop(dev_label.columns[[1,2,5]],axis=1, inplace=True)
dev_label.columns =['ID', 'Label', 'Class']
dev_label[dev_label.columns[0]] = dev_label[dev_label.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
dev_label_f = dev_label[dev_label['ID'] == 0].index
dev_label.drop(dev_label_f, inplace=True)
dev_label = dev_label[dev_label['Label'].str.count(' ') == 0]

In [79]:
digits = ['0','1','2','3','4','5','6','7','8','9']

for char in punctuation:
    test_label['Label'] = test_label['Label'].str.replace(char, '')
       
for char in digits:
    test_label['Label'] = test_label['Label'].str.replace(char, '')
    
for char in punctuation:
    test_text['Text'] = test_text['Text'].str.replace(char, '')

for char in digits:
    test_text['Text'] = test_text['Text'].str.replace(char, '')

test_text["Text"] = test_text["Text"].str.lower()
test_label["Label"] = test_label["Label"].str.lower()
test_text["Text"] = test_text["Text"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

  after removing the cwd from sys.path.
  # Remove the CWD from sys.path while we load stuff.


In [80]:
digits = ['0','1','2','3','4','5','6','7','8','9']

for char in punctuation:
    dev_label['Label'] = dev_label['Label'].str.replace(char, '')
       
for char in digits:
    dev_label['Label'] = dev_label['Label'].str.replace(char, '')
    
for char in punctuation:
    dev_text['Text'] = dev_text['Text'].str.replace(char, '')

for char in digits:
    dev_text['Text'] = dev_text['Text'].str.replace(char, '')

dev_text["Text"] = dev_text["Text"].str.lower()
dev_label["Label"] = dev_label["Label"].str.lower()
dev_text["Text"] = dev_text["Text"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

  after removing the cwd from sys.path.
  # Remove the CWD from sys.path while we load stuff.


In [81]:
dev_label.head()


Unnamed: 0,ID,Label,Class
7,9050866,ataxiatelangiectasia,Modifier
8,9050866,ataxiatelangiectasia,Modifier
9,9050866,ataxiatelangiectasia,Modifier
18,8755645,fed,SpecificDisease
19,8755645,fed,Modifier


In [82]:
test_text.head()

Unnamed: 0_level_0,Text
ID,Unnamed: 1_level_1
932197,hereditary deficiency fifth component compleme...
941901,low levels beta hexosaminidase healthy individ...
993342,chromosomal order genes controlling major hist...
9288106,clustering missense mutations ataxiatelangiect...
9294109,myotonic dystrophy protein kinase involved mod...


In [83]:
test_text['Terms'] = test_text['Text'].str.split(' ')
test_word=test_text.explode('Terms')

In [84]:
dev_text['Terms'] = dev_text['Text'].str.split(' ')
dev_word=dev_text.explode('Terms')
dev_word.drop('Text', axis=1, inplace=True)
dev_word.head()

Unnamed: 0_level_0,Terms
ID,Unnamed: 1_level_1
8589722,brca
8589722,secreted
8589722,exhibits
8589722,properties
8589722,granin


In [85]:
test_word.drop('Text', axis=1, inplace=True)
test_word.head()

Unnamed: 0_level_0,Terms
ID,Unnamed: 1_level_1
932197,hereditary
932197,deficiency
932197,fifth
932197,component
932197,complement


In [86]:
Test_fin=test_word.merge(test_label, left_on=['ID','Terms'], right_on = ['ID','Label'], how='left')
Test_fin.drop('Label', axis=1, inplace=True)
Test_fin.fillna('OTHER', axis=1, inplace=True)
Test_fin_f = Test_fin[Test_fin['Class'] == 'Modifier'].index
Test_fin.drop(Test_fin_f, inplace=True)
Test_fin_f = Test_fin[Test_fin['Class'] == 'CompositeMention'].index
Test_fin.drop(Test_fin_f, inplace=True)
Test_fin=Test_fin[Test_fin.ID != 9472666]

In [87]:
Dev_fin=dev_word.merge(dev_label, left_on=['ID','Terms'], right_on = ['ID','Label'], how='left')
Dev_fin.drop('Label', axis=1, inplace=True)
Dev_fin.fillna('OTHER', axis=1, inplace=True)
Dev_fin_f = Dev_fin[Dev_fin['Class'] == 'Modifier'].index
Dev_fin.drop(Dev_fin_f, inplace=True)
Dev_fin_f = Dev_fin[Dev_fin['Class'] == 'CompositeMention'].index
Dev_fin.drop(Dev_fin_f, inplace=True)
Dev_fin.dropna(inplace=True)

In [88]:
Test_fin.to_csv('./data/test_fin.csv', index=False)

Dev_fin.to_csv('./data/dev_fin.csv', index=False)