# Task 1 - Disease Named Entity Recognition (D-NER)
## Preprocessing Steps

### Introduction

This project will utilize the Mimic III dataset as training data for a Kndole sequence tagger model.
For the development and evaluation of the model, the NCIT dataset will be used, for which the existing tags are as follows:
DiseaseClass, SpecificDisease, Modifier, CompositeMention
In order to supplement the Mimic III dataset for this task, three other datasets where consulted to aid in the fitting of specific categories for the matching with the structure of the classes:

National Cancer Institute Thesaurus: NCI9d - Thesaurus.txt: NCI Thesaurus Version 22.09d - flattened from the owl format 
from their own webpage: https://evs.nci.nih.gov/evs-download/thesaurus-downloads

The vast ULMS dataset (Unified Medical Language System),
after authentification, to be accessed at: https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
Specifically the table NB.DB

MEDIC - CTD's MEDIC disease vocabulary is a modified subset of descriptors from the “Diseases” [C] branch of the U.S. National Library of Medicine's Medical Subject Headings (MeSH®),combined with genetic disorders from the Online Mendelian Inheritance in Man® (OMIM®) database. Comparative Toxicogenomics Database: https://ctdbase.org/downloads/#alldiseases

Note 1: In a subsequent section all of the prevously processed elements shall be exported and imported, for reuse it is convenient to simply skip to this section.

Note 2: All the data which is too big for github, will be available for download through a link in the repository discription.

### Examination

In the following section each of the datasets in use will be examined, the relevant sections stripped from them and further processed into lists of strings to be used in rule based labeling functions.

In [81]:
import pandas as pd
import os
from collections import Counter
D_ICD_DIAGNOSES=pd.read_csv('../../../mimic-iii-clinical-database-1.4/D_ICD_DIAGNOSES.csv', sep=',', header=0)
D_ICD_DIAGNOSES.head(5)


Unnamed: 0,ROW_ID,ICD9_CODE,SHORT_TITLE,LONG_TITLE
0,174,1166,TB pneumonia-oth test,"Tuberculous pneumonia [any form], tubercle bac..."
1,175,1170,TB pneumothorax-unspec,"Tuberculous pneumothorax, unspecified"
2,176,1171,TB pneumothorax-no exam,"Tuberculous pneumothorax, bacteriological or h..."
3,177,1172,TB pneumothorx-exam unkn,"Tuberculous pneumothorax, bacteriological or h..."
4,178,1173,TB pneumothorax-micro dx,"Tuberculous pneumothorax, tubercle bacilli fou..."


In [82]:
DiseaseClass1=D_ICD_DIAGNOSES.loc[:,'SHORT_TITLE']
DC1=set(DiseaseClass1.values.tolist()) 
SpecificDisease1=D_ICD_DIAGNOSES.loc[:,'LONG_TITLE']
SD1=set(SpecificDisease1.values.tolist())

In [83]:
CTD_diseases=pd.read_csv('./CTD_diseases.csv', sep=',', header=0)
CTD_diseases.head(5)

Unnamed: 0,DiseaseName,DiseaseID,AltDiseaseIDs,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms,SlimMappings
0,10p Deletion Syndrome (Partial),MESH:C538288,,,MESH:D002872|MESH:D025063,C16.131.260/C538288|C16.320.180/C538288|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,"Chromosome 10, 10p- Partial|Chromosome 10, mon...",Congenital abnormality|Genetic disease (inborn...
1,13q deletion syndrome,MESH:C535484,,,MESH:D002872|MESH:D025063,C16.131.260/C535484|C16.320.180/C535484|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,Chromosome 13q deletion|Chromosome 13q deletio...,Congenital abnormality|Genetic disease (inborn...
2,15q24 Microdeletion,MESH:C579849,DO:DOID:0060395,,MESH:D002872|MESH:D008607|MESH:D025063,C10.597.606.360/C579849|C16.131.260/C579849|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,15q24 Deletion|15q24 Microdeletion Syndrome|In...,Congenital abnormality|Genetic disease (inborn...
3,16p11.2 Deletion Syndrome,MESH:C579850,,,MESH:D001321|MESH:D002872|MESH:D008607|MESH:D0...,C10.597.606.360/C579850|C16.131.260/C579850|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,,Congenital abnormality|Genetic disease (inborn...
4,"17,20-Lyase Deficiency, Isolated",MESH:C567076,,,MESH:D000312,C12.050.351.875.253.090.500/C567076|C12.200.70...,C12.050.351.875.253.090.500|C12.200.706.316.09...,"17-Alpha-Hydroxylase-17,20-Lyase Deficiency, C...",Congenital abnormality|Endocrine system diseas...


In [84]:
DiseaseClass2=CTD_diseases.loc[:,'SlimMappings']
DC2_input= DiseaseClass2.values.tolist()
DC2=[]
for former_row in DC2_input:
    Sublist=str(former_row).split('|')
    for item in Sublist:
        DC2.append(item) 
DC2=set(DC2)

In [85]:
SpecificDisease2=CTD_diseases.loc[:,'DiseaseName']
SD2=set(SpecificDisease2.values.tolist())
SpecificDisease3=CTD_diseases.loc[:,'Synonyms']
SD3_input=set(SpecificDisease3.values.tolist())

In [86]:
SD3=[]
for former_row in SD3_input:
    Sublist=str(former_row).split('|')
    for item in Sublist:
        SD3.append(item) 
SD3=set(SD3)

In [87]:
NCI9d=pd.read_csv('./Thesaurus.text', sep='\t')
NCI9d_disease_subset=NCI9d.loc[NCI9d['Therapeutic or Preventive Procedure'] == 'Disease or Syndrome']
NCI9d_disease_subset.head()

Unnamed: 0,C100000,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C100000>,C99521,"Percutaneous Coronary Intervention for ST Elevation Myocardial Infarction-Stable-Over 12 Hours From Symptom Onset|PERCUTANEOUS CORONARY INTERVENTION (PCI) FOR ST ELEVATION MYOCARDIAL INFARCTION (STEMI) (STABLE, >12 HRS FROM SYMPTOM ONSET)","A percutaneous coronary intervention is necessary for a myocardial infarction that presents with ST segment elevation and the subject does not have recurrent or persistent symptoms, symptoms of heart failure or ventricular arrhythmia. The presentation is past twelve hours since onset of symptoms. (ACC)",Unnamed: 5,Unnamed: 6,Therapeutic or Preventive Procedure
12,C100012,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C50797,Severe Cardiac Valve Regurgitation,Evidence of severe retrograde blood flow throu...,,,Disease or Syndrome
21,C100020,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35317,Three Vessel Coronary Disease|THREE VESSEL DIS...,There was greater than or equal to 50% stenosi...,,,Disease or Syndrome
24,C100023,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35317,Two Vessel Coronary Disease|TWO VESSEL DISEASE,There was greater than or equal to 50% stenosi...,,,Disease or Syndrome
67,C100062,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C99938,Chronic Total Coronary Artery Occlusion,Prolonged complete obstruction of the coronary...,,,Disease or Syndrome
76,C100070,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35279,Coronary Venous Dissection,A tear within the wall of a coronary vein. (ACC),,,Disease or Syndrome


In [88]:
SpecificDisease4=NCI9d_disease_subset.iloc[:,3]
SD4_input=SpecificDisease4.values.tolist()

SD4=[]
for former_row in SD4_input:
    Sublist=str(former_row).split('|')
    for item in Sublist:
        SD4.append(item) 
SD4=set(SD4)
#print(SD4)

In [89]:
ULMS_terminology=pd.read_csv("./NC.DB", sep = '|')
Modifiers=ULMS_terminology.iloc[:,0]
Mod_pre=Modifiers.values.tolist()

In [90]:
def skip_brackets(test_str):
    ret = ''
    skip1c = 0
    skip2c = 0
    for i in test_str:
        if i == '[':
            skip1c += 1
        elif i == '(':
            skip2c += 1
        elif i == ']' and skip1c > 0:
            skip1c -= 1
        elif i == ')'and skip2c > 0:
            skip2c -= 1
        elif skip1c == 0 and skip2c == 0:
            ret += i
    return ret

    

In [91]:
Mod=[]

for term in Mod_pre:
    a=skip_brackets(str(term))
    Mod.append(a)

for term in Mod_pre:
    x=term.replace('(','').replace(')','').replace(r'[0-9]+', '')
    Mod.append(x)
    
Mod=set(Mod)

In [92]:
SD1 = [str(x).lower() for x in SD1]
SD2 = [str(x).lower() for x in SD2]
SD3 = [str(x).lower() for x in SD3]
SD4 = [str(x).lower() for x in SD4]
DC1 = [str(x).lower() for x in DC1]
DC2 = [str(x).lower() for x in DC2]
Mod = [str(x).lower() for x in Mod]

For the Class of CompositeDisease the occurence of a term of DiseaseClass or SpecificDisease followed by another will be used, with an instance of and, or or / \ between them.

### Measuring and Pruning

In order to evaluate the requirements for an n-gram upper limit, we take a look at our keyword element distributions.
At the same time we will remove stopwords, punctuation and digits.

In [93]:
#import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
import copy
import string

punctuation = string.punctuation.replace('/', '')
punctuation = punctuation.replace('\\', '')


%pprint
num_words_DC1 = [len(element.split()) for element in DC1]  
print(Counter(num_words_DC1))

DC1_tidy = []
DC1_cleaned=copy.deepcopy(SD3)
for string in DC1_cleaned:
    if string not in stopwords:
        DC1_tidy.append(string.translate(str.maketrans('', '', punctuation)))

Pretty printing has been turned OFF
Counter({3: 5052, 4: 3769, 2: 3077, 5: 1561, 1: 564, 6: 263, 7: 38, 8: 4})


In [94]:
%pprint
num_words_DC2 = [len(element.split()) for element in DC2]  
print(Counter(num_words_DC2))

DC2_tidy = []
DC2_cleaned=copy.deepcopy(SD3)
for string in DC2_cleaned:
    if string not in stopwords:
        DC2_tidy.append(string.translate(str.maketrans('', '', punctuation)))

Pretty printing has been turned ON
Counter({2: 21, 3: 13, 1: 2, 4: 1})


In [95]:
%pprint
num_words_SD1 = [len(element.split()) for element in SD1]  
#print(Counter(num_words_SD1))

Pretty printing has been turned OFF


In [96]:
%pprint
num_words_SD3 = [len(element.split()) for element in SD3]  
print(Counter(num_words_SD3))

Pretty printing has been turned ON
Counter({2: 26708, 3: 23179, 1: 9959, 4: 6596, 5: 3883, 6: 2028, 7: 1169, 8: 603, 9: 378, 10: 244, 11: 178, 12: 80, 13: 43, 14: 43, 15: 19, 16: 8, 18: 7, 17: 4, 20: 3, 19: 3, 21: 2, 24: 2})


In [97]:
SD3_tidy = []
SD3_cleaned=copy.deepcopy(SD3)
for string in SD3_cleaned:
    if string not in stopwords:
        SD3_tidy.append(string.translate(str.maketrans('', '', punctuation)))
#SD3_tidy

In [98]:
%pprint
num_words_SD3_tidy = [len(element.split()) for element in SD3_tidy]  
print(Counter(num_words_SD3_tidy))

Pretty printing has been turned OFF
Counter({2: 26709, 3: 23184, 1: 9951, 4: 6596, 5: 3880, 6: 2030, 7: 1166, 8: 601, 9: 379, 10: 243, 11: 178, 12: 80, 13: 43, 14: 43, 15: 19, 16: 8, 18: 7, 17: 4, 19: 4, 20: 2, 21: 2, 24: 2})


In [99]:
%pprint
num_words_SD1 = [len(element.split()) for element in SD1]  
#print(Counter(num_words_SD1))
SD1_tidy = []
SD1_cleaned=copy.deepcopy(SD3)
for string in SD1_cleaned:
    if string not in stopwords:
        SD1_tidy.append(string.translate(str.maketrans('', '', punctuation)))

%pprint
num_words_SD1_tidy = [len(element.split()) for element in SD1_tidy]  
print(Counter(num_words_SD1_tidy))

Pretty printing has been turned ON
Pretty printing has been turned OFF
Counter({2: 26709, 3: 23184, 1: 9951, 4: 6596, 5: 3880, 6: 2030, 7: 1166, 8: 601, 9: 379, 10: 243, 11: 178, 12: 80, 13: 43, 14: 43, 15: 19, 16: 8, 18: 7, 17: 4, 19: 4, 20: 2, 21: 2, 24: 2})


In [100]:
%pprint
num_words_SD2 = [len(element.split()) for element in SD2]  
print(Counter(num_words_SD2))

SD2_tidy = []
SD2_cleaned=copy.deepcopy(SD3)
for string in SD2_cleaned:
    if string not in stopwords:
        SD2_tidy.append(string.translate(str.maketrans('', '', punctuation)))

num_words_SD2_tidy = [len(element.split()) for element in SD2_tidy]  
print(Counter(num_words_SD2_tidy))

Pretty printing has been turned ON
Counter({2: 3629, 3: 2782, 4: 2296, 5: 1373, 1: 1373, 6: 733, 7: 378, 8: 253, 9: 162, 10: 98, 11: 58, 12: 30, 13: 11, 14: 7, 15: 5, 16: 1, 19: 1})
Counter({2: 26709, 3: 23184, 1: 9951, 4: 6596, 5: 3880, 6: 2030, 7: 1166, 8: 601, 9: 379, 10: 243, 11: 178, 12: 80, 13: 43, 14: 43, 15: 19, 16: 8, 18: 7, 17: 4, 19: 4, 20: 2, 21: 2, 24: 2})


In [101]:
%pprint
num_words_SD4 = [len(element.split()) for element in SD4]  
#print(Counter(num_words_SD4))

SD4_tidy = []
SD4_cleaned=copy.deepcopy(SD3)
for string in SD4_cleaned:
    if string not in stopwords:
        SD4_tidy.append(string.translate(str.maketrans('', '', punctuation)))

num_words_SD4_tidy = [len(element.split()) for element in SD4_tidy]  
print(Counter(num_words_SD4_tidy))

Pretty printing has been turned OFF
Counter({2: 26709, 3: 23184, 1: 9951, 4: 6596, 5: 3880, 6: 2030, 7: 1166, 8: 601, 9: 379, 10: 243, 11: 178, 12: 80, 13: 43, 14: 43, 15: 19, 16: 8, 18: 7, 17: 4, 19: 4, 20: 2, 21: 2, 24: 2})


In [102]:
%pprint
num_words_Mod = [len(element.split()) for element in Mod]  
print(Counter(num_words_Mod))

Mod_tidy = []
Mod_cleaned=copy.deepcopy(SD3)
for string in Mod_cleaned:
    if string not in stopwords:
        Mod_tidy.append(string.translate(str.maketrans('', '', punctuation)))


Pretty printing has been turned ON
Counter({1: 1720})


In [103]:
DC1_trim = []
for element in DC1_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    DC1_trim.append(element)

DC2_trim = []
for element in DC2_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    DC2_trim.append(element)

SD1_trim = []
for element in SD1_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    SD1_trim.append(element)

SD2_trim = []
for element in SD2_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    SD2_trim.append(element)

SD3_trim = []
for element in SD3_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    SD3_trim.append(element)

SD4_trim = []
for element in SD4_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    SD4_trim.append(element)

Mod_trim = []
for element in Mod_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    Mod_trim.append(element)


A sensible cutoff point where most of our keywords are retained for rules is going to be 7_grams. In the nextstep, the lists will be ridded of any elements with more terms within them.

In [104]:
DC1_groomed = []
for element in DC1_trim:
    if len(element.split()) < 8:
        DC1_groomed.append(element)

DC2_groomed = []
for element in DC2_trim:
    if len(element.split()) < 8:
        DC2_groomed.append(element)

SD1_groomed = []
for element in SD1_trim:
    if len(element.split()) < 8:
        SD1_groomed.append(element)

SD2_groomed = []
for element in SD2_trim:
    if len(element.split()) < 8:
        SD2_groomed.append(element)

SD3_groomed = []
for element in SD3_trim:
    if len(element.split()) < 8:
        SD3_groomed.append(element)

SD4_groomed = []
for element in SD4_trim:
    if len(element.split()) < 8:
        SD4_groomed.append(element)

## Export

Now that our lists are looking dapper and shiny, we will wrap them up with a bow and store them for later use, and in order to ease collaboration.

In [105]:

try:
    os.remove('DC1.txt')
except OSError:
    pass

try:
    os.remove('DC2.txt')
except OSError:
    pass

try:
    os.remove('SD1.txt')
except OSError:
    pass

try:
    os.remove('SD2.txt')
except OSError:
    pass

try:
    os.remove('SD3.txt')
except OSError:
    pass

try:
    os.remove('SD4.txt')
except OSError:
    pass

try:
    os.remove('Mod.txt')
except OSError:
    pass

with open(r'DC1.txt', 'w') as fp:
    for item in DC1_groomed:
        #writes each entry on a new line
        fp.write("%s\n" % item)
    print('Done')

with open(r'DC2.txt', 'w') as fp:
    for item in DC2_groomed:
        fp.write("%s\n" % item)
    print('Done')

with open(r'SD1.txt', 'w') as fp:
    for item in SD1_groomed:
        fp.write("%s\n" % item)
    print('Done')


with open(r'SD2.txt', 'w') as fp:
    for item in SD2_groomed:
        fp.write("%s\n" % item)
    print('Done')


with open(r'SD3.txt', 'w') as fp:
    for item in SD3_groomed:
        fp.write("%s\n" % item)
    print('Done')  


with open(r'SD4.txt', 'w') as fp:
    for item in SD4_groomed:
        fp.write("%s\n" % item)
    print('Done')


with open(r'Mod.txt', 'w') as fp:
    for item in Mod_trim:
        fp.write("%s\n" % item)
    print('Done')

Done
Done
Done
Done
Done
Done
Done
