# Task 1 - Disease Named Entity Recognition (D-NER)
## Preprocessing Steps

### Introduction

This project will utilize the Mimic III dataset as training data for a Kndole sequence tagger model.
For the development and evaluation of the model, the NCIT dataset will be used, for which the existing tags are as follows:
DiseaseClass, SpecificDisease, Modifier, CompositeMention
In order to supplement the Mimic III dataset for this task, three other datasets where consulted to aid in the fitting of specific categories for the matching with the structure of the classes:

National Cancer Institute Thesaurus: NCI9d - Thesaurus.txt: NCI Thesaurus Version 22.09d - flattened from the owl format 
from their own webpage: https://evs.nci.nih.gov/evs-download/thesaurus-downloads

The vast ULMS dataset (Unified Medical Language System),
after authentification, to be accessed at: https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
Specifically the table NB.DB

MEDIC - CTD's MEDIC disease vocabulary is a modified subset of descriptors from the “Diseases” [C] branch of the U.S. National Library of Medicine's Medical Subject Headings (MeSH®),combined with genetic disorders from the Online Mendelian Inheritance in Man® (OMIM®) database. Comparative Toxicogenomics Database: https://ctdbase.org/downloads/#alldiseases

Note 1: In a subsequent section all of the prevously processed elements shall be exported and imported, for reuse it is convenient to simply skip to this section.

Note 2: All the data which is too big for github, will be available for download through a link in the repository discription.

### Examination

In the following section each of the datasets in use will be examined, the relevant sections stripped from them and further processed into lists of strings to be used in rule based labeling functions.

In [1]:
import pandas as pd
import os
from collections import Counter
import scipy
import re
import string
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
import copy
import string

punctuation = string.punctuation.replace('/', '')
punctuation = punctuation.replace('\\', '')


In [3]:
D_ICD_DIAGNOSES=pd.read_csv('../../../mimic-iii-clinical-database-1.4/D_ICD_DIAGNOSES.csv', sep=',', header=0)
D_ICD_DIAGNOSES.head(5)

Unnamed: 0,ROW_ID,ICD9_CODE,SHORT_TITLE,LONG_TITLE
0,174,1166,TB pneumonia-oth test,"Tuberculous pneumonia [any form], tubercle bac..."
1,175,1170,TB pneumothorax-unspec,"Tuberculous pneumothorax, unspecified"
2,176,1171,TB pneumothorax-no exam,"Tuberculous pneumothorax, bacteriological or h..."
3,177,1172,TB pneumothorx-exam unkn,"Tuberculous pneumothorax, bacteriological or h..."
4,178,1173,TB pneumothorax-micro dx,"Tuberculous pneumothorax, tubercle bacilli fou..."


In [4]:
DiseaseClass1=D_ICD_DIAGNOSES.loc[:,'SHORT_TITLE']
DC1=set(DiseaseClass1.values.tolist()) 
SpecificDisease1=D_ICD_DIAGNOSES.loc[:,'LONG_TITLE']
SD1=set(SpecificDisease1.values.tolist())

In [3]:
CTD_diseases=pd.read_csv('./data/CTD_diseases.csv', sep=',', header=0)
CTD_diseases.head(5)

Unnamed: 0,DiseaseName,DiseaseID,AltDiseaseIDs,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms,SlimMappings
0,10p Deletion Syndrome (Partial),MESH:C538288,,,MESH:D002872|MESH:D025063,C16.131.260/C538288|C16.320.180/C538288|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,"Chromosome 10, 10p- Partial|Chromosome 10, mon...",Congenital abnormality|Genetic disease (inborn...
1,13q deletion syndrome,MESH:C535484,,,MESH:D002872|MESH:D025063,C16.131.260/C535484|C16.320.180/C535484|C23.55...,C16.131.260|C16.320.180|C23.550.210.050.500.500,Chromosome 13q deletion|Chromosome 13q deletio...,Congenital abnormality|Genetic disease (inborn...
2,15q24 Microdeletion,MESH:C579849,DO:DOID:0060395,,MESH:D002872|MESH:D008607|MESH:D025063,C10.597.606.360/C579849|C16.131.260/C579849|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,15q24 Deletion|15q24 Microdeletion Syndrome|In...,Congenital abnormality|Genetic disease (inborn...
3,16p11.2 Deletion Syndrome,MESH:C579850,,,MESH:D001321|MESH:D002872|MESH:D008607|MESH:D0...,C10.597.606.360/C579850|C16.131.260/C579850|C1...,C10.597.606.360|C16.131.260|C16.320.180|C23.55...,,Congenital abnormality|Genetic disease (inborn...
4,"17,20-Lyase Deficiency, Isolated",MESH:C567076,,,MESH:D000312,C12.050.351.875.253.090.500/C567076|C12.200.70...,C12.050.351.875.253.090.500|C12.200.706.316.09...,"17-Alpha-Hydroxylase-17,20-Lyase Deficiency, C...",Congenital abnormality|Endocrine system diseas...


In [6]:
DiseaseClass2=CTD_diseases.loc[:,'SlimMappings']
DC2_input= DiseaseClass2.values.tolist()
DC2=[]
for former_row in DC2_input:
    Sublist=str(former_row).split('|')
    for item in Sublist:
        DC2.append(item) 
DC2=set(DC2)

In [7]:
SpecificDisease2=CTD_diseases.loc[:,'DiseaseName']
SD2_input=set(SpecificDisease2.values.tolist())

In [8]:
SD2=[]
for former_row in SD2_input:
    Sublist=str(former_row).split('|')
    for item in Sublist:
        SD2.append(item) 
SD2=set(SD2)

In [9]:
NCI9d=pd.read_csv('./data/Thesaurus.text', sep='\t')
NCI9d_disease_subset=NCI9d.loc[NCI9d['Therapeutic or Preventive Procedure'] == 'Disease or Syndrome']
NCI9d_disease_subset.head()

Unnamed: 0,C100000,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C100000>,C99521,"Percutaneous Coronary Intervention for ST Elevation Myocardial Infarction-Stable-Over 12 Hours From Symptom Onset|PERCUTANEOUS CORONARY INTERVENTION (PCI) FOR ST ELEVATION MYOCARDIAL INFARCTION (STEMI) (STABLE, >12 HRS FROM SYMPTOM ONSET)","A percutaneous coronary intervention is necessary for a myocardial infarction that presents with ST segment elevation and the subject does not have recurrent or persistent symptoms, symptoms of heart failure or ventricular arrhythmia. The presentation is past twelve hours since onset of symptoms. (ACC)",Unnamed: 5,Unnamed: 6,Therapeutic or Preventive Procedure
12,C100012,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C50797,Severe Cardiac Valve Regurgitation,Evidence of severe retrograde blood flow throu...,,,Disease or Syndrome
21,C100020,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35317,Three Vessel Coronary Disease|THREE VESSEL DIS...,There was greater than or equal to 50% stenosi...,,,Disease or Syndrome
24,C100023,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35317,Two Vessel Coronary Disease|TWO VESSEL DISEASE,There was greater than or equal to 50% stenosi...,,,Disease or Syndrome
67,C100062,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C99938,Chronic Total Coronary Artery Occlusion,Prolonged complete obstruction of the coronary...,,,Disease or Syndrome
76,C100070,<http://ncicb.nci.nih.gov/xml/owl/EVS/Thesauru...,C35279,Coronary Venous Dissection,A tear within the wall of a coronary vein. (ACC),,,Disease or Syndrome


In [10]:
SpecificDisease4=NCI9d_disease_subset.iloc[:,3]
SD3_input=SpecificDisease4.values.tolist()

SD3=[]
for former_row in SD3_input:
    Sublist=str(former_row).split('|')
    for item in Sublist:
        SD3.append(item) 
SD3=set(SD3)
#print(SD3)

In [11]:
ULMS_terminology=pd.read_csv("./NC.DB", sep = '|')
Modifiers=ULMS_terminology.iloc[:,0]
Mod_pre=Modifiers.values.tolist()

In [12]:
def skip_brackets(test_str):
    ret = ''
    skip1c = 0
    skip2c = 0
    for i in test_str:
        if i == '[':
            skip1c += 1
        elif i == '(':
            skip2c += 1
        elif i == ']' and skip1c > 0:
            skip1c -= 1
        elif i == ')'and skip2c > 0:
            skip2c -= 1
        elif skip1c == 0 and skip2c == 0:
            ret += i
    return ret

    

In [13]:
Mod=[]

for term in Mod_pre:
    a=skip_brackets(str(term))
    Mod.append(a)

for term in Mod_pre:
    x=term.replace('(','').replace(')','').replace(r'[0-9]+', '')
    Mod.append(x)
    
Mod=set(Mod)

In [14]:
SD1 = [str(x).lower() for x in SD1]
SD2 = [str(x).lower() for x in SD2]
SD3 = [str(x).lower() for x in SD3]
DC1 = [str(x).lower() for x in DC1]
DC2 = [str(x).lower() for x in DC2]
Mod = [str(x).lower() for x in Mod]

For the Class of CompositeDisease the occurence of a term of DiseaseClass or SpecificDisease followed by another will be used, with an instance of and, or or / \ between them.

### Measuring and Pruning

In order to evaluate the requirements for an n-gram upper limit, we take a look at our keyword element distributions.
At the same time we will remove stopwords, punctuation and digits.

In [15]:
%pprint
num_words_DC1 = [len(element.split()) for element in DC1]  
print(Counter(num_words_DC1))

DC1_tidy = []
DC1_cleaned=copy.deepcopy(DC1)
for string in DC1_cleaned:
    if string not in stopwords:
        DC1_tidy.append(string.translate(str.maketrans('', '', punctuation)))

Pretty printing has been turned OFF
Counter({3: 5052, 4: 3769, 2: 3077, 5: 1561, 1: 564, 6: 263, 7: 38, 8: 4})


In [16]:
%pprint
num_words_DC2 = [len(element.split()) for element in DC2]  
print(Counter(num_words_DC2))

DC2_tidy = []
DC2_cleaned=copy.deepcopy(DC2)
for string in DC2_cleaned:
    if string not in stopwords:
        DC2_tidy.append(string.translate(str.maketrans('', '', punctuation)))

Pretty printing has been turned ON
Counter({2: 21, 3: 13, 1: 2, 4: 1})


In [17]:
%pprint
num_words_SD1 = [len(element.split()) for element in SD1]  
print(Counter(num_words_SD1))

Pretty printing has been turned OFF
Counter({5: 1784, 4: 1722, 3: 1574, 6: 1334, 7: 1234, 2: 1169, 8: 1022, 9: 786, 10: 717, 11: 592, 12: 468, 1: 330, 13: 325, 14: 299, 15: 237, 16: 189, 17: 155, 18: 145, 19: 100, 21: 75, 20: 66, 22: 64, 24: 34, 23: 32, 26: 29, 27: 23, 25: 20, 29: 19, 30: 8, 28: 5, 32: 4, 31: 1})


In [18]:
%pprint
num_words_SD3 = [len(element.split()) for element in SD3]  
print(Counter(num_words_SD3))

Pretty printing has been turned ON
Counter({2: 3858, 3: 2902, 1: 2107, 4: 1533, 5: 822, 6: 362, 7: 254, 8: 113, 9: 45, 10: 32, 11: 19, 12: 6, 15: 1, 17: 1, 16: 1, 13: 1})


In [19]:
%pprint
num_words_SD1 = [len(element.split()) for element in SD1]  
#print(Counter(num_words_SD1))
SD1_tidy = []
SD1_cleaned=copy.deepcopy(SD1)
for string in SD1_cleaned:
    if string not in stopwords:
        SD1_tidy.append(string.translate(str.maketrans('', '', punctuation)))

%pprint
num_words_SD1_tidy = [len(element.split()) for element in SD1_tidy]  
print(Counter(num_words_SD1_tidy))

Pretty printing has been turned OFF
Pretty printing has been turned ON
Counter({5: 1784, 4: 1722, 3: 1574, 6: 1334, 7: 1234, 2: 1169, 8: 1022, 9: 787, 10: 716, 11: 594, 12: 472, 1: 330, 13: 324, 14: 298, 15: 235, 16: 189, 17: 154, 18: 145, 19: 99, 21: 74, 20: 67, 22: 64, 24: 34, 23: 32, 26: 29, 27: 23, 25: 20, 29: 19, 30: 8, 28: 5, 32: 4, 31: 1})


In [20]:
%pprint
num_words_SD2 = [len(element.split()) for element in SD2]  
print(Counter(num_words_SD2))

SD2_tidy = []
SD2_cleaned=copy.deepcopy(SD2)
for string in SD2_cleaned:
    if string not in stopwords:
        SD2_tidy.append(string.translate(str.maketrans('', '', punctuation)))

num_words_SD2_tidy = [len(element.split()) for element in SD2_tidy]  
print(Counter(num_words_SD2_tidy))

Pretty printing has been turned OFF
Counter({2: 3629, 3: 2782, 4: 2296, 5: 1373, 1: 1373, 6: 733, 7: 378, 8: 253, 9: 162, 10: 98, 11: 58, 12: 30, 13: 11, 14: 7, 15: 5, 19: 1, 16: 1})
Counter({2: 3630, 3: 2781, 4: 2297, 5: 1375, 1: 1373, 6: 730, 7: 378, 8: 253, 9: 163, 10: 97, 11: 58, 12: 30, 13: 11, 14: 7, 15: 5, 19: 1, 16: 1})


In [21]:
%pprint
num_words_SD3 = [len(element.split()) for element in SD3]  
#print(Counter(num_words_SD3))

SD3_tidy = []
SD3_cleaned=copy.deepcopy(SD3)
for string in SD3_cleaned:
    if string not in stopwords:
        SD3_tidy.append(string.translate(str.maketrans('', '', punctuation)))

num_words_SD3_tidy = [len(element.split()) for element in SD3_tidy]  
print(Counter(num_words_SD3_tidy))

Pretty printing has been turned ON
Counter({2: 3859, 3: 2910, 1: 2104, 4: 1583, 5: 775, 6: 359, 7: 247, 8: 112, 9: 45, 10: 32, 11: 18, 12: 6, 15: 1, 17: 1, 16: 1, 13: 1})


In [22]:
%pprint
num_words_Mod = [len(element.split()) for element in Mod]  
print(Counter(num_words_Mod))

Mod_tidy = []
Mod_cleaned=copy.deepcopy(Mod)
for string in Mod_cleaned:
    if string not in stopwords:
        Mod_tidy.append(string.translate(str.maketrans('', '', punctuation)))


Pretty printing has been turned OFF
Counter({1: 1720})


In [23]:
DC1_trim = []
for element in DC1_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    DC1_trim.append(element)

DC2_trim = []
for element in DC2_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    DC2_trim.append(element)

SD1_trim = []
for element in SD1_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    SD1_trim.append(element)

SD2_trim = []
for element in SD2_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    SD2_trim.append(element)

SD3_trim = []
for element in SD3_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    SD3_trim.append(element)

Mod_trim = []
for element in Mod_tidy:
    element=result = ''.join([i for i in element if not i.isdigit()])
    Mod_trim.append(element)


A sensible cutoff point where most of our keywords are retained for rules is going to be 7_grams. In the nextstep, the lists will be ridded of any elements with more terms within them.

In [24]:
DC1_groomed = []
for element in DC1_trim:
    if len(element.split()) < 3:
        DC1_groomed.append(element)

DC2_groomed = []
for element in DC2_trim:
    if len(element.split()) < 3:
        DC2_groomed.append(element)

SD1_groomed = []
for element in SD1_trim:
    if len(element.split()) < 3:
        SD1_groomed.append(element)

SD2_groomed = []
for element in SD2_trim:
    if len(element.split()) < 3:
        SD2_groomed.append(element)

SD3_groomed = []
for element in SD3_trim:
    if len(element.split()) < 3:
        SD3_groomed.append(element)

In order to mittigate the problems of the excessive RAM usage, we will be sampling the dataset heavily

In [25]:
# df_notes = pd.read_csv('../../../mimic-iii-clinical-database-1.4/NOTEEVENTS.csv', low_memory=False)
# df_notes = df_notes.sample(frac=0.01, random_state=19)

In [28]:
# notes_text = df_notes.TEXT
# notes_text = notes_text.str.replace(r'\[\*\*(.*?)\*\*\]', '') # extract tag placeholders
# notes_text = notes_text.str.replace(r'[0-9]+', '') # extract all digits
# punctuation = string.punctuation.replace('/', '')
# punctuation = punctuation.replace('\\', '') # excluding slashes from punctuation removal, since we need them to match with some disease

# notes_text = notes_text.str.translate(str.maketrans('', '', punctuation))

# stopwords = stopwords.words('english')
# stopwords.remove('and')
# stopwords.remove('or')
# notes_text = notes_text.apply(lambda x: ' '.join([word for word in x.lower().split() if word not in (stopwords)]))

  
  This is separate from the ipykernel package so we can avoid doing imports until


In [29]:
#notes_text.to_csv('../data/notes_cleaned.csv', index=False)

## Export

Now that our lists as well as the traing data are looking dapper and shiny, we will wrap them up with a bow and store them for later use, and in order to ease collaboration.

In [67]:
'''
try:
    os.remove('DC1.txt')
except OSError:
    pass

try:
    os.remove('DC2.txt')
except OSError:
    pass

try:
    os.remove('SD1.txt')
except OSError:
    pass

try:
    os.remove('SD2.txt')
except OSError:
    pass

try:
    os.remove('SD3.txt')
except OSError:
    pass

try:
    os.remove('Mod.txt')
except OSError:
    pass

with open(r'./data/DC1.txt', 'w') as fp:
    for item in DC1_groomed:
        #writes each entry on a new line
        fp.write("%s\n" % item)
    print('Done')

with open(r'./data/DC2.txt', 'w') as fp:
    for item in DC2_groomed:
        fp.write("%s\n" % item)
    print('Done')

with open(r'./data/SD1.txt', 'w') as fp:
    for item in SD1_groomed:
        fp.write("%s\n" % item)
    print('Done')


with open(r'./data/SD2.txt', 'w') as fp:
    for item in SD2_groomed:
        fp.write("%s\n" % item)
    print('Done')
    

with open(r'./data/SD3.txt', 'w') as fp:
    for item in SD3_groomed:
        fp.write("%s\n" % item)
    print('Done')


with open(r'./data/Mod.txt', 'w') as fp:
    for item in Mod_trim:
        fp.write("%s\n" % item)
    print('Done')

'''

'\ntry:\n    os.remove(\'DC1.txt\')\nexcept OSError:\n    pass\n\ntry:\n    os.remove(\'DC2.txt\')\nexcept OSError:\n    pass\n\ntry:\n    os.remove(\'SD1.txt\')\nexcept OSError:\n    pass\n\ntry:\n    os.remove(\'SD2.txt\')\nexcept OSError:\n    pass\n\ntry:\n    os.remove(\'SD3.txt\')\nexcept OSError:\n    pass\n\ntry:\n    os.remove(\'Mod.txt\')\nexcept OSError:\n    pass\n\nwith open(r\'DC1.txt\', \'w\') as fp:\n    for item in DC1_groomed:\n        #writes each entry on a new line\n        fp.write("%s\n" % item)\n    print(\'Done\')\n\nwith open(r\'DC2.txt\', \'w\') as fp:\n    for item in DC2_groomed:\n        fp.write("%s\n" % item)\n    print(\'Done\')\n\nwith open(r\'SD1.txt\', \'w\') as fp:\n    for item in SD1_groomed:\n        fp.write("%s\n" % item)\n    print(\'Done\')\n\n\nwith open(r\'SD2.txt\', \'w\') as fp:\n    for item in SD2_groomed:\n        fp.write("%s\n" % item)\n    print(\'Done\')\n    \n\nwith open(r\'SD3.txt\', \'w\') as fp:\n    for item in SD3_groomed:\n

## Now for the preprocessing of the test and dev data - NCBI

In [54]:
text_t = []
labels_t = []

file = open('./data/NCBItestset_corpus.txt', 'r')
Lines = file.readlines()
  
for line in Lines:
    if line.find("|") > 0:
        text_t.append(line)
    elif len(line) == 0:
        continue
    else:
        labels_t.append(line)

In [56]:
text_d = []
labels_d = []

file = open('./data/NCBIdevelopset_corpus.txt', 'r')
Lines = file.readlines()
  
for line in Lines:
    if line.find("|") > 0:
        text_d.append(line)
    elif len(line) == 0:
        continue
    else:
        labels_d.append(line)

In [57]:
test_text = pd.DataFrame([elem.split("|") for elem in text_t])
test_text.drop(test_text.columns[[1,3]],axis=1, inplace=True)
test_text.columns = ['ID','Text']
test_text[test_text.columns[0]] = test_text[test_text.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
test_text_f = test_text[test_text['ID'] == 0].index
test_text.drop(test_text_f, inplace=True)
test_text = test_text.groupby(['ID']).agg({'Text': ' '.join})

In [58]:
dev_text = pd.DataFrame([elem.split("|") for elem in text_d])
dev_text.drop(dev_text.columns[[1,3]],axis=1, inplace=True)
dev_text.columns = ['ID','Text']
dev_text[dev_text.columns[0]] = dev_text[dev_text.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
dev_text_f = dev_text[dev_text['ID'] == 0].index
dev_text.drop(dev_text_f, inplace=True)
dev_text = dev_text.groupby(['ID']).agg({'Text': ' '.join})

In [59]:
test_label = pd.DataFrame([elem.split('\t') for elem in labels_t])
test_label.drop(test_label.columns[[1,2,5]],axis=1, inplace=True)
test_label.columns =['ID', 'Label', 'Class']
test_label[test_label.columns[0]] = test_label[test_label.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
test_label_f = test_label[test_label['ID'] == 0].index
test_label.drop(test_label_f, inplace=True)
test_label = test_label[test_label['Label'].str.count(' ') == 0]


In [60]:
dev_label = pd.DataFrame([elem.split('\t') for elem in labels_d])
dev_label.drop(dev_label.columns[[1,2,5]],axis=1, inplace=True)
dev_label.columns =['ID', 'Label', 'Class']
dev_label[dev_label.columns[0]] = dev_label[dev_label.columns[0]].apply(pd.to_numeric, errors='coerce').fillna(0).astype(int).dropna()
dev_label_f = dev_label[dev_label['ID'] == 0].index
dev_label.drop(dev_label_f, inplace=True)
dev_label = dev_label[dev_label['Label'].str.count(' ') == 0]

In [61]:
digits = ['0','1','2','3','4','5','6','7','8','9']

for char in punctuation:
    test_label['Label'] = test_label['Label'].str.replace(char, '')
       
for char in digits:
    test_label['Label'] = test_label['Label'].str.replace(char, '')
    
for char in punctuation:
    test_text['Text'] = test_text['Text'].str.replace(char, '')

for char in digits:
    test_text['Text'] = test_text['Text'].str.replace(char, '')

test_text["Text"] = test_text["Text"].str.lower()
test_label["Label"] = test_label["Label"].str.lower()
test_text["Text"] = test_text["Text"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

  after removing the cwd from sys.path.
  # Remove the CWD from sys.path while we load stuff.


In [62]:
digits = ['0','1','2','3','4','5','6','7','8','9']

for char in punctuation:
    dev_label['Label'] = dev_label['Label'].str.replace(char, '')
       
for char in digits:
    dev_label['Label'] = dev_label['Label'].str.replace(char, '')
    
for char in punctuation:
    dev_text['Text'] = dev_text['Text'].str.replace(char, '')

for char in digits:
    dev_text['Text'] = dev_text['Text'].str.replace(char, '')

dev_text["Text"] = dev_text["Text"].str.lower()
dev_label["Label"] = dev_label["Label"].str.lower()
dev_text["Text"] = dev_text["Text"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

  after removing the cwd from sys.path.
  # Remove the CWD from sys.path while we load stuff.


In [25]:
dev_label.head()


Unnamed: 0,ID,Label,Class
18,8755645,fed,SpecificDisease
20,8755645,fed,SpecificDisease
34,8674108,tumors,DiseaseClass
43,8917548,ataxiatelangiectasia,SpecificDisease
44,8917548,thymomas,SpecificDisease


In [99]:
test_text.head()

Unnamed: 0_level_0,Text
ID,Unnamed: 1_level_1
932197,hereditary deficiency fifth component compleme...
941901,low levels beta hexosaminidase healthy individ...
993342,chromosomal order genes controlling major hist...
9288106,clustering missense mutations ataxiatelangiect...
9294109,myotonic dystrophy protein kinase involved mod...


In [63]:
test_text['Terms'] = test_text['Text'].str.split(' ')
test_word=test_text.explode('Terms')

In [64]:
dev_text['Terms'] = dev_text['Text'].str.split(' ')
dev_word=dev_text.explode('Terms')
dev_word.drop('Text', axis=1, inplace=True)
dev_word.head()

Unnamed: 0_level_0,Terms
ID,Unnamed: 1_level_1
8589722,brca
8589722,secreted
8589722,exhibits
8589722,properties
8589722,granin


In [65]:
test_word.drop('Text', axis=1, inplace=True)
test_word.head()

Unnamed: 0_level_0,Terms
ID,Unnamed: 1_level_1
932197,hereditary
932197,deficiency
932197,fifth
932197,component
932197,complement


In [66]:
Test_fin=test_word.merge(test_label, left_on=['ID','Terms'], right_on = ['ID','Label'], how='left')
Test_fin.drop('Label', axis=1, inplace=True)
Test_fin.fillna('OTHER', axis=1, inplace=True)
Test_fin_f = Test_fin[Test_fin['Class'] == 'Modifier'].index
Test_fin.drop(Test_fin_f, inplace=True)
Test_fin_f = Test_fin[Test_fin['Class'] == 'CompositeMention'].index
Test_fin.drop(Test_fin_f, inplace=True)
Test_fin=Test_fin[Test_fin.ID != 9472666]

In [67]:
Dev_fin=dev_word.merge(dev_label, left_on=['ID','Terms'], right_on = ['ID','Label'], how='left')
Dev_fin.drop('Label', axis=1, inplace=True)
Dev_fin.fillna('OTHER', axis=1, inplace=True)
Dev_fin_f = Dev_fin[Dev_fin['Class'] == 'Modifier'].index
Dev_fin.drop(Dev_fin_f, inplace=True)
Dev_fin_f = Dev_fin[Dev_fin['Class'] == 'CompositeMention'].index
Dev_fin.drop(Dev_fin_f, inplace=True)
Dev_fin.dropna(inplace=True)

In [69]:
Test_fin.to_csv('./data/test_fin.csv', index=False)

Dev_fin.to_csv('./data/dev_fin.csv', index=False)