# Creating a dataset of the positive abstract set

Goals of the labeling redo:
1. Read in the dataset
2. Turn it into list of sentences
3. Map a label to each word
    - Incorporate Disease Lookup from GARD database
    - Location is same
    - Epidemiologic Identifier {Prevalence, Incidence, Occurrence}
    - Do tokenization with just the main location spacy model so there is no mismatch
    - Do independent statistic labeling (ranges but not if 95%CI)
4. Randomly split the dataset into test (50 abstract) set, and then train/val sets with train-test split - incorporate seed
5. Save into the correct format    
6. Manually fix the test set (Separate)
7. After training the transformer - do sentence level extraction if there is at least one relevant STAT tag

## (1) Reading in the dataset

In [29]:
#Download all dependencies, only need to do this once
#import sys
#!{sys.executable} -m pip install spacy
#!{sys.executable} -m spacy download en_core_web_lg
#!{sys.executable} -m spacy download en_core_web_trf
#!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz
#!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz

import nltk
#nltk.download('stopwords')
#nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import tokenize
STOPWORDS = set(stopwords.words('english'))
import string
PUNCTUATION = set(char for char in string.punctuation)
import csv
import spacy
import re
import pandas as pd
from termcolor import colored
from spacy import displacy
from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
                        'stanford-ner/stanford-ner.jar',
                        encoding='utf-8')
nlp = spacy.load('en_core_web_trf')
nlpSci = spacy.load("en_ner_bc5cdr_md")

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-trf==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.1.0/en_core_web_trf-3.1.0-py3-none-any.whl (460.2 MB)
[K     |████████████████████████████████| 460.2 MB 13 kB/s s eta 0:00:01             | 9.7 MB 6.6 MB/s eta 0:01:09        | 17.5 MB 6.6 MB/s eta 0:01:08:01:07[K     |█▊                              | 24.7 MB 6.6 MB/s eta 0:01:07                          | 28.7 MB 6.6 MB/s eta 0:01:0633.1 MB 6.6 MB/s eta 0:01:05| 36.8 MB 6.6 MB/s eta 0:01:05                        | 39.9 MB 6.6 MB/s eta 0:01:04 eta 0:01:04.2 MB 6.6 MB/s eta 0:01:01    |████▋                           | 66.2 MB 75.7 MB/s eta 0:00:06          | 73.7 MB 75.7 MB/s eta 0:00:06�███▊                          | 81.6 MB 75.7 MB/s eta 0:00:05    |██████                          | 84.7 MB 75.7 MB/s eta 0:00:0588.8 MB 75.7 MB/s eta 0:00:05:00:055██                         | 100

Collecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.3-cp36-cp36m-manylinux2014_x86_64.whl (998 kB)
[K     |████████████████████████████████| 998 kB 30.6 MB/s eta 0:00:01
[?25hCollecting transformers<4.10.0,>=3.4.0
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 51.2 MB/s eta 0:00:01███████████████████████████     | 2.2 MB 51.2 MB/s eta 0:00:01
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Installing collected packages: huggingface-hub, transformers, spacy-alignments, spacy-transformers, en-core-web-trf
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 1.12.1 requires huggingface-hub<0.1.0,>=0.0.14, but you have huggingface-hub 0.0.12 which is incompatible.[0m
Successfully installed en-core-web-trf-3.1.0 huggingf



In [2]:
df = pd.read_csv("positive_abstract_set.csv")
df.tail()

Unnamed: 0,disease,pmid,abstract,epi_prob,is_epi
750,CD4 deficiency,34128866,<h4>Abstract</h4>Previous studies have suggest...,0.519289,True
751,Muscle eye brain disease,31344241,<h4>Background and objective</h4>To understand...,0.514754,True
752,Enchondromatosis dwarfism deafness,32355641,Anticancer drug nephrotoxicity is an important...,0.512264,True
753,Severe congenital neutropenia autosomal dominant,33554218,Coronavirus disease 2019 (COVID-19) is emergin...,0.511204,True
754,Palindromic rheumatism,33780753,"To date, retinal implants are the only availab...",0.500894,True


In [3]:
#Total pmids, unique pmids
print(len(df["pmid"]),len(df["pmid"].unique()))

755 620


In [4]:
#total abstracts, total diseases, was supposed to be a sample of 500 diseases, but obviously only 347 have epi studies
print(len(df["disease"]),len(df["disease"].unique()))

755 347


In [5]:
pmid = df["pmid"]
df[pmid.isin(pmid[pmid.duplicated()])]#.sort("pmid")
#This shows that 1) duplicate pmids are tagged with different diseases, but more importantly that 2) the disease that is 
#labeled was just from the search and may not have anything to do with the actual abstract (see https://pubmed.ncbi.nlm.nih.gov/33602758/ in row 743 & 744)
#3) This means that we need to do disease extraction too and cannot rely on the search disease that is labeled.
#4) It also means that duplicates can be dropped

Unnamed: 0,disease,pmid,abstract,epi_prob,is_epi
0,"Deafness, autosomal dominant nonsyndromic sens...",33962637,<h4>Background</h4>The incidence of hydrocepha...,0.998725,True
1,Anencephaly and spina bifida X-linked,33962637,<h4>Background</h4>The incidence of hydrocepha...,0.998725,True
10,Distal renal tubular acidosis with hemolytic a...,34167483,<h4>Background</h4>The prevalence of Multiple ...,0.998105,True
11,Cardiomyopathy due to anthracyclines,34167483,<h4>Background</h4>The prevalence of Multiple ...,0.998105,True
15,Syndactyly Cenani Lenz type,28402072,<h4>Background</h4>Unilateral lung agenesis is...,0.998031,True
...,...,...,...,...,...
742,Cryoglobulinemic vasculitis,33526985,<b>Background:</b> Osteogenesis imperfecta (OI...,0.556312,True
743,Infantile onset spinocerebellar ataxia,33602758,Q fever can present as a fever of unknown aeti...,0.556216,True
744,Progressive non-fluent aphasia,33602758,Q fever can present as a fever of unknown aeti...,0.556216,True
749,"Nephropathy, deafness, and hyperparathyroidism",34128866,<h4>Abstract</h4>Previous studies have suggest...,0.519289,True


In [6]:
#Gather only necessary information
abstracts_df = df.drop_duplicates(subset='pmid', keep='first', ignore_index=True)
abstracts_df = abstracts_df[['pmid','abstract']]
abstracts_df.tail()

Unnamed: 0,pmid,abstract
615,34128866,<h4>Abstract</h4>Previous studies have suggest...
616,31344241,<h4>Background and objective</h4>To understand...
617,32355641,Anticancer drug nephrotoxicity is an important...
618,33554218,Coronavirus disease 2019 (COVID-19) is emergin...
619,33780753,"To date, retinal implants are the only availab..."


### HTML Removal

In [7]:
def remove_html(string):
    #string = re.sub(':', '', string)
    string = re.sub('<.{1,4}>', ' ', string)
    #string = re.sub('.{1,3}>', ' ', string)
    #string = re.sub('<.{1,3}', ' ', string)
    string = re.sub("  *", " " , string)
    string = re.sub("^ ", "" , string)
    string = re.sub(" $", "" , string)
    string = re.sub("  ", " " , string)
    #This also removes extra parentheses
    #string= string.replace("(", ' ')
    #string= string.replace(")", ' ')
    
    string=string.strip()
    return string

### Jennifer's functions

In [20]:
def printHighlighted(doc, indices):
    final = ''
    start = 0
    for i in indices:
        final += doc[start:i[0]].text+' '
        final += colored(doc[i[0]:i[1]].text, 'red', 'on_yellow', attrs=['bold']) + ' '
        start = i[1]
    final += doc[start:].text
    print(final)

In [9]:
def removeDuplicates(a):
    for i in range(len(a)-1,0,-1):
        if a[i] == a[i-1]:
            del a[i]

#### Location

In [10]:
def getLocsNltk(text):
    tokenized_text = word_tokenize(text)
    classified_text = st.tag(tokenized_text)
    locs = set()

    for word in classified_text:
        if word[1] == 'LOCATION':
            if word[0] not in locs:
                locs.add(word[0])
    
    return locs

In [11]:
def getLocsSpacy(doc):
    locs = {}
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            tokens = {token.text for token in ent}
            if ent.text not in locs:
                locs[ent.text] = tokens
            else:
                for t in tokens:
                    if t not in locs[ent.text]:
                        locs[ent.text].add(t)     
    return locs

In [12]:
def getLocs(text):
    doc = nlp(text)
    
    spacyLocs = getLocsSpacy(doc)
    nltkLocs = getLocsNltk(text)
    locs = []
    
    for entity in spacyLocs:
        if len(spacyLocs[entity] & nltkLocs) != 0:
            locs.append(entity)
            
    return locs

#### STATS

In [13]:
def getTokenChunkDict(doc):
    chunks = [chunk for chunk in doc.noun_chunks]
    tokenToChunk = {}
    for chunk in chunks:
        for i in range(chunk.start, chunk.end):
            tokenToChunk[i] = [chunk.start, chunk.end]
    return tokenToChunk

In [14]:
def isValidStat(token):
    ancestors = {a.text.lower() for a in token.ancestors}
    if 'ci' in ancestors or 'confidence' in ancestors or 'interval' in ancestors or 'p' in ancestors or 'p-value' in ancestors or 'type' in ancestors:
        return False
    if 'times' in ancestors:
        return False
    if token.text.lower() == 'one' and len(token.doc) > token.i + 1 and token.doc[token.i + 1].text == 'of':
        return False
    if token.ent_type_ == 'DATE':
        return False
    if token.ent_type_ in {'CARDINAL','QUANTITY'}:
        return True
    return False

In [15]:
def getStats(abst, display=False):
    doc = nlp(abst)
    indices = []
    tokenToChunk = getTokenChunkDict(doc)
    key_val_dz = []
    
    for sent in doc.sents:
        keywords = []
        values = []
        dzs = []
        
        keywords_text = []
        values_text = []
        dzs_text = []
        
        sciSent = nlpSci(sent.text)
        
        for token in sent:
            sciToken = nlpSci(token.text)[0]
            if token.text.lower() in {'prevalence','incidence','frequency','PR','prevalences','occurrence'}:
                if token.i in tokenToChunk:
                    keywords.append(tokenToChunk[token.i])
                else:
                    keywords.append([token.i, token.i+1])
            if isValidStat(token) or isValidStat(nlp(token.text)[0]):
                if token.i in tokenToChunk:
                    values.append(tokenToChunk[token.i])
                else:
                    values.append([token.i, token.i+1])
        if keywords != [] and values != []:
            for token in sciSent:
                if token.ent_type_ == 'DISEASE':
                    for token_reg in sent:
                        if token_reg.text == token.text:
                            if token_reg.i in tokenToChunk:
                                dzs.append(tokenToChunk[token_reg.i])
                            else:
                                dzs.append([token_reg.i, token_reg.i+1])
            
            removeDuplicates(keywords)
            removeDuplicates(values)
            removeDuplicates(dzs)
            for i in keywords:
                keywords_text.append(doc[i[0]:i[1]])
            for i in values:
                values_text.append(doc[i[0]:i[1]])
            for i in dzs:
                dzs_text.append(doc[i[0]:i[1]])
            key_val_dz.append((keywords_text, values_text, dzs_text))
            indices += keywords
            indices += values
            indices += dzs
    indices = sorted(indices)
    removeDuplicates(indices)
    if display:
        printHighlighted(doc, indices)
    return key_val_dz

### Testing cells

In [27]:
sent = "Incidence of Hansen's disease in Olmsted County, Minnesota, was 2.6/million/year."
#doc = nlp(sent)
getStats(sent, True)

 [1m[43m[31mIncidence[0m of [1m[43m[31mHansen's disease[0m in Olmsted County, Minnesota, was [1m[43m[31m2.6/million/year[0m .


[([Incidence], [2.6/million/year], [Hansen's disease])]

In [28]:
test_dict = {'string0':"<h4>Background</h4>Most epidemiological data on vitiligo refer to selected environments or focus on the prevalence of comorbidity unrelated to the population.<h4>Objective</h4>Aim of the study was to gain robust representative prevalence data on vitiligo and on associated dermatologic comorbidity in the German adult population.<h4>Methods</h4>A dual population-based approach was applied with 1) primary data obtained between 2004 and 2014 from dermatological exams in the general working population; 2) claims data from a large German statutory health insurance, reference year 2010.<h4>Results</h4>In the working cohort (N = 121,783; 57% male; mean age 43 years), the prevalence of vitiligo was 0.77% (0.84% in men; 0.67% in women). In the claims data (N = 1,619,678; 38% male; mean age 46 years), prevalence was 0.17% (0.14% in men; 0.18% in women). In the working cohort, vitiligo was significantly more common in people with fair skin type, ephelides and port-wine stains and less common in people with acne and solar lentigines. In the claims data, vitiligo was associated with a variety of skin conditions, eg, atopic dermatitis, psoriasis and alopecia areata.<h4>Conclusion</h4>The resulting discrepancy of claims vs primary data between 0.17% and 0.77% indicates the most probable spectrum of vitiligo prevalence in Germany. It is more frequently observed in clinical exams than recorded in claims data, indicating a marked proportion of people seeking no medical help. Such nonattendance may result from the fact that many treatment options do not provide satisfying benefits to the patients.",
'string2':"<h4>Abstract</h4>Previous studies have suggested.  Aim: There are more than 50 inherited lysosomal storage diseases (LSDs), and this study examined the incidence of clinically diagnosed LSDs in Sweden. Methods: The number of patients diagnosed during 1980-2009 was compiled from the registries of the two Swedish diagnostic laboratories that cover the whole country. Results: We identified 433 patients during the 30-year period, with a total incidence of one in every 6100 births and identified fairly constant annual diagnoses during the last 20 years. Krabbe disease was the most common (one in 39 000) followed by Gaucher disease (one in 47 000), metachromatic leukodystrophy and Salla disease. Gaucher disease was more frequent in Sweden than other European countries, due to a founder effect of the mutation (p.L444P) in northern Sweden. Metachromatic leukodystrophy was one of the. Neuroendocrine tumors (NETs) are rare neoplasms, with an estimated annual incidence of 6.9/100 000. They arise from cells of the diffuse endocrine system",
'string1':"<h4>Abstract</h4>Previous studies have suggested. Neuroendocrine tumors (NETs) rare neoplasms, estimated annual incidence 6.9/100 000. They arise cells diffuse endocrine system, mainly dispersed throughout gastrointestinal (GI), pancreatic, respiratory tracts. The incidence GI-NETs recently begun show steady increase. According Surveillance, Epidemiology, End Results database, 53% patients NETs present localized disease, 20% locoregional disease, 27% distant metastases time diagnosis. Surgery mainstay treatment locoregional GI-NETs. Endoscopic resection option well-differentiated early GI-NETs, thought rarely metastasize lymph nodes. A lesion technically difficult resect via endoscopy indication local resection (partial resection without lymph node dissection). GI-NETs possible lymph node metastasis indication enterectomy lymph node dissection. For NETs metastatic lesions, cytoreduction surgery control hormonal hypersecretion alleviate symptoms; therefore, cytoreduction surgery recommended. The indications surgery vary based organ NET arose; therefore, understanding patient's clinical state individualized treatment based characteristics patient's GI-NET needed. This review summarizes surgical treatments GI-NETs organ.",
'string3':"Frontal fibrosing alopecia (FFA) is a variant of lichen planopilaris (LPP) with characteristic band-like frontotemporal hairline involvement and eyebrow loss. It most commonly occurs in post-menopausal White women.<sup>1</sup> In skin of color (SOC) individuals, FFA is often misdiagnosed as traction alopecia (TA),<sup>2</sup> and little data exists regarding the presentation of FFA in the SOC patient population.<sup>3</sup> As FFA incidence continues to increase,<sup>4</sup> we aim to understand differences in the presentation of FFA between White and Black women in order to aid in the accurate and timely diagnosis as well as help inform prognosis and management.",
'string4':"'Although benzothiazole and its derivatives (BTHs) are considered emerging contaminants in diverse environments and organisms, little information is available about their contamination profiles and health impact in ambient particles. In this study, an optimized method of ultrasound-assisted extraction coupled with the selected reaction monitoring (SRM) mode of GC-EI-MS/MS was applied to characterize and analyze PM<sub>2.5</sub>-bound BTHs from three cities of China (Guangzhou, Shanghai, and Taiyuan) during the winter of 2018. The total BTH concentration (ΣBTHs) in PM<sub>2.5</sub> samples from the three cities decreased in the order of Guangzhou > Shanghai > Taiyuan, independently of the PM<sub>2.5</sub> concentration. Despite the large variation in concentration of ΣBTHs in PM<sub>2.5</sub>, 2-hydroxybenzothiazole (OTH) was always the predominant compound among the PM<sub>2.5</sub>-bound BTHs and accounted for 50-80% of total BTHs in the three regions. Results from human exposure assessment and toxicity screening indicated that the outdoor exposure risk of PM<sub>2.5</sub>-bound BTHs in toddlers was much higher than in adults, especially for OTH. The developmental and reproduction toxicity of OTH was further explored in vivo and in vitro. Exposure of mouse embryonic stem cells (mESCs) to OTH for 48\xa0h significantly increased the intracellular reactive oxygen species (ROS) and induced DNA damage and apoptosis via the functionally activating p53 expression. In addition, the growth and development of zebrafish embryos were found to be severely affected after OTH treatment. An overall metabolomics study was conducted on the exposed zebrafish larvae. The results indicated that exposure to OTH inhibited the phenylalanine hydroxylation reaction, which further increased the accumulation of toxic phenylpyruvate and acetylphenylalanine in zebrafish. These findings provide important insights into the contamination profiles of PM<sub>2.5</sub>-bound BTHs and emphasize the health risk of OTH.'",
'string5':'Incidence of the disease in Olmsted County, Minnesota, was 2.6/million/year.'
            }
for i in range(6):
    print(test_dict['string'+str(i)])
    print('...............')
    print(getStats(remove_html(test_dict['string'+str(i)]),True))
    print('')


<h4>Background</h4>Most epidemiological data on vitiligo refer to selected environments or focus on the prevalence of comorbidity unrelated to the population.<h4>Objective</h4>Aim of the study was to gain robust representative prevalence data on vitiligo and on associated dermatologic comorbidity in the German adult population.<h4>Methods</h4>A dual population-based approach was applied with 1) primary data obtained between 2004 and 2014 from dermatological exams in the general working population; 2) claims data from a large German statutory health insurance, reference year 2010.<h4>Results</h4>In the working cohort (N = 121,783; 57% male; mean age 43 years), the prevalence of vitiligo was 0.77% (0.84% in men; 0.67% in women). In the claims data (N = 1,619,678; 38% male; mean age 46 years), prevalence was 0.17% (0.14% in men; 0.18% in women). In the working cohort, vitiligo was significantly more common in people with fair skin type, ephelides and port-wine stains and less common in pe

Frontal fibrosing alopecia (FFA) is a variant of lichen planopilaris (LPP) with characteristic band-like frontotemporal hairline involvement and eyebrow loss. It most commonly occurs in post-menopausal White women. 1 In skin of color (SOC) individuals, FFA is often misdiagnosed as traction alopecia (TA), 2 and little data exists regarding the presentation of FFA in the SOC patient population. [1m[43m[31m3[0m As [1m[43m[31mFFA incidence[0m continues to increase, [1m[43m[31m4[0m we aim to understand differences in the presentation of FFA between White and Black women in order to aid in the accurate and timely diagnosis as well as help inform prognosis and management.
[([FFA incidence], [3, 4], [])]

'Although benzothiazole and its derivatives (BTHs) are considered emerging contaminants in diverse environments and organisms, little information is available about their contamination profiles and health impact in ambient particles. In this study, an optimized method of ultrasoun

## (3) Map label onto each word (done rule-based at the sentence level)

### Manual GARD Disease Look-up

In [9]:
#Read in GARD diseases
GARD_df = pd.read_csv('GARD.csv')
GARD_df.tail()

Unnamed: 0,d.gard_id,d.name,d.synonyms
6056,GARD:0013731,T-cell prolymphocytic leukemia,[T Cell Prolymphocytic Leukemia]
6057,GARD:0013735,Spastic paraplegia 47,
6058,GARD:0013737,AP-4-Associated Hereditary Spastic Paraplegia,[Severe intellectual disability and progressiv...
6059,GARD:0013743,"Multicentric osteolysis, nodulosis and arthrop...","[Torg-Winchester Syndrome,Torg Syndrome,Nodulo..."
6060,GARD:0013818,Sphingosine phosphate lyase insufficiency synd...,"[SPL insufficiency syndrome,SPLIS,Familial ste..."


In [10]:
#GARD.csv d.synonyms has oddly saved string data that cannot be converted directly into a list, this converts that
def str2list(string):
    string = str(string).replace('[','')
    string = string.replace(']','')
    string = string.strip()
    str_list = string.split(',')
    for s in str_list:
        s = s.strip()
        if s=='nan':
            str_list.remove('nan')
    return str_list

In [11]:
#Convert d.synonym strings into lists
i=0
for i in range(len(GARD_df['d.synonyms'])):
    GARD_df['d.synonyms'][i] = str2list(GARD_df['d.synonyms'][i])

In [12]:
GARD_df.tail()

Unnamed: 0,d.gard_id,d.name,d.synonyms
6056,GARD:0013731,T-cell prolymphocytic leukemia,[T Cell Prolymphocytic Leukemia]
6057,GARD:0013735,Spastic paraplegia 47,[]
6058,GARD:0013737,AP-4-Associated Hereditary Spastic Paraplegia,[Severe intellectual disability and progressiv...
6059,GARD:0013743,"Multicentric osteolysis, nodulosis and arthrop...","[Torg-Winchester Syndrome, Torg Syndrome, Nodu..."
6060,GARD:0013818,Sphingosine phosphate lyase insufficiency synd...,"[SPL insufficiency syndrome, SPLIS, Familial s..."


In [13]:
#Set up a new & easier to use list of diseases
rowlist = []
i=0
for i in range(len(GARD_df)):
    columnlist=[]
    columnlist.append(GARD_df['d.name'][i])
    columnlist+=GARD_df['d.synonyms'][i]
    rowlist.append(columnlist)

#keys are going to be disease names, values are going to be the GARD ID, set up this way bc dictionaries are faster lookup than lists
GARD_dict = {}
GARD_firstwd_dict = {}

#Find out what the length of the longest disease name sequence is, of all names and synonyms
max_length = -1
for i in range(len(rowlist)):
    #Compare length of primary disease name
    #dz = str(rowlist[i][0]).strip()
    #l_dz = len(dz.split())
    #if l_dz>max_length:
    #    max_length = l_dz
    for j in range(len(rowlist[i])):
        if rowlist[i][j] not in GARD_dict.keys():
            s = str(rowlist[i][j]).lower().strip()
            if len(s.split())>0 and s not in STOPWORDS:
                GARD_dict[s] = GARD_df['d.gard_id'][i]
                #GARD_firstwd_dict[s.split()[0]] = GARD_df['d.gard_id'][i]
                #This will increase the false negative rate a little bit, but decrease the false positive rate tremendously
                if s.split()[0] not in STOPWORDS:
                    GARD_firstwd_dict[s.split()[0]] = GARD_df['d.gard_id'][i]
            #compare length
            l = len(s.split())
            if l>max_length:
                print(s)
                max_length = l
print(max_length)
print(len(GARD_dict))
print(len(GARD_firstwd_dict))

gracile syndrome
finnish lactic acidosis with hepatic hemosiderosis
hyperphalangy-clinodactyly of index finger with pierre robin syndrome
partial deletion of the short arm of chromosome 3
with partial agenesis of the corpus callosum and arachnoid cysts
macrothrombocytopenia and granulocyte inclusions with or without nephritis or sensorineural hearing loss
malignant tumors of the central nervous system associated with familial polyposis of the colon
lethal autosomal recessive arthrogryposis multiplex congenita with whistling face and calcifications of the nervous system
xy disorder of sex development due to luteinizing hormone resistance or luteinizing hormone beta subunit deficiency
severe or complete loss of motor function in the lower extremities and lower portions of the trunk
17
22714
10492


In [14]:
#This means that the GARD.csv dataset is inaccurate, but I would rather have higher recall than higher precision
GARD_dict['corpus callosum']
#GARD_firstwd_dict['is']

'GARD:0012486'

In [15]:
#This will not be run now, will be done after training the model since it is just dictionary lookup
def tag_diseases(tokens,labels):   
    i=0
    while i <len(tokens):
        if (len(tokens)-i) < max_length:
            compare_length=len(tokens)-i
        else:
            compare_length = max_length
        #Compares longest sequences first and goes down until there is a match
        #print('(start compare_length)',compare_length)
        exit = False
        while compare_length>0:
            s = ' '.join(tokens[i:i+compare_length])
            for key in GARD_dict.keys():
                if key==s.lower():
                    labels[i] = 'B-DIS'
                    #print(s)
                    for j in range(i+1,i+compare_length):
                        labels[j] = 'I-DIS'
                    #Need to skip over the next few indexes
                    #print('(compare_length):',compare_length)
                    i+=compare_length-1
                    exit =True #this allows you to break out of two loops
            #break out of loop in case there are multiple rare diseases in the same sentence
            if exit:
                break
            else:
                compare_length-=1
        i+=1  
    return tokens,labels

Testing Cells

In [16]:
s  = ['with','partial','agenesis','of','the','corpus','callosum','and','arachnoid','cysts','severe','or','complete','loss','of','motor','function','in','the','lower','extremities','and','lower','portions','of','the','trunk','now','we','are','at','an','impasse','finnish','lactic','acidosis','with','hepatic','hemosiderosis']
t = ['B-LOC' if i%5==0 else 'O' for i in range(len(s))]

#for i in range(len(s)):
#    print(s[i],t[i])
#print('')
s,t = tag_diseases(s,t)
print('')
for i in range(len(s)):
    print(s[i],t[i])


with B-DIS
partial I-DIS
agenesis I-DIS
of I-DIS
the I-DIS
corpus I-DIS
callosum I-DIS
and I-DIS
arachnoid I-DIS
cysts I-DIS
severe B-DIS
or I-DIS
complete I-DIS
loss I-DIS
of I-DIS
motor I-DIS
function I-DIS
in I-DIS
the I-DIS
lower I-DIS
extremities I-DIS
and I-DIS
lower I-DIS
portions I-DIS
of I-DIS
the I-DIS
trunk I-DIS
now O
we O
are O
at B-LOC
an O
impasse O
finnish B-DIS
lactic I-DIS
acidosis I-DIS
with I-DIS
hepatic I-DIS
hemosiderosis I-DIS


In [17]:
ts = ['here','is','GRACILE','Syndrome','Right','Here']
ls = ['O','O','O','O','O','O']
for i in range(len(ts)):
    print(ts[i],ls[i])
print('')
ts,ls = tag_diseases(ts,ls)
print('')
for i in range(len(ts)):
    print(ts[i],ls[i])

here O
is O
GRACILE O
Syndrome O
Right O
Here O


here O
is O
GRACILE B-DIS
Syndrome I-DIS
Right O
Here O


### Tagging Functions

In [19]:
def combine_stats(tokens,labels):
    i=1
    while i<len(labels)-1:
        if 'STAT' in labels[i]:
            #Includes <, > number in the statistic
            if tokens[i-1]=='<' or tokens[i-1]=='>':
                labels[i-1]='B-STAT'
                labels[i]='I-STAT'
            #Includes greater than, less than, more than, etc. 
            if tokens[i-1]=='than':
                labels[i-2]='B-STAT'
                labels[i-1]='I-STAT'
                labels[i]='I-STAT'
                
        #Combines "This disease affects 1 in 7500 to 1 in 10,000 people" into a single statistic phrase instead of 2
        if 'STAT' in labels[i-1] and 'STAT' in labels[i+1] and 'STAT' not in labels[i]:
            if tokens[i] =='to':
                labels[i]='I-STAT'
                labels[i+1]='I-STAT'
            if tokens[i] =='-':
                labels[i]='I-STAT'
                labels[i+1]='I-STAT'
        
        #This gets of the type "prevalence of 2 to 18 per 100,000"
        if labels[i+1]=='B-STAT':
            if tokens[i]=='to' or tokens[i]=='-' or tokens[i-1].isdigit():
                labels[i-1]='B-STAT'
                labels[i]='I-STAT'
                labels[i+1]='I-STAT'
        i+=1
    return tokens,labels

In [20]:
# This function should take in a sentence and output each word in it with a tentative label
def tag_NERs(sentence):
    
    doc = nlp(sentence)
    tokens = [token.text for token in doc]
    labels = ['O' for token in doc]
    
    i = 0
    for token in doc:
        if len(str(token.text).strip())==0:
            tokens.pop(i)
            labels.pop(i)
            
        else:
            ## Epidemiologic identifier
            if token.text.lower() in {'incidence','prevalence','prevalences','prevalence ','incidences','occurrence','occurrences'}:
                labels[i] = 'B-EPI'
        
            ## Location
            if token.ent_type_ in {'GPE','LOC'}:
                labels[i] = str(token.ent_iob_+'-LOC')
            if token.text in {"worldwide"}:
                labels[i] = 'B-LOC'
        
            ## Epidemiologic Rates
            #This gets stuff of the form 3.5/100
            if token.text[0].isdigit() and '/' in token.text:
                labels[i] = 'B-STAT'
        
            #label all percents except those preceding "confidence interval (CI)"
            if token.ent_type_ in {'PERCENT'}:# and token.text not in {'95', 'CI'}:
                if i<len(doc)-2:
                    if doc[i+2].text in {'CI','confidence','interval','confidence interval','(CI)','(CI','CI)'}:
                        labels[i] = 'O'
                        labels[i+1] = 'O'
                        labels[i+2] = 'O'
                    elif doc[i+1].text in {'CI','confidence','interval','confidence interval','(CI)','(CI','CI)'}:
                        labels[i] = 'O'
                        labels[i+1] = 'O'
                    else:
                        labels[i] = str(token.ent_iob_+'-STAT')
                elif i<len(doc)-1:
                    if doc[i+1].text in {'CI','confidence','interval','confidence interval','(CI)','(CI','CI)'}:
                        labels[i] = 'O'
                        labels[i+1] = 'O'
                    else:
                        labels[i] = str(token.ent_iob_+'-STAT')        
                else:
                    labels[i] = str(token.ent_iob_+'-STAT')
        
            #These 3 get stuff of the form "one in 35000" or "one in every 23043"
            if (token.text.lower() in {'one','1'} and i<(len(doc)-3)): 
                if doc[i+3].is_digit:
                    labels[i] = 'B-STAT'
                    for j in range(i+1,i+4):
                        labels[j] = 'I-STAT'
            if (token.text.lower() in {'one','1'} and i<(len(doc)-2)): 
                if doc[i+2].is_digit:
                    labels[i] = 'B-STAT'
                    labels[i+1] = 'I-STAT'
                    labels[i+2] = 'I-STAT'
            if (token.text.lower() in {'one','1'} and i<(len(doc)-1)):
                if doc[i+1].is_digit:
                    labels[i] = 'B-STAT'
                    labels[i+1] = 'I-STAT'
        
            #These should get the ones of the form: 14.1 deaths per 1,000 LBs
            #This is a big decision tree, not sure how to write it in fewer lines of code
            #Need to get all permutations of "a b per c d e" where (a or b) and (c or d) is number and e is anything, but if e does not exist still need to tag a-d as STAT
            if token.text.lower() =='per':
                #print(i,len(doc))
                if i>1:
                    if i<len(doc)-3:
                        #Resulted in better testing when not validating that words after 'per' are numbers
                        if (doc[i-2].is_digit or doc[i-2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i-1].is_digit or doc[i-1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+2].is_digit or doc[i+2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+1].is_digit or doc[i+1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}):
                            if tokens[i-2] not in STOPWORDS and tokens[i-2] not in PUNCTUATION:
                                labels[i-2] = 'B-STAT'
                                #labeling also the token after the number
                                for j in range(i-1,i+3):
                                    labels[j]='I-STAT'
                            else:
                                labels[i-1] = 'B-STAT'
                                #labeling also the token after the number
                                for j in range(i,i+3):
                                    labels[j]='I-STAT'
                            
                    if i<len(doc)-2:
                        if (doc[i-2].is_digit or doc[i-2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i-1].is_digit or doc[i-1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+2].is_digit or doc[i+2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+1].is_digit or doc[i+1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}):
                            if tokens[i-2] not in STOPWORDS and tokens[i-2] not in PUNCTUATION:
                                labels[i-2] = 'B-STAT'
                                #labeling also the token after the number
                                for j in range(i-1,i+2):
                                    labels[j]='I-STAT'
                            else: 
                                labels[i-1] = 'B-STAT'
                                #labeling also the token after the number
                                for j in range(i,i+2):
                                    labels[j]='I-STAT'
                    #The difference between the above and below is in labeling the token immediately after the number
                    if i<len(doc)-1:
                        if (doc[i-2].is_digit or doc[i-2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i-1].is_digit or doc[i-1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+2].is_digit or doc[i+2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+1].is_digit or doc[i+1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}):
                            if tokens[i-2] not in STOPWORDS and tokens[i-2] not in PUNCTUATION:
                                labels[i-2] = 'B-STAT'
                                #labeling also the token after if it is number
                                for j in range(i-1,i+1):
                                    labels[j]='I-STAT'
                            else: 
                                labels[i-1] = 'B-STAT'
                                #labeling also the token after the number
                                for j in range(i,i+1):
                                    labels[j]='I-STAT'
                elif i>0:
                    if i<len(doc)-3:
                        if (doc[i-1].is_digit or doc[i-1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+2].is_digit or doc[i+2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+1].is_digit or doc[i+1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}):
                
                            labels[i-1] = 'B-STAT'
                            #labeling also the token after the number
                            for j in range(i,i+3):
                                labels[j]='I-STAT'
                            
                    if i<len(doc)-2:
                        if (doc[i-1].is_digit or doc[i-1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+2].is_digit or doc[i+2].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+1].is_digit or doc[i+1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}):
                        
                            labels[i-1] = 'B-STAT'
                            #labeling also the token after the number
                            for j in range(i,i+2):
                                labels[j]='I-STAT'
                            
                    if i<len(doc)-1:
                        if (doc[i-1].is_digit or doc[i-1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}) or (
                            doc[i+1].is_digit or doc[i+1].ent_type_ in {'CARDINAL','ORDINAL','QUANTITY','MONEY'}):
                    
                            labels[i-1] = 'B-STAT'
                            #labeling just the number if there is nothing after. 
                            for j in range(i,i+1):
                                labels[j]='I-STAT'
            i+=1

    if len(tokens) != len(labels):
        raise ValueError('Token/Label Length Mismatch')
        
    if len(tokens)>2 and len(labels)>2:
        tokens, labels = combine_stats(tokens,labels)

    return tokens, labels #This returns as type Spacy.tokens, need to convert to strings at writing

### tag_NERs Function Testing Cells

In [21]:
test_dictionary = {'s2':'5% of guards run to the watch tower in Budapest where one 6 14.1 deaths per 1,000 LBs here here the king (95% CI 148.3-254.2)',
    's3':'Organ damage in sickle cell disease (SCD) is a crucial determinant for disease severity and prognosis. In a previous study, we analyzed the prevalence of SCD-related organ damage and complications in adult sickle cell patients. We now describe a seven-year follow-up of this cohort.All patients from the primary analysis in 2006 (n = 104), were included for follow-up. Patients were screened for SCD-related organ damage and complications (microalbuminuria, renal failure, elevated tricuspid regurgitation flow velocity (TRV) (≥2.5 m/seconds), retinopathy, iron overload, cholelithiasis, avascular osteonecrosis, leg ulcers, acute chest syndrome (ACS), stroke, priapism and admissions for vaso-occlusive crises (VOC) biannually. Upon 7 years of follow-up, progression in the prevalence of avascular osteonecrosis (from 12.5% to 20.4%), renal failure (from 6.7% to 23.4%), retinopathy (from 39.7% to 53.8%) was observed in the whole group. In HbSS/HbSβ0 -thal patients also progression in microalbuminuria (from 34% to 45%) and elevated TRV (from 40% to 48%) was observed while hardly any progression in the prevalence of cholelithiasis, priapism, stroke or chronic ulcers was seen. The proportion of patients with at least one episode of ACS increased in the group of HbSS/HbSβ0 -thal patients from 32% to 49.1%. In conclusion, 62% of the sickle cell patients in this prospective cohort study developed a new SCD-related complication in a comprehensive care setting within 7 years of follow-up. Although the hospital admission rate for VOC remained stable, multiple forms of organ damage increased substantially. These observations underline the need for continued screening for organ damage in all adult patients with SCD.',
    's4':"Uveal melanoma (UM) represents the most prominent primary eye cancer in adults. With an incidence of approximately 5 cases per million individuals annually in the United States, UM could be considered a relatively rare cancer. The 90‑95% of UM cases arise from the choroid. Diagnosis is based mainly on a clinical examination and ancillary tests, with ocular ultrasonography being of greatest value. Differential diagnosis can prove challenging in the case of indeterminate choroidal lesions and, sometimes, monitoring for documented growth may be the proper approach. Fine needle aspiration biopsy tends to be performed with a prognostic purpose, often in combination with radiotherapy. Gene expression profiling has allowed for the grading of UMs into two classes, which feature different metastatic risks. Patients with UM require a specialized multidisciplinary management. Primary tumor treatment can be either enucleation or globe preserving. Usually, enucleation is reserved for larger tumors, while radiotherapy is preferred for small/medium melanomas. The prognosis is unfavorable due to the high mortality rate and high tendency to metastasize. Following the development of metastatic disease, the mortality rate increases to 80% within one year, due to both the absence of an effective treatment and the aggressiveness of the condition. Novel molecular studies have allowed for a better understanding of the genetic and epigenetic mechanisms involved in UM biological activity, which differs compared to skin melanomas. The most commonly mutated genes are GNAQ, GNA11 and BAP1. Research in this field could help to identify effective diagnostic and prognostic biomarkers, as well as novel therapeutic targets.",
    's5':"This disease affects <1/500,000 baby boys in Russia. It is linked to the   original hemopheliac prince and is a old disease."}

In [22]:
for k, v in test_dictionary.items():
    #Test the function
    a,b = tag_NERs(v)
    print('\n-----------------',k,'-------------------------\n')
    for i in range(len(a)):
        print(a[i], b[i])


----------------- s2 -------------------------

5 B-STAT
% I-STAT
of O
guards O
run O
to O
the O
watch O
tower O
in O
Budapest B-LOC
where O
one B-STAT
6 I-STAT
14.1 B-STAT
deaths I-STAT
per I-STAT
1,000 I-STAT
LBs I-STAT
here O
here O
the O
king O
( O
95 O
% O
CI O
148.3 O
- O
254.2 O
) O

----------------- s3 -------------------------

Organ O
damage O
in O
sickle O
cell O
disease O
( O
SCD O
) O
is O
a O
crucial O
determinant O
for O
disease O
severity O
and O
prognosis O
. O
In O
a O
previous O
study O
, O
we O
analyzed O
the O
prevalence B-EPI
of O
SCD O
- O
related O
organ O
damage O
and O
complications O
in O
adult O
sickle O
cell O
patients O
. O
We O
now O
describe O
a O
seven O
- O
year O
follow O
- O
up O
of O
this O
cohort O
. O
All O
patients O
from O
the O
primary O
analysis O
in O
2006 O
( O
n O
= O
104 O
) O
, O
were O
included O
for O
follow O
- O
up O
. O
Patients O
were O
screened O
for O
SCD O
- O
related O
organ O
damage O
and O
complications O
( O
microalbuminuri

In [23]:
a,b = tag_NERs(remove_html(test_dictionary['s5']))
for i in range(len(a)):
    print(a[i], b[i])

This O
disease O
affects O
< B-STAT
1/500,000 I-STAT
baby O
boys O
in O
Russia B-LOC
. O
It O
is O
linked O
to O
the O
original O
hemopheliac O
prince O
and O
is O
a O
old O
disease O
. O


In [24]:
sentences = tokenize.sent_tokenize(remove_html(test_dictionary['s5']))

for sent in sentences:
    a,b = tag_NERs(sent)
    for i in range(len(a)):
        print(a[i], b[i])
    print('')

This O
disease O
affects O
< B-STAT
1/500,000 I-STAT
baby O
boys O
in O
Russia B-LOC
. O

It O
is O
linked O
to O
the O
original O
hemopheliac O
prince O
and O
is O
a O
old O
disease O
. O



In [25]:
abstracts = abstracts_df['abstract'].to_list()
for i in range(len(abstracts)):
    abstracts[i] = tokenize.sent_tokenize(remove_html(abstracts[i]))

In [26]:
#Test this again and again with random sampling
import random
for sentence in random.choice(abstracts):
    a,b = tag_NERs(sentence)
    a,b = tag_diseases(a,b)
    for i in range(len(a)):
        print(a[i], b[i])
    print('')

Background O
Kindler O
poikiloderma B-DIS
is O
an O
inherited B-DIS
autosomal O
genodermatosis O
characterized O
by O
blistering O
of O
the O
epidermis O
and O
mucosae O
. O

Its O
prevalence B-EPI
is O
unknown O
. O

Case O
report O
We O
monitored O
two O
brothers O
suffering O
from O
this O
pathology O
. O

Oral O
manifestations O
mainly O
take O
the O
form O
of O
periodontal O
lesions O
. O

In O
our O
patients O
we O
noted O
gingivitis O
progressing O
to O
periodontitis O
as O
follow O
- O
up O
care O
was O
not O
effective O
. O

We O
also O
diagnosed O
enamel B-DIS
hypoplasia I-DIS
, O
described O
more O
rarely O
in O
this O
pathology O
. O

Conclusion O
Periodontitis O
in O
Kindler B-DIS
Syndrome I-DIS
responds O
to O
maintenance O
therapy O
, O
but O
the O
absence B-DIS
of I-DIS
surveillance O
is O
penalized O
by O
a O
deterioration O
in O
periodontal O
condition O
and O
complication O
of O
management O
. O

All O
restorative O
, O
endodontic O
, O
surgical O
, O
periodontal O
a

## (4) Split the data

In [27]:
abstract_sents = [tokenize.sent_tokenize(abstract) for abstract in [remove_html(abstract) for abstract in abstracts_df['abstract']]]

In [28]:
for abstract in abstract_sents:
    print(abstract)
    for sentence in abstract:
        print(sentence)
        break
    break


['Background The incidence of hydrocephalus in the spinal muscular atrophy (SMA) population relative to the general population is currently unknown.', 'Since the approval of nusinersen, an intrathecally administered drug for SMA, a small number of hydrocephalus cases among nusinersen users have been reported.', "Currently, the incidence of hydrocephalus in untreated SMA patients is not available, thereby making it difficult to determine if hydrocephalus is a side effect of nusinersen or part of SMA's natural history.", 'This retrospective, matched cohort study used electronic health records (EHRs) to estimate and compare the incidence of hydrocephalus in both SMA patients and matched non-SMA controls in the time period prior to the approval of nusinersen.', 'Methods The U.S. Optum® de-identified EHR database contains records for approximately 100 million persons.', 'The current study period spanned January 1, 2007-December 22, 2016.', 'Patients with SMA were identified by one or more I

In [29]:
from sklearn.model_selection import train_test_split
dev_set, test_set = train_test_split(abstract_sents, test_size=50,random_state=5)
print(len(test_set),len(dev_set))

#dev set is just a holder and will not be saved

train_set, val_set = train_test_split(dev_set, train_size=0.8,random_state=6)
print(len(train_set),len(val_set))
del dev_set

50 570
456 114


Actually map the tags onto abstract_sents and count how many annotations there are

In [30]:
def tag_abstracts(abstract_sents):
    all_tokens, all_labels = [],[]
    num_tokens,num_labels =0,0
    i=0
    for abstract in abstract_sents:
        abstract_tokens, abstract_labels = [],[]
        for sentence in abstract:
            sentence_tokens, sentence_labels = tag_NERs(sentence)
            num_tokens+=len(sentence_tokens)
            num_labels+=len(sentence_labels)
            abstract_tokens.append(sentence_tokens)
            abstract_labels.append(sentence_labels)
            #Count cause it can take a long time
            i+=1
            if i%250==0:
                print('Step:',i)
        all_tokens.append(abstract_tokens)
        all_labels.append(abstract_labels)
            
    print(len(sentence_tokens),len(sentence_labels))
    print('number of annotations',num_tokens,num_labels)
    return all_tokens, all_labels
tokens, labels = tag_abstracts(test_set)

Step: 250
Step: 500
22 22
number of annotations 13910 13910


## (5) Save into the correct format

In [31]:
with open('epi_train_setV2.tsv', "w") as f:
    abstracts, label_list = tag_abstracts(train_set)
    for i in range(len(abstracts)): #For abstract in abstracts
        for j in range(len(abstracts[i])): #for sentence in abstract
            for k in range(len(abstracts[i][j])): #for token in sentence
                output = str(abstracts[i][j][k]) +'\t' +str(label_list[i][j][k])+'\n'
                f.write(output)
            f.write('\n')
        #f.write('\n')
        if i%50==0:
            print('abstract num',i,'done')
f.close()

Step: 250
Step: 500
Step: 750
Step: 1000
Step: 1250
Step: 1500
Step: 1750
Step: 2000
Step: 2250
Step: 2500
Step: 2750
Step: 3000
Step: 3250
Step: 3500
Step: 3750
Step: 4000
Step: 4250
Step: 4500
12 12
number of annotations 117888 117888
abstract num 0 done
abstract num 50 done
abstract num 100 done
abstract num 150 done
abstract num 200 done
abstract num 250 done
abstract num 300 done
abstract num 350 done
abstract num 400 done
abstract num 450 done


In [32]:
with open('epi_val_setV2.tsv', "w") as f:
    abstracts, label_list = tag_abstracts(val_set)
    for i in range(len(abstracts)): #For abstract in abstracts
        for j in range(len(abstracts[i])): #for sentence in abstract
            for k in range(len(abstracts[i][j])): #for token in sentence
                output = str(abstracts[i][j][k]) +'\t' +str(label_list[i][j][k])+'\n'
                f.write(output)
            f.write('\n')
        #f.write('\n')
        if i%50==0:
            print('abstract num',i,'done')
f.close()

Step: 250
Step: 500
Step: 750
Step: 1000
34 34
number of annotations 31262 31262
abstract num 0 done
abstract num 50 done
abstract num 100 done


In [34]:
with open('epi_test_setV2.tsv', "w") as f:
    abstracts, label_list = tag_abstracts(test_set)
    for i in range(len(abstracts)): #For abstract in abstracts
        for j in range(len(abstracts[i])): #for sentence in abstract
            for k in range(len(abstracts[i][j])): #for token in sentence
                output = str(abstracts[i][j][k]) +'\t' +str(label_list[i][j][k])+'\n'
                f.write(output)
            f.write('\n')
        #f.write('\n')
        if i%10==0:
            print('abstract num',i,'done')
f.close()

Step: 250
Step: 500
22 22
number of annotations 13910 13910
abstract num 0 done
abstract num 10 done
abstract num 20 done
abstract num 30 done
abstract num 40 done


OLD function

In [None]:
'''
#This will not be run now, will be done after training the model since it is just dictionary lookup
def tag_diseases(tokens,labels):   
    i=0
    while i <len(tokens)-1:
    #for i in range(len(tokens)):
        print(i)
        if tokens[i].lower() in GARD_firstwd_dict.keys():
            #print(tokens[i])
            if (len(tokens)-i) < max_length:
                compare_length=len(tokens)-i
            else:
                compare_length = max_length
            #Compares longest sequences first and goes down until there is a match
            while compare_length>0:
                s = ' '.join(tokens[i:i+compare_length])
                for key in GARD_dict.keys():
                    if key==s.lower():
                        labels[i] = 'B-DIS'
                        print(s)
                        for j in range(i+1,i+compare_length):
                            labels[j] = 'I-DIS'
                        #Need to skip over the next few indexes
                        print(compare_length)
                        i+=compare_length
                        #only want to break out of innermost loop in case there are multiple rare diseases in the same sentence
                        break
                compare_length-=1
                
                #if s.lower() in GARD_dict.keys():
                #    labels[i] = 'B-DIS'
                #    print(s)
                #    for j in range(i+1,i+compare_length):
                #        labels[j] = 'I-DIS'
                #    #Need to skip over the next few indexes
                #    i+=compare_length
                #    #only want to break out of innermost loop in case there are multiple rare diseases in the same sentence
                #    break
                #else:
                #    compare_length-=1
                
        i+=1  
    return tokens,labels
'''