#### Import packages

In [1]:
import numpy as np
import pandas as pd
from collections import Counter
from tqdm import tqdm

#### 1. Parse in Practice_NLP.txt, a synthetic note from an Electronic Health Record, and print out the text

In [2]:
# 1.1 Read Practice_NLP.txt file with open
BWH_text=open("Practice_NLP.txt","r").read()

# 1.2 Import spaCy and create object 'nlp'
import spacy
nlp = spacy.load("en_core_web_sm")

# 1.3 Invoke nlp object on the text file to create a new doc file BWH_doc
BWH_doc = nlp(BWH_text)

print(BWH_doc)

The BWH ED Nursing Record and other paper documents are still available through scanned documents in LMR. Select Results Menu, select Links, and then BWH Inpt/ED/Day Surg Scanned Record. If you need further assistance, please contact the Help Desk at (617) 732-5927.

The above is an example of a real header you might encounter within a note. This is some additional random text that is not relevant for our
task at hand. 


The patient came in at around 11:00AM today. They were seen around 11:30 AM. Their primary complaint was ongoing issues with their memory. 
They have been seeing their PCP at MGH for 10 years. Their primary care physician noted about a year ago that the patient's memory was declining. 
They ran a cognitive test but did not find evidence of dementia. They have no family history of Alzheimer's. They called after their visit at 
3  PM, requesting a refill for their Metformin. 

Lab values: 
HBA1c 9.0
TUC 204
ABCDKFKS 3 

It is recommended that the patient see a neurologi

#### 2. Print out the first 5 sentences (use spacy’s processing for determining what a sentence is, hint: use .sent object)

In [3]:
def head_n_sentences(doc_file, n):
    """Function to display first n sentences in a doc file"""
    sent_num = 1
    for senttext in list(doc_file.sents)[0:n]:
        print(sent_num,"-->", senttext)
        sent_num += 1


head_n_sentences(doc_file = BWH_doc,n = 5)

1 --> The BWH ED Nursing Record and other paper documents are still available through scanned documents in LMR.
2 --> Select Results Menu, select Links, and then BWH Inpt/ED/Day Surg Scanned Record.
3 --> If you need further assistance, please contact the Help Desk at (617) 732-5927.


4 --> The above is an example of a real header you might encounter within a note.
5 --> This is some additional random text that is not relevant for our
task at hand. 





#### 3. Write a rule to extract only paragraphs: define here as 3 or more full English sentences with at least one line of whitespace separating them from the next element of text. Print the 2 paragraphs in this note out

In [4]:
# 3.1 This function 'find_duplicates_indice' returns the exact locations of an element within a list. 
#     In this case we wish to locate the position of whitespaces \n\n or \n\n\n
def find_duplicates_indice(input_list, item_to_locate):
    "Function to locate the indices of specific items in a list"
    return [indc for indc, x in enumerate(input_list) if x == item_to_locate]

# 3.2 Append and sort the location of whitespaces in variable space_loc
space_loc = (find_duplicates_indice(input_list=[i.orth_ for i in BWH_doc],
                               item_to_locate="\n\n")+
             find_duplicates_indice(input_list=[i.orth_ for i in BWH_doc],
                               item_to_locate="\n\n\n"))

space_loc.sort()

print("Whitespaces located at index: ",space_loc)


# 3.2 Segregate sentences in the doc file using the index 'space_loc', stored in list 'required_list'
required_list=[]

for i in np.arange(0,len(space_loc)):
    if   (i< len(space_loc)-1):
        required_list.append(" ".join([i.orth_ for i in BWH_doc][space_loc[i]:space_loc[i+1]]).strip())
    elif (i==len(space_loc)-1):
        required_list.append(" ".join([i.orth_ for i in BWH_doc][space_loc[i]:]).strip())

        
# 3.3 Eliminate sentences from the above 'required_list' if they have less than 3 periods in their paragraphs        
for i in required_list:
    if i.count(".")<3:
        required_list.pop(required_list.index(i))
        
print("\n\nNumber of paragraphs identified in the doc file :", len(required_list))


# 3.4 Print all paragraphs 
for i in required_list:
    print("\n\nPARAGRAPH",
          required_list.index(i)+1,"-->",
          i,"\n")

Whitespaces located at index:  [56, 90, 188, 201]


Number of paragraphs identified in the doc file : 2


PARAGRAPH 1 --> The patient came in at around 11:00AM today . They were seen around 11:30 AM . Their primary complaint was ongoing issues with their memory . 
 They have been seeing their PCP at MGH for 10 years . Their primary care physician noted about a year ago that the patient 's memory was declining . 
 They ran a cognitive test but did not find evidence of dementia . They have no family history of Alzheimer 's . They called after their visit at 
 3   PM , requesting a refill for their Metformin . 



PARAGRAPH 2 --> It is recommended that the patient see a neurologist , as they have not done so yet . The subject denies MCI but it appears like they 
 may have early stage dementia . The patient will continue their metformin and check back in with me in six weeks . 



#### 4. Build a list of negative words (no, not, etc.) you would generally expect to see in everyday English (or that you see in the text). Turn this into a regular expression using re.compile. How many sentences have a negative word in them? Print them along with the index of the sentence containing the negative in the original document

In [5]:
# 4.1 Import in-buit library 're' for regular expressions
import re


# 4.2 Build a list of all possible negative words
words_negative = ["not","no","nothing","none","never","neither"]


# 4.3 Convert the above list to regular expression using re.compile()
words_negative_regex_compiled = re.compile(r'\b|'.join(words_negative))


# 4.4 Print the sentences containing the negative keywords along with the index in the original document
def disp_sent_with_keyword_list(doc_file, text_file, regex_compiled_list):
    """Function to display lines containing specific keywords"""
    listed_doc = [str(lines).strip() for lines in list(doc_file.sents)]
    list_of_keywords = [kw.center(len(kw)+2) for kw in list(set(regex_compiled_list.findall(text_file)))]
    index=[]
    text=[]
    for sent in listed_doc:
        ind_no = listed_doc.index(sent)
        for kwds in list_of_keywords:
            if kwds in sent:
                index.append(ind_no)
                text.append(sent)
                
    return pd.DataFrame({"Sent_index":index,"Text":text})
                
                

disp_sent_with_keyword_list(doc_file  = BWH_doc,
                            text_file = BWH_text,
                            regex_compiled_list = words_negative_regex_compiled)

Unnamed: 0,Sent_index,Text
0,4,This is some additional random text that is no...
1,10,They ran a cognitive test but did not find evi...
2,11,They have no family history of Alzheimer's.
3,17,It is recommended that the patient see a neuro...


In total, there are 4 sentences that contain at least 1 negative word

#### 5. Get all negating dependencies using spacy’s dependency parser (token.dep_ = = ‘neg’) and print them. Which negative words did this miss?

In [6]:
pd.DataFrame([(token,token.dep_) for token in BWH_doc if (token.dep_=="neg")],
             columns=["Token","Dependancy"])

Unnamed: 0,Token,Dependancy
0,not,neg
1,not,neg
2,not,neg


Clearly, the above parser has missed the token 'no' in the doc file which is also a negating dependancy

#### 6. Visualize the full note dependency using displacy’s visualizer. What is the dependency parse of a negative word (for example, “They have no family history of Alzheimer&#39;s”) that was missed in the previous step?

In [7]:
from spacy import displacy

displacy.render(docs=list(BWH_doc.sents)[11],
                style="dep",
                jupyter=True,
                options={"font": "Source Sans Pro"})

In [8]:
spacy.explain("DET")

'determiner'

Here, the dependancy parse of the negative word 'no' is a determiner for the word 'history'

#### 7. Remove all sentences with negatives defined in your rule from step 4 and print out the final text object with these sentences removed

In [9]:
list(disp_sent_with_keyword_list(doc_file  = BWH_doc,
                            text_file = BWH_text,
                            regex_compiled_list = words_negative_regex_compiled)["Sent_index"])

[4, 10, 11, 17]

In [10]:
def clean_doc_file_using_index_list(doc_file, drop_index_list):
    "Function to eliminate lines from doc file using list of user specified indices"
    cleaned_file=""
    for i in list(doc_file.sents):
        if list(doc_file.sents).index(i) not in drop_index_list:
            cleaned_file=str(cleaned_file)+str(i)
    return nlp(cleaned_file)



clean_doc_file_using_index_list(doc_file=BWH_doc, 
                                drop_index_list=[4,10,11,17])

The BWH ED Nursing Record and other paper documents are still available through scanned documents in LMR.Select Results Menu, select Links, and then BWH Inpt/ED/Day Surg Scanned Record.If you need further assistance, please contact the Help Desk at (617) 732-5927.

The above is an example of a real header you might encounter within a note.The patient came in at around 11:00AM today.They were seen around 11:30 AM.Their primary complaint was ongoing issues with their memory. 
They have been seeing their PCP at MGH for 10 years.Their primary care physician noted about a year ago that the patient's memory was declining. 
They called after their visit at 
3  PM, requesting a refill for their Metformin. 

Lab values: 
HBA1c 9.0
TUC 204
ABCDKFKS 3 

The subject denies MCI but it appears like they 
may have early stage dementia.The patient will continue their metformin and check back in with me in six weeks.