<a href="https://colab.research.google.com/github/kai-lim/NLP_course/blob/main/Day2_pattern_matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP with Python - pattern matching
## Negation detection

A brief introduction to using Python for negation detection for information extraction using a simple version of the NegEx algorithm.


Written by Sumithra Velupillai May 2020


## The NegEx algorithm

The NegEx algorithm is a widely used algorithm in clinical NLP. It is a simple pattern matching algorithm that relies on two main lexicons:

* a list of terms/concepts that are the main concepts of interest for the information extraction problem, e.g. diagnoses, symptoms. These are called target terms. 

* a list of terms that indicate negation. In the original version of NegEx the negation terms were classified as pre-negation terms, post-negation terms, and pseudonegation terms (i.e. terms that are ambiguous). 

In simple terms, the algorithm works in the following way:

* For each sentence, look for target terms.
* If a target term is found, check if this term is negated. This is done by looking at the surrounding words in a window of +/- 5 words within the sentence.


The original article: 

Chapman et al. A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries, 
Journal of Biomedical Informatics Volume 34, Issue 5, October 2001, Pages 301-310

https://www.sciencedirect.com/science/article/pii/S1532046401910299

There are a few extended versions of this algorithm, where other modifiers are taken into account (e.g. uncertainty, experiencer), where several types of targets can be defined, where the scope of a modifier is dealt with differently, etc.  

Some relevant publications:


Harkema et al. ConText: An Algorithm for Determining Negation, Experiencer, and Temporal Status From Clinical Reports.
J Biomed Inform. 2009 Oct;42(5):839-51. doi: 10.1016/j.jbi.2009.05.002. Epub 2009 May 10.

https://pubmed.ncbi.nlm.nih.gov/19435614/


Chapman et al. Document-Level Classification of CT Pulmonary Angiography Reports based on an Extension of the ConText Algorithm.
J Biomed Inform. 2011 Oct; 44(5): 728–737. doi: 10.1016/j.jbi.2011.03.011
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3164892/


Chapman et al. Extending the NegEx Lexicon for Multiple Languages
Stud Health Technol Inform. 2013;192:677-81.
https://pubmed.ncbi.nlm.nih.gov/23920642/


Example of using this in the mental health domain:

Downs et al. Detection of Suicidality in Adolescents with Autism Spectrum Disorders: Developing a Natural Language Processing Approach for Use in Electronic Health Records
AMIA Annu Symp Proc. 2017; 2017: 641–649.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5977628/


Velupillai et al. Identifying Suicidal Adolescents From Mental Health Records Using Natural Language Processing
Stud Health Technol Inform. 2019 Aug 21;264:413-417. doi: 10.3233/SHTI190254.

https://pubmed.ncbi.nlm.nih.gov/31437956/


We'll use pandas to save outputs

In [1]:
import pandas as pd

A key package for working with pattern matching and regular expressions is called 're', we need to import that too. 

In [2]:
import re

We will use SpaCy again for tokenizing.

spaCy: https://spacy.io/


spaCy has a default language model for English that we will load into the variable 'nlp'

In [5]:
try:
    import spacy
except ImportError as e:
    !pip install spacy
    import spacy
try:
    nlp = spacy.load('en_core_web_sm')
except Error as e:
    !python -m spacy download en_core_web_sm #to run on command line
    nlp = spacy.load('en_core_web_sm')
    

Let's define a function to extract words from sentences, and exclude punctuations

In [6]:
import string
def get_spacy_tokens(row):
  return [str(token) for token in row if str(token) not in string.punctuation]

Then we need a function to implement the NegEx algorithm, that returns a dataframe with each sentence, a list of target terms (if found) and if the sentence is negated or not (boolean)

In [7]:
def simple_negex(doc, target_terms, negation_terms):
    negated_sentences = []
    for sentence in doc.sents:
        words = get_spacy_tokens(sentence)
        negated = False
        ## find target terms
        t_word = []
        neg_word = []
        for w in words:            
            negated = False
            for reg in target_terms:
                r = re.compile(reg, flags=re.I)
                if re.search(r, w):
                    # target term found, save
                    t_word.append(w)
                    # look for negation in window +- 5 words
                    start = words.index(w)-6
                    if start<0:
                        start=0
                    for ww in words[start:words.index(w)]:
                        if ww in negation_terms:
                            negated = True
                            break
                    end = words.index(w)+6
                    if end > len(words):
                        end = len(words)
                    for ww in words[words.index(w):end+1]:
                        if ww in negation_terms:
                            negated = True
                            break
            neg_word.append(negated)
        if True in neg_word:
            negated_sentences.append([str(sentence), t_word, True])
        else:
            negated_sentences.append([str(sentence), t_word, False])
    df = pd.DataFrame(negated_sentences, columns=['sentence', 'target terms', 'negated'])
    return df
    

Let's create a sample document, a list of target terms, and a list of negation terms

In [8]:

text = "The patient denies having suicidal thoughts. This was not an intentional overdose. She has been suicidal in the past. Suicidal ideation was not intentional."

## we'll use spacy for tokenizing
doc = nlp(text)

## a simple list of target terms
target_terms = ['suicid']

## a simple list of negation terms
negation_terms = ['no', 'not']

negated_sentences = simple_negex(doc, target_terms, negation_terms)

What results did we get?

In [9]:
negated_sentences

Unnamed: 0,sentence,target terms,negated
0,The patient denies having suicidal thoughts.,[suicidal],False
1,This was not an intentional overdose.,[],False
2,She has been suicidal in the past.,[suicidal],False
3,Suicidal ideation was not intentional.,[Suicidal],True


What do you think about these results? Are there any terms missing as targets? As negations? 

*Try adding new terms, changing sentences!*

In [59]:

text = "The patient denies having suicidal thoughts. This was not an intentional overdose. She has been suicidal in the past. Suicidal ideation was not intentional. She was not suicidal. She didn't want to die. He wasn't suicidal at all. Paddington bear's owner was suicidal."

## we'll use spacy for tokenizing
doc = nlp(text)

## a simple list of target terms
target_terms = ['suicid', 'die','overdose']

## a simple list of negation terms
negation_terms = ['no', 'not', 'n\'t','denies']

negated_sentences = simple_negex(doc, target_terms, negation_terms)

In [61]:
negated_sentences

Unnamed: 0,sentence,target terms,negated
0,The patient denies having suicidal thoughts.,[suicidal],True
1,This was not an intentional overdose.,[overdose],True
2,She has been suicidal in the past.,[suicidal],False
3,Suicidal ideation was not intentional.,[Suicidal],True
4,She was not suicidal.,[suicidal],True
5,She didn't want to die.,[die],True
6,He wasn't suicidal at all.,[suicidal],True
7,Paddington bear's owner was suicidal.,[suicidal],False


In [60]:
for sentence in doc.sents:
         print(get_spacy_tokens(sentence))


['The', 'patient', 'denies', 'having', 'suicidal', 'thoughts']
['This', 'was', 'not', 'an', 'intentional', 'overdose']
['She', 'has', 'been', 'suicidal', 'in', 'the', 'past']
['Suicidal', 'ideation', 'was', 'not', 'intentional']
['She', 'was', 'not', 'suicidal']
['She', 'did', "n't", 'want', 'to', 'die']
['He', 'was', "n't", 'suicidal', 'at', 'all']
['Paddington', 'bear', "'s", 'owner', 'was', 'suicidal']
