# Rule Based NLP Pipeline

This notebook explains the idea of building rule based systems in plain language. You will know what it is and how it works after finishing this module. Let's start with a simple task with basic intuition.

Suppose we have a sentence:

In [1]:
input='The patient came in with fever.'

and we want to know if this sentence tells us about whether the patient has fever. What is your basic instinct to implement a solution?

### A naïve solution

A very naïve solution could be to just find the string "the patient came in with fever."

In [2]:
rule='The patient came in with fever'
print(rule in input)

True


But this solution has a big problem. **What is it**?

---

Think about other cases. There may be a case like: 
- 'The pt got a fever.'   
- 'The pt developed a fever.' 
- 'Patient presented with fever.' 
- 'Findings: fever. 'etc. 

What can you do to handle cases like these? 

---


### A better and neat solution

As you may have noticed, there is a common characteristics in all the examples above. They all have the word "fever." So probably we do not need to match all the words in a sentence to determine if patient has fever. Instead, we can simply check if a sentence has the keyword "fever." 


In [3]:
rule='fever'
print(rule in input)

True


## Named Entity Recognition

In fact, we just implemented a very simple named entity recognition solution! Of course it's not perfect, but it works. Named entity recognition is a subtask of NLP, which aims to identify the named entities or concepts in free text.

In the example above,"fever" is our target concept. And we use simple string match to identify it from a sentence. Besides string match, another commonly used technique for named entity recognition is regular expression. 

Here is an exercise to try:



In [4]:
input='He got a high temperature yesterday, T 102.5.'

How can you identify the numeric temperature? Of course for the sentence of alone, you can still use string match. What if another case has a different number? In these cases, using regular expressions will be more efficient.

If you don't know what regular expression is, you can check out [this notebook](m5_2_Regular_Expression.ipynb) to get a taste of RegEx.


In [None]:
import re
## define your regular expression rule to match "T 102.5"
rule=r'' #Add Regular expression to capture temperature
pattern=re.compile(rule)
res=pattern.search(input)

## test your result, see if it matches your expectation
res.group()=='T 102.5'


If the output above shows "*False*", please check the code above. Variable "rule" has to be defined to match the target pattern.

----------

More advanced named entity recognition solutions can deal with syntactical similarities to match similar phrases or words, even in slightly different order; or handle semantic similarities to match synonyms. The topic of named entity recognition is more advanced and we are not going to cover that here.

### Question:
Usually, just to identify the named entities is not enough. **Why?**

--- 

What if it says 
- 'Discussed about possible fever.' or 
- 'No fever.'

How do we identify that a patient has or does not have fever based on these sentences?

---

## Context Detection

From the examples above, we now see that how important the context information is, especially for a clinical NLP, where many concepts are expressed as "*not present in patient*." Now let's think about how to implement a context detector. Again, let's start from a very intuitive way.


### A simple solution
Use regular expressions to match the context information and the target concepts together. For example,

In [None]:
cases =['The patient denies any fever.', 'The pt denies fever.', 'The patient denies any chills or fever.' , 'Patient presented with fever.']

In [None]:
rule=r'denies( \w+)* fever'
negation_pattern=re.compile(rule)
for case in cases:
    res=negation_pattern.search(case)
    print('negated' if res is not None else 'not negated' , ' -- ', case)

Of course this solution works. But, what if we have a long list of vocabularies for the target concepts? It seems we have to write every negation word combined with every synonym of the target concept. That is neither efficient nor convenient.

### A smarter solution

If we separate the context identification (negated vs not negated) from the target concept identification, we can then check whether a target concept is modified by any context information. This way we do not have to define all possible combinations, but simply loop through the list of context rules for each concept. And most importantly, once we build this set of context rules, we can reuse it in other similar tasks without major tweaks.

This is the basic outline of the ConText algorithm (Harkema H, Dowling JN, Thornblade T, Chapman WW. [ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports.](https://www.ncbi.nlm.nih.gov/pubmed/19435614) J Biomed Inform 2009;42(5):839-851). 

Though the ConText algorithm is much more complicated, using it is easy. For more detailed information about how to work with the context algorithm, please check [m5_3_Intro_pyConText.ipynb](m5_3_Intro_pyConText.ipynb)  notebook.

Here we are going to implement a very simple version of context algorithm:


In [None]:
# First we need to define the rule to identify the target concept---fever
target_rule='fever'

# Second, we need to define the rules to identify the negation clue -- denies
context_rule='denies'

for input in cases: 
    print(input,' -- ',target_rule,"is", 'negated' if target_rule in input and context_rule in input else 'not negated')


### Context detection within a document

Now we've learned how to detect the context within a sentence. In real practice, we need to do the same thing to muliptle sentences within a document, which brings another issue. For example:

In [None]:
input='''
No vomiting, chest pain, shortness of breath, nausea, dizziness, or chills on arrival.
On operative day three, the patient fever was detected with temperature 101.5 F.
After 3 days no fever was detected.
Patient came back for a follow up, denies fever.
'''

As a human, we can certainly understand that the word "No" does not apply to the "fever", because they are in different sentences. But how can we implement a solution to make a computer thinks in a similiar way?

## Sentence Segmentation

This is why we need sentence segmentation. Sentence segmentation/detection is another subtask of NLP. It sets up the boundaries for the downstream NLP components to work on. Let's implement a single sentence segmenter to demonstrate how it helps for context detection.

In [None]:
# This segmenter a split sentence based on the new line character.
sentences=input.split('\n')

Now we can apply our previous NER and context detection for each sentence:

In [None]:
# Define the rule to identify the target concept---fever
target_rule=re.compile(r'fever')

# Define the rules to identify the negation clue---No, no, denies
context_rule= re.compile(r'[Nn]o|denies')

# Let's define a "check" function to mimic the context detection
def check(sentence):
    if target_rule.search(sentence):
        if context_rule.search(sentence):
            return "negated"
        else:
            return "not negated"
    else:
        return "NA"

for sentence in sentences:
    print(check(sentence), ' -- ' , sentence)

And again, there are many ways to segment sentences. You can check [5_4_Sentence_Segmentation.ipynb](5_4_Sentence_Segmentation.ipynb) notebook to learn more about it.

Now it seems we already have all the pieces that we need to build to the NLP pipeline.

....No exactly. Remember our goal is to get a conclusion at document level to say whether a document has mentioned something that we care about. So, currently we only get conclusions at the sentence level, there is still one piece missing, where we need to make the document conclusion based on the sentence level conclusions.

## Document Classification

The rule-based the document classification is completely based on the rules that define inferences from the sentence level conclusions towards the document level. 

In [None]:
sentences=input.split('\n')
print(sentences)

To make it a simple, let's just use string match to find the target concepts.

In [None]:
# Define the rules in a list
target_rule=re.compile(r'fever|101.5 F')

# Define the rules to identify the negation clue---no
context_rule= re.compile(r'[Nn]o|denies')

def check(sentence):
    if target_rule.search(sentence):
        if context_rule.search(sentence):
            return "negated"
        else:
            return "not negated"
    else:
        return "NA"

results=[]
for sentence in sentences:
    results.append(check(sentence))

Now we have a list of sentence-level conclusions.

In [None]:
results

We can define a simple function to make a conclusion  by taking this list as the input, and output three type of document level conclusions.

In [None]:
# If at least one result is not negated, the document is labeled as not negated.
def classifier(results):
    for result in results:
        if result == "not negated":
            return "not negated"      
    return "NA"

In [None]:
# Test your classifier
print(classifier(results)=="not negated")

## NLP pipeline
Now you may have some ideas about the concept: **pipeline**. We need several different NLP components, and oragnize them together in an appropriate order to process the input information.

To sum up the key points we learned from above, a rule based NLP pipeline includes the following components:

- sentence segmenter, 
- named entity recognizer, 
- context detector, and 
- a document classifier.

Now, **forget what you have coded above, those are just used to help you understand** the functionality of each component and why we need it. 


### Build a state-of-the-art rule-based NLP pipeline

To build a state-of-art rule-based NLP pipeline, we definitely don't need to start from scratch. Instead, we will just borrow the components from others. 

In [None]:
from PyRuSH.RuSH import RuSH
from pyConTextNLP import pyConTextGraph
from pyConTextNLP.utils import get_document_markups


from DocumentClassifier import FeatureInferencer
from DocumentClassifier import DocumentInferencer
from nlp_pneumonia_utils import markup_sentence
from itemData import get_item_data
from visual import convertMarkups2DF

In [None]:
# because there are too many sentence segmentation rules, let's read them from an external file
sentence_rules='KB/rush_rules.tsv'
# you can point target_rules to a file path instead, if there are many rules
target_rules='''
Comments: ''
Direction: ''
Lex: fever
Regex: ''
Type: FEVER
---
Comments: ''
Direction: ''
Lex: high temperature
Regex: '1\d\d\.\d F'
Type: FEVER'''
# context rules are often lengthy, you can point context_rules to an external rule files instead
context_rules='''Comments: ''
Direction: forward
Lex: 'no'
Regex: ''
Type: DEFINITE_NEGATED_EXISTENCE
---
Comments: ''
Direction: forward
Lex: 'denies'
Regex: ''
Type: DEFINITE_NEGATED_EXISTENCE
'''
# define the feature inference rule
feature_inference_rule='''
#Conclusion type, Evidence type, Modifier values associated with the evidence
NEGATED_CONCEPT,FEVER,DEFINITE_NEGATED_EXISTENCE
'''
# define the document inference rule
document_inference_rule='''
#Conclusion Type at document level, Evidence type at mention level
FEVER_DOC,FEVER

#Default document type
NO_FEVER
'''

sentence_segmenter = RuSH(sentence_rules)
feature_inferencer=FeatureInferencer(feature_inference_rule)
document_inferencer = DocumentInferencer(document_inference_rule)

targets=get_item_data(target_rules)
modifiers=get_item_data(context_rules)

#### 1) Let's split sentences

In [None]:
sentences=sentence_segmenter.segToSentenceSpans(input)

In [None]:
# See what the document was splitted into
for sentence in sentences:
    print("Sentence({}-{}):\t{}".format(sentence.begin, sentence.end, input[sentence.begin:sentence.end]))
    print('\n'+'-'* 100+'\n')


#### 2) Apply Context algorithm to each sentence

The pyConText has already built in a NER and ConText detector, which can be applied in a single call. So we don't need to use them separately.

In [None]:
# initiate a pyConTextGraph to hold the pyConText output
context_doc = pyConTextGraph.ConTextDocument()

for sentence in sentences:
    sentence_text=input[sentence.begin:sentence.end].lower()
    # Process every sentence by adding markup
    m = markup_sentence(sentence_text, modifiers=modifiers, targets=targets)
    context_doc.addMarkup(m)
    context_doc.getSectionMarkups()
    print(m)


In [None]:
# See the context output in XML format
print(context_doc.getXML())

#### 3) Apply document classification

The graphy output is not convenient to either use or display, let's convert it into [pandas](https://pandas.pydata.org/) [dataframes](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) first.

In [None]:
# convert graphic markups into dataframe    
markups = get_document_markups(context_doc)
annotations, relations, doc_txt = convertMarkups2DF(markups) 


Now it's easier to read the results in tables:

In [None]:
display(annotations)
display(relations)

Next, we can use the anntoations and relations as input to make the document inference

In [None]:
# apply inferences for document classication
inferenced_types = feature_inferencer.process(annotations, relations)
print('After inferred from modifier values, we got these types:\n '+str(inferenced_types))
doc_class = document_inferencer.process(inferenced_types)
print('\nDocument classification: '+ doc_class )

In [None]:
from visual import view_pycontext_output
from visual import Vis

In [None]:
view_pycontext_output(context_doc, Vis(file_name="context_graph.html"))

The image above illustrates relationships built by the pyConText algorithm.

### Have quiestions? Please ask!