# Rule Based NLP Pipeline

If this title doesn't make any sense to you, don't worry for now. This notebook tends to explain the idea in plain language. You will know what it is and how it works after finishing this module. Let's start from a simple task with basic intuition.

Suppose we have a sentence:

In [1]:
input='The patient came in with fever.'

and we want to know if this sentence tells us about whether the patient has fever. What is your basic instinct to implement a solution?

### A naïve solution

A very naïve solution could be to just find the string "the patient came in with fever."

In [3]:
rule='The patient came in with fever'
print(rule in input)

True


But this solution has a big problem. **What is it**?

<div title="Think about other cases. There may be a case like 'The pt got a fever.' or 'The pt developed a fever.' etc. What can you if you define rules like above.">Hint</div>

### A better and neat solution

As you may have noticed, there's a common characteristics in all the examples above. They all have the word "fever." so probably we don't need to match all the boards in that sentence. Instead, we can simply check if a sentence has the keyword "fever." 


In [4]:
rule='fever'
print(rule in input)

True


## Named Entity Recognition

In fact, we just implemented a very simple named entity recognition solution! Of course it's not perfect, but it works. Named entity recognition is a subtask of NLP, which aims to identify the named entities or concepts in free text.

In the example above,"fever" is our target concept. And we use simple string match to identify it from a sentence. Besides string match, another commonly used technique for named entity recognition is regular expression. 

Here is an exercise to try:



In [5]:
input='He got a high temperature yesterday, T 102.5.'

How can you identify the numeric temperature? Of course for the sentence of alone, you can still use string match. What if another case has a different number? In these cases, using regular expressions will be more efficient.

If you don't know what regular expression is, you can check out [this notebook](2.1_Regular_Expression.ipynb) to taste its flavor.


In [6]:
import re
## define your regular expression rule
rule=r''
pattern=re.compile(rule)
res=pattern.search(input)
## test your result, see if it matches your expectation
res.group()=='T 102.5'


False

More advanced named entity recognition solutions can deal with syntactical similarities to match similar phrases or words, even in slightly different order; handle semantic similarities to match synonyms. We are not going to cover that here.

### Question:
Usually, just to identify the named entities is not enough. Why?

<div title="What if it says 'Discussed about possible fever.' or 'No fever.'">Hint</div>

## Context Detection

From the examples above, we now see that how important the context information is, especially for a clinical NLP, where many concepts are expressed as "not exist." Now let's think about how to implement a context detector. Again, let's start from a very intuitive way.


### A simple solution
Use regular expressions to match the context information and the target concepts together. For example,

In [7]:
cases =['The patient denies any fever.', 'The pt denies fever.',' The patient denies any chills or fever.']

In [8]:
rule=r'denies( \w+)* fever'
negation_pattern=re.compile(rule)
for case in cases:
    res=negation_pattern.search(case)
    print('negated' if res is not None else 'not negated')

negated
negated
negated


Of course this solution works. But, what if we have a long list of vocabularies for the target concepts? It seems we have to write every negation words combined with every synonym. That is not efficient and convenient.

### A smarter solution

Let's separate the context identification with the target concept identification. Then, we can use code to check whether a target concept is mortified by any context information. This way we don't have to define the combinations by hand. And most importantly, once we build a set of context rules, we can reuse it in other similar tasks with or without minor tweaks.

This is the first intuition of the context algorithm, which in fact the context algorithm does more than this. For more detailed information about the context algorithm, please check [this notebook](2.4_Intro_pyConText.ipynb).

Here we are going to implement a very simple version of context algorithm:



In [9]:
# First we need to define the rule to identify the target concept---fever
target_rule=''

# Second, we need to define the rules to identify the negation clue
context_rule=''

print('negated' if target_rule in input and context_rule in input else 'not negated')

negated


Again, this is a very simplified context algorithm. Your courage to study [this notebook](2.4_Intro_pyConText.ipynb).

### Context detection within a document

Now we've learned how to detect the context within a sentence. In real practice, we need to do the same thing to muliptle sentences within a document, which brings another issue. For example:

In [10]:
input='''....
No vomiting, chest pain, shortness of breath,nausea, vomiting or chills.
On operative day three, the patient developed a fever.
....'''

As a human, we can certainly understand that the word "No" doesn't apply to the "101.5 F", because they are in different sentences. But how can we implement a solution to make a computer thinks in a similiar way?

## Sentence Segmentation

This is why we need sentence segmentation. Sentence segmentation/detection is another subtask of NLP. It sets up the boundaries for the downstream NLP components to work on. Let's implement a single sentence segmenter to demonstrate how it helps for the context detection.

In [11]:
# This segmenter a split sentence based on the new line character.
sentences=input.split('\n')

Now we can apply our previous NER and context detection for each sentence:

In [12]:
# Define the rule to identify the target concept---fever
target_rule=''

# Define the rules to identify the negation clue---no
context_rule=''

# Let's define a "check" function to mimic the context detection
def check(sentence):
    if target_rule in sentence:
        if context_rule in sentence:
            return "negated"
        else:
            return "not negated"
    else:
        return "NA"

for sentence in sentences:
    print(sentence+':\t\t'+check(sentence))

....:		negated
No vomiting, chest pain, shortness of breath,nausea, vomiting or chills.:		negated
On operative day three, the patient developed a fever.:		negated
....:		negated


And again, there are many ways to segment sentences. You can check [this notebook](2.3_Sentence_Segmentation.ipynb) to learn more about it.

Now it seems we already have all the pieces that we need to build to the NLP pipeline.

....No exactly. Remember our goal is to get a conclusion at document level to say whether a document has mentioned something that we care about. So, currently we only get conclusions at the sentence level, there is still one piece missing, where we need to make the document conclusion based on the sentence level conclusions.

## Document Classification

The rule-based the document classification is completely based on the rules that define inferences from the sentence level conclusions towards the document level. 

In [13]:
input='''No nausea, vomiting, fever or chills.
On operative day three, the patient spiked a temperature to 101.5 F.
'''
sentences=input.split('\n')

To make it a simple, let's just use string match to find the target concepts.

In [14]:
# Define the rules in a list
target_rules=['fever','101.5 F']

# Define the rules to identify the negation clue---no
context_rule='No'

def check(sentence):
    res='NA'
    for target_rule in target_rules:        
        if target_rule in sentence:        
            if context_rule in sentence:
                res="negated"
            else:
                res="not negated"
    return res

results=[]
for sentence in sentences:
    results.append(check(sentence))

Now we have a list of sentence Level conclusions.

In [15]:
results

['negated', 'not negated', 'NA']

We can define a simple function to make a conclusion  by taking this list as the input, and output three type of document level conclusions.

In [16]:
# your document classifier
def classifier(results):
#   add your code
    
    
    return "NA"

In [17]:
# Test your classifier
print(classifier(results)=="not negated")

False


## NLP pipeline
Now you may have some ideas about the concept: **pipeline**. We need several different NLP components, and oragnize them together in an appropriate order to process the input information.

To sum up the key points we learned from above, a rule based NLP pipeline includes the following components:

sentence segmenter, named entity recognizer, context detector, and a document classifier.

Now, **forget what you have coded above, those are just used to help you understand** the functionality of each component and why we need it. 

Next, we are going to

### Build a state-of-the-art rule-based NLP pipeline

To build a state-of-art rule-based NLP pipeline, we definitely don't need to be start from scratch. Instead, we will just borrow the components from others. 

In [18]:
from PyRuSH.RuSH import RuSH
from DocumentClassifier import FeatureInferencer
from DocumentClassifier import DocumentInferencer
from pyConTextNLP import pyConTextGraph
from nlp_pneumonia_utils import markup_sentence
from itemData import get_item_data
from pyConTextNLP.utils import get_document_markups
from visual import convertMarkups2DF

In [19]:
# because there are too many sentence segmentation rules,let's read them from an external file
sentence_rules='KB/rush_rules.tsv'
# you can point target_rules to a file path instead, if there are many rules
target_rules='''
Comments: ''
Direction: ''
Lex: fever
Regex: ''
Type: FEVER
---
Comments: ''
Direction: ''
Lex: high temperature
Regex: '1\d\d\.\d F'
Type: FEVER'''
# context rules are often lengthy, you can point context_rules to an external rule files instead
context_rules='''Comments: ''
Direction: forward
Lex: 'no'
Regex: ''
Type: DEFINITE_NEGATED_EXISTENCE
'''
# define the feature inference rule
feature_inference_rule='''
#Conclusion type, Evidence type, Modifier values associated with the evidence
NEGATED_CONCEPT,FEVER,DEFINITE_NEGATED_EXISTENCE
'''
# define the document inference rule
document_inference_rule='''
#Conclusion Type at document level, Evidence type at mention level
FEVER_DOC,FEVER

#Default document type
NO_FEVER
'''

sentence_segmenter = RuSH(sentence_rules)
feature_inferencer=FeatureInferencer(feature_inference_rule)
document_inferencer = DocumentInferencer(document_inference_rule)

targets=get_item_data(target_rules)
modifiers=get_item_data(context_rules)

#### 1) Let's split sentences

In [20]:
sentences=sentence_segmenter.segToSentenceSpans(input)

In [21]:
# See what the document was splitted into
for sentence in sentences:
    print("Sentence({}-{}):\t{}".format(sentence.begin, sentence.end, input[sentence.begin:sentence.end]))
    print('\n'+'-'* 100+'\n')


Sentence(0-37):	No nausea, vomiting, fever or chills.

----------------------------------------------------------------------------------------------------

Sentence(38-106):	On operative day three, the patient spiked a temperature to 101.5 F.

----------------------------------------------------------------------------------------------------



#### 2) Apply Context algorithm to each sentence

The pyConText has already built in a NER and ConText detector, which can be applied in a single call. So we don't need to use them separately.

In [22]:
# initiate a pyConTextGraph to hold the pyConText output
context_doc = pyConTextGraph.ConTextDocument()
for sentence in sentences:
    sentence_text=input[sentence.begin:sentence.end].lower()
    m = markup_sentence(sentence_text, modifiers=modifiers, targets=targets)
    context_doc.addMarkup(m)
    context_doc.getSectionMarkups()
    print(m)

__________________________________________
rawText: no nausea, vomiting, fever or chills.
cleanedText: no nausea, vomiting, fever or chills.
********************************
TARGET: <id> 82260465953830595506096534863875374696 </id> <phrase> fever </phrase> <category> ['fever'] </category> 
----MODIFIED BY: <id> 82260466033058758020360872457419325032 </id> <phrase> no </phrase> <category> ['definite_negated_existence'] </category> 
__________________________________________

__________________________________________
rawText: on operative day three, the patient spiked a temperature to 101.5 f.
cleanedText: on operative day three, the patient spiked a temperature to 101.5 f.
********************************
TARGET: <id> 82260466112286920534625210050963275368 </id> <phrase> 101.5 f </phrase> <category> ['fever'] </category> 
__________________________________________



In [23]:
# See the context output in XML format
print(context_doc.getXML())


<ConTextDocument>
no nausea, vomiting, fever or chills. on operative day three, the patient spiked a temperature to 101.5 f. <section>
<sectionLabel> document </sectionLabel>
<sentence>
<sentenceNumber> 0 </sentenceNumber>
<sentenceOffset> 0 </sentenceOffset></sentence>

<ConTextMarkup>
<rawText> no nausea, vomiting, fever or chills. </rawText>
<cleanText> no nausea, vomiting, fever or chills. </cleanText>
<nodes>

<node>
<category> modifier </category>

<tagObject>
<id> 82260466033058758020360872457419325032 </id>
<phrase> no </phrase>
<literal> no </literal>
<category> ['definite_negated_existence'] </category>
<spanStart> 0 </spanStart>
<spanStop> 2 </spanStop>
<scopeStart> 2 </scopeStart>
<scopeStop> 37 </scopeStop>
</tagObject>
<modifies>
<modifiedNode> 82260465953830595506096534863875374696 </modifiedNode>
</modifies>

</node>

<node>
<category> target </category>

<tagObject>
<id> 82260465953830595506096534863875374696 </id>
<phrase> fever </phrase>
<literal> fever </literal>
<

#### 3) Apply document classification

The graphy output is not convenient to either use or display, let's convert it into [pandas](https://pandas.pydata.org/) [dataframes](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) first.

In [24]:
# convert graphic markups into dataframe    
markups = get_document_markups(context_doc)
annotations, relations, doc_txt = convertMarkups2DF(markups) 


Now it's easier to read the results in tables:

In [25]:
display(annotations)
display(relations)

Unnamed: 0,markup_id,vis_category,start,end,txt,type
0,T0,Target,21,26,fever,fever
1,T1,Modifier,0,2,no,definite_negated_existence
2,T2,Target,98,105,101.5 f,fever


Unnamed: 0,relation_id,type,arg1_cate,arg1_id,arg2_cate,arg2_id
0,R0,definite_negated_existence,Modifier,T1,Target,T0


Next, we can use the anntoations and relations as input to make the document inference

In [26]:
# apply inferences for document classication
inferenced_types = feature_inferencer.process(annotations, relations)
print('After inferred from modifier values, we got these types:\n '+str(inferenced_types))
doc_conclusion = document_inferencer.process(inferenced_types)
print('\nDocument concluesion: '+doc_conclusion )

After inferred from modifier values, we got these types:
 ['negated_concept', 'fever']

Document concluesion: fever_doc


In [27]:
from visual import view_pycontext_output

In [28]:
view_pycontext_output(context_doc)