#  Identifying Entities in Healthcare Data
Syntactic Processing Assignment:

# Problem Statement

Now, let’s consider a hypothetical example of a health tech company called ‘BeHealthy’. Suppose ‘BeHealthy’ aims to connect the medical communities with millions of patients across the country. 

 

‘BeHealthy’ has a web platform that allows doctors to list their services and manage patient interactions and provides services for patients such as booking interactions with doctors and ordering medicines online. Here, doctors can easily organise appointments, track past medical records and provide e-prescriptions.

 

So, companies like ‘BeHealthy’ are providing medical services, prescriptions and online consultations and generating huge data day by day.

 

Let’s take a look at the following snippet of medical data that may be generated when a doctor is writing notes to his/her patient or as a review of a therapy that he or she has done.

 

“The patient was a 62-year-old man with squamous cell lung cancer, which was first successfully treated by a combination of radiation therapy and chemotherapy.”

 

As you can see in this text, a person with a non-medical background cannot understand the various medical terms. We have taken a simple sentence from a medical data set to understand the problem and where you can understand the terms ‘cancer’ and ‘chemotherapy’. 

 

Suppose you have been given such a data set in which a lot of text is written related to the medical domain. As you can see in the dataset, there are a lot of diseases that can be mentioned in the entire dataset and their related treatments are also mentioned implicitly in the text, which you saw in the aforementioned example that the disease mentioned is cancer and its treatment can be identified as chemotherapy using the sentence.

 

But, note that it is not explicitly mentioned in the dataset about the diseases and their treatment, but somehow, you can build an algorithm to map the diseases and their respective treatment.

 

Suppose you have been asked to determine the disease name and its probable treatment from the dataset and list it out in the form of a table or a dictionary like this.

<p>
<img src ="https://images.upgrad.com/0891d77b-b9ca-4e9d-8934-d9a9b078a51c-syntactic%20sol%20pic1.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>

# Business Objectives

BeHealthy require **predictive model** which can **identify disease and treatment** from the patients interaction with doctor or ordering medicines online.

By observing the requirement, it is clearly visible that we have to process the textual sentence and identify the entities like Disease and Treatment. We can predict these all requirements using
  -  CRF (Conditional Random Field) classifier
  -  Random Forest Classifier
  -  HMM (Hidden Markov Model)

# IMPORT LIBRARIES AND DATASETS

In [1]:
!pip install pycrf
!pip install sklearn-crfsuite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycrf
  Downloading pycrf-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pycrf
  Building wheel for pycrf (setup.py) ... [?25l[?25hdone
  Created wheel for pycrf: filename=pycrf-0.0.1-py3-none-any.whl size=1897 sha256=e5358a1e357aad7bda33dca0a130fc87ba5ee063e1fbb51ee2fab45865b1bb66
  Stored in directory: /root/.cache/pip/wheels/da/5c/29/bf862cc934550145485b0e0502cb8deadffb387f6a096e4b5f
Successfully built pycrf
Installing collected packages: pycrf
Successfully installed pycrf-0.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K

In [2]:
# import package
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics

import re                                                          # Regular expressions 
import spacy                                                       #  NLP, POS tag check

pd.set_option('display.max_columns', None)
%matplotlib inline
sns.set()




In [3]:
import warnings
warnings.filterwarnings('ignore')

## Data Loading and Description

- **Data Set** - It contains four data file for this activity to proceed, they are
  - Train Sentence Dataset
  - Train Label Dataset
  - Test Sentence Dataset
  - Test Label Dataset

Sentence file contains all interations between patients and doctor and Label file contains all enitiy tags for particular words arranged as per sentence. We need to do few preprocessing while accessing dataset we will explore that in further steps.

We have the train and the test datasets; the train dataset is used to train the CRF model, and the test dataset is used to evaluate the built model.

Let’s take a look at the structure of these datasets using the image provided below.

<p>
<img src ="https://images.upgrad.com/af3536e2-c88f-42f1-8dda-fa563763ecff-Syntactic%20sol%20pic2.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>

Here, we need to understand that each word in this dataset is provided in a single line. So, first, we need to club all these words together to form the sentences. Moreover, there are blank lines given in the dataset that have been highlighted in the image given above. These blank lines indicate that a new sentence is starting from the next line onwards to the next blank line.

In the image provided above, you need to make the sentences in the following way:

Sentence1: …using a Spearman-rank Correlation
Sentence2: This relationship should be taken into account when interpreting the AFI as a measure of fetal well-being.
Sentence3: The study population…
...and so on.


We can also refer to the image given below to get a better idea on how to create sentences from words.

<p>
<img src ="https://images.upgrad.com/2a1ec8a4-e26c-4b5b-bfe0-1816d14a4a30-syntactic%20sol%20pic3.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>

In this ‘train_sent’ dataset, there are a total of **2,599** sentences when you form the sentences from the words. Similarly, there are a total of **1,056** sentences in the ‘test_sent’ dataset when you form the sentences from the words.

Now, let’s take a look at the next datasets that are named ‘train_label’ and ‘test_label’.

<p>
<img src ="https://images.upgrad.com/bdd7f8f5-0fbb-4b46-9c2c-500a68c40d2e-syntactic%20sol%20pic4.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>


The above dataset is about the labels corresponding to the diseases and the treatment. There are three labels that have been used in this dataset: O, D and T, which are corresponding to ‘Other’, ‘Disease’ and ‘Treatment’, respectively.

 

These labels correspond to each word that is available in the ‘train_sent’ and 'test_sent' datasets. So, there is one-to-one mapping of each label available in the 'train_label' and 'test_label' datasets with the words that are available in the 'train_sent' and 'test_sent' datasets, respectively. 

We need to again create the lines of labels corresponding to each sentence in the ‘train_sent’ and the ‘test_sent’ datasets as shown below.

<p>
<img src ="https://images.upgrad.com/cad61c6d-534f-4452-81e6-dc02d7c8fcde-syntactic%20sol%20pic5.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>


So, in this ‘train_label’ dataset, there are a total of **2,599** lines of labels when you form the lines from the label dataset. Similarly, there are a total of **1,056** lines of labels in the ‘test_label’ dataset when you form the lines from the label dataset.

 

In this assignment, you need to perform the following broad steps:

- We need to process and modify the data into sentence format. This step has to be done for the 'train_sent' and ‘train_label’ datasets and for test datasets as well.
-  After that, we need to define the features to build the CRF model.
- Then, you need to apply these features in each sentence of the train and the test dataset to get the feature values.
- Once the features are computed, you need to define the target variable and then build the CRF model.
- Then, we need to perform the evaluation using a test data set.
- After that, we need to create a dictionary in which diseases are keys and treatments are values.

## Utils Functions

### Preprocessing Functions

In [4]:
# Extract sentence from words
def content_extract(file_path='',sep='\t'):
    '''It helps to extract the word based on the separator to form the sentence'''
    try:
        with open (file_path,'r',encoding='utf-8') as text:
            if text.mode  == 'r':
                content = text.readlines()
        sentence = []
        final_sentence=''
        for c in content:
            content_word = c.strip('\n')
            if content_word == '':
                #Once it get matched with separator, it appends previous extracted concatenated string as sentence
#                 final_sentence = re.sub('(?<=[\(]) | (?=[%\',)])','', final_sentence)
                sentence.append(final_sentence.strip(' '))

                #Initialize for next sentence
                final_sentence=''
            else:
                # Till the loop identifies the separator it concatenates string
                final_sentence+=content_word+' '
        print('Total identified value: ',len(sentence),'\n')
        print('Sample display value:\n',sentence[:5])
        return sentence
    except FileNotFoundError:
        print('Check and provide proper file path')

### Post-processing functions

In [5]:
# A class to retrieve the sentences details from the dataframe
class sentencedetail(object):
    def __init__(self, data):
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, l) for w, p, l in zip(s["word"].values.tolist(), s["pos"].values.tolist(),s["label"].values.tolist())]
        self.grouped = self.data.groupby("sentence").apply(agg_func)
        self.sentences = [s for s in self.grouped]

### Feature Extraction

In [6]:
# Feature set
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[0]': word[0],
        'word[-1]': word[-1],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag':postag,
        'postag_isnounpronoun': postag in ['NOUN','PROPN'],
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word[0]': word1[0],
            '-1:word[-1]': word1[-1],
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
            '-1:postag_isnounpronoun': postag1 in ['NOUN','PROPN']
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
            '+1:postag_isnounpronoun': postag1 in ['NOUN','PROPN']
        })
    else:
        features['EOS'] = True

    return features

In [7]:
# Define a function to extract features for a sentence.
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

In [8]:
# Define a function to get the labels for a sentence.
def sent2labels(sent):
    return [label for token, postag, label in sent]

# Data preprocessing:



The dataset provided is in the form of one word per line. Let's understand the format of data below:

Suppose there are x words in a sentence, then there will be x continuous lines with one word in each line.
Further, the two sentences are separated by empty lines. The labels for the data follow the same format.
We need to pre-process the data to recover the complete sentences and their labels.

The above dataset is about the labels corresponding to the diseases and the treatment. There are three labels that have been used in this dataset: O, D and T, which are corresponding to Other, **Disease** and **Treatment**, respectively.

Construct the proper sentences from individual words and print the 5 sentences.

<p>
<img src ="https://images.upgrad.com/2a1ec8a4-e26c-4b5b-bfe0-1816d14a4a30-syntactic%20sol%20pic3.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>

<p>
<img src ="https://images.upgrad.com/cad61c6d-534f-4452-81e6-dc02d7c8fcde-syntactic%20sol%20pic5.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>


In [9]:
# Train sentence extraction from dataset
train_sent = content_extract(file_path='train_sent',sep='\n')

Total identified value:  2599 

Sample display value:
 ['All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )', 'The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )', 'Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )', "The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 )", "Arrest of dilation was the most common indication in both `` corrected '' subgroups ( 23.4 and 24.6 % , respectively )"]


In [10]:
# Train label extraction from dataset
train_label = content_extract(file_path='train_label',sep='\n')

Total identified value:  2599 

Sample display value:
 ['O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O']


In [11]:
# Test sentence extraction from dataset
test_sent = content_extract(file_path='test_sent',sep='\n')

Total identified value:  1056 

Sample display value:
 ['Furthermore , when all deliveries were analyzed , regardless of risk status but limited to gestational age > or = 36 weeks , the rates did not change ( 12.6 % , 280 of 2214 ; primary 9.2 % , 183 of 1994 )', 'As the ambient temperature increases , there is an increase in insensible fluid loss and the potential for dehydration', 'The daily high temperature ranged from 71 to 104 degrees F and AFI values ranged from 1.7 to 24.7 cm during the study period', 'There was a significant correlation between the 2- , 3- , and 4-day mean temperature and AFI , with the 4-day mean being the most significant ( r = 0.31 , p & # 60 ; 0.001 )', 'Fluctuations in ambient temperature are inversely correlated to changes in AFI']


In [12]:
# Test label extraction from dataset
test_label = content_extract(file_path='test_label',sep='\n')

Total identified value:  1056 

Sample display value:
 ['O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O']


### Let's extract POS information using SpaCy

In [13]:
# Import spacy small library to find medical related entities
nlp= spacy.load("en_core_web_sm")

In [14]:
# Dataframe of POS tagging,Lemma word and Label for Train and test sentence
train_df = pd.DataFrame(columns=['sentence','word','lemma','pos','label'])
test_df = pd.DataFrame(columns=['sentence','word','lemma','pos','label'])

Count the number of sentences, number of lines of labels in the processed train and test dataset

In [15]:
#train datframe

i=0 #Sentence count
j=0 #Iteration count

for sent,label in zip(train_sent,train_label):
    i+=1
    for s,l in zip(sent.split(),label.split()):
        doc = nlp(s)
        for tok in doc:
            train_df.loc[j,['sentence','word','lemma','pos','label']] = [i,tok.text,tok.lemma_,tok.pos_,l]
            j+=1

In [16]:
#test datframe

i=0 #Sentence count
j=0 #Iteration count

for sent,label in zip(test_sent,test_label):
    i+=1
    for s,l in zip(sent.split(),label.split()):
        doc = nlp(s)
        for tok in doc:
            test_df.loc[j,['sentence','word','lemma','pos','label']] = [i,tok.text,tok.lemma_,tok.pos_,l]
            j+=1

**Extract those tokens which have NOUN or PROPN as their PoS tag and find their frequency**

In [17]:
# Word and it's frequency for word which contains NOUN or PROPN as POS tagging
freq_df = pd.DataFrame()
freq_df = pd.concat((train_df,test_df),axis=0)

In [18]:
# Resetting index
freq_df.reset_index(inplace=True,drop=True)

**Print the top 25 most common tokens with NOUN or PROPN PoS tags**

In [19]:
# Top 25 most frequency values for Train and Test related dataset words
freq_df[(freq_df['pos'] == 'NOUN') | ((freq_df['pos'] == 'PROPN'))]['word'].value_counts()[:25]

patients        492
treatment       281
cancer          200
therapy         175
disease         143
cell            140
lung            116
group            94
gene             88
chemotherapy     88
effects          85
results          79
women            77
patient          75
TO_SEE           75
surgery          71
risk             71
cases            71
analysis         70
human            67
rate             67
response         66
survival         65
children         64
effect           64
Name: word, dtype: int64

In [20]:
# Top 25 most frequency values for Train and Test related lemma words
freq_df[(freq_df['pos'] == 'NOUN') | ((freq_df['pos'] == 'PROPN'))]['lemma'].value_counts()[:25]

patient         587
treatment       316
cancer          226
cell            203
therapy         182
disease         172
effect          163
case            132
group           128
lung            120
result          118
gene            112
year            105
rate            102
trial            91
chemotherapy     91
woman            89
analysis         86
protein          82
response         81
risk             78
child            78
human            77
TO_SEE           75
mutation         75
Name: lemma, dtype: int64

**Dataframe (Sentence, word, POS) visualisation**

In [21]:
train_df.head(5)

Unnamed: 0,sentence,word,lemma,pos,label
0,1,All,all,PRON,O
1,1,live,live,VERB,O
2,1,births,birth,NOUN,O
3,1,>,>,PUNCT,O
4,1,or,or,CCONJ,O


In [22]:
test_df.head(5)

Unnamed: 0,sentence,word,lemma,pos,label
0,1,Furthermore,furthermore,ADV,O
1,1,",",",",PUNCT,O
2,1,when,when,SCONJ,O
3,1,all,all,PRON,O
4,1,deliveries,delivery,NOUN,O


**Sentense-wise detail dataframe preparation**

In [23]:
# Fetch detail view of sentence for train set
train_sent_obj = sentencedetail(train_df)
train_sent_detail = train_sent_obj.sentences

In [24]:
# Display one sentence detail view for train set
train_sent_detail[0]

[('All', 'PRON', 'O'),
 ('live', 'VERB', 'O'),
 ('births', 'NOUN', 'O'),
 ('>', 'PUNCT', 'O'),
 ('or', 'CCONJ', 'O'),
 ('=', 'VERB', 'O'),
 ('23', 'NUM', 'O'),
 ('weeks', 'NOUN', 'O'),
 ('at', 'ADP', 'O'),
 ('the', 'PRON', 'O'),
 ('University', 'NOUN', 'O'),
 ('of', 'ADP', 'O'),
 ('Vermont', 'PROPN', 'O'),
 ('in', 'ADP', 'O'),
 ('1995', 'NUM', 'O'),
 ('(', 'PUNCT', 'O'),
 ('n', 'CCONJ', 'O'),
 ('=', 'VERB', 'O'),
 ('2395', 'NUM', 'O'),
 (')', 'PUNCT', 'O'),
 ('were', 'AUX', 'O'),
 ('retrospectively', 'ADV', 'O'),
 ('analyzed', 'VERB', 'O'),
 ('for', 'ADP', 'O'),
 ('delivery', 'NOUN', 'O'),
 ('route', 'NOUN', 'O'),
 (',', 'PUNCT', 'O'),
 ('indication', 'NOUN', 'O'),
 ('for', 'ADP', 'O'),
 ('cesarean', 'VERB', 'O'),
 (',', 'PUNCT', 'O'),
 ('gestational', 'ADJ', 'O'),
 ('age', 'NOUN', 'O'),
 (',', 'PUNCT', 'O'),
 ('parity', 'NOUN', 'O'),
 (',', 'PUNCT', 'O'),
 ('and', 'CCONJ', 'O'),
 ('practice', 'VERB', 'O'),
 ('group', 'NOUN', 'O'),
 ('(', 'PUNCT', 'O'),
 ('to', 'PART', 'O'),
 ('reflect

In [25]:
# Fetch detail view of sentence for train set
test_sent_obj = sentencedetail(test_df)
test_sent_detail = test_sent_obj.sentences

In [26]:
# Display one sentence detail view for train set
test_sent_detail[0]

[('Furthermore', 'ADV', 'O'),
 (',', 'PUNCT', 'O'),
 ('when', 'SCONJ', 'O'),
 ('all', 'PRON', 'O'),
 ('deliveries', 'NOUN', 'O'),
 ('were', 'AUX', 'O'),
 ('analyzed', 'VERB', 'O'),
 (',', 'PUNCT', 'O'),
 ('regardless', 'ADV', 'O'),
 ('of', 'ADP', 'O'),
 ('risk', 'NOUN', 'O'),
 ('status', 'NOUN', 'O'),
 ('but', 'CCONJ', 'O'),
 ('limited', 'VERB', 'O'),
 ('to', 'PART', 'O'),
 ('gestational', 'ADJ', 'O'),
 ('age', 'NOUN', 'O'),
 ('>', 'PUNCT', 'O'),
 ('or', 'CCONJ', 'O'),
 ('=', 'VERB', 'O'),
 ('36', 'NUM', 'O'),
 ('weeks', 'NOUN', 'O'),
 (',', 'PUNCT', 'O'),
 ('the', 'PRON', 'O'),
 ('rates', 'NOUN', 'O'),
 ('did', 'VERB', 'O'),
 ('not', 'PART', 'O'),
 ('change', 'VERB', 'O'),
 ('(', 'PUNCT', 'O'),
 ('12.6', 'NUM', 'O'),
 ('%', 'INTJ', 'O'),
 (',', 'PUNCT', 'O'),
 ('280', 'NUM', 'O'),
 ('of', 'ADP', 'O'),
 ('2214', 'NUM', 'O'),
 (';', 'PUNCT', 'O'),
 ('primary', 'NOUN', 'O'),
 ('9.2', 'NUM', 'O'),
 ('%', 'INTJ', 'O'),
 (',', 'PUNCT', 'O'),
 ('183', 'NUM', 'O'),
 ('of', 'ADP', 'O'),
 ('199

# Define input and target variables

Correctly computing X and Y sequence matrices for training and test data. Check that both sentences and labels are processed

Define the features' values for each sentence as input variable for CRF model in test and the train dataset

In [27]:
# Prepare X-train and X-test by extracting features from train and test dataset
X_train = [sent2features(s) for s in train_sent_detail]
X_test = [sent2features(s) for s in test_sent_detail]

Define the labels as the target variable for test and the train dataset

In [28]:
# Prepare y-train and y-test by extracting labels from train and test dataset
y_train = [sent2labels(l) for l in train_sent_detail]
y_test = [sent2labels(l) for l in test_sent_detail]

# Building the CRF Model using sklearn





In [30]:
# Import model and metrics
from sklearn_crfsuite import CRF, scorers, metrics

Predict the labels of each of the tokens in each sentence of the test dataset that has been pre processed earlier.

In [34]:
%%time

# Build the CRF model.
# crf = CRF(max_iterations=100, c1=1.0, c2=0.01, all_possible_transitions=False)
crf = CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

# fit the model
# crf.fit(X_train, y_train)
try:
    crf.fit(X_train, y_train)
except AttributeError:
    pass
y_pred = crf.predict(X_test)
# predictions = crf.predict(X_test)

CPU times: user 6.18 s, sys: 41.9 ms, total: 6.22 s
Wall time: 6.46 s


# Model Evaluation 

Calculate the f1 score using the actual labels and the predicted labels of the test dataset.

In [35]:
# Calculate the f1 score using the test data
# y_pred = crf.predict(X_test)

f1_score = metrics.flat_f1_score(y_test, y_pred, average='weighted')
print('Predicted F1-score for Medical Entity Dataset is: {0} % '.format(round(f1_score*100,2)))

Predicted F1-score for Medical Entity Dataset is: 92.5 % 


# Predict Disease and Treatment

In [36]:
# Taken out predicted label from the model
pred_label=[]
for i in y_pred:
    pred_label.extend(i)

In [37]:
# Loaded into test dataframe
test_df['label_predicted'] = pred_label

In [38]:
# Visualise top 5 data
test_df.head(5)

Unnamed: 0,sentence,word,lemma,pos,label,label_predicted
0,1,Furthermore,furthermore,ADV,O,O
1,1,",",",",PUNCT,O,O
2,1,when,when,SCONJ,O,O
3,1,all,all,PRON,O,O
4,1,deliveries,delivery,NOUN,O,O


##Identifying Diseases and Treatments using Custom NER
We now use the CRF model's prediction to prepare a record of diseases identified in the corpus and treatments used for the diseases.

**Create the logic to get all the predicted treatments (T) labels corresponding to each disease (D) label in the test dataset.**

<p>
<img src ="https://images.upgrad.com/0891d77b-b9ca-4e9d-8934-d9a9b078a51c-syntactic%20sol%20pic1.png" alt='Figure 1'>
<center> <b>Figure 1. BookBikes</b> </center> 
 </br>  
</p>


In [39]:
# Preparing dictionary by keeping Disease as unique Key element and Treatment as value element
new_df =test_df[(test_df['label_predicted'] != 'O')]
new_df.set_index('sentence',inplace=True)
disease=[]
treatment=[]
sentence=[]
med_dict = {}
for i in new_df.index.unique():
    try:
        val = new_df.loc[i,'label_predicted'].unique()
        if len(val) == 2:
            disease_val = new_df[new_df['label_predicted'] == 'D'].loc[i,'word']
            treatment_val = new_df[new_df['label_predicted'] == 'T'].loc[i,'word']
            disease_single = disease_val if type(disease_val) == str else " ".join(disease_val)
            treatment_single = treatment_val if type(treatment_val) == str else " ".join(treatment_val)
            if disease_single not in disease:
                med_dict[disease_single] = treatment_single
            else:
                print('Entered')
                med_dict[disease_single] = med_dict.get(disease_single)+'/'+treatment_single
    except AttributeError:
        pass

In [40]:
print(med_dict)

{'macrosomic infants in gestational diabetes cases': 'good glycemic control', 'nonimmune hydrops fetalis': 'Trisomy', 'retinoblastoma': 'radiotherapy', 'epilepsy': 'Methylphenidate', 'unstable angina or non - Q - wave myocardial infarction': 'roxithromycin', 'coronary - artery disease': 'Antichlamydial antibiotics', 'primary pulmonary hypertension ( PPH )': 'fenfluramines', 'essential hypertension': 'moxonidine `', 'cellulitis': 'G - CSF therapy intravenous antibiotic treatment', 'foot infection in diabetic patients': 'G - CSF treatment', 'hemorrhagic stroke': 'double - bolus alteplase accelerated infusion of alteplase ( P=0.24', 'cardiac disease': 'fenfluramine - phentermine', 'rheumatoid arthritis': 'arthrodesis', "early Parkinson 's disease": 'Ropinirole monotherapy', 'sore throat': 'Antibiotics', "Crohn 's disease": 'steroids', 'stress urinary incontinence': 'surgical procedures', 'female stress urinary incontinence': 'surgical treatment', 'preeclampsia ( proteinuric hypertension )

**Predict the treatment for the disease name: 'hereditary retinoblastoma'**

In [46]:
#Predict treatment withthe help of dictionary
d=[]
disease=''
test_sent=[]
treatment=''

input_sent = 'hereditary retinoblastoma'
m = spacy.load('en_core_web_sm')
doc = m(input_sent)
for i in doc:
    d.append((i.text,i.pos_,'D'))
test_sent.append(sent2features(d))
for i,tag in enumerate(crf.predict(test_sent)[0]):
    if tag == 'D':
        tr = input_sent.split()[i]
        disease += tr
        if tr in med_dict:
            treatment += ''+med_dict.get(tr)
if len(treatment) == 0:
    treatment='None'
print('Identified Disease: ',disease)
print('Identified Treatment: ', treatment)

Identified Disease:  retinoblastoma
Identified Treatment:  radiotherapy
