<a href="https://colab.research.google.com/github/moulinath/Cloud-Certifications/blob/master/NLP_Healthcare%20Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [58]:
# Import and install required packages

!pip install pycrf
!pip install sklearn-crfsuite
! pip install scikit-learn==0.22.2 --user

import spacy
import sklearn_crfsuite
from sklearn_crfsuite import metrics

model = spacy.load("en_core_web_sm")


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Need to pre-process the data to recover the complete sentences and their labels.

We will construct proper sentences from individual words and print the first 5 sentences.

In [59]:
import pandas as pd


In [60]:
# Create a function to process the file and return a sentence list
def prep_inputfile(input_file):
    i_file = open(input_file, 'r')
    file_name = i_file.readlines()
    i_file.close()

    output_list = []

    full_sentence = ""

    for each_word in file_name:
        each_word = each_word.strip()
        if each_word == "":
            output_list.append(full_sentence) # To append the complete sentence to the output list
            full_sentence = "" # For new sentence start
        else:
            if full_sentence:
                full_sentence += " " + each_word
            else:
                full_sentence = each_word
                
    return output_list



In [61]:
train_sentence = prep_inputfile('/train_sent ')
train_label = prep_inputfile('/train_label ')
test_sentence = prep_inputfile('/test_sent ')
test_label = prep_inputfile('/test_label ')


We will count the number of sentences in the processed train and test dataset¶

In [62]:
# Print first five sentences from the processed dataset
for each_item in range(5):
    print(f"Sentence {each_item+1} is: {train_sentence[each_item]}")
    print(f"Label {each_item+1} is: {train_label[each_item]}")
    print("*"*100)
    

Sentence 1 is: All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )
Label 1 is: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
****************************************************************************************************
Sentence 2 is: The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )
Label 2 is: O O O O O O O O O O O O O O O O O O O O O O O O O
****************************************************************************************************
Sentence 3 is: Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )
Label 3 is: O O O O O O O O O O O O O O O
****************************************************************************************************
Sentence 4 is: The `` corrected '' ce

In [63]:
print(f"No. of sentences in processed train dataset is: {len(train_sentence)}")
print(f"No. of sentences in processed test dataset is: {len(test_sentence)}")


print(f"No. of lines of labels in processed train dataset is: {len(train_label)}")
print(f"No. of lines of labels in processed test dataset is: {len(test_label)}")

No. of sentences in processed train dataset is: 2599
No. of sentences in processed test dataset is: 1056
No. of lines of labels in processed train dataset is: 2599
No. of lines of labels in processed test dataset is: 1056


##Concept Identification

We will first explore what are the various concepts present in the dataset. For this, we will use PoS Tagging.

Extract those tokens which have NOUN or PROPN as their PoS tag and find their frequency



In [64]:
# Creating a list to hold all the tokens which are either NOUN or PROPER NOUN
noun_propn_tokens_list = []


# Each token which is a NOUN or PROPN will be appended to the list "noun_propn_tokens_list"
for sentences in (train_sentence, test_sentence):
    for sent in sentences:
        processed_sent = model(sent)
        for each_token in processed_sent:
            if each_token.pos_ == "NOUN" or each_token.pos_ == "PROPN":
                noun_propn_tokens_list.append(each_token.text)
                
                
# Creating a Series to hold the tokens which are either NOUN or PROPER NOUN
def_noun_propn = pd.Series(noun_propn_tokens_list)


We will print the top 25 most common tokens with NOUN or PROPN PoS tags

In [65]:
# Getting the count of each token and sorting the data in top 25 most token counts
def_noun_propn.value_counts().sort_values(ascending=False).head(25)

patients        492
treatment       281
%               247
cancer          200
therapy         175
study           152
disease         141
cell            140
lung            116
group            94
chemotherapy     88
gene             87
effects          85
results          78
women            77
use              74
risk             71
surgery          71
cases            71
analysis         70
rate             67
response         66
survival         65
children         64
effect           63
dtype: int64

Defining features for CRF

In [66]:
# Let's define the features to get the feature value for one word.
def getFeaturesForAWord(sentence, pos, pos_tags):
  word = sentence[pos]

  features = [
    'word.lower=' + word.lower(), # serves as word id
    'word[-3:]=' + word[-3:],     # last three characters
    'word[-2:]=' + word[-2:],     # last two characters
    'word.isupper=%s' % word.isupper(),  # is the word in all uppercase
    'word.isdigit=%s' % word.isdigit(),  # is the word a number
    'word.startsWithCapital=%s' % word[0].isupper(), # is the word starting with a capital letter
    'word.pos=' + pos_tags[pos]
  ]

  #Use the previous word also while defining features
  if(pos > 0):
    prev_word = sentence[pos-1]
    features.extend([
    'prev_word.lower=' + prev_word.lower(), 
    'prev_word.isupper=%s' % prev_word.isupper(),
    'prev_word.isdigit=%s' % prev_word.isdigit(),
    'prev_word.startsWithCapital=%s' % prev_word[0].isupper(),
    'prev_word.pos=' + pos_tags[pos-1]
  ])
  # Mark the begining and the end words of a sentence correctly in the form of features.
  else:
    features.append('BEG') # feature to track begin of sentence 

  if(pos == len(sentence)-1):
    features.append('END') # feature to track end of sentence

  return features


Getting the Features

We will create a function to get the features for a sentence¶

In [67]:
# Write a code to get features for a sentence.
# Function to get features for a sentence.
def getFeaturesForASentence(sentence):
    
    # We need to get the pos_tags to be passed to the function
    processed_sent = model(sentence)
    postags = []
    
    for each_token in processed_sent:
        postags.append(each_token.pos_)
    
    sentence_list = sentence.split()
    return [getFeaturesForAWord(sentence_list, pos, postags) for pos in range(len(sentence_list))]


We will create a function to get the labels of a sentence¶

In [68]:
# Write a code to get the labels for a sentence.
# Function to get the labels for a sentence.
def getLabelsInListForASentence(labels):
  return labels.split()

We will define the  input and target variables

We will define the features' values for each sentence as input variable for CRF model in test and train datasets




In [69]:
X_train = [getFeaturesForASentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForASentence(sentence) for sentence in test_sentences]

We will define the labels as the target variable for test and the train dataset

In [70]:
Y_train = [getLabelsInListForASentence(labels) for labels in train_labels]
Y_test = [getLabelsInListForASentence(labels) for labels in test_labels]

Building the CRF Model

In [71]:
# Build the CRF model.

crf = sklearn_crfsuite.CRF(max_iterations=100)
crf.fit(X_train, Y_train)



CRF(algorithm=None, all_possible_states=None, all_possible_transitions=None,
    averaging=None, c=None, c1=None, c2=None, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

Model Evaluation

We will have to predict the labels of each of the tokens in each sentence of the test dataset that has already been pre-processed

In [72]:
Y_pred = crf.predict(X_test)

We will calculate the f1 score using the actual labels and the predicted labels of the test dataset.

In [73]:
f1_score = metrics.flat_f1_score(Y_test, Y_pred, average='weighted')
print(f"F1 score : {round(f1_score,4)}")

F1 score : 0.9043


Identifying Diseases and Treatments using Custom NER

We will now use the CRF model's prediction to prepare a record of diseases identified in the corpus and treatments used for the diseases.

We will have to get all the predicted treatments (T) labels corresponding to each disease (D) label in the test dataset.

In [74]:
# Creating an empty dictionary to hold diseases and their corresponding treatments
Dis_Treat_dict = dict()

for i in range(len(Y_pred)):
    # Get the predicted labels of each test sentence into "val"
    val = Y_pred[i]
    
    # Empty strings to store the values of Diseases and Treatments
    Diseases = ""
    Treatments = ""
    
    # Each loop will iterate through the individual labels and focus on mapping D and T labels
    # with Diseases and Treatments within each sentence into a concatenated string
    for j in range(len(val)):
        if val[j] == 'D': # If label is D, it indicates a Disease 
            Diseases += test_sentences[i].split()[j] + " "
        elif val[j] == 'T': # If label is T, it indicates a Treatment
            Treatments += test_sentences[i].split()[j] + " "
            
    # Removes any extra whitespaces to either end of the string
    Diseases = Diseases.lstrip().rstrip()
    Treatments = Treatments.lstrip().rstrip()

    # If Diseases and Treatments are blank, ignore them
    # If Disease is not present in Dictionary, add it along with the corresponding treatment
    # If Disease is present in the Dictionary, append the treatments for that diseases with existing
    # treatments
    if Diseases != "" and Treatments != "":
        if Diseases in Dis_Treat_dict.keys():
            treat_out = list(Dis_Treat_dict[Diseases])
            treat_out.append(Treatments)
            Dis_Treat_dict[Diseases] = treat_out
        elif Diseases not in Dis_Treat_dict.keys():
            Dis_Treat_dict[Diseases] = Treatments

Predicting the treatment for the disease name: 'hereditary retinoblastoma'

In [75]:
Dis_Treat_dict['hereditary retinoblastoma']

'radiotherapy'