<a href="https://colab.research.google.com/github/maanassiraj/Healthcare_NER/blob/main/NER_Healthcare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying Entities in Healthcare Data

Importing the necessary libraries


In [4]:
!pip install pycrf
!pip install sklearn-crfsuite

import spacy
import sklearn_crfsuite
import pandas as pd
from sklearn_crfsuite import metrics
from collections import Counter
model = spacy.load("en_core_web_sm")



### DATA PREPROCESSING

The train and test sentence files contain one word per line. The same holds true for the label files. Two sentences/label sequences are separated by an empty line. Prepocessing needs to be performed to recover the sentences and their label sequences.

Define a function that reads the file and splits the data when a blank line occurs ("\n\n"). Replace every newline character in every element of the list obtained with space to obtain a list of sentences/label sequences

In [5]:
def data_preproc(path) :
  with open(path) as file_hd :
    list_sent = file_hd.read().split("\n\n")
  sentences = [sent.replace("\n", " ") for sent in list_sent]
  return sentences

In [6]:
train_sentences = data_preproc("/content/train_sent")
train_labels = data_preproc("/content/train_label")
test_sentences = data_preproc("/content/test_sent")
test_labels = data_preproc("/content/test_label")

Printing the fisrt five train and test sentences/label sequences

In [7]:
print(train_sentences[:5])
print(train_labels[:5])
print(test_sentences[:5])
print(test_labels[:5])

['All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )', 'The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )', 'Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )', "The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 )", "Arrest of dilation was the most common indication in both `` corrected '' subgroups ( 23.4 and 24.6 % , respectively )"]
['O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O', 'O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O', 'O O O O 

Count of the number of sentences in the processed train and test dataset 

In [8]:
print(len(train_sentences))
print(len(test_sentences))

2600
1057


Count of the number of lines of labels in the processed train and test dataset.

In [9]:
print(len(train_labels))
print(len(test_labels))

2600
1057


##Concept Identification
Exploring the various concepts present in the dataset through POS tagging



Extract those tokens which have NOUN or PROPN as their PoS tag and find their frequency

In [10]:
noun_propn = []
for sent in (train_sentences + test_sentences) :
  doc = model(sent.lower())
  for token in doc :
    if token.pos_ in ["NOUN", "PROPN"] :
      noun_propn.append(token.text)
freq_dist = Counter(noun_propn)

### Print the top 25 most common tokens with NOUN or PROPN PoS tags

In [11]:
print(freq_dist.most_common(25))

[('patients', 507), ('treatment', 304), ('%', 247), ('cancer', 211), ('therapy', 177), ('study', 174), ('disease', 149), ('cell', 142), ('lung', 118), ('results', 117), ('group', 111), ('effects', 99), ('gene', 91), ('chemotherapy', 91), ('use', 89), ('effect', 82), ('women', 81), ('analysis', 76), ('risk', 74), ('surgery', 73), ('cases', 72), ('p', 72), ('rate', 68), ('survival', 67), ('response', 66)]


## Defining features for CRF





Following feature functions are used to compute the features of each word after converting the word to lower case


*   f1 : the word itself
*   f2 : POS_tag of the word
*   f3 : last three characters of the word
*   f4 : last two characters of the word
*   f5 : length of the word
*   f6 : the previous word
*   f7 : POS_tag of the previous word
*   f8 : length of the previous word
*   f9 : if word is at the beginning of the sentence, add BEG
*   f10 :if word is at the end of the sentence, add END











In [12]:
# Let's define the features to get the feature value for one word.
def getFeaturesForOneWord(sentence, pos, tokens) :

  word = sentence[pos].lower()
  word_pos_tag = tokens[pos].pos_
  features = ["word = " + word,
              "word_POS_tag = " + word_pos_tag,
              "word[-3:] = " + word[-3:],
              "word[-2:] = " + word[-2:],
              "word_length = %s" % len(word)        
  ]

  if pos > 0 :

    prev_word = sentence[pos-1]
    prev_word_pos_tag = tokens[pos-1].pos_
    features.append("prev_word = " + prev_word)
    features.append("prev_word_POS_tag = " + prev_word_pos_tag)
    features.append("prev_word_length = %s" % len(prev_word))

  else :
    features.append("BEG")
  
  if pos == len(sentence) - 1 :
    features.append("END")
  
  return features

## Getting the features

 Function that extracts the features of a sentence

In [13]:
# Write a code to get features for a sentence.
def getFeaturesForOneSentence(sentence) :
  tokens = model(sentence)
  sentence = sentence.split()
  return [getFeaturesForOneWord(sentence, pos, tokens) for pos in range(len(sentence))]

Function that extracts the labels of a sentence

In [14]:
# Write a code to get the labels for a sentence.
def getLabelsForOneSentence(labels) :
  labels = labels.split()
  return labels

## Defining the input and target variables for the CRF model


In [15]:
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]

In [16]:
Y_train = [getLabelsForOneSentence(labels) for labels in train_labels]
Y_test = [getLabelsForOneSentence(labels) for labels in test_labels]

## Building the CRF Model

In [17]:
# Build the CRF model.
crf_model = sklearn_crfsuite.CRF(max_iterations = 100)
crf_model.fit(X_train, Y_train)



CRF(algorithm=None, all_possible_states=None, all_possible_transitions=None,
    averaging=None, c=None, c1=None, c2=None, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

## Evaluation

Predicting the labels of each of the tokens in each sentence of the test dataset.

In [18]:
Y_pred = crf_model.predict(X_test)

### Calculate the f1 score using the actual labels and the predicted labels of the test dataset.

In [19]:
metrics.flat_f1_score(Y_test, Y_pred, average = "weighted")

0.9146202176451267

##Identifying Diseases and Treatments

We'll now use the CRF model's prediction to prepare a record of diseases identified in the corpus and treatments used for the diseases.



In [20]:
def D_T_identification(pos) :
  label_seq = Y_pred[pos]
  disease_idx = []
  treatment_idx = []
  for idx, label in enumerate(label_seq) :
    if label == "D" :
      disease_idx.append(idx)

    if label == "T" :
      treatment_idx.append(idx)
  
  return disease_idx, treatment_idx

In [21]:
# records = {}
diseases = []
treatments = []
records = pd.DataFrame(columns = ["Disease", "Treatment"])
for id, sent in enumerate(test_sentences) :
  sent = sent.split()
  disease_idx, treatments_idx = D_T_identification(id)
  if len(disease_idx) > 0 and len(treatments_idx) > 0 :
    # records[" ".join([sent[idx] for idx in disease_idx])] = " ".join([sent[idx] for idx in treatments_idx])
    diseases.append(" ".join([sent[idx] for id, idx in enumerate(disease_idx) if id == 0 or idx == disease_idx[id-1] + 1]))
    treatments.append(" ".join([sent[idx] for idx in treatments_idx]))
records["Disease"] = diseases
records["Treatment"] = treatments
records["Treatment"] = records["Treatment"].apply(lambda x : x.replace("and", ",").replace(", ,", ","))
records = records.groupby("Disease")["Treatment"].apply(", ".join).reset_index()
records

Unnamed: 0,Disease,Treatment
0,B16 melanoma,adenosine triphosphate buthionine sulfoximine
1,Barrett 's esophagus,Acid suppression therapy
2,Eisenmenger 's syndrome,laparoscopic cholecystectomy
3,Parkinson 's disease,Microelectrode-guided posteroventral pallidotomy
4,abdominal pain,thoracic paravertebral block ( tpvb )
...,...,...
116,tumors,Immunotherapy
117,unresectable stage iii nsclc,sequential chemotherapy
118,unstable angina,roxithromycin
119,untreated small cell lung cancer ( sclc ) sclc,chemotherapy
