# Create a Prescription Parser using CRF
## This Notebook was run on Google Colab!!
This task tests your ability to build a Doctor Prescription Parser with the help of CRF model

Your job is to build a Prescription Parser that takes a prescription (sentence) as an input and find / label the words in that sentence with one of the already pre-defined labels

### Problem: SEQUENCE PREDICTION - Label words in a sentence
#### Input : Doctor Prescription in the form of a sentence split into tokens
- Ex: Take 2 tablets once a day for 10 days

#### Output : FHIR Labels
- ('Take', 'Method')
- ('2', 'Qty') 
- ('tablets', 'Form')
- ('once', 'Frequency')
- ('a', 'Period') 
- ('day', 'PeriodUnit')
- ('for', 'FOR')
- ('10', 'Duration')
- ('days', 'DurationUnit') 

### Major Steps
- Install necessary library
- Import the libraries
- Create training data with labels
    - Split the sentence into tokens
    - Compute POS tags
    - Create triples
- Extract features
- Split the data into training and testing set
- Create CRF model
- Save the CRF model
- Load the CRF model
- Predict on test data
- Accuracy

## This Notebook was run on Google Colab!!

#### Install necesary libraries and  versions

In [32]:
!pip install textacy
# !pip install textacy==0.9.1
# !python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [33]:
# Downgrade skit-learn to avoid conflict with Conditional Random Fields Model (CRF)
# 'CRF' object has no attribute 'keep_tempfiles'
# !pip install scikit-learn==0.22.2 --user

In [34]:
pip install -U 'scikit-learn<0.24'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn<0.24
  Using cached scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.2 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.[0m
Successfully installed scikit-learn-0.23.2


In [238]:
# skit-learn has to be running before installing sklearn-crfsuite
import sklearn

In [36]:
# Installing Conditional Random Fields Model (CRF)
!pip install sklearn-crfsuite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [237]:
import spacy

In [37]:
from textacy import extract

In [38]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

In [39]:
from nltk.tag import pos_tag
from sklearn_crfsuite import CRF, metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer,confusion_matrix
from pprint import pprint
from sklearn.metrics import f1_score,classification_report
from sklearn.pipeline import Pipeline
import string

In [40]:
# Download 'punkt' and 'averaged_perceptron_tagger' to avoid errors in nltk functions
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### Input data (GIVEN)
#### Creating the inputs to the ML model in the following form:
- sigs --> ['take 3 tabs for 10 days']       INPUT SIG
- input_sigs --> [['take', '3', 'tabs', 'for', '10', 'days']]      TOKENS
- output_labels --> [['Method','Qty', 'Form', 'FOR', 'Duration', 'DurationUnit']]       LABELS

In [41]:
sigs = ["for 5 to 6 days", "inject 2 units", "x 2 weeks", "x 3 days", "every day", "every 2 weeks", "every 3 days", "every 1 to 2 months", "every 2 to 6 weeks", "every 4 to 6 days", "take two to four tabs", "take 2 to 4 tabs", "take 3 tabs orally bid for 10 days at bedtime", "swallow three capsules tid orally", "take 2 capsules po every 6 hours", "take 2 tabs po for 10 days", "take 100 caps by mouth tid for 10 weeks", "take 2 tabs after an hour", "2 tabs every 4-6 hours", "every 4 to 6 hours", "q46h", "q4-6h", "2 hours before breakfast", "before 30 mins at bedtime", "30 mins before bed", "and 100 tabs twice a month", "100 tabs twice a month", "100 tabs once a month", "100 tabs thrice a month", "3 tabs daily for 3 days then 1 tab per day at bed", "30 tabs 10 days tid", "take 30 tabs for 10 days three times a day", "qid q6h", "bid", "qid", "30 tabs before dinner and bedtime", "30 tabs before dinner & bedtime", "take 3 tabs at bedtime", "30 tabs thrice daily for 10 days ", "30 tabs for 10 days three times a day", "Take 2 tablets a day", "qid for 10 days", "every day", "take 2 caps at bedtime", "apply 3 drops before bedtime", "take three capsules daily", "swallow 3 pills once a day", "swallow three pills thrice a day", "apply daily", "apply three drops before bedtime", "every 6 hours", "before food", "after food", "for 20 days", "for twenty days", "with meals"]
input_sigs = [['for', '5', 'to', '6', 'days'], ['inject', '2', 'units'], ['x', '2', 'weeks'], ['x', '3', 'days'], ['every', 'day'], ['every', '2', 'weeks'], ['every', '3', 'days'], ['every', '1', 'to', '2', 'months'], ['every', '2', 'to', '6', 'weeks'], ['every', '4', 'to', '6', 'days'], ['take', 'two', 'to', 'four', 'tabs'], ['take', '2', 'to', '4', 'tabs'], ['take', '3', 'tabs', 'orally', 'bid', 'for', '10', 'days', 'at', 'bedtime'], ['swallow', 'three', 'capsules', 'tid', 'orally'], ['take', '2', 'capsules', 'po', 'every', '6', 'hours'], ['take', '2', 'tabs', 'po', 'for', '10', 'days'], ['take', '100', 'caps', 'by', 'mouth', 'tid', 'for', '10', 'weeks'], ['take', '2', 'tabs', 'after', 'an', 'hour'], ['2', 'tabs', 'every', '4-6', 'hours'], ['every', '4', 'to', '6', 'hours'], ['q46h'], ['q4-6h'], ['2', 'hours', 'before', 'breakfast'], ['before', '30', 'mins', 'at', 'bedtime'], ['30', 'mins', 'before', 'bed'], ['and', '100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'once', 'a', 'month'], ['100', 'tabs', 'thrice', 'a', 'month'], ['3', 'tabs', 'daily', 'for', '3', 'days', 'then', '1', 'tab', 'per', 'day', 'at', 'bed'], ['30', 'tabs', '10', 'days', 'tid'], ['take', '30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['qid', 'q6h'], ['bid'], ['qid'], ['30', 'tabs', 'before', 'dinner', 'and', 'bedtime'], ['30', 'tabs', 'before', 'dinner', '&', 'bedtime'], ['take', '3', 'tabs', 'at', 'bedtime'], ['30', 'tabs', 'thrice', 'daily', 'for', '10', 'days'], ['30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['take', '2', 'tablets', 'a', 'day'], ['qid', 'for', '10', 'days'], ['every', 'day'], ['take', '2', 'caps', 'at', 'bedtime'], ['apply', '3', 'drops', 'before', 'bedtime'], ['take', 'three', 'capsules', 'daily'], ['swallow', '3', 'pills', 'once', 'a', 'day'], ['swallow', 'three', 'pills', 'thrice', 'a', 'day'], ['apply', 'daily'], ['apply', 'three', 'drops', 'before', 'bedtime'], ['every', '6', 'hours'], ['before', 'food'], ['after', 'food'], ['for', '20', 'days'], ['for', 'twenty', 'days'], ['with', 'meals']]
output_labels = [['FOR', 'Duration', 'TO', 'DurationMax', 'DurationUnit'], ['Method', 'Qty', 'Form'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['EVERY', 'Period'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'Form', 'PO', 'BID', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'TID', 'PO'], ['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'PO', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'BY', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'AFTER', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Q46H'], ['Q4-6H'], ['Qty', 'PeriodUnit', 'BEFORE', 'WHEN'], ['BEFORE', 'Qty', 'M', 'AT', 'WHEN'], ['Qty', 'M', 'BEFORE', 'WHEN'], ['AND', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'FOR', 'Duration', 'DurationUnit', 'THEN', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'AT', 'WHEN'], ['Qty', 'Form', 'Duration', 'DurationUnit', 'TID'], ['Method', 'Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Qty', 'TIMES', 'Period', 'PeriodUnit'], ['QID', 'Q6H'], ['BID'], ['QID'],['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Qty', 'Form', 'Frequency', 'DAILY', 'FOR', 'Duration', 'DurationUnit'], ['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Frequency', 'TIMES', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Period', 'PeriodUnit'], ['QID', 'FOR', 'Duration', 'DurationUnit'], ['EVERY', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['Method', 'Qty', 'Form', 'DAILY'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'DAILY'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['EVERY', 'Period', 'PeriodUnit'], ['BEFORE', 'FOOD'], ['AFTER', 'FOOD'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['WITH', 'FOOD']]

In [42]:
# Checking size of samples
len(sigs), len(input_sigs) , len(output_labels)

(56, 56, 56)

### Creating a Tuples Maker method
Create the tuples as given below by writing a function **tuples_maker(input_sigs, output_labels)** and returns **output** as given below

Input(s): 
- input_sigs
- output_lables

Output:

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')], [second sentence], ...]

In [43]:
def tuples_maker(inp, out):
    sample_data = zip(inp, out)
    list_of_tuples=[]
    for items in list(sample_data):
        mapped = zip(items[0], items[1])
        list_of_tuples.append(list(mapped))
    return list_of_tuples

In [44]:
tuples =tuples_maker(input_sigs, output_labels)
tuples

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')],
 [('inject', 'Method'), ('2', 'Qty'), ('units', 'Form')],
 [('x', 'FOR'), ('2', 'Duration'), ('weeks', 'DurationUnit')],
 [('x', 'FOR'), ('3', 'Duration'), ('days', 'DurationUnit')],
 [('every', 'EVERY'), ('day', 'Period')],
 [('every', 'EVERY'), ('2', 'Period'), ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'), ('3', 'Period'), ('days', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('1', 'Period'),
  ('to', 'TO'),
  ('2', 'PeriodMax'),
  ('months', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('2', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('4', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('days', 'PeriodUnit')],
 [('take', 'Method'),
  ('two', 'Qty'),
  ('to', 'TO'),
  ('four', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('to', 'TO'),
  ('4', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('3', 

### Creating the triples_maker( ) for feature extraction
- input: tuples_maker_output
- output: 
[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')], [second sentence], ... ]

In [45]:
def triples_maker(sigs, input_sigs, output_labels):
    sig_tokenized=[]
    for sentence in sigs:
      sig_tokenized.append(word_tokenize(sentence))
    sig_tokenized_tagged =[pos_tag(sent) for sent in sig_tokenized]
    # sig_tokenized_tagged
    final_sig=[]
    for i, items in enumerate(sig_tokenized_tagged):
      aux=[]
      for j, element in enumerate(items): 
        aux.append([element[0], element[1], output_labels[i][j]])
        final_sig.append(aux)
    final_sig.append
    # sig_tokenized
    return final_sig


In [46]:
sample_data = triples_maker(sigs, 
                            input_sigs, 
                            output_labels)
sample_data

[[['for', 'IN', 'FOR'],
  ['5', 'CD', 'Duration'],
  ['to', 'TO', 'TO'],
  ['6', 'CD', 'DurationMax'],
  ['days', 'NNS', 'DurationUnit']],
 [['for', 'IN', 'FOR'],
  ['5', 'CD', 'Duration'],
  ['to', 'TO', 'TO'],
  ['6', 'CD', 'DurationMax'],
  ['days', 'NNS', 'DurationUnit']],
 [['for', 'IN', 'FOR'],
  ['5', 'CD', 'Duration'],
  ['to', 'TO', 'TO'],
  ['6', 'CD', 'DurationMax'],
  ['days', 'NNS', 'DurationUnit']],
 [['for', 'IN', 'FOR'],
  ['5', 'CD', 'Duration'],
  ['to', 'TO', 'TO'],
  ['6', 'CD', 'DurationMax'],
  ['days', 'NNS', 'DurationUnit']],
 [['for', 'IN', 'FOR'],
  ['5', 'CD', 'Duration'],
  ['to', 'TO', 'TO'],
  ['6', 'CD', 'DurationMax'],
  ['days', 'NNS', 'DurationUnit']],
 [['inject', 'JJ', 'Method'], ['2', 'CD', 'Qty'], ['units', 'NNS', 'Form']],
 [['inject', 'JJ', 'Method'], ['2', 'CD', 'Qty'], ['units', 'NNS', 'Form']],
 [['inject', 'JJ', 'Method'], ['2', 'CD', 'Qty'], ['units', 'NNS', 'Form']],
 [['x', 'RB', 'FOR'],
  ['2', 'CD', 'Duration'],
  ['weeks', 'NNS', 'Durat

### Creating the features extractor method (GIVEN as a BASELINE)
#### The features used are:
- SOS, EOS, lowercase, uppercase, title, digit, postag, previous_tag, next_tag
#### Feel free to include more features

In [47]:
def token_to_features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

    # Features for words that are not at the beginning of a document
    if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')

    # Features for words that are not at the end of a document
    if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

    return features

### Running the feature extractor on the training data 
- Feature extraction
- Train-test-split

In [48]:
# result =[]
# for sentence in sample_data:
#   for i, word in enumerate (sentence):
#     y = token_to_features(sentence, i)
#     result.append(y)
# result

In [49]:
# Functions for extracting features in documents
def extract_features(doc):
    return [token_to_features(doc, i) for i in range(len(doc))]

def get_labels(doc):
    return [label for (token, postag, label) in doc]

In [50]:
tuples

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')],
 [('inject', 'Method'), ('2', 'Qty'), ('units', 'Form')],
 [('x', 'FOR'), ('2', 'Duration'), ('weeks', 'DurationUnit')],
 [('x', 'FOR'), ('3', 'Duration'), ('days', 'DurationUnit')],
 [('every', 'EVERY'), ('day', 'Period')],
 [('every', 'EVERY'), ('2', 'Period'), ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'), ('3', 'Period'), ('days', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('1', 'Period'),
  ('to', 'TO'),
  ('2', 'PeriodMax'),
  ('months', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('2', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('4', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('days', 'PeriodUnit')],
 [('take', 'Method'),
  ('two', 'Qty'),
  ('to', 'TO'),
  ('four', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('to', 'TO'),
  ('4', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('3', 

In [51]:
data = []

for i, doc in enumerate(tuples):
    tokens = [t for t, label in doc]    
    tagged = nltk.pos_tag(tokens)    
    data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])


In [52]:
data

[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')],
 [('inject', 'JJ', 'Method'), ('2', 'CD', 'Qty'), ('units', 'NNS', 'Form')],
 [('x', 'RB', 'FOR'),
  ('2', 'CD', 'Duration'),
  ('weeks', 'NNS', 'DurationUnit')],
 [('x', 'RB', 'FOR'),
  ('3', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('every', 'DT', 'EVERY'), ('day', 'NN', 'Period')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('3', 'CD', 'Period'),
  ('days', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('1', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('2', 'CD', 'PeriodMax'),
  ('months', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('4', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('

In [53]:
X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [54]:
X_test

[[['bias',
   'word.lower=take',
   'word[-3:]=ake',
   'word[-2:]=ke',
   'word.isupper=False',
   'word.istitle=False',
   'word.isdigit=False',
   'postag=VB',
   'BOS',
   '+1:word.lower=2',
   '+1:word.istitle=False',
   '+1:word.isupper=False',
   '+1:word.isdigit=True',
   '+1:postag=CD'],
  ['bias',
   'word.lower=2',
   'word[-3:]=2',
   'word[-2:]=2',
   'word.isupper=False',
   'word.istitle=False',
   'word.isdigit=True',
   'postag=CD',
   '-1:word.lower=take',
   '-1:word.istitle=False',
   '-1:word.isupper=False',
   '-1:word.isdigit=False',
   '-1:postag=VB',
   '+1:word.lower=tablets',
   '+1:word.istitle=False',
   '+1:word.isupper=False',
   '+1:word.isdigit=False',
   '+1:postag=NNS'],
  ['bias',
   'word.lower=tablets',
   'word[-3:]=ets',
   'word[-2:]=ts',
   'word.isupper=False',
   'word.istitle=False',
   'word.isdigit=False',
   'postag=NNS',
   '-1:word.lower=2',
   '-1:word.istitle=False',
   '-1:word.isupper=False',
   '-1:word.isdigit=True',
   '-1:postag

In [55]:
# Sanity check of size of training and testing set
print(len(X_train), len(X_test), len(y_train), len(y_test))

44 12 44 12


### Training the CRF model with the features extracted using the feature extractor method

In [187]:
# Define the model
# Providing a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
crf = CRF(algorithm='lbfgs', 
          c1=0.1, 
          c2=0.10, 
          max_iterations=150,
          verbose=True)

In [188]:
# Train the model with generated training data
crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████| 44/44 [00:00<00:00, 17803.34it/s]


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 957
Seconds required: 0.003

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 150
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.00  loss=570.10   active=937   feature_norm=1.00
Iter 2   time=0.00  loss=342.31   active=923   feature_norm=5.78
Iter 3   time=0.00  loss=280.05   active=932   feature_norm=7.37
Iter 4   time=0.00  loss=243.98   active=934   feature_norm=7.51
Iter 5   time=0.00  loss=196.30   active=939   feature_norm=7.96
Iter 6   time=0.00  loss=154.80   active=887   feature_norm=9.07
Iter 7   time=0.00  loss=111.02   active=849   feature_norm=12.87
Iter 8   time=0.00  loss=88.46    active=911   feature_norm=14.16
Iter 9   time=0.00  loss=83.43    active=932   feature_norm=14.50
Iter 10  time




CRF(algorithm='lbfgs', c1=0.1, c2=0.1, keep_tempfiles=None, max_iterations=150,
    verbose=True)

### Predicting the test data with the built model

In [189]:
crf.predict(X_test)

[['Method', 'Qty', 'Form', 'Period', 'PeriodUnit'],
 ['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'],
 ['Method', 'Qty', 'Form', 'AT', 'WHEN'],
 ['Method',
  'Qty',
  'Form',
  'FOR',
  'Duration',
  'DurationUnit',
  'Frequency',
  'TIMES',
  'Period',
  'PeriodUnit'],
 ['Method', 'Qty', 'Form', 'TID', 'DAILY'],
 ['EVERY', 'Period'],
 ['Method', 'DAILY'],
 ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'],
 ['EVERY', 'Period', 'TO', 'Qty', 'Form'],
 ['BEFORE', 'WHEN'],
 ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'],
 ['EVERY', 'Period', 'TO', 'DurationMax', 'DurationUnit']]

### Putting all the prediction logic inside a predict method

In [190]:
def predict(s):
    """
    predict(sig)
    Purpose: Labels the given sig into corresponding labels
    @param sig. A Sentence  # A medical prescription sig written by a doctor
    @return     A list      # A list with predicted labels (first level of labeling)

    >>> predict('2 tabs every 4 hours')
    [['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit']]
    
    >>> predict('2 tabs with food')
    [['Qty', 'Form', 'WITH', 'FOOD']]
    
    >>> predict('2 tabs qid x 30 days')
    [['Qty', 'Form', 'QID', 'FOR', 'Duration', 'DurationUnit']]
    """
    print (s)
    sig_tokenized=[]
    sig_tokenized.append(word_tokenize(s))
    sig_tokenized_tagged =[pos_tag(sent) for sent in sig_tokenized]
    # print(sig_tokenized_tagged)
    Xp = [extract_features(doc) for doc in sig_tokenized_tagged]
    # print(Xp)
    return crf.predict(Xp)

### Sample predictions

In [191]:
predict("take 2 tabs every 6 hours x 10 days")

take 2 tabs every 6 hours x 10 days


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'PeriodUnit',
  'FOR',
  'Duration',
  'DurationUnit']]

In [192]:
predict("2 capsu for 10 day at bed")

2 capsu for 10 day at bed


[['Qty', 'Form', 'FOR', 'Duration', 'PeriodUnit', 'AT', 'WHEN']]

In [193]:
predict("2 capsu for 10 days at bed")

2 capsu for 10 days at bed


[['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN']]

In [194]:
predict("5 days 2 tabs at bed")

5 days 2 tabs at bed


[['Duration', 'DurationUnit', 'Qty', 'Form', 'AT', 'WHEN']]

In [195]:
predict("3 tabs qid x 10 weeks")

3 tabs qid x 10 weeks


[['Qty', 'Form', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [196]:
predict("x 30 days")

x 30 days


[['FOR', 'Duration', 'DurationUnit']]

In [197]:
predict("x 20 months")

x 20 months


[['FOR', 'Duration', 'DurationUnit']]

In [198]:
predict("take 2 tabs po tid for 10 days")

take 2 tabs po tid for 10 days


[['Method', 'Qty', 'Form', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit']]

In [199]:
predict("take 2 capsules po every 6 hours")

take 2 capsules po every 6 hours


[['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit']]

In [200]:
predict("inject 2 units pu tid")

inject 2 units pu tid


[['Method', 'Qty', 'Form', 'PO', 'TID']]

In [201]:
predict("swallow 3 caps tid by mouth")

swallow 3 caps tid by mouth


[['Method', 'Qty', 'Form', 'TID', 'BY', 'PO']]

In [202]:
predict("inject 3 units orally")

inject 3 units orally


[['Method', 'Qty', 'Form', 'PO']]

In [203]:
predict("orally take 3 tabs tid")

orally take 3 tabs tid


[['PO', 'Method', 'Qty', 'Form', 'TID']]

In [204]:
predict("by mouth take three caps")

by mouth take three caps


[['BY', 'PO', 'Method', 'Qty', 'Form']]

In [205]:
predict("take 3 tabs orally three times a day for 10 days at bedtime")

take 3 tabs orally three times a day for 10 days at bedtime


[['Method',
  'Qty',
  'Form',
  'PO',
  'Frequency',
  'TIMES',
  'Period',
  'PeriodUnit',
  'FOR',
  'Duration',
  'DurationUnit',
  'AT',
  'WHEN']]

In [206]:
predict("take 3 tabs orally bid for 10 days at bedtime")

take 3 tabs orally bid for 10 days at bedtime


[['Method',
  'Qty',
  'Form',
  'PO',
  'BID',
  'FOR',
  'Duration',
  'DurationUnit',
  'AT',
  'WHEN']]

In [207]:
predict("take 3 tabs bid orally at bed")

take 3 tabs bid orally at bed


[['Method', 'Qty', 'Form', 'BID', 'PeriodUnit', 'AT', 'WHEN']]

In [208]:
predict("take 10 capsules by mouth qid")

take 10 capsules by mouth qid


[['Method', 'Qty', 'Form', 'BY', 'PO', 'QID']]

In [209]:
predict("inject 10 units orally qid x 3 months")

inject 10 units orally qid x 3 months


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [210]:
predict("please take 2 tablets per day for a month in the morning and evening each day")

please take 2 tablets per day for a month in the morning and evening each day


[['Method',
  'Method',
  'Qty',
  'Form',
  'Frequency',
  'PeriodUnit',
  'FOR',
  'Period',
  'PeriodUnit',
  'EVERY',
  'Period',
  'PeriodUnit',
  'AND',
  'EVERY',
  'Period',
  'PeriodUnit']]

In [211]:
predict("Amoxcicillin QID 30 tablets")

Amoxcicillin QID 30 tablets


[['Qty', 'Method', 'Qty', 'Form']]

In [212]:
predict("take 3 tabs TID for 90 days with food")

take 3 tabs TID for 90 days with food


[['Method',
  'Qty',
  'Form',
  'Frequency',
  'FOR',
  'Duration',
  'DurationUnit',
  'WITH',
  'FOOD']]

In [213]:
predict("with food take 3 tablets per day for 90 days")

with food take 3 tablets per day for 90 days


[['WITH',
  'FOOD',
  'Method',
  'Qty',
  'Form',
  'Frequency',
  'PeriodUnit',
  'FOR',
  'Duration',
  'DurationUnit']]

In [214]:
predict("with food take 3 tablets per week for 90 weeks")

with food take 3 tablets per week for 90 weeks


[['WITH',
  'FOOD',
  'Method',
  'Qty',
  'Form',
  'Frequency',
  'PeriodUnit',
  'FOR',
  'Duration',
  'DurationUnit']]

In [215]:
predict("take 2-4 tabs")

take 2-4 tabs


[['Method', 'Qty', 'Form']]

In [216]:
predict("take 2 to 4 tabs")

take 2 to 4 tabs


[['Method', 'Qty', 'TO', 'Qty', 'Form']]

In [217]:
predict("take two to four tabs")

take two to four tabs


[['Method', 'Qty', 'TO', 'Qty', 'Form']]

In [218]:
predict("take 2-4 tabs for 8 to 9 days")

take 2-4 tabs for 8 to 9 days


[['Method',
  'Qty',
  'Form',
  'FOR',
  'Duration',
  'TO',
  'Duration',
  'DurationUnit']]

In [219]:
predict("take 20 tabs every 6 to 8 days")

take 20 tabs every 6 to 8 days


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'TO',
  'Duration',
  'DurationUnit']]

In [220]:
predict("take 2 tabs every 4 to 6 days")

take 2 tabs every 4 to 6 days


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'TO',
  'DurationMax',
  'DurationUnit']]

In [221]:
predict("take 2 tabs every 2 to 10 weeks")

take 2 tabs every 2 to 10 weeks


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'TO',
  'Duration',
  'DurationUnit']]

In [222]:
predict("take 2 tabs every 4 to 6 days")

take 2 tabs every 4 to 6 days


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'TO',
  'DurationMax',
  'DurationUnit']]

In [223]:
predict("take 2 tabs every 2 to 10 months")

take 2 tabs every 2 to 10 months


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'TO',
  'Duration',
  'DurationUnit']]

In [224]:
predict("every 60 mins")

every 60 mins


[['EVERY', 'Period', 'PeriodUnit']]

In [225]:
predict("every 10 mins")

every 10 mins


[['EVERY', 'Period', 'PeriodUnit']]

In [226]:
predict("every two to four months")

every two to four months


[['EVERY', 'Period', 'TO', 'Qty', 'Form']]

In [227]:
predict("take 2 tabs every 3 to 4 days")

take 2 tabs every 3 to 4 days


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'TO',
  'Duration',
  'DurationUnit']]

In [228]:
 predict("every 3 to 4 days take 20 tabs")

every 3 to 4 days take 20 tabs


[['EVERY',
  'Period',
  'TO',
  'Duration',
  'DurationUnit',
  'Method',
  'Qty',
  'Form']]

In [229]:
predict("once in every 3 days take 3 tabs")

once in every 3 days take 3 tabs


[['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit', 'Method', 'Qty', 'Form']]

In [230]:
predict("take 3 tabs once in every 3 days")

take 3 tabs once in every 3 days


[['Method',
  'Qty',
  'Form',
  'Frequency',
  'TIMES',
  'EVERY',
  'Period',
  'PeriodUnit']]

In [231]:
predict("orally take 20 tabs every 4-6 weeks")

orally take 20 tabs every 4-6 weeks


[['PO', 'Method', 'Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit']]

In [232]:
predict("10 tabs x 2 days")

10 tabs x 2 days


[['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit']]

In [233]:
predict("3 capsule x 15 days")

3 capsule x 15 days


[['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit']]

In [234]:
predict("10 tabs")

10 tabs


[['Qty', 'Form']]