

```
# This is formatted as code
```

# A Prescription Parser using CRF
This task tests your ability to build a Doctor Prescription Parser with the help of CRF model

Your job is to build a Prescription Parser that takes a prescription (sentence) as an input and find / label the words in that sentence with one of the already pre-defined labels

### Problem: SEQUENCE PREDICTION - Label words in a sentence
#### Input : Doctor Prescription in the form of a sentence split into tokens
- Ex: Take 2 tablets once a day for 10 days

#### Output : FHIR Labels
- ('Take', 'Method')
- ('2', 'Qty')
- ('tablets', 'Form')
- ('once', 'Frequency')
- ('a', 'Period')
- ('day', 'PeriodUnit')
- ('for', 'FOR')
- ('10', 'Duration')
- ('days', 'DurationUnit')

### Major Steps
- Install necessary library
- Import the libraries
- Create training data with labels
    - Split the sentence into tokens
    - Compute POS tags
    - Create triples
- Extract features
- Split the data into training and testing set
- Create CRF model
- Save the CRF model
- Load the CRF model
- Predict on test data
- Accuracy

#### Install necesaary library

In [None]:
#!pip install textacy==0.11.0
#!python -m spacy download en_core_web_sm
#!pip install sklearn_crfsuite
#!pip install thinc

Collecting textacy==0.11.0
  Using cached textacy-0.11.0-py3-none-any.whl (200 kB)
Installing collected packages: textacy
  Attempting uninstall: textacy
    Found existing installation: textacy 0.13.0
    Uninstalling textacy-0.13.0:
      Successfully uninstalled textacy-0.13.0
Successfully installed textacy-0.11.0


In [None]:
pip show textacy

Name: textacy
Version: 0.11.0
Summary: NLP, before and after spaCy
Home-page: https://github.com/chartbeat-labs/textacy
Author: 
Author-email: 
License: Apache
Location: C:\Users\FNS LECTURE\anaconda3\Lib\site-packages
Requires: cachetools, cytoolz, jellyfish, joblib, networkx, numpy, pyphen, requests, scikit-learn, scipy, spacy, tqdm
Required-by: 
Note: you may need to restart the kernel to use updated packages.


#### Import the necessary libraries

In [None]:
import spacy
from textacy import *
#from textacy import ke

### Input data (GIVEN)
#### Creating the inputs to the ML model in the following form:
- sigs --> ['take 3 tabs for 10 days']       INPUT SIG
- input_sigs --> [['take', '3', 'tabs', 'for', '10', 'days']]      TOKENS
- output_labels --> [['Method','Qty', 'Form', 'FOR', 'Duration', 'DurationUnit']]       LABELS

In [None]:
sigs = ["for 5 to 6 days", "inject 2 units", "x 2 weeks", "x 3 days", "every day", "every 2 weeks", "every 3 days", "every 1 to 2 months", "every 2 to 6 weeks", "every 4 to 6 days", "take two to four tabs", "take 2 to 4 tabs", "take 3 tabs orally bid for 10 days at bedtime", "swallow three capsules tid orally", "take 2 capsules po every 6 hours", "take 2 tabs po for 10 days", "take 100 caps by mouth tid for 10 weeks", "take 2 tabs after an hour", "2 tabs every 4-6 hours", "every 4 to 6 hours", "q46h", "q4-6h", "2 hours before breakfast", "before 30 mins at bedtime", "30 mins before bed", "and 100 tabs twice a month", "100 tabs twice a month", "100 tabs once a month", "100 tabs thrice a month", "3 tabs daily for 3 days then 1 tab per day at bed", "30 tabs 10 days tid", "take 30 tabs for 10 days three times a day", "qid q6h", "bid", "qid", "30 tabs before dinner and bedtime", "30 tabs before dinner & bedtime", "take 3 tabs at bedtime", "30 tabs thrice daily for 10 days ", "30 tabs for 10 days three times a day", "Take 2 tablets a day", "qid for 10 days", "every day", "take 2 caps at bedtime", "apply 3 drops before bedtime", "take three capsules daily", "swallow 3 pills once a day", "swallow three pills thrice a day", "apply daily", "apply three drops before bedtime", "every 6 hours", "before food", "after food", "for 20 days", "for twenty days", "with meals"]
input_sigs = [['for', '5', 'to', '6', 'days'], ['inject', '2', 'units'], ['x', '2', 'weeks'], ['x', '3', 'days'], ['every', 'day'], ['every', '2', 'weeks'], ['every', '3', 'days'], ['every', '1', 'to', '2', 'months'], ['every', '2', 'to', '6', 'weeks'], ['every', '4', 'to', '6', 'days'], ['take', 'two', 'to', 'four', 'tabs'], ['take', '2', 'to', '4', 'tabs'], ['take', '3', 'tabs', 'orally', 'bid', 'for', '10', 'days', 'at', 'bedtime'], ['swallow', 'three', 'capsules', 'tid', 'orally'], ['take', '2', 'capsules', 'po', 'every', '6', 'hours'], ['take', '2', 'tabs', 'po', 'for', '10', 'days'], ['take', '100', 'caps', 'by', 'mouth', 'tid', 'for', '10', 'weeks'], ['take', '2', 'tabs', 'after', 'an', 'hour'], ['2', 'tabs', 'every', '4-6', 'hours'], ['every', '4', 'to', '6', 'hours'], ['q46h'], ['q4-6h'], ['2', 'hours', 'before', 'breakfast'], ['before', '30', 'mins', 'at', 'bedtime'], ['30', 'mins', 'before', 'bed'], ['and', '100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'twice', 'a', 'month'], ['100', 'tabs', 'once', 'a', 'month'], ['100', 'tabs', 'thrice', 'a', 'month'], ['3', 'tabs', 'daily', 'for', '3', 'days', 'then', '1', 'tab', 'per', 'day', 'at', 'bed'], ['30', 'tabs', '10', 'days', 'tid'], ['take', '30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['qid', 'q6h'], ['bid'], ['qid'], ['30', 'tabs', 'before', 'dinner', 'and', 'bedtime'], ['30', 'tabs', 'before', 'dinner', '&', 'bedtime'], ['take', '3', 'tabs', 'at', 'bedtime'], ['30', 'tabs', 'thrice', 'daily', 'for', '10', 'days'], ['30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'], ['take', '2', 'tablets', 'a', 'day'], ['qid', 'for', '10', 'days'], ['every', 'day'], ['take', '2', 'caps', 'at', 'bedtime'], ['apply', '3', 'drops', 'before', 'bedtime'], ['take', 'three', 'capsules', 'daily'], ['swallow', '3', 'pills', 'once', 'a', 'day'], ['swallow', 'three', 'pills', 'thrice', 'a', 'day'], ['apply', 'daily'], ['apply', 'three', 'drops', 'before', 'bedtime'], ['every', '6', 'hours'], ['before', 'food'], ['after', 'food'], ['for', '20', 'days'], ['for', 'twenty', 'days'], ['with', 'meals']]
output_labels = [['FOR', 'Duration', 'TO', 'DurationMax', 'DurationUnit'], ['Method', 'Qty', 'Form'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['EVERY', 'Period'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'TO', 'Qty', 'Form'], ['Method', 'Qty', 'Form', 'PO', 'BID', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'TID', 'PO'], ['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'PO', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'BY', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit'], ['Method', 'Qty', 'Form', 'AFTER', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit'], ['EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'], ['Q46H'], ['Q4-6H'], ['Qty', 'PeriodUnit', 'BEFORE', 'WHEN'], ['BEFORE', 'Qty', 'M', 'AT', 'WHEN'], ['Qty', 'M', 'BEFORE', 'WHEN'], ['AND', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Qty', 'Form', 'Frequency', 'FOR', 'Duration', 'DurationUnit', 'THEN', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'AT', 'WHEN'], ['Qty', 'Form', 'Duration', 'DurationUnit', 'TID'], ['Method', 'Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Qty', 'TIMES', 'Period', 'PeriodUnit'], ['QID', 'Q6H'], ['BID'], ['QID'],['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Qty', 'Form', 'Frequency', 'DAILY', 'FOR', 'Duration', 'DurationUnit'], ['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Frequency', 'TIMES', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Period', 'PeriodUnit'], ['QID', 'FOR', 'Duration', 'DurationUnit'], ['EVERY', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'AT', 'WHEN'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['Method', 'Qty', 'Form', 'DAILY'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'], ['Method', 'DAILY'], ['Method', 'Qty', 'Form', 'BEFORE', 'WHEN'], ['EVERY', 'Period', 'PeriodUnit'], ['BEFORE', 'FOOD'], ['AFTER', 'FOOD'], ['FOR', 'Duration', 'DurationUnit'], ['FOR', 'Duration', 'DurationUnit'], ['WITH', 'FOOD']]

In [None]:
len(sigs), len(input_sigs) , len(output_labels)

(56, 56, 56)

### Creating a Tuples Maker method
Create the tuples as given below by writing a function **tuples_maker(input_sigs, output_labels)** and returns **output** as given below

Input(s):
- input_sigs
- output_lables

Output:

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')], [second sentence], ...]

In [None]:
#def tuples_maker(inpput_sigs, output_labels):
 #   inp = [input_sigs[0]+i for i in input_sigs[1:]][0]
 #   out = []

 #   for i in output_labels:
 #       out.extend(i)

 #   sample_data = [(i, j) for i, j in zip(inp, out)]

 #   return sample_data

#tuples_maker(input_sigs, output_labels)

[('for', 'FOR'),
 ('5', 'Duration'),
 ('to', 'TO'),
 ('6', 'DurationMax'),
 ('days', 'DurationUnit'),
 ('inject', 'Method'),
 ('2', 'Qty'),
 ('units', 'Form')]

In [None]:
def tuples_maker(input_sigs, output_labels):
    inp = [i for sublist in input_sigs for i in sublist]
    out = [i for sublist in output_labels for i in sublist]

    sample_data = [(i, j) for i, j in zip(inp, out)]

    return sample_data

result = tuples_maker(input_sigs, output_labels)
print(result)

[('for', 'FOR'), ('5', 'Duration'), ('to', 'TO'), ('6', 'DurationMax'), ('days', 'DurationUnit'), ('inject', 'Method'), ('2', 'Qty'), ('units', 'Form'), ('x', 'FOR'), ('2', 'Duration'), ('weeks', 'DurationUnit'), ('x', 'FOR'), ('3', 'Duration'), ('days', 'DurationUnit'), ('every', 'EVERY'), ('day', 'Period'), ('every', 'EVERY'), ('2', 'Period'), ('weeks', 'PeriodUnit'), ('every', 'EVERY'), ('3', 'Period'), ('days', 'PeriodUnit'), ('every', 'EVERY'), ('1', 'Period'), ('to', 'TO'), ('2', 'PeriodMax'), ('months', 'PeriodUnit'), ('every', 'EVERY'), ('2', 'Period'), ('to', 'TO'), ('6', 'PeriodMax'), ('weeks', 'PeriodUnit'), ('every', 'EVERY'), ('4', 'Period'), ('to', 'TO'), ('6', 'PeriodMax'), ('days', 'PeriodUnit'), ('take', 'Method'), ('two', 'Qty'), ('to', 'TO'), ('four', 'Qty'), ('tabs', 'Form'), ('take', 'Method'), ('2', 'Qty'), ('to', 'TO'), ('4', 'Qty'), ('tabs', 'Form'), ('take', 'Method'), ('3', 'Qty'), ('tabs', 'Form'), ('orally', 'PO'), ('bid', 'BID'), ('for', 'FOR'), ('10', 'Dur

In [None]:
#sample results as a guide

[[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')],
 [('inject', 'Method'), ('2', 'Qty'), ('units', 'Form')],
 [('x', 'FOR'), ('2', 'Duration'), ('weeks', 'DurationUnit')],
 [('x', 'FOR'), ('3', 'Duration'), ('days', 'DurationUnit')],
 [('every', 'EVERY'), ('day', 'Period')],
 [('every', 'EVERY'), ('2', 'Period'), ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'), ('3', 'Period'), ('days', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('1', 'Period'),
  ('to', 'TO'),
  ('2', 'PeriodMax'),
  ('months', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('2', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('4', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('days', 'PeriodUnit')],
 [('take', 'Method'),
  ('two', 'Qty'),
  ('to', 'TO'),
  ('four', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('to', 'TO'),
  ('4', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('3', 

### Creating the triples_maker( ) for feature extraction
- input: tuples_maker_output
- output:
[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')], [second sentence], ... ]

In [None]:
#sample data as a guide

[[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')],
 [('inject', 'JJ', 'Method'), ('2', 'CD', 'Qty'), ('units', 'NNS', 'Form')],
 [('x', 'RB', 'FOR'),
  ('2', 'CD', 'Duration'),
  ('weeks', 'NNS', 'DurationUnit')],
 [('x', 'RB', 'FOR'),
  ('3', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('every', 'DT', 'EVERY'), ('day', 'NN', 'Period')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('3', 'CD', 'Period'),
  ('days', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('1', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('2', 'CD', 'PeriodMax'),
  ('months', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('4', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('

In [None]:
def triples_maker(input_sigs, output_labels):
    sample_data = []
    for i in range(len(input_sigs) - 2):
        triple = (
            input_sigs[i][0], input_sigs[i][1],
            output_labels[i][0]
        )
        sample_data.append(triple)

    return sample_data

# Example usage:
input_sigs = [('for', '5', 'to', '6', 'days'), ('inject', '2', 'units'), ('x', '2', 'weeks'), ('x', '3', 'days'), ('every', 'day'), ('every', '2', 'weeks'), ('every', '3', 'days'), ('every', '1', 'to', '2', 'months'), ('every', '2', 'to', '6', 'weeks'), ('every', '4', 'to', '6', 'days'), ('take', 'two', 'to', 'four', 'tabs'), ('take', '2', 'to', '4', 'tabs'), ('take', '3', 'tabs', 'orally', 'bid', 'for', '10', 'days', 'at', 'bedtime'), ('swallow', 'three', 'capsules', 'tid', 'orally'), ('take', '2', 'capsules', 'po', 'every', '6', 'hours'), ('take', '2', 'tabs', 'po', 'for', '10', 'days'), ('take', '100', 'caps', 'by', 'mouth', 'tid', 'for', '10', 'weeks'), ('take', '2', 'tabs', 'after', 'an', 'hour'), ('2', 'tabs', 'every', '4-6', 'hours'), ('every', '4', 'to', '6', 'hours'), ('q46h'), ('q4-6h'), ('2', 'hours', 'before', 'breakfast'), ('before', '30', 'mins', 'at', 'bedtime'), ('30', 'mins', 'before', 'bed'), ('and', '100', 'tabs', 'twice', 'a', 'month'), ('100', 'tabs', 'twice', 'a', 'month'), ('100', 'tabs', 'once', 'a', 'month'), ('100', 'tabs', 'thrice', 'a', 'month'), ('3', 'tabs', 'daily', 'for', '3', 'days', 'then', '1', 'tab', 'per', 'day', 'at', 'bed'), ('30', 'tabs', '10', 'days', 'tid'), ('take', '30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'), ('qid', 'q6h'), ('bid'), ('qid'), ('30', 'tabs', 'before', 'dinner', 'and', 'bedtime'), ('30', 'tabs', 'before', 'dinner', '&', 'bedtime'), ('take', '3', 'tabs', 'at', 'bedtime'), ('30', 'tabs', 'thrice', 'daily', 'for', '10', 'days'), ('30', 'tabs', 'for', '10', 'days', 'three', 'times', 'a', 'day'), ('take', '2', 'tablets', 'a', 'day'), ('qid', 'for', '10', 'days'), ('every', 'day'), ('take', '2', 'caps', 'at', 'bedtime'), ('apply', '3', 'drops', 'before', 'bedtime'), ('take', 'three', 'capsules', 'daily'), ('swallow', '3', 'pills', 'once', 'a', 'day'), ('swallow', 'three', 'pills', 'thrice', 'a', 'day'), ('apply', 'daily'), ('apply', 'three', 'drops', 'before', 'bedtime'), ('every', '6', 'hours'), ('before', 'food'), ('after', 'food'), ('for', '20', 'days'), ('for', 'twenty', 'days'), ('with', 'meals')]
output_labels = [('FOR', 'Duration', 'TO', 'DurationMax', 'DurationUnit'), ('Method', 'Qty', 'Form'), ('FOR', 'Duration', 'DurationUnit'), ('FOR', 'Duration', 'DurationUnit'), ('EVERY', 'Period'), ('EVERY', 'Period', 'PeriodUnit'), ('EVERY', 'Period', 'PeriodUnit'), ('EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'), ('EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'), ('EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'), ('Method', 'Qty', 'TO', 'Qty', 'Form'), ('Method', 'Qty', 'TO', 'Qty', 'Form'), ('Method', 'Qty', 'Form', 'PO', 'BID', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN'), ('Method', 'Qty', 'Form', 'TID', 'PO'), ('Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit'), ('Method', 'Qty', 'Form', 'PO', 'FOR', 'Duration', 'DurationUnit'), ('Method', 'Qty', 'Form', 'BY', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit'), ('Method', 'Qty', 'Form', 'AFTER', 'Period', 'PeriodUnit'), ('Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit'), ('EVERY', 'Period', 'TO', 'PeriodMax', 'PeriodUnit'), ('Q46H'), ('Q4-6H'), ('Qty', 'PeriodUnit', 'BEFORE', 'WHEN'), ('BEFORE', 'Qty', 'M', 'AT', 'WHEN'), ('Qty', 'M', 'BEFORE', 'WHEN'), ('AND', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'), ('Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'), ('Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'), ('Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'), ('Qty', 'Form', 'Frequency', 'FOR', 'Duration', 'DurationUnit', 'THEN', 'Qty', 'Form', 'Frequency', 'PeriodUnit', 'AT', 'WHEN'), ('Qty', 'Form', 'Duration', 'DurationUnit', 'TID'), ('Method', 'Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Qty', 'TIMES', 'Period', 'PeriodUnit'), ('QID', 'Q6H'), ('BID'), ('QID'),('Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'), ('Qty', 'Form', 'BEFORE', 'WHEN', 'AND', 'WHEN'), ('Method', 'Qty', 'Form', 'AT', 'WHEN'), ('Qty', 'Form', 'Frequency', 'DAILY', 'FOR', 'Duration', 'DurationUnit'), ('Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'Frequency', 'TIMES', 'Period', 'PeriodUnit'), ('Method', 'Qty', 'Form', 'Period', 'PeriodUnit'), ('QID', 'FOR', 'Duration', 'DurationUnit'), ('EVERY', 'PeriodUnit'), ('Method', 'Qty', 'Form', 'AT', 'WHEN'), ('Method', 'Qty', 'Form', 'BEFORE', 'WHEN'), ('Method', 'Qty', 'Form', 'DAILY'), ('Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'), ('Method', 'Qty', 'Form', 'Frequency', 'Period', 'PeriodUnit'), ('Method', 'DAILY'), ('Method', 'Qty', 'Form', 'BEFORE', 'WHEN'), ('EVERY', 'Period', 'PeriodUnit'), ('BEFORE', 'FOOD'), ('AFTER', 'FOOD'), ('FOR', 'Duration', 'DurationUnit'), ('FOR', 'Duration', 'DurationUnit'), ('WITH', 'FOOD')]

result = tuples_maker(input_sigs, output_labels)
print(result)

[('for', '5', 'FOR'), ('inject', '2', 'Method'), ('x', '2', 'FOR'), ('x', '3', 'FOR'), ('every', 'day', 'EVERY'), ('every', '2', 'EVERY'), ('every', '3', 'EVERY'), ('every', '1', 'EVERY'), ('every', '2', 'EVERY'), ('every', '4', 'EVERY'), ('take', 'two', 'Method'), ('take', '2', 'Method'), ('take', '3', 'Method'), ('swallow', 'three', 'Method'), ('take', '2', 'Method'), ('take', '2', 'Method'), ('take', '100', 'Method'), ('take', '2', 'Method'), ('2', 'tabs', 'Qty'), ('every', '4', 'EVERY'), ('q', '4', 'Q'), ('q', '4', 'Q'), ('2', 'hours', 'Qty'), ('before', '30', 'BEFORE'), ('30', 'mins', 'Qty'), ('and', '100', 'AND'), ('100', 'tabs', 'Qty'), ('100', 'tabs', 'Qty'), ('100', 'tabs', 'Qty'), ('3', 'tabs', 'Qty'), ('30', 'tabs', 'Qty'), ('take', '30', 'Method'), ('qid', 'q6h', 'QID'), ('b', 'i', 'B'), ('q', 'i', 'Q'), ('30', 'tabs', 'Qty'), ('30', 'tabs', 'Qty'), ('take', '3', 'Method'), ('30', 'tabs', 'Qty'), ('30', 'tabs', 'Qty'), ('take', '2', 'Method'), ('qid', 'for', 'QID'), ('every

### Creating the features extractor method (GIVEN as a BASELINE)
#### The features used are:
- SOS, EOS, lowercase, uppercase, title, digit, postag, previous_tag, next_tag
#### Feel free to include more features

In [None]:
result_tuples = [[('for', 'FOR'),
  ('5', 'Duration'),
  ('to', 'TO'),
  ('6', 'DurationMax'),
  ('days', 'DurationUnit')],
 [('inject', 'Method'), ('2', 'Qty'), ('units', 'Form')],
 [('x', 'FOR'), ('2', 'Duration'), ('weeks', 'DurationUnit')],
 [('x', 'FOR'), ('3', 'Duration'), ('days', 'DurationUnit')],
 [('every', 'EVERY'), ('day', 'Period')],
 [('every', 'EVERY'), ('2', 'Period'), ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'), ('3', 'Period'), ('days', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('1', 'Period'),
  ('to', 'TO'),
  ('2', 'PeriodMax'),
  ('months', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('2', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('weeks', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('4', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('days', 'PeriodUnit')],
 [('take', 'Method'),
  ('two', 'Qty'),
  ('to', 'TO'),
  ('four', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('to', 'TO'),
  ('4', 'Qty'),
  ('tabs', 'Form')],
 [('take', 'Method'),
  ('3', 'Qty'),
  ('tabs', 'Form'),
  ('orally', 'PO'),
  ('bid', 'BID'),
  ('for', 'FOR'),
  ('10', 'Duration'),
  ('days', 'DurationUnit'),
  ('at', 'AT'),
  ('bedtime', 'WHEN')],
 [('swallow', 'Method'),
  ('three', 'Qty'),
  ('capsules', 'Form'),
  ('tid', 'TID'),
  ('orally', 'PO')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('capsules', 'Form'),
  ('po', 'PO'),
  ('every', 'EVERY'),
  ('6', 'Period'),
  ('hours', 'PeriodUnit')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('tabs', 'Form'),
  ('po', 'PO'),
  ('for', 'FOR'),
  ('10', 'Duration'),
  ('days', 'DurationUnit')],
 [('take', 'Method'),
  ('100', 'Qty'),
  ('caps', 'Form'),
  ('by', 'BY'),
  ('mouth', 'PO'),
  ('tid', 'TID'),
  ('for', 'FOR'),
  ('10', 'Duration'),
  ('weeks', 'DurationUnit')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('tabs', 'Form'),
  ('after', 'AFTER'),
  ('an', 'Period'),
  ('hour', 'PeriodUnit')],
 [('2', 'Qty'),
  ('tabs', 'Form'),
  ('every', 'EVERY'),
  ('4-6', 'Period'),
  ('hours', 'PeriodUnit')],
 [('every', 'EVERY'),
  ('4', 'Period'),
  ('to', 'TO'),
  ('6', 'PeriodMax'),
  ('hours', 'PeriodUnit')],
 [('q46h', 'Q46H')],
 [('q4-6h', 'Q4-6H')],
 [('2', 'Qty'),
  ('hours', 'PeriodUnit'),
  ('before', 'BEFORE'),
  ('breakfast', 'WHEN')],
 [('before', 'BEFORE'),
  ('30', 'Qty'),
  ('mins', 'M'),
  ('at', 'AT'),
  ('bedtime', 'WHEN')],
 [('30', 'Qty'), ('mins', 'M'), ('before', 'BEFORE'), ('bed', 'WHEN')],
 [('and', 'AND'),
  ('100', 'Qty'),
  ('tabs', 'Form'),
  ('twice', 'Frequency'),
  ('a', 'Period'),
  ('month', 'PeriodUnit')],
 [('100', 'Qty'),
  ('tabs', 'Form'),
  ('twice', 'Frequency'),
  ('a', 'Period'),
  ('month', 'PeriodUnit')],
 [('100', 'Qty'),
  ('tabs', 'Form'),
  ('once', 'Frequency'),
  ('a', 'Period'),
  ('month', 'PeriodUnit')],
 [('100', 'Qty'),
  ('tabs', 'Form'),
  ('thrice', 'Frequency'),
  ('a', 'Period'),
  ('month', 'PeriodUnit')],
 [('3', 'Qty'),
  ('tabs', 'Form'),
  ('daily', 'Frequency'),
  ('for', 'FOR'),
  ('3', 'Duration'),
  ('days', 'DurationUnit'),
  ('then', 'THEN'),
  ('1', 'Qty'),
  ('tab', 'Form'),
  ('per', 'Frequency'),
  ('day', 'PeriodUnit'),
  ('at', 'AT'),
  ('bed', 'WHEN')],
 [('30', 'Qty'),
  ('tabs', 'Form'),
  ('10', 'Duration'),
  ('days', 'DurationUnit'),
  ('tid', 'TID')],
 [('take', 'Method'),
  ('30', 'Qty'),
  ('tabs', 'Form'),
  ('for', 'FOR'),
  ('10', 'Duration'),
  ('days', 'DurationUnit'),
  ('three', 'Qty'),
  ('times', 'TIMES'),
  ('a', 'Period'),
  ('day', 'PeriodUnit')],
 [('qid', 'QID'), ('q6h', 'Q6H')],
 [('bid', 'BID')],
 [('qid', 'QID')],
 [('30', 'Qty'),
  ('tabs', 'Form'),
  ('before', 'BEFORE'),
  ('dinner', 'WHEN'),
  ('and', 'AND'),
  ('bedtime', 'WHEN')],
 [('30', 'Qty'),
  ('tabs', 'Form'),
  ('before', 'BEFORE'),
  ('dinner', 'WHEN'),
  ('&', 'AND'),
  ('bedtime', 'WHEN')],
 [('take', 'Method'),
  ('3', 'Qty'),
  ('tabs', 'Form'),
  ('at', 'AT'),
  ('bedtime', 'WHEN')],
 [('30', 'Qty'),
  ('tabs', 'Form'),
  ('thrice', 'Frequency'),
  ('daily', 'DAILY'),
  ('for', 'FOR'),
  ('10', 'Duration'),
  ('days', 'DurationUnit')],
 [('30', 'Qty'),
  ('tabs', 'Form'),
  ('for', 'FOR'),
  ('10', 'Duration'),
  ('days', 'DurationUnit'),
  ('three', 'Frequency'),
  ('times', 'TIMES'),
  ('a', 'Period'),
  ('day', 'PeriodUnit')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('tablets', 'Form'),
  ('a', 'Period'),
  ('day', 'PeriodUnit')],
 [('qid', 'QID'),
  ('for', 'FOR'),
  ('10', 'Duration'),
  ('days', 'DurationUnit')],
 [('every', 'EVERY'), ('day', 'PeriodUnit')],
 [('take', 'Method'),
  ('2', 'Qty'),
  ('caps', 'Form'),
  ('at', 'AT'),
  ('bedtime', 'WHEN')],
 [('apply', 'Method'),
  ('3', 'Qty'),
  ('drops', 'Form'),
  ('before', 'BEFORE'),
  ('bedtime', 'WHEN')],
 [('take', 'Method'),
  ('three', 'Qty'),
  ('capsules', 'Form'),
  ('daily', 'DAILY')],
 [('swallow', 'Method'),
  ('3', 'Qty'),
  ('pills', 'Form'),
  ('once', 'Frequency'),
  ('a', 'Period'),
  ('day', 'PeriodUnit')],
 [('swallow', 'Method'),
  ('three', 'Qty'),
  ('pills', 'Form'),
  ('thrice', 'Frequency'),
  ('a', 'Period'),
  ('day', 'PeriodUnit')],
 [('apply', 'Method'), ('daily', 'DAILY')],
 [('apply', 'Method'),
  ('three', 'Qty'),
  ('drops', 'Form'),
  ('before', 'BEFORE'),
  ('bedtime', 'WHEN')],
 [('every', 'EVERY'), ('6', 'Period'), ('hours', 'PeriodUnit')],
 [('before', 'BEFORE'), ('food', 'FOOD')],
 [('after', 'AFTER'), ('food', 'FOOD')],
 [('for', 'FOR'), ('20', 'Duration'), ('days', 'DurationUnit')],
 [('for', 'FOR'), ('twenty', 'Duration'), ('days', 'DurationUnit')],
 [('with', 'WITH'), ('meals', 'FOOD')]]

In [None]:
result_triples = [[('for', 'IN', 'FOR'),
  ('5', 'CD', 'Duration'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'DurationMax'),
  ('days', 'NNS', 'DurationUnit')],
 [('inject', 'JJ', 'Method'), ('2', 'CD', 'Qty'), ('units', 'NNS', 'Form')],
 [('x', 'RB', 'FOR'),
  ('2', 'CD', 'Duration'),
  ('weeks', 'NNS', 'DurationUnit')],
 [('x', 'RB', 'FOR'),
  ('3', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('every', 'DT', 'EVERY'), ('day', 'NN', 'Period')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('3', 'CD', 'Period'),
  ('days', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('1', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('2', 'CD', 'PeriodMax'),
  ('months', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('2', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('weeks', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('4', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('days', 'NNS', 'PeriodUnit')],
 [('take', 'VB', 'Method'),
  ('two', 'CD', 'Qty'),
  ('to', 'TO', 'TO'),
  ('four', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form')],
 [('take', 'VB', 'Method'),
  ('2', 'CD', 'Qty'),
  ('to', 'TO', 'TO'),
  ('4', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form')],
 [('take', 'VB', 'Method'),
  ('3', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('orally', 'RB', 'PO'),
  ('bid', 'VBP', 'BID'),
  ('for', 'IN', 'FOR'),
  ('10', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit'),
  ('at', 'IN', 'AT'),
  ('bedtime', 'NN', 'WHEN')],
 [('swallow', 'JJ', 'Method'),
  ('three', 'CD', 'Qty'),
  ('capsules', 'NNS', 'Form'),
  ('tid', 'VBP', 'TID'),
  ('orally', 'RB', 'PO')],
 [('take', 'VB', 'Method'),
  ('2', 'CD', 'Qty'),
  ('capsules', 'NNS', 'Form'),
  ('po', 'RB', 'PO'),
  ('every', 'DT', 'EVERY'),
  ('6', 'CD', 'Period'),
  ('hours', 'NNS', 'PeriodUnit')],
 [('take', 'VB', 'Method'),
  ('2', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('po', 'NN', 'PO'),
  ('for', 'IN', 'FOR'),
  ('10', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('take', 'VB', 'Method'),
  ('100', 'CD', 'Qty'),
  ('caps', 'NNS', 'Form'),
  ('by', 'IN', 'BY'),
  ('mouth', 'NN', 'PO'),
  ('tid', 'NN', 'TID'),
  ('for', 'IN', 'FOR'),
  ('10', 'CD', 'Duration'),
  ('weeks', 'NNS', 'DurationUnit')],
 [('take', 'VB', 'Method'),
  ('2', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('after', 'IN', 'AFTER'),
  ('an', 'DT', 'Period'),
  ('hour', 'NN', 'PeriodUnit')],
 [('2', 'CD', 'Qty'),
  ('tabs', 'JJ', 'Form'),
  ('every', 'DT', 'EVERY'),
  ('4-6', 'JJ', 'Period'),
  ('hours', 'NNS', 'PeriodUnit')],
 [('every', 'DT', 'EVERY'),
  ('4', 'CD', 'Period'),
  ('to', 'TO', 'TO'),
  ('6', 'CD', 'PeriodMax'),
  ('hours', 'NNS', 'PeriodUnit')],
 [('q46h', 'NN', 'Q46H')],
 [('q4-6h', 'NN', 'Q4-6H')],
 [('2', 'CD', 'Qty'),
  ('hours', 'NNS', 'PeriodUnit'),
  ('before', 'IN', 'BEFORE'),
  ('breakfast', 'NN', 'WHEN')],
 [('before', 'IN', 'BEFORE'),
  ('30', 'CD', 'Qty'),
  ('mins', 'NNS', 'M'),
  ('at', 'IN', 'AT'),
  ('bedtime', 'NN', 'WHEN')],
 [('30', 'CD', 'Qty'),
  ('mins', 'NNS', 'M'),
  ('before', 'IN', 'BEFORE'),
  ('bed', 'NN', 'WHEN')],
 [('and', 'CC', 'AND'),
  ('100', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('twice', 'RB', 'Frequency'),
  ('a', 'DT', 'Period'),
  ('month', 'NN', 'PeriodUnit')],
 [('100', 'CD', 'Qty'),
  ('tabs', 'JJ', 'Form'),
  ('twice', 'RB', 'Frequency'),
  ('a', 'DT', 'Period'),
  ('month', 'NN', 'PeriodUnit')],
 [('100', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('once', 'RB', 'Frequency'),
  ('a', 'DT', 'Period'),
  ('month', 'NN', 'PeriodUnit')],
 [('100', 'CD', 'Qty'),
  ('tabs', 'JJ', 'Form'),
  ('thrice', 'NN', 'Frequency'),
  ('a', 'DT', 'Period'),
  ('month', 'NN', 'PeriodUnit')],
 [('3', 'CD', 'Qty'),
  ('tabs', 'JJ', 'Form'),
  ('daily', 'RB', 'Frequency'),
  ('for', 'IN', 'FOR'),
  ('3', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit'),
  ('then', 'RB', 'THEN'),
  ('1', 'CD', 'Qty'),
  ('tab', 'NNS', 'Form'),
  ('per', 'IN', 'Frequency'),
  ('day', 'NN', 'PeriodUnit'),
  ('at', 'IN', 'AT'),
  ('bed', 'NN', 'WHEN')],
 [('30', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('10', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit'),
  ('tid', 'NN', 'TID')],
 [('take', 'VB', 'Method'),
  ('30', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('for', 'IN', 'FOR'),
  ('10', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit'),
  ('three', 'CD', 'Qty'),
  ('times', 'NNS', 'TIMES'),
  ('a', 'DT', 'Period'),
  ('day', 'NN', 'PeriodUnit')],
 [('qid', 'NN', 'QID'), ('q6h', 'NN', 'Q6H')],
 [('bid', 'NN', 'BID')],
 [('qid', 'NN', 'QID')],
 [('30', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('before', 'IN', 'BEFORE'),
  ('dinner', 'NN', 'WHEN'),
  ('and', 'CC', 'AND'),
  ('bedtime', 'NN', 'WHEN')],
 [('30', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('before', 'IN', 'BEFORE'),
  ('dinner', 'NN', 'WHEN'),
  ('&', 'CC', 'AND'),
  ('bedtime', 'NN', 'WHEN')],
 [('take', 'VB', 'Method'),
  ('3', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('at', 'IN', 'AT'),
  ('bedtime', 'NN', 'WHEN')],
 [('30', 'CD', 'Qty'),
  ('tabs', 'JJ', 'Form'),
  ('thrice', 'JJ', 'Frequency'),
  ('daily', 'RB', 'DAILY'),
  ('for', 'IN', 'FOR'),
  ('10', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('30', 'CD', 'Qty'),
  ('tabs', 'NNS', 'Form'),
  ('for', 'IN', 'FOR'),
  ('10', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit'),
  ('three', 'CD', 'Frequency'),
  ('times', 'NNS', 'TIMES'),
  ('a', 'DT', 'Period'),
  ('day', 'NN', 'PeriodUnit')],
 [('take', 'VB', 'Method'),
  ('2', 'CD', 'Qty'),
  ('tablets', 'NNS', 'Form'),
  ('a', 'DT', 'Period'),
  ('day', 'NN', 'PeriodUnit')],
 [('qid', 'NN', 'QID'),
  ('for', 'IN', 'FOR'),
  ('10', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('every', 'DT', 'EVERY'), ('day', 'NN', 'PeriodUnit')],
 [('take', 'VB', 'Method'),
  ('2', 'CD', 'Qty'),
  ('caps', 'NNS', 'Form'),
  ('at', 'IN', 'AT'),
  ('bedtime', 'NN', 'WHEN')],
 [('apply', 'RB', 'Method'),
  ('3', 'CD', 'Qty'),
  ('drops', 'NNS', 'Form'),
  ('before', 'IN', 'BEFORE'),
  ('bedtime', 'NN', 'WHEN')],
 [('take', 'VB', 'Method'),
  ('three', 'CD', 'Qty'),
  ('capsules', 'NNS', 'Form'),
  ('daily', 'RB', 'DAILY')],
 [('swallow', 'JJ', 'Method'),
  ('3', 'CD', 'Qty'),
  ('pills', 'NNS', 'Form'),
  ('once', 'RB', 'Frequency'),
  ('a', 'DT', 'Period'),
  ('day', 'NN', 'PeriodUnit')],
 [('swallow', 'JJ', 'Method'),
  ('three', 'CD', 'Qty'),
  ('pills', 'NNS', 'Form'),
  ('thrice', 'VBP', 'Frequency'),
  ('a', 'DT', 'Period'),
  ('day', 'NN', 'PeriodUnit')],
 [('apply', 'VB', 'Method'), ('daily', 'JJ', 'DAILY')],
 [('apply', 'RB', 'Method'),
  ('three', 'CD', 'Qty'),
  ('drops', 'NNS', 'Form'),
  ('before', 'IN', 'BEFORE'),
  ('bedtime', 'NN', 'WHEN')],
 [('every', 'DT', 'EVERY'),
  ('6', 'CD', 'Period'),
  ('hours', 'NNS', 'PeriodUnit')],
 [('before', 'IN', 'BEFORE'), ('food', 'NN', 'FOOD')],
 [('after', 'IN', 'AFTER'), ('food', 'NN', 'FOOD')],
 [('for', 'IN', 'FOR'),
  ('20', 'CD', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('for', 'IN', 'FOR'),
  ('twenty', 'JJ', 'Duration'),
  ('days', 'NNS', 'DurationUnit')],
 [('with', 'IN', 'WITH'), ('meals', 'NNS', 'FOOD')]]

In [None]:
def token_to_features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

    # Features for words that are not
    # at the beginning of a document
    if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')

    # Features for words that are not
    # at the end of a document
    if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

    return features

### Running the feature extractor on the training data
- Feature extraction
- Train-test-split

In [None]:
# function for extracting features in doc
def get_features(doc):
    return [token_to_features(doc,i) for i in range(len(doc))]

# function for generating list of labels for each doc
def get_labels(doc):
    return [label for (token, postag, label) in doc]

sample_data = result_triples

X = [get_features(doc) for doc in sample_data]
y = [get_labels(doc) for doc in sample_data]

X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2)

### Training the CRF model with the features extracted using the feature extractor method

In [None]:
# Submit training data to the trainer
# Set the parameters of the model
# Providing a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 1971
Seconds required: 0.010

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 1000
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 734.228583
Feature norm: 1.000000
Error norm: 148.137282
Active features: 1948
Line search trials: 1
Line search step: 0.005207
Seconds required for this iteration: 0.001

***** Iteration #2 *****
Loss: 457.872349
Feature norm: 5.727530
Error norm: 185.004580
Active features: 1861
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.003

***** Iteration #3 *****
Loss: 331.095715
Feature norm: 7.055854
Error norm: 87.348079
Active features: 1853
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration:

***** Iteration #150 *****
Loss: 47.261611
Feature norm: 23.027520
Error norm: 0.046637
Active features: 295
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #151 *****
Loss: 47.261590
Feature norm: 23.026657
Error norm: 0.053513
Active features: 295
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #152 *****
Loss: 47.261530
Feature norm: 23.026726
Error norm: 0.045319
Active features: 295
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #153 *****
Loss: 47.261508
Feature norm: 23.025995
Error norm: 0.048871
Active features: 295
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #154 *****
Loss: 47.261456
Feature norm: 23.026147
Error norm: 0.040011
Active features: 295
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteratio

In [None]:
import pycrfsuite
trainer = pycrfsuite.Trainer()
# Submit training data to the trainer
for feat, targ in zip(X,y):
    trainer.append(feat, targ)

# Set the parameters of the model
trainer.set_params({
    'max_iterations':1000,
    'c1':0.1,
    'c2':0.01
})

# Providing a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('prescription_parser.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 1015
Seconds required: 0.004

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 1000
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 734.255445
Feature norm: 1.000000
Error norm: 148.040941
Active features: 992
Line search trials: 1
Line search step: 0.005208
Seconds required for this iteration: 0.003

***** Iteration #2 *****
Loss: 458.462225
Feature norm: 5.714149
Error norm: 184.275740
Active features: 993
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #3 *****
Loss: 335.607650
Feature norm: 7.050128
Error norm: 91.585562
Active features: 985
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.

### Predicting the test data with the built model

In [None]:
#loading the model
tagger = pycrfsuite.Tagger()
tagger.open('prescription_parser.model')

<contextlib.closing at 0x213b2f0e550>

In [None]:
import nltk
from nltk import pos_tag
#nltk.download('averaged_perceptron_tagger')

In [None]:
test_sentence =  '2 tabs every 4 hours'

# do work tokenization, lowercase()
processed_sentence = [w.lower() for w in test_sentence.split()]
# get pos_tags -> [[('2','CD'),('tabs','NN'),('every','DT'),('4','CD'),('hours','NN')]]
tagged_sentence = pos_tag(processed_sentence)

# put dummy placeholder for label
features = []
for tpls in tagged_sentence:
    tpls += (tpls[1],)
    features.append(tpls)
test_sentence = [features]

# labeled_sentence = triples_maker(list(tagged_sentence))
# test = [[('2','CD','CD'),('tabs','NN','NN'),('every','DT','DT'),('4','CD','CD'),('hours','NN','NN')]]
test_sentence

[[('2', 'CD', 'CD'),
  ('tabs', 'NNS', 'NNS'),
  ('every', 'DT', 'DT'),
  ('4', 'CD', 'CD'),
  ('hours', 'NNS', 'NNS')]]

In [None]:
x_feat = [get_features(doc) for doc in test_sentence]

In [None]:
predictions = [tagger.tag(data) for data in x_feat]

In [None]:
predictions

[['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit']]

### Putting all the prediction logic inside a predict method

In [None]:
def predict(sig):
    """
    predict(sig)
    Purpose: Labels the given sig into corresponding labels
    @param sig. A Sentence  # A medical prescription sig written by a doctor
    @return     A list      # A list with predicted labels (first level of labeling)
    >>> predict('2 tabs every 4 hours')
    [['Qty', 'Form', 'EVERY', 'Period', 'PeriodUnit']]
    >>> predict('2 tabs with food')
    [['Qty', 'Form', 'WITH', 'FOOD']]
    >>> predict('2 tabs qid x 30 days')
    [['Qty', 'Form', 'QID', 'FOR', 'Duration', 'DurationUnit']]
    """

    sentence = [w.lower() for w in sig.split()] # word tokenization
    pos = pos_tag(sentence) # generating pos tags

    # adding dummy placeholer for the label
    features = []
    for tpls in pos:
        tpls += (tpls[1],)
        features.append(tpls)
    test_sentence = [features]

    x_feat = [get_features(doc) for doc in test_sentence]
    predictions = [tagger.tag(data) for data in x_feat]
    print(sig)
    return predictions

### Sample predictions

In [None]:
predictions = predict("take 2 tabs every 6 hours x 10 days")
predictions

take 2 tabs every 6 hours x 10 days


[['Method',
  'Qty',
  'Form',
  'EVERY',
  'Period',
  'PeriodUnit',
  'FOR',
  'Duration',
  'DurationUnit']]

In [None]:
predictions = predict("2 capsu for 10 day at bed")
predictions

2 capsu for 10 day at bed


[['Qty', 'Form', 'FOR', 'Duration', 'PeriodUnit', 'AT', 'WHEN']]

In [None]:
predictions = predict("2 capsu for 10 days at bed")
predictions

2 capsu for 10 days at bed


[['Qty', 'Form', 'FOR', 'Duration', 'DurationUnit', 'AT', 'WHEN']]

In [None]:
predictions = predict("5 days 2 tabs at bed")
predictions

5 days 2 tabs at bed


[['Duration', 'DurationUnit', 'Qty', 'Form', 'AT', 'WHEN']]

In [None]:
predictions = predict("3 tabs qid x 10 weeks")
predictions

3 tabs qid x 10 weeks


[['Qty', 'Form', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
predictions = predict("x 30 days")
predictions

x 30 days


[['FOR', 'Duration', 'DurationUnit']]

In [None]:
predictions = predict("x 20 months")
predictions

x 20 months


[['FOR', 'Duration', 'DurationUnit']]

In [None]:
predictions = predict("take 2 tabs po tid for 10 days")
predictions

take 2 tabs po tid for 10 days


[['Method', 'Qty', 'Form', 'PO', 'TID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
predictions = predict("take 2 capsules po every 6 hours")
predictions

take 2 capsules po every 6 hours


[['Method', 'Qty', 'Form', 'PO', 'EVERY', 'Period', 'PeriodUnit']]

In [None]:
predictions = predict("inject 2 units pu tid")
predictions

inject 2 units pu tid


[['Method', 'Qty', 'Form', 'Frequency', 'TID']]

In [None]:
predictions = predict("swallow 3 caps tid by mouth")
predictions

swallow 3 caps tid by mouth


[['Method', 'Qty', 'Form', 'TID', 'BY', 'PO']]

In [None]:
predictions = predict("inject 3 units orally")
predictions

inject 3 units orally


[['Method', 'Qty', 'Form', 'PO']]

In [None]:
predictions = predict("orally take 3 tabs tid")
predictions

orally take 3 tabs tid


[['PO', 'Method', 'Qty', 'Form', 'TID']]

In [None]:
predictions = predict("by mouth take three caps")
predictions

by mouth take three caps


[['BY', 'PO', 'Method', 'Qty', 'Form']]

In [None]:
predictions = predict("take 3 tabs orally three times a day for 10 days at bedtime")
predictions

take 3 tabs orally three times a day for 10 days at bedtime


[['Method',
  'Qty',
  'Form',
  'PO',
  'Qty',
  'TIMES',
  'Period',
  'PeriodUnit',
  'FOR',
  'Duration',
  'DurationUnit',
  'AT',
  'WHEN']]

In [None]:
predictions = predict("take 3 tabs orally bid for 10 days at bedtime")
predictions

take 3 tabs orally bid for 10 days at bedtime


[['Method',
  'Qty',
  'Form',
  'PO',
  'BID',
  'FOR',
  'Duration',
  'DurationUnit',
  'AT',
  'WHEN']]

In [None]:
predictions = predict("take 3 tabs bid orally at bed")
predictions

take 3 tabs bid orally at bed


[['Method', 'Qty', 'Form', 'BID', 'PO', 'AT', 'WHEN']]

In [None]:
predictions = predict("take 10 capsules by mouth qid")
predictions

take 10 capsules by mouth qid


[['Method', 'Qty', 'Form', 'BY', 'PO', 'QID']]

In [None]:
predictions = predict("inject 10 units orally qid x 3 months")
predictions

inject 10 units orally qid x 3 months


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("please take 2 tablets per day for a month in the morning and evening each day")
predictions

please take 2 tablets per day for a month in the morning and evening each day


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("Amoxcicillin QID 30 tablets")
predictions

Amoxcicillin QID 30 tablets


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 3 tabs TID for 90 days with food")
predictions

take 3 tabs TID for 90 days with food


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("with food take 3 tablets per day for 90 days")
predictions

with food take 3 tablets per day for 90 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("with food take 3 tablets per week for 90 weeks")
predictions

with food take 3 tablets per week for 90 weeks


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2-4 tabs")
predictions

take 2-4 tabs


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2 to 4 tabs")
predictions

take 2 to 4 tabs


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take two to four tabs")
predictions

take two to four tabs


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2-4 tabs for 8 to 9 days")
predictions

take 2-4 tabs for 8 to 9 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 20 tabs every 6 to 8 days")
predictions

take 20 tabs every 6 to 8 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2 tabs every 4 to 6 days")
predictions

take 2 tabs every 4 to 6 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2 tabs every 2 to 10 weeks")
predictions

take 2 tabs every 2 to 10 weeks


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2 tabs every 4 to 6 days")
predictions

take 2 tabs every 4 to 6 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2 tabs every 2 to 10 months")
predictions

take 2 tabs every 2 to 10 months


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("every 60 mins")
predictions

every 60 mins


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("every 10 mins")
predictions

every 10 mins


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("every two to four months")
predictions

every two to four months


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 2 tabs every 3 to 4 days")
predictions

take 2 tabs every 3 to 4 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("every 3 to 4 days take 20 tabs")
predictions

every 3 to 4 days take 20 tabs


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("once in every 3 days take 3 tabs")
predictions

once in every 3 days take 3 tabs


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("take 3 tabs once in every 3 days")
predictions

take 3 tabs once in every 3 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("orally take 20 tabs every 4-6 weeks")
predictions

orally take 20 tabs every 4-6 weeks


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("10 tabs x 2 days")
predictions

10 tabs x 2 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("3 capsule x 15 days")
predictions

3 capsule x 15 days


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]

In [None]:
prediction = predict("10 tabs")
predictions

10 tabs


[['Method', 'Qty', 'Form', 'PO', 'QID', 'FOR', 'Duration', 'DurationUnit']]