# **Part Two of the Course Project**
In this part of the course project, you will build and train a named entity recognition model, which recognizes movie related named entities. 
<hr style="border-top: 2px solid #606366; background: transparent;">

# **Setup**
 
First conditional random fields (CRF) and scikit-learn (SKL) libraries need to be aligned in compatible versioning. So, we lower SKL's version to 0.23.2 to avoid errors from SKL (until the CRF library is upgraded by their authors). Recall that CRF library allows non-numeric features in dictionary format and predicts the chunks and their labels simultaneously.


In [1]:
!pip -q install -U --force-reinstall scikit-learn==0.23.2 > Log   # SKL version compatible with CRF
!pip freeze | grep learn

scikit-learn==0.23.2
sklearn==0.0
sklearn-crfsuite==0.3.6


Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries and corpora needed for this project. In this project you will use NLTK's parts of speech (POS) tagger, so `punk` and `averaged_perceptron_tagger` corpora are needed. The `warnings` library hides the `FutureWarnings` rising from the lower version of SKL.

In [2]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = 'all'
import pandas as pd, numpy as np, nltk, unittest, numpy.testing as npt, sklearn_crfsuite as CRF, warnings, re
from sklearn.model_selection  import train_test_split
from sklearn_crfsuite.metrics import flat_classification_report as rpt  # model's detailed metrics
from colorunittest import run_unittest
from collections import Counter
eq, aeq = npt.assert_equal, npt.assert_almost_equal

_ = nltk.download(['punkt', 'averaged_perceptron_tagger'], quiet=True) # silently load corpora from NLTK
np.set_printoptions(linewidth=100, precision=4, edgeitems=20, suppress=True) 
pd.set_option('max_colwidth', 200, 'max_columns', 50, 'display.max_rows', 6)
warnings.simplefilter(action='ignore', category=FutureWarning) # suppress FutureWarning warnings

Two text files, *engtrain.bio.txt* and *engtest.bio.txt* (in the local drive of Jupyter Notebook (JN)) contain English train and test text queries in the movie domain. These [MIT Movie text files](https://groups.csail.mit.edu/sls/downloads/) are in BIO format, where each word is on its own line and sentences are separated by blank line. Each word is preceded by an IOB (inside-outside-beginning) style NE tags and a tab `\t`, as shown in `sEW3` below. 
 
Your task will involve reading each BIO file into a list of lists of tuples in `(Word,NE_Tag)` format (example: list of `LTsWE3`-like elements) and then adding POS tags. The ready format will be a list of `LTsWPE3`-like elements shown below. Then a window of three words (and their attributes) will be featurized into a list of lists of dictionaries of features (key-value pairs), which will be fed to a CRF model for training. Both train and test file must be featured in the same way. Finally, featurized test sentences will be used to evaluate the quality of your model.
 
Notice that SpaCy model cannot handle these NE tags out of the box. However, the CRF model allows you to train your own chunker/tagger.

In [3]:
sEW3 = 'B-ACTOR\tStallone\nO\tin\nB-TITLE\tRocky\n\n'; print(sEW3) # Bio format, string NE+Word, example with 3 tokens
LTsWE3 = [('Stallone','B-ACTOR'), ('in','O'), ('Rocky','B-TITLE')]
LTsWPE3 = [('Stallone','NN','B-ACTOR'), ('in','IN','O'), ('Rocky','NN','B-TITLE')]

B-ACTOR	Stallone
O	in
B-TITLE	Rocky




Here are a few lines from the training file that are printed without invisible symbols, `\t` and `\n`. The `strip()` removes any possible leading/trailing whitespace characters 

In [4]:
with open("engtrain.bio.txt") as file: _ = ([print(next(file).strip()) for x in range(8)])

O	what
O	movies
O	star
B-ACTOR	bruce
I-ACTOR	willis

O	show
O	me


Here are a few lines printed as a list of strings with the invisible characters. Each element is a line in the file.

In [5]:
with open("engtrain.bio.txt") as file: print([next(file) for x in range(8)])

['O\twhat\n', 'O\tmovies\n', 'O\tstar\n', 'B-ACTOR\tbruce\n', 'I-ACTOR\twillis\n', '\n', 'O\tshow\n', 'O\tme\n']


# Task 1. Load BIO File
 
Complete `LoadBioFile()` function, which takes the name of the file in BIO (text) format and returns a list of lists of tuples. Each outer list's element is in the format of `LTsWE3` variable. Remember to strip each line of any leading/trailing whitespaces.

For example, `BioFile2WE('engtrain.bio.txt')[0:2]` should return the following two tagged sentences: 

    [[('what', 'O'), ('movies', 'O'), ('star', 'O'), ('bruce', 'B-ACTOR'), ('willis', 'I-ACTOR')], [('show', 'O'), ('me', 'O'), ('films', 'O'), ('with', 'O'), ('drew', 'B-ACTOR'), ('barrymore', 'I-ACTOR'), ('from', 'O'), ('the', 'O'), ('1980s', 'B-YEAR')]]

In [6]:
def BioFile2WE(FileName='') -> list(list((str, str,))):
    '''Takes the name of BIO file, FileName, which is stored locally in the script's directory.
    Returns: list of lists of tuples. Each tuple is in the format (word, NE_tag). 
    Note that BIO file has the reversed order on each line: NE_tag and word (separted by tab \t)    '''
    #LLTsWE = [[('W','NE')]]   # desired output format to be build
    LLTsWE = []
    current_list = []
    
    with open(FileName, 'r') as file:
        for line in file:
            line = line.strip()
            
            if line:
                ne_tag, word = line.split('\t')
                current_list.append((word, ne_tag))
            elif current_list:
                LLTsWE.append(current_list)
                current_list = []
        
        if current_list:
            LLTsWE.append(current_list)
    
    return LLTsWE

tWE = BioFile2WE('engtrain.bio.txt')
vWE = BioFile2WE('engtest.bio.txt')
print('> ', tWE[0:2])
print('> ', vWE[0:2])

>  [[('what', 'O'), ('movies', 'O'), ('star', 'O'), ('bruce', 'B-ACTOR'), ('willis', 'I-ACTOR')], [('show', 'O'), ('me', 'O'), ('films', 'O'), ('with', 'O'), ('drew', 'B-ACTOR'), ('barrymore', 'I-ACTOR'), ('from', 'O'), ('the', 'O'), ('1980s', 'B-YEAR')]]
>  [[('are', 'O'), ('there', 'O'), ('any', 'O'), ('good', 'O'), ('romantic', 'B-GENRE'), ('comedies', 'I-GENRE'), ('out', 'O'), ('right', 'B-YEAR'), ('now', 'I-YEAR')], [('show', 'O'), ('me', 'O'), ('a', 'O'), ('movie', 'O'), ('about', 'O'), ('cars', 'B-PLOT'), ('that', 'I-PLOT'), ('talk', 'I-PLOT')]]


In [7]:
tWE[0], [('what','O'),('movies','O'),('star','O'),('bruce','B-ACTOR'),('willis','I-ACTOR')]


([('what', 'O'),
  ('movies', 'O'),
  ('star', 'O'),
  ('bruce', 'B-ACTOR'),
  ('willis', 'I-ACTOR')],
 [('what', 'O'),
  ('movies', 'O'),
  ('star', 'O'),
  ('bruce', 'B-ACTOR'),
  ('willis', 'I-ACTOR')])

In [8]:
    def test08(self): eq(tWE[0], [('what','O'),('movies','O'),('star','O'),('bruce','B-ACTOR'),('willis','I-ACTOR')])
    def test09(self): eq(vWE[5], [('show','O'),('me','O'),('1980s','B-YEAR'),('action','B-GENRE'),('movies','O')])
    def test10(self): eq(len(tWE), 9775) # number of training sentences

In [9]:
# RUN CELL TO TEST YOUR CODE
flat = lambda LLT: [T for LT in LLT for T in LT] # flatten container of containers, [[1,2],[3,4]] => [1,2,3,4]

@run_unittest
class Test_BioFile2WE(unittest.TestCase):
    def test00(self): eq(type(tWE), list)  # check if train output is a list
    def test01(self): eq(type(vWE), list)  # check if test output is a list
    def test02(self): eq(all(type(e)==list for e in tWE), True) # tWE must contain lists only
    def test03(self): eq(all(type(e)==list for e in vWE), True) # vWE must contain lists only
    def test04(self): eq(all(type(t)==tuple for t in flat(tWE)), True) # inner list must contain only tuples
    def test05(self): eq(all(type(t)==tuple for t in flat(vWE)), True) # inner list must contain only tuples
    def test06(self): eq(all(type(s)==str for s in flat(flat(tWE))), True) # tuples must contain strings
    def test07(self): eq(all(type(t)==str for t in flat(flat(vWE))), True) # tuples must contain strings
    def test08(self): eq(tWE[0], [('what','O'),('movies','O'),('star','O'),('bruce','B-ACTOR'),('willis','I-ACTOR')])
    def test09(self): eq(vWE[5], [('show','O'),('me','O'),('1980s','B-YEAR'),('action','B-GENRE'),('movies','O')])
    def test10(self): eq(len(tWE), 9775) # number of training sentences
    def test11(self): eq(len(vWE), 2443) # number of test sentences
    def test12(self): eq(len(flat(tWE)), 99491) # number of training words
    def test13(self): eq(len(flat(vWE)), 24686) # number of testing words
    def test14(self): eq(Counter(e for w, e in flat(vWE) if e[0]=='B'), {'B-ACTOR': 812,'B-CHARACTER': 90,
         'B-DIRECTOR': 456,'B-GENRE': 1117,'B-PLOT': 491,'B-RATING': 500,'B-RATINGS_AVERAGE': 451,'B-REVIEW': 56,
         'B-SONG': 54,'B-TITLE': 562,'B-TRAILER': 30,'B-YEAR': 720})


Ran 15 tests in 0.220s

[1m[34mOK[0m
test00 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test01 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test02 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test03 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test04 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test05 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test06 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test07 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test08 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test09 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test10 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test11 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test12 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test13 (__main__.Test_BioFile2WE) ... [1m[34mok[0m
test14 (__main__.Test_BioFile2WE) ... [1m[34mok[0m

----------------------------------------------------------------------



In this project you will build a typical set of features to predict the following NE tags. Your goal might be to improve the model quality overall or to improve the quality of a specific NE tag. The performance of the model on some tag depends on the diversity of tagged words, number of examples, and the model's ability to fit and generalize to these examples. 
 
For example, one would expect genres to be from a small (say 30 to 40) set of words. You could create a feature containing the whole word, but this would explode the model in memory because it will memorize all words and will not generalize. Instead, it is more advantageous to find a small set of features (from the tagged and neighboring words) that keeps the model small, fast, and generalizable. 
 
Years and ratings are numbers, so some feature based on digit symbols might be helpful in predicting these two NE tags (this is just an intuition, i.e. a hypothesis yet to be tested, not a fact). 
 
The dataset has actor/director/character names in large numbers. While there are many examples, these names also vary widely and some name-recognizing features would be helpful. 
 
Plot, review and other textual phrases may be short sentences, with nouns, verbs, and adjectives, so POS tagging should help. This is what you need to build next.


In [10]:
pd.DataFrame.from_dict(Counter([e for sent in tWE for w,e in sent]), orient='index').T

Unnamed: 0,O,B-ACTOR,I-ACTOR,B-YEAR,B-TITLE,B-GENRE,I-GENRE,B-DIRECTOR,I-DIRECTOR,B-SONG,I-SONG,B-PLOT,I-PLOT,B-REVIEW,B-CHARACTER,I-CHARACTER,B-RATING,B-RATINGS_AVERAGE,I-RATINGS_AVERAGE,I-TITLE,I-RATING,B-TRAILER,I-TRAILER,I-REVIEW,I-YEAR
0,61008,3220,3474,2858,2376,4354,786,1720,1850,245,446,1927,1687,221,385,342,2007,1869,1673,3495,840,113,7,132,2456


# Task 2. Add POS Tag

Complete a UDF `InsertPOS()`, which takes a single sentence (any inner list from, say, `tWE`) and inserts a POS tag next to each word. It is then applied to each sentence to populate all POS tags, which are likely (not guaranteed) to be helpful in predicting NE tags.

For example, the UDF converts the first train sentence from 

    [('what','O'), ('movies','O'), ('star','O'), ('bruce','B-ACTOR'), ('willis','I-ACTOR')]
    
to

    [('what','WP','O'), ('movies','NNS','O'), ('star','VBP','O'), ('bruce','NN','B-ACTOR'), ('willis','NN','I-ACTOR')]

In [21]:
def InsertPOS(LTsWE=LTsWE3) -> list(list((str, str, str))):
    '''Converts a sentence in WE format to the sentence in WPE format, 
        i.e. [(word, NE),...] -> [(word, POS, NE),...]
    Use nltk.pos_tag() to generate POS tags for a sentence given as a list of tuples. 
    Then add these tags to their corresponding words, so as to convert (word, NE) into (word, POS, NE).
    This UDF deals with one sentence only.
    Hint: you can use for loops or several applications of zip(*...) function'''
    #LTsWPE = [('W','POS','NE')]   # desired output format to be build
    #LTsWPE = []
    words = [word for word, ne in LTsWE]
    pos_tags = nltk.pos_tag(words)
    LTsWPE = []
    for (word, ne), (_, word_pos) in zip(LTsWE, pos_tags):
        LTsWPE.append((word, word_pos, ne))
    return LTsWPE

%time tWPE = [InsertPOS(Sent) for Sent in tWE]
%time vWPE = [InsertPOS(Sent) for Sent in vWE]
tWPE[0]
vWPE[5]

CPU times: user 8.02 s, sys: 358 ms, total: 8.38 s
Wall time: 8.43 s
CPU times: user 2.03 s, sys: 93.9 ms, total: 2.12 s
Wall time: 2.14 s


[('what', 'WP', 'O'),
 ('movies', 'NNS', 'O'),
 ('star', 'VBP', 'O'),
 ('bruce', 'NN', 'B-ACTOR'),
 ('willis', 'NN', 'I-ACTOR')]

[('show', 'VB', 'O'),
 ('me', 'PRP', 'O'),
 ('1980s', 'CD', 'B-YEAR'),
 ('action', 'NN', 'B-GENRE'),
 ('movies', 'NNS', 'O')]

In [22]:
tWPE[0], [('what','WP','O'), ('movies','NNS','O'), ('star','VBP','O'), ('bruce','NN','B-ACTOR'), ('willis','NN','I-ACTOR')]

([('what', 'WP', 'O'),
  ('movies', 'NNS', 'O'),
  ('star', 'VBP', 'O'),
  ('bruce', 'NN', 'B-ACTOR'),
  ('willis', 'NN', 'I-ACTOR')],
 [('what', 'WP', 'O'),
  ('movies', 'NNS', 'O'),
  ('star', 'VBP', 'O'),
  ('bruce', 'NN', 'B-ACTOR'),
  ('willis', 'NN', 'I-ACTOR')])

In [23]:
# RUN CELL TO TEST YOUR CODE
@run_unittest
class Test_InsertPOS(unittest.TestCase):
    def test00(self): eq(type(tWPE[0]), list)
    def test01(self): eq(type(vWPE), list)
    def test02(self): eq(all(type(e)==list for e in vWPE), True)            # vWPE must contain only lists
    def test03(self): eq(all(type(t)==str for t in flat(tWPE[0])), True)    # tuples must contain strings only
    def test04(self): eq(all(type(t)==tuple for t in flat(vWPE)), True)     # inner list must contain tuples only
    def test05(self): eq(all(type(s)==str for s in flat(flat(tWPE))), True) # tuples must contain strings only
    def test06(self): eq(all(type(t)==str for t in flat(flat(vWPE))), True) # tuples must contain strings only
    def test07(self): eq(tWPE[0], [('what','WP','O'), ('movies','NNS','O'), ('star','VBP','O'), ('bruce','NN','B-ACTOR'), ('willis','NN','I-ACTOR')])
    def test08(self): eq(vWPE[5], [('show','VB','O'), ('me','PRP','O'), ('1980s','CD','B-YEAR'), ('action','NN','B-GENRE'), ('movies','NNS','O')])
    def test09(self): eq(len(tWPE), 9775) # number of training sentences
    def test10(self): eq(len(vWPE), 2443) # number of test sentences
    def test11(self): eq(len(flat(tWPE)), 99491) # number of training words
    def test12(self): eq(len(flat(vWPE)), 24686) # number of testing words
    def test13(self): eq(Counter(e for w, p, e in flat(vWPE) if e[0]=='B'), {'B-ACTOR': 812,'B-CHARACTER': 90,
         'B-DIRECTOR': 456,'B-GENRE': 1117,'B-PLOT': 491,'B-RATING': 500,'B-RATINGS_AVERAGE': 451,'B-REVIEW': 56,
         'B-SONG': 54,'B-TITLE': 562,'B-TRAILER': 30,'B-YEAR': 720})
    def test14(self): eq(Counter(p for w, p, e in flat(vWPE) if p[0]=='V'), 
                 {'VB':865,'VBD':1193,'VBG':377,'VBN':1003,'VBP':452,'VBZ': 811})


Ran 15 tests in 0.119s

[1m[34mOK[0m
test00 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test01 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test02 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test03 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test04 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test05 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test06 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test07 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test08 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test09 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test10 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test11 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test12 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test13 (__main__.Test_InsertPOS) ... [1m[34mok[0m
test14 (__main__.Test_InsertPOS) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 3

Complete `Case2Num()` to return a number associated with capitalization of UDF's argument `w`.

In [13]:
w='123*@^$(*&)'
w.lower()==w.upper()

True

In [32]:
def Case2Num(w='Word') -> int:
    """ Тakes a word (string) w and returns a number:
    If w is not a string OR w in lower case is the same as w in upper case, return 3.
        E.g. w='123*@^$(*&)'; w.lower()==w.upper() # returns TRUE. Casing makes no sense for numbers
    0 if w contains low case letters. May help with plot descriptions.
        E.g. you can lower-case w and compare it to the original w
    1 if w contains capital case letters.
        E.g. you can upper-case w and compare it to the original w
    2 if w is a title case word. May help with proper names in English language, incl. week days, month names
        E.g. you can title-case w and compare it to the original w
    3 otherwise      """
    nFeatureValue = 0  # desired format of the output
    if not isinstance(w, str) or w.lower() == w.upper():
        return 3
    elif w == w.title():
        return 2
    elif w == w.lower():
        return 0
    elif w == w.upper():
        return 1

    return 3
    
    
    return nFeatureValue

print(Case2Num('out0'), Case2Num('OUT1'), Case2Num('Out2'), Case2Num('oUt3')) # 

0 1 2 3


In [33]:
Case2Num('oUt3')

3

In [34]:
# RUN CELL TO TEST YOUR CODE
@run_unittest
class Test_Case2Num(unittest.TestCase):
    def test00(self): eq(type(Case2Num('out0')), int)  
    def test01(self): eq(Case2Num('out0'), 0)  
    def test02(self): eq(Case2Num('OUT0'), 1)  
    def test03(self): eq(Case2Num('Out2'), 2) 
    def test04(self): eq(Case2Num('oUt3'), 3) 
    def test05(self): eq(Case2Num(1), 3)      # return 3 for non strings 
    def test06(self): eq(Case2Num([]), 3)     # return 3 for non strings 
    def test07(self): eq(Case2Num(None), 3)   # return 3 for non strings 


Ran 8 tests in 0.001s

[1m[34mOK[0m
test00 (__main__.Test_Case2Num) ... [1m[34mok[0m
test01 (__main__.Test_Case2Num) ... [1m[34mok[0m
test02 (__main__.Test_Case2Num) ... [1m[34mok[0m
test03 (__main__.Test_Case2Num) ... [1m[34mok[0m
test04 (__main__.Test_Case2Num) ... [1m[34mok[0m
test05 (__main__.Test_Case2Num) ... [1m[34mok[0m
test06 (__main__.Test_Case2Num) ... [1m[34mok[0m
test07 (__main__.Test_Case2Num) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 4

Complete `Featurize()`, which takes the word `w`, its POS tag, and its location `loc`. It then returns a dictionary of features with a feature name as a key and feature value as the value. Feature values need not be numeric for the CRF model.

Example. The following four calls:

    Featurize(w='7', POS='NNP', loc='')     # '7' is neither capital nor lower case. Key C is 3
    Featurize(w='AOL', POS='NNP', loc='')
    Featurize(w='have', POS='VBP', loc='b')
    Featurize(w='Boris', POS='NNP', loc='a')

Return the following corresponding dictionaries:

    {'D': 1, 'C': 3, 'W3': '7', 'W2': '7', 'W1': '7', 'POS': 'NNP', 'POS2': 'NN'}
    {'D': 0, 'C': 1, 'W3': 'AOL', 'W2': 'OL', 'W1': 'L', 'POS': 'NNP', 'POS2': 'NN'}
    {'bD': 0, 'bC': 0, 'bW3': 'ave', 'bW2': 've', 'bW1': 'e', 'bPOS': 'VBP', 'bPOS2': 'VB'}
    {'aD': 0, 'aC': 2, 'aW3': 'ris', 'aW2': 'is', 'aW1': 's', 'aPOS': 'NNP', 'aPOS2': 'NN'}

In [41]:
def Featurize(w='AOL', POS='NNP', loc='') -> dict():
    ''' Build a dictionary of features from the word, its POS and its location.
            Features:
                  D: 1 if w is a digit, 0 otherwise
                  C: case feature, a value from Case2Num()
                 W3: last 3 letters of w
                 W2: last 2 letters of w
                 W1: last letter of w
                POS: POS
               POS2: first 2 letters of POS
               All keys are prepended with loc.
        Inputs:
          w: word being featurized. If not a string, convert to string.
        loc: location of w yielding features(relative to the central (of concern) word in a window of 3 words)
            'b'=before, 'a'=after, ''=current  
        Returns: a dictionary of feature key-value pairs '''
    #DsFeatures = {'feature_name':'feature_value'}  # desired format of the output
    DsFeatures = {}
    
    w = str(w)
    
    DsFeatures[f'{loc}D'] = 1 if w.isdigit() else 0
    DsFeatures[f'{loc}C'] = Case2Num(w)
    DsFeatures[f'{loc}W3'] = w[-3:]
    DsFeatures[f'{loc}W2'] = w[-2:]
    DsFeatures[f'{loc}W1'] = w[-1:]
    DsFeatures[f'{loc}POS'] = POS
    DsFeatures[f'{loc}POS2'] = POS[:2]
    
    
    
    
    
    
    
    return DsFeatures

print(Featurize(w='7', POS='NNP', loc=''))     # function demo
print(Featurize(w='AOL', POS='NNP', loc=''))
print(Featurize(w='have', POS='VBP', loc='b'))
print(Featurize(w='Boris', POS='NNP', loc='a'))

{'D': 1, 'C': 3, 'W3': '7', 'W2': '7', 'W1': '7', 'POS': 'NNP', 'POS2': 'NN'}
{'D': 0, 'C': 1, 'W3': 'AOL', 'W2': 'OL', 'W1': 'L', 'POS': 'NNP', 'POS2': 'NN'}
{'bD': 0, 'bC': 0, 'bW3': 'ave', 'bW2': 've', 'bW1': 'e', 'bPOS': 'VBP', 'bPOS2': 'VB'}
{'aD': 0, 'aC': 2, 'aW3': 'ris', 'aW2': 'is', 'aW1': 's', 'aPOS': 'NNP', 'aPOS2': 'NN'}


In [42]:
Featurize(1) # {'C': 3, 'D': 1, 'POS': 'NNP', 'POS2': 'NN', 'W1': '1', 'W2': '1', 'W3': '1'}

{'C': 3, 'D': 1, 'POS': 'NNP', 'POS2': 'NN', 'W1': '1', 'W2': '1', 'W3': '1'}

In [43]:
# RUN CELL TO TEST YOUR CODE
@run_unittest
class Test_Featurize(unittest.TestCase):
    def test00(self): eq(type(Featurize()), dict)
    def test01(self): eq(Featurize(w='7', POS='NNP', loc=''), {'C':3,'D': 1, 'POS': 'NNP', 'POS2': 'NN', 'W1': '7', 'W2': '7', 'W3': '7'})  
    def test02(self): eq(Featurize(w='AOL', POS='NNP', loc=''), {'D':0,'C': 1, 'W3': 'AOL', 'W2': 'OL', 'W1': 'L', 'POS': 'NNP', 'POS2': 'NN'})  
    def test03(self): eq(Featurize(w='have', POS='VBP', loc='b'), {'bD':0,'bC': 0, 'bW3': 'ave', 'bW2': 've', 'bW1': 'e', 'bPOS': 'VBP', 'bPOS2': 'VB'}) 
    def test04(self): eq(Featurize(w='Boris', POS='NNP', loc='a'), {'aD':0,'aC': 2, 'aW3': 'ris', 'aW2': 'is', 'aW1': 's', 'aPOS': 'NNP', 'aPOS2': 'NN'}) 
    def test05(self): eq(Featurize(1), {'C': 3, 'D': 1, 'POS': 'NNP', 'POS2': 'NN', 'W1': '1', 'W2': '1', 'W3': '1'})
    def test06(self): eq(Featurize(w='iMac', POS='NNP', loc='a'), {'aD': 0, 'aC': 3, 'aW3': 'Mac', 'aW2': 'ac', 'aW1': 'c', 'aPOS': 'NNP', 'aPOS2': 'NN'})

Ran 7 tests in 0.003s

[1m[34mOK[0m
test00 (__main__.Test_Featurize) ... [1m[34mok[0m
test01 (__main__.Test_Featurize) ... [1m[34mok[0m
test02 (__main__.Test_Featurize) ... [1m[34mok[0m
test03 (__main__.Test_Featurize) ... [1m[34mok[0m
test04 (__main__.Test_Featurize) ... [1m[34mok[0m
test05 (__main__.Test_Featurize) ... [1m[34mok[0m
test06 (__main__.Test_Featurize) ... [1m[34mok[0m

----------------------------------------------------------------------



## Build Features on a Text Window
 
The UDF `WordWindow2Features()` applies the `Featurize()` function to each word on a window sliding along the text. Thus, each window generates a larger dictionary where features are labeled with the relation to the position of the word in a window.

In [44]:
def WordWindow2Features(WPE=LTsWPE3, i=0) -> dict():
    '''Extract features from the central word, word before 
        and word after (unless central word starts/ends a sentence)
    Inputs:
        LTsWPE:   list of tuples of strings [(word, POS_Tag, NE_Tag), ...]. 
                    These tuples represent sequential words in a sentence.
        i:        integer index of the central word. 
        Returns:  dictionary of features for the central word.      '''
    Ft = Featurize(WPE[i][0], WPE[i][1]) # extract features from the center word and its POS tag
    Ft.update({'b':1.})  # add a bias parameter to raise model flexibility
    Ft.update(Featurize( WPE[i-1][0], WPE[i-1][1], loc='b') if i>0 else {'BOS':1}) # a word before center word
    Ft.update(Featurize( WPE[i+1][0], WPE[i+1][1], loc='a') if i<(len(WPE)-1) else {'EOS':1}) # a word after center word
    return Ft  # return dictionary of string keys with feature values (heterogeneous types)

print(WordWindow2Features(i=0))  # Featurize 1st word in "Yahoo fell today"
print(WordWindow2Features(i=1))  # Featurize 2nd word in "Yahoo fell today"
print(WordWindow2Features(i=2))  # Featurize last word in "Yahoo fell today"

{'D': 0, 'C': 2, 'W3': 'one', 'W2': 'ne', 'W1': 'e', 'POS': 'NN', 'POS2': 'NN', 'b': 1.0, 'BOS': 1, 'aD': 0, 'aC': 0, 'aW3': 'in', 'aW2': 'in', 'aW1': 'n', 'aPOS': 'IN', 'aPOS2': 'IN'}
{'D': 0, 'C': 0, 'W3': 'in', 'W2': 'in', 'W1': 'n', 'POS': 'IN', 'POS2': 'IN', 'b': 1.0, 'bD': 0, 'bC': 2, 'bW3': 'one', 'bW2': 'ne', 'bW1': 'e', 'bPOS': 'NN', 'bPOS2': 'NN', 'aD': 0, 'aC': 2, 'aW3': 'cky', 'aW2': 'ky', 'aW1': 'y', 'aPOS': 'NN', 'aPOS2': 'NN'}
{'D': 0, 'C': 2, 'W3': 'cky', 'W2': 'ky', 'W1': 'y', 'POS': 'NN', 'POS2': 'NN', 'b': 1.0, 'bD': 0, 'bC': 0, 'bW3': 'in', 'bW2': 'in', 'bW1': 'n', 'bPOS': 'IN', 'bPOS2': 'IN', 'EOS': 1}


The `WordWindow2Features()` is rolled over the sequence of words to generate a set of features for each word based on itself and its neighboring words. This set of features is generated for each sentence separately using the `WPE2X()` function. The output labels (i.e., NE tags) are retrieved from the WPE triples using `WPE2Y()` function.

In [45]:
WPE2X = lambda WPE: [WordWindow2Features(WPE, i) for i in range(len(WPE))]
WPE2Y = lambda WPE: [NE for Word, POS, NE in WPE]  # pull out labels for each word in a sentence

Finally, the features are generated by applying `WPE2X()` and `WPE2Y()` to each sentence in the training and test sets.

In [46]:
tX = [WPE2X(s) for s in tWPE] # training input features, list of list of dictionaries
tY = [WPE2Y(s) for s in tWPE] # training outputs
vX = [WPE2X(s) for s in vWPE] # test input features
vY = [WPE2Y(s) for s in vWPE] # test outputs
print('tY (train NE labels):', tY[0])
print('tX (train features): ', str(tX[0])[:300], '...')

tY (train NE labels): ['O', 'O', 'O', 'B-ACTOR', 'I-ACTOR']
tX (train features):  [{'D': 0, 'C': 0, 'W3': 'hat', 'W2': 'at', 'W1': 't', 'POS': 'WP', 'POS2': 'WP', 'b': 1.0, 'BOS': 1, 'aD': 0, 'aC': 0, 'aW3': 'ies', 'aW2': 'es', 'aW1': 's', 'aPOS': 'NNS', 'aPOS2': 'NN'}, {'D': 0, 'C': 0, 'W3': 'ies', 'W2': 'es', 'W1': 's', 'POS': 'NNS', 'POS2': 'NN', 'b': 1.0, 'bD': 0, 'bC': 0, 'b ...


## Train CRF Model

Next, the CRF model is trained on the training inputs and outputs. The hyperparameters can be experimented with to find a set of tuning parameters yielding improved performance of metrics on all or some desired NE tags.

In [47]:
crf = CRF.CRF(max_iterations=10, algorithm='lbfgs', c1=0.1, c2=0.1, all_possible_transitions=True, verbose=0)
%time crf.fit(tX, tY)  # fit the model on training inputs and outputs

CPU times: user 9.66 s, sys: 94.7 ms, total: 9.75 s
Wall time: 9.97 s


CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=10, verbose=0)

## Predict NE Tags
 
The trained `crf` model can now be used in predicting NE tags on test sentences featurized in the same way as the training sentences were. You can compare the actual and predicted NE tags. Notably, they are unlikely to match perfectly, but this leaves room for improvement of the model (via a better suite of hyperparameters or through a larger training set or through "better" features).

In [48]:
pY = crf.predict(vX)                # predicted NE labels
print('Actual NER tags:', vY[0])    # test (or validation) NE labels
print('Predicted  tags:', pY[0])

Actual NER tags: ['O', 'O', 'O', 'O', 'B-GENRE', 'I-GENRE', 'O', 'B-YEAR', 'I-YEAR']
Predicted  tags: ['O', 'O', 'O', 'O', 'B-GENRE', 'O', 'O', 'B-RATINGS_AVERAGE', 'I-RATINGS_AVERAGE']


## Measuring Model Performance
 
The model performance can be measured and printed with the `flat_classification_report()` function, which reports precision, recall and f-1 score for each NE tag. The support is the number of observations in each NE category. From here on, the scientist would evaluate what is most important for the business in this model and try to tune features and/or model to improve the model's prediction of those specific NE tags.

In [49]:
labels = list(crf.classes_)
labels.remove('O')    # suppress statistics for "O" tags
pd.DataFrame(rpt(vY, pY, labels=labels, output_dict=True, zero_division=0)).round(2)

Unnamed: 0,B-ACTOR,I-ACTOR,B-YEAR,B-TITLE,B-GENRE,I-GENRE,B-DIRECTOR,I-DIRECTOR,B-SONG,I-SONG,B-PLOT,I-PLOT,B-REVIEW,B-CHARACTER,I-CHARACTER,B-RATING,B-RATINGS_AVERAGE,I-RATINGS_AVERAGE,I-TITLE,I-RATING,B-TRAILER,I-TRAILER,I-REVIEW,I-YEAR,micro avg,macro avg,weighted avg
precision,0.68,0.7,0.89,0.31,0.72,0.72,0.79,0.84,0.0,0.0,0.48,0.38,0.0,0.0,0.0,0.93,0.77,0.67,0.28,0.96,0.0,0.0,0.0,0.9,0.63,0.46,0.64
recall,0.7,0.67,0.88,0.33,0.71,0.4,0.59,0.56,0.0,0.0,0.31,0.21,0.0,0.0,0.0,0.85,0.65,0.58,0.64,0.87,0.0,0.0,0.0,0.94,0.61,0.41,0.61
f1-score,0.69,0.68,0.89,0.32,0.72,0.51,0.67,0.67,0.0,0.0,0.38,0.27,0.0,0.0,0.0,0.89,0.71,0.62,0.39,0.91,0.0,0.0,0.0,0.92,0.62,0.43,0.61
support,812.0,862.0,720.0,562.0,1117.0,222.0,456.0,496.0,54.0,119.0,491.0,496.0,56.0,90.0,75.0,500.0,451.0,403.0,856.0,226.0,30.0,8.0,45.0,610.0,9757.0,9757.0,9757.0
