**In this assignment you will be guided to add more features in order to get better performance!**

In [1]:
import matplotlib.pyplot as plt

In [2]:
from nltk.corpus import stopwords

In [5]:
!pip install sklearn_crfsuite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn_crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn_crfsuite
Successfully installed python-crfsuite-0.9.9 sklearn_crfsuite-0.3.6


In [6]:
from itertools import chain

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import scipy.stats
import sklearn
import sklearn_crfsuite
from sklearn import metrics as mt
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn_crfsuite import metrics, scorers
from sklearn_crfsuite.utils import flatten

A simple sentence NER example:

[**ORG** U.N. ] official [**PER** Ekeus ] heads for [**LOC** Baghdad ] 

We will concentrate on four types of named entities:
 * persons (**PER**), 
 * locations (**LOC**) 
 * organizations (**ORG**)
 * Others (**O**)

In [7]:
def _generate_examples(filepath):
        with open(filepath, encoding="utf-8") as f:
            sent = []
            for line in f:
                if line.startswith("-DOCSTART-") or line == "" or line == "\n":
                    if sent:
                        yield sent
                        sent = []
                else:
                    splits = line.split(" ")
                    token = splits[0]
                    pos_tag = splits[1]
                    ner_tag = splits[3].rstrip()
                    if 'MISC' in ner_tag:
                        ner_tag = 'O'
                    
                    sent.append((token, pos_tag, ner_tag))

In [10]:
%%time
train_sents = list(_generate_examples('/content/train.txt'))
test_sents = list(_generate_examples('/content/test.txt'))

CPU times: user 267 ms, sys: 40.9 ms, total: 308 ms
Wall time: 305 ms


In [11]:
train_sents[2]

[('BRUSSELS', 'NNP', 'B-LOC'), ('1996-08-22', 'CD', 'O')]

Here we have succesfully loaded the trianing and test data.
_________________

Here is a list of english stopwords, and we would like to include it as a feature

In [12]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Here is a list of names, and we would like to include it as a feature

In [13]:
names = set()
with open('/content/names.txt') as f:
    for l in f:
        names.add(l[:-1].lower())

In [14]:
names

{'hiram',
 'goerge',
 'regory',
 'zebulon',
 'caitland',
 'amberly',
 'samer',
 'rayshun',
 'zachariah',
 'donita',
 'chennell',
 'dedrick',
 'christiane',
 'mykia',
 'luanna',
 'joannah',
 'deanna',
 'kyle',
 'shalin',
 'shronda',
 'hillery',
 'latonia',
 'ngan',
 'eleana',
 'maryjo',
 'annah',
 'brandonn',
 'tinea',
 'kameka',
 'alandis',
 'brittainy',
 'lancer',
 'holli',
 'kelvina',
 'ikeisha',
 'dayanna',
 'jenilee',
 'bach',
 'ryane',
 'jeneen',
 'danille',
 'lucious',
 'kojo',
 'aamil',
 'liseth',
 'juvenal',
 'penelope',
 'kenyotta',
 'spiros',
 'bobbie',
 'amish',
 'alfredia',
 'shawntelle',
 'laterria',
 'yoselin',
 'donielle',
 'chalee',
 'georgie',
 'demetric',
 'zaynab',
 'nary',
 'cristal',
 'kellymarie',
 'jaeson',
 'clarrissa',
 'guido',
 'aleesa',
 'cindy',
 'claude',
 'ceddrick',
 'jacqulyne',
 'british',
 'genna',
 'naftali',
 'leanda',
 'nick',
 'savina',
 'leaha',
 'estanislao',
 'hayes',
 'talissa',
 'shakesha',
 'wilhemina',
 'laurissa',
 'mervin',
 'chanc',
 'ti

_____________________________________________________

You are asked to change the `word2features` function to add the following features:

**For the current word**:
1. Add a feature named `word.isupper()` that tells if the word is in upper case (you can your the `isupper()` function in python)
2. Add a feature named `word.isdigit()` that tells if the word is all digits (similarility you can use the `isdigit()` built-in python function)
3. Add a feature named `word.l1_is_capital` that tells if the word starts with a capital letter
4. Add a feature named `word.ends_in_dot` that tells if the word has lenght > 1 and ends with a dot (`.`)
5. Add a feature named `word.is_stop_word` that tells is the word belongs to the list of stop words defined previously `stop_words` (don't forget to convert the word into lower case before testing, just to be case insensitive)
6. Add a feature named `word.constains_digits` that tells if the word contains a digit or not
7. Add a feature names `word.figures_in_names_list` that tells if the word belongs to the list of names we defined previously `names`. Again don't forget to change the word into lower case first.

**For the previous word**: (BE CAREFUL, YOU SHOULD NOT USE `word`, USE `word1` instead):

Add the same features. Just prepend the name of features with `-1:` (It's important for the different features to be of different names)

**Add infomration about nextword**: (BE CAREFUL, YOU SHOULD NOT USE `word`, USE `word1` instead):

* Add the same features. Just prepend the name of features with `+1:` (It's important for the different features to be of different names)

* **PS**: If the word is the last one in the sentence (no next word), just add a feature named `EOS` = True to tell that the word is in the last position. JUst as we've did with `BOS`



In [15]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
    'bias': 1.0,
    'word.lower()': word.lower(),
    'postag': postag,
    'word.isupper()': word.isupper(),
    'word.isdigit()': word.isdigit(),
    'word.l1_is_capital': word[0].upper() == word[0],
    'word.ends_in_dot': len(word) > 1 and word[-1] == '.',
    'word.is_stop_word': word.lower() in stop_words,
    'word.constains_digits': any(char.isdigit() for char in word),
    'word.figures_in_names_list': word.lower() in names
    }

    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:postag': postag1,
            '-1:word.isupper()': word1.isupper(),
            '-1:word.isdigit()': word1.isdigit(),
            '-1:word.l1_is_capital': word1[0].upper() == word1[0],
            '-1:word.ends_in_dot': len(word1) > 1 and word1[-1] == '.',
            '-1:word.is_stop_word': word1.lower() in stop_words,
            '-1:word.constains_digits': any(char.isdigit() for char in word1),
            '-1:word.figures_in_names_list': word1.lower() in names
        })
    else:
        features['BOS'] = True
        
    if i < len(sent) - 1:
        word2 = sent[i+1][0]
        postag2 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word2.lower(),
            '+1:postag': postag2,
            '+1:word.isupper()': word2.isupper(),
            '+1:word.isdigit()': word2.isdigit(),
            '+1:word.l1_is_capital': word2[0].upper() == word2[0],
            '+1:word.ends_in_dot': len(word2) > 1 and word2[-1] == '.',
            '+1:word.is_stop_word': word2.lower() in stop_words,
            '+1:word.constains_digits': any(char.isdigit() for char in word2),
            '+1:word.figures_in_names_list': word2.lower() in names
        })
    else:
        features['EOS'] = True
        
    return features


In [16]:
test_sents[2]

[('United', 'NNP', 'B-LOC'),
 ('Arab', 'NNP', 'I-LOC'),
 ('Emirates', 'NNPS', 'I-LOC'),
 ('1996-12-06', 'CD', 'O')]

In [18]:
word2features(test_sents[2],3)

{'bias': 1.0,
 'word.lower()': '1996-12-06',
 'postag': 'CD',
 'word.isupper()': False,
 'word.isdigit()': False,
 'word.l1_is_capital': True,
 'word.ends_in_dot': False,
 'word.is_stop_word': False,
 'word.constains_digits': True,
 'word.figures_in_names_list': False,
 '-1:word.lower()': 'emirates',
 '-1:postag': 'NNPS',
 '-1:word.isupper()': False,
 '-1:word.isdigit()': False,
 '-1:word.l1_is_capital': True,
 '-1:word.ends_in_dot': False,
 '-1:word.is_stop_word': False,
 '-1:word.constains_digits': False,
 '-1:word.figures_in_names_list': False,
 'EOS': True}

In [19]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

Construct the features for the training and test test
_________________________

In [20]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

CPU times: user 2.23 s, sys: 316 ms, total: 2.54 s
Wall time: 2.58 s


Train your CRF
______________________________

In [23]:
%%time 
#search for sklearn_crfsuite.CRF, 
# use the lbfgs algorithm, 
# c parameters should be 0.1 and max iterations 100, 
# all possible transactions true
try:
    crf = sklearn_crfsuite.CRF(algorithm="lbfgs", c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True,)
    # fit the model
    crf.fit(X_train, y_train)
except AttributeError as e:
    print("Error", e)

CPU times: user 38.3 s, sys: 296 ms, total: 38.6 s
Wall time: 39.2 s


In [24]:
# save a list of all labels in your model, hint crfs have a classes attribute
labels = list(crf.classes_)
labels

['B-ORG', 'O', 'B-PER', 'I-PER', 'B-LOC', 'I-ORG', 'I-LOC']

In [25]:
# remove the label 'O' from your list
try:
    labels.remove("O")
except ValueError:
    pass


Compute F1 score for different labels. Remove the 'O' label before that
______________________

In [26]:
# perfrom a prediction on your test set
y_pred = crf.predict(X_test)

metrics.flat_f1_score(
    y_test,
    y_pred,
    average="weighted",
    labels=labels,
)

0.8367184071821857

In [27]:
# group B and I results, use the sorted function on the list labels with a lambda function as the key
sorted_labels =sorted(labels,key=lambda l1: (l1[1:], l1[0]))

In [28]:
# Display classification report
print(
    mt.classification_report(
        y_true=flatten(y_test),
        y_pred=flatten(y_pred),
        labels=sorted_labels,
        digits=3,
    )
)

              precision    recall  f1-score   support

       B-LOC      0.878     0.855     0.866      1667
       I-LOC      0.841     0.743     0.789       257
       B-ORG      0.817     0.740     0.776      1660
       I-ORG      0.736     0.777     0.756       834
       B-PER      0.861     0.846     0.853      1615
       I-PER      0.912     0.940     0.926      1156

   micro avg      0.848     0.827     0.837      7189
   macro avg      0.841     0.817     0.828      7189
weighted avg      0.848     0.827     0.837      7189



In [29]:
# what is the number of transition features in our model, crfs have an attribute called transition_features_
len(crf.transition_features_)

49

In [30]:
from collections import Counter


def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s ->  %-7s %0.6f" % (label_from, label_to, weight))


print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

Top likely transitions:
B-LOC  ->  I-LOC   4.302825
I-LOC  ->  I-LOC   4.005200
B-PER  ->  I-PER   3.939634
B-ORG  ->  I-ORG   3.739085
I-ORG  ->  I-ORG   3.367001
O      ->  O       3.057047
I-PER  ->  I-PER   2.982216
O      ->  B-PER   1.529200
O      ->  B-ORG   0.914788
O      ->  B-LOC   0.609755
B-LOC  ->  O       0.111220
I-PER  ->  O       -0.068979
I-LOC  ->  O       -0.267734
B-ORG  ->  O       -0.547656
B-PER  ->  O       -0.828131
I-ORG  ->  O       -0.891113
I-LOC  ->  B-LOC   -1.076195
I-LOC  ->  B-ORG   -1.433567
B-LOC  ->  B-LOC   -1.590152
B-LOC  ->  B-ORG   -1.636857


Show the 20 top and 20 least likely transitions between labels 
_________________

In [31]:
# top 20 unlikely transitions
print("\nTop unlikely transitions:")
(
    pd.DataFrame(crf.transition_features_, index=["value"])
    .transpose()
    .reset_index()
    .rename(
        columns={
            "level_0": "from",
            "level_1": "to",
        },
    )
    .sort_values(by="value")
    .reset_index(drop=True)
    .head(20)
)


Top unlikely transitions:


Unnamed: 0,from,to,value
0,O,I-ORG,-6.531975
1,O,I-LOC,-5.49588
2,B-PER,B-PER,-4.683342
3,B-LOC,I-ORG,-4.634844
4,O,I-PER,-4.566678
5,B-PER,B-ORG,-4.021932
6,B-PER,I-ORG,-3.915945
7,B-ORG,I-PER,-3.836738
8,B-ORG,B-LOC,-3.645662
9,B-LOC,I-PER,-3.582885


In [32]:
# number of transition features in our model
len(crf.state_features_)

17617

In [33]:
# create dataframe to easily sort linked values
df_trans = (
    pd.DataFrame(crf.state_features_, index=["value"])
    .transpose()
    .reset_index()
    .rename(
        columns={
            "level_0": "attr_name",
            "level_1": "label",
        },
    )
)
df_trans = df_trans[["value", "label", "attr_name"]]

In [34]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

Show the 50 top and 50 least likely state features (compatibility between features and labels)
_______________

In [35]:
# top 50 positive
print("Top positive:")
print(
    df_trans.sort_values(
        by="value",
        ascending=False,
        ignore_index=True,
    ).head(50)
)

Top positive:
       value  label                    attr_name
0   7.054226      O            word.lower():june
1   6.970161      O       word.lower():september
2   6.868783      O            word.lower():july
3   6.787012      O         word.lower():tuesday
4   6.735703      O          word.lower():august
5   6.697165  B-LOC         word.lower():england
6   6.575011      O       word.lower():wednesday
7   6.483059  B-LOC        word.lower():pakistan
8   6.466018      O         +1:word.lower():open
9   6.440661      O             word.lower():may
10  5.937475  B-ORG            -1:word.lower():v
11  5.895658  B-LOC         word.lower():germany
12  5.893044      O        word.lower():thursday
13  5.878410      O             word.lower():aug
14  5.852966      O          word.lower():sunday
15  5.740281      O          word.lower():friday
16  5.714514  B-LOC              word.lower():m3
17  5.650848      O        word.constains_digits
18  5.641599      O          word.lower():monday
19  5.

In [36]:
# top 50 negative
print("\nTop negative:")
print(
    df_trans.sort_values(
        by="value",
        ignore_index=True,
    ).head(50)
)


Top negative:
       value  label                    attr_name
0  -4.438366      O           word.l1_is_capital
1  -4.294540  I-LOC            +1:word.isdigit()
2  -4.177717      O        -1:word.lower():moody
3  -3.962887  I-PER        word.constains_digits
4  -3.882223  I-PER               word.isupper()
5  -3.529955      O        +1:word.lower():arose
6  -3.453103      O     -1:word.lower():interior
7  -3.443033      O        -1:word.lower():lloyd
8  -3.174045      O         -1:word.lower():beat
9  -2.933214  B-ORG            word.is_stop_word
10 -2.913164      O          -1:word.lower():cdu
11 -2.795001  I-LOC                    postag:NN
12 -2.659360  B-PER            word.is_stop_word
13 -2.601639  I-PER                   postag:VBD
14 -2.601068      O        -1:word.lower():queen
15 -2.576247      O            word.lower():2000
16 -2.437059      O     -1:word.lower():official
17 -2.384381      O        +1:word.lower():libor
18 -2.352354      O     +1:word.lower():walkover
19 -2

See if you can spot some interesting features (both with positive and negative coefficients)