**In this assignment you will be guided to add more features in order to get better performance!**

In [23]:
import matplotlib.pyplot as plt

In [24]:
from nltk.corpus import stopwords

In [25]:
from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

A simple sentence NER example:

[**ORG** U.N. ] official [**PER** Ekeus ] heads for [**LOC** Baghdad ] 

We will concentrate on four types of named entities:
 * persons (**PER**), 
 * locations (**LOC**) 
 * organizations (**ORG**)
 * Others (**O**)

In [4]:
def _generate_examples(filepath):
        with open(filepath, encoding="utf-8") as f:
            sent = []
            for line in f:
                if line.startswith("-DOCSTART-") or line == "" or line == "\n":
                    if sent:
                        yield sent
                        sent = []
                else:
                    splits = line.split(" ")
                    token = splits[0]
                    pos_tag = splits[1]
                    ner_tag = splits[3].rstrip()
                    if 'MISC' in ner_tag:
                        ner_tag = 'O'
                    
                    sent.append((token, pos_tag, ner_tag))

In [5]:
%%time
train_sents = list(_generate_examples('/home/taki/dt_df/CRF/train.txt'))
test_sents = list(_generate_examples('/home/taki/dt_df/CRF/test.txt'))

CPU times: user 184 ms, sys: 21.4 ms, total: 206 ms
Wall time: 190 ms


Here we have succesfully loaded the trianing and test data.
_________________

Here is a list of english stopwords, and we would like to include it as a feature

In [6]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/taki/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here is a list of names, and we would like to include it as a feature

In [7]:
names = set()
with open('/home/taki/dt_df/CRF/names.txt') as f:
    for l in f:
        names.add(l[:-1].lower())

_____________________________________________________

You are asked to change the `word2features` function to add the following features:

**For the current word**:
1. Add a feature named `word.isupper()` that tells if the word is in upper case (you can your the `isupper()` function in python)
2. Add a feature named `word.isdigit()` that tells if the word is all digits (similarility you can use the `isdigit()` built-in python function)
3. Add a feature named `word.l1_is_capital` that tells if the word starts with a capital letter
4. Add a feature named `word.ends_in_dot` that tells if the word has lenght > 1 and ends with a dot (`.`)
5. Add a feature named `word.is_stop_word` that tells is the word belongs to the list of stop words defined previously `stop_words` (don't forget to convert the word into lower case before testing, just to be case insensitive)
6. Add a feature named `word.constains_digits` that tells if the word contains a digit or not
7. Add a feature names `word.figures_in_names_list` that tells if the word belongs to the list of names we defined previously `names`. Again don't forget to change the word into lower case first.

**For the previous word**: (BE CAREFUL, YOU SHOULD NOT USE `word`, USE `word1` instead):

Add the same features. Just prepend the name of features with `-1:` (It's important for the different features to be of different names)

**Add infomration about nextword**: (BE CAREFUL, YOU SHOULD NOT USE `word`, USE `word1` instead):

* Add the same features. Just prepend the name of features with `+1:` (It's important for the different features to be of different names)

* **PS**: If the word is the last one in the sentence (no next word), just add a feature named `EOS` = True to tell that the word is in the last position. JUst as we've did with `BOS`



In [27]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'postag': postag,
        # Add features of the current word here
    }
    
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:postag': postag1,
            # Add features of previous word here
        })
    else:
        features['BOS'] = True
        
    # Similariliy, add features of the next word here.

        
    return features

In [28]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

Construct the features for the training and test test
_________________________

Train your CRF
______________________________

Compute F1 score for different labels. Remove the 'O' label before that
______________________

Show the 20 top and 20 least likely transitions between labels 
_________________

Show the 50 top and 50 least likely state features (compatibility between features and labels)
_______________

See if you can spot some interesting features (both with positive and negative coefficients)