In this notebook, we will use Conditional Random Fields (CRFs) to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. We will build a NER to recognize named entities from Twitter.

For example, we want to extract persons' and organizations' names from the text. Than for the input text:

    Ian Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

We will use [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/) to build CRF model

In [1]:
import os 
import time 
import random
import warnings

import numpy as np 
from collections import defaultdict

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics


warnings.simplefilter('ignore')
data_path = "./data/twitter"

# I/- Load the Twitter Named Entity Recognition corpus

We will work with a corpus, which contains tweets with NE tags. Every line of a file contains a pair of a token (word/punctuation symbol) and a tag, separated by a whitespace. Different tweets are separated by an empty line.

## 1) Read data
The function *read_data* reads a corpus from the *file_path* and returns two lists: one with tokens and one with the corresponding tags. You need to complete this function by adding a code, which will replace a user's nickname to `<USR>` token and any URL to `<URL>` token. 

In [2]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()

            if token.startswith('http://') or token.startswith("https://"):
                token = "<URL>"
            if token.startswith("@"):
                token = "<USR>"
            
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

And now we can load three separate parts of the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.

In [3]:
train_tokens, train_tags = read_data(os.path.join(data_path, 'train.txt'))
validation_tokens, validation_tags = read_data(os.path.join(data_path, 'validation.txt'))
test_tokens, test_tags = read_data(os.path.join(data_path, 'test.txt'))

## 2) Features engineering

Unlike the case of neural netwokrs, it is necessary to build some features before training a CRF. In fact we use our knowledge about natural language and about the task that we want to achieve to build some features that represents the words of the corpus. For example is we want to build a model that detects proper nouns, then using "Starts with Capital Letter" will be a good choie of features.

In [4]:
def word2features(sent, i):
    word = sent[i]

    features = {
        'bias': 1.0,
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        # If the word is not the first in the sentence
        # Use the previous word to build some features
        word1 = sent[i-1]
        features.update({
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        # If the word is not the last in the sentence
        # Use the next word to build some features
        word1 = sent[i+1]
        features.update({
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

In [5]:
X_train = [sent2features(s) for s in train_tokens]
y_train = train_tags

X_val = [sent2features(s) for s in validation_tokens]
y_val = validation_tags

X_test = [sent2features(s) for s in test_tokens]
y_test = test_tags

# II/- Training

In [6]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
)
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=None,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

# III/-Evaluation

In [7]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B-musicartist',
 'I-musicartist',
 'B-product',
 'I-product',
 'B-company',
 'B-person',
 'B-other',
 'I-other',
 'B-facility',
 'I-facility',
 'B-sportsteam',
 'B-geo-loc',
 'I-geo-loc',
 'I-company',
 'I-person',
 'B-movie',
 'I-movie',
 'B-tvshow',
 'I-tvshow',
 'I-sportsteam']

In [8]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred,
                      average='weighted', labels=labels)

0.3698941654120787

In [9]:
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

               precision    recall  f1-score   support

    B-company      0.600     0.214     0.316        84
    I-company      0.600     0.225     0.327        40
   B-facility      0.625     0.319     0.423        47
   I-facility      0.667     0.426     0.520        61
    B-geo-loc      0.802     0.491     0.609       165
    I-geo-loc      0.647     0.423     0.512        52
      B-movie      0.500     0.125     0.200         8
      I-movie      0.333     0.200     0.250        10
B-musicartist      0.000     0.000     0.000        27
I-musicartist      0.000     0.000     0.000        24
      B-other      0.518     0.282     0.365       103
      I-other      0.254     0.194     0.220        93
     B-person      0.548     0.385     0.452       104
     I-person      0.517     0.470     0.492        66
    B-product      0.286     0.071     0.114        28
    I-product      0.290     0.150     0.198        60
 B-sportsteam      0.500     0.065     0.114        31
 I-sports