This notebook describes named entity recognition (NER) using conditional random fields (CRFs), which uses a sequence labeling approach to identify entities in text. The notebook has been adapted from https://github.com/steveneale/ner_crf/blob/master/ner_crf.ipynb. 

### Named entity recognition using spacy
We will first see how NER can be done using an existing model, such as one that comes with spacy.

In [None]:
from pprint import pprint
import en_core_web_sm
nlp = en_core_web_sm.load()

tweet = "all the pasta, pasta sauce and pizza were sold out at the grocery store. did everyone in dallas become italian grandmas? #dallas #coronapocolypse #covid2019"
doc = nlp(tweet)
print(tweet)
pprint([(X.text, X.label_) for X in doc.ents])

The results from spacy for named entities is not great! Only hashtag-related items were recognized as named entities and their labels (PERSON) don't make a lot of sense. 

Now let's improve the capitalization of this tweet and remove the hashtags, making it more like sentences you might see in a news article.

In [None]:
improved_tweet="All the pasta, pasta sauce and pizza were sold out at the grocery store. Did everyone in Dallas become Italian grandmas?"
doc = nlp(improved_tweet)
print(improved_tweet)
print([(X.text, X.label_) for X in doc.ents])

The results are better now, because the standard spacy NER model was trained on news articles.

As a first step for your NER needs, you might want to try to test spacy or similar tools (e.g., Stanza) to whether they help your task. Note that these tools often come with multiple models for NER trained on different corpora, so it is important to select the most appropriate one for your use case. 

### Training our own NER model using CRFs
You may have specific NER needs that are unmet by existing models, in which case you can train a CRF NER model (among other possibilities), assuming you have labeled data. 

In [None]:
!pip install sklearn-crfsuite

In [None]:
import math
import warnings
import os
import io
import sys

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.model_selection import RandomizedSearchCV
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn.metrics import make_scorer
from sklearn.exceptions import UndefinedMetricWarning

# from model_plots import plot_learning_curve

#### Load the NER dataset into a Pandas DataFrame

First, the data is loaded into a Pandas DataFrame. This can be done easily using the `read_csv` function, specifying that the separator is a space. It's also useful to keep the blank lines, which are helpful later for determining the sentence breaks.

Once the data is loaded into a DataFrame, the easy access we have to columns allows a couple of useful things to be done - group the data by the "ne" column to see the distributions of each tag, and extract the classes (disregarding 'O' and blank lines with NaN values) as a list.

In [None]:
# # Upload data files if you are working on colab
# from google.colab import files
# uploaded = files.upload()

In [None]:
# # Use this if you are using colab
# train_file = io.BytesIO(uploaded['train.txt'])
# test_file = io.BytesIO(uploaded['test.txt'])
# valid_file = io.BytesIO(uploaded['valid.txt'])

# Use this if your are working locally
train_file = "train.txt"
test_file = "test.txt"
valid_file = "valid.txt"

In [None]:
# Read the NER data using spaces as separators, keeping blank lines and adding columns
ner_data = pd.read_csv(train_file, sep=" ", header=None, skip_blank_lines=False, encoding="utf-8")
ner_data.columns = ["token", "pos", "chunk", "ne"]

# Explore the distribution of NE tags in the dataset
tag_distribution = ner_data.groupby("ne").size().reset_index(name='counts')
print(tag_distribution)

In [None]:
# Extract the useful classes (B and I tags, not 'O' or NaN values) as a list
classes = list(filter(lambda x: x not in ["O", np.nan], list(ner_data["ne"].unique())))

print(classes)

## Extract sentences from the dataset

Next, sentences need to be extracted from the data - it's useful to have the sentences as a list of lists, with each sublist containing the token, POS tag, syntactic chunk, and NE label for every word token in the sentence.

In [None]:
# Create a sentences dictionary and an initial single sentence dictionary
sentences, sentence = [], []

# For each row in the NER data...
for index, row in ner_data.iterrows():
    # If the row is empty (no string in the token column)
    if type(row["token"]) != str:
        # If the current sentence is not empty, append it to the sentences and create a new sentence
        if len(sentence) > 0:
            sentences.append(sentence)
            sentence = []
    # Otherwise...
    else:
        # If the row does not indicate the start of a document, add the token to the current sentence
        if type(row["token"]) != float and type(row["pos"]) != float and type(row["ne"]) != float:
            if not row["token"].startswith("-DOCSTART-"):
                sentence.append([row["token"], row["pos"], row["chunk"], row["ne"]])
#     pbar.update()

In [None]:
sentences

## Extract sentence features

The 'sklearn-crfsuite' website provides a tutorial on their CRF model, which contains sample code for extracting word features as a dictionary ready-formatted for use with the model. The function below is based on their model, making use of:

* Current words
* Previous words
* Next words
* Current POS tags
* Previous and next POS tags

These features are all used in the Stanford NLP group's work on using CRFs for NER (Finkel et al., 2005). They also make use of a 'current word shape' feature, which generally shows the upper-cased letters, lower-cased letters, and digits that make up a word (For example, 'CoNLL-2003' => 'XxXXX-dddd'). In the 'sclearn-crfsuite' implementation below, the 'word.isupper()', 'word.istitle()', and 'word.isdigit()' features are used in place of this.

The function below has also had a flag added to it to include chunk tags from the training data as features, for the current, previous, and next words.

In [None]:
def word_features(sentence, i, use_chunks=False):
    # Get the current word and POS
    word = sentence[i][0]
    pos = sentence[i][1]
    # Create a feature dictionary, based on characteristics of the current word and POS
    features = { "bias": 1.0,
                 "word.lower()": word.lower(),
                 "word[-3:]": word[-3:],
                 "word[-2:]": word[-2:],
                 "word.isupper()": word.isupper(),
                 "word.istitle()": word.istitle(),
                 "word.isdigit()": word.isdigit(),
                 "pos": pos,
                 "pos[:2]": pos[:2], # POS category, like verb, noun, rather than the fine-grained POS tag
               }
    # If chunks are being used, add the current chunk to the feature dictionary
    if use_chunks:
        chunk = sentence[i][2]
        features.update({ "chunk": chunk })
    # If this is not the first word in the sentence...
    if i > 0:
        # Get the sentence's previous word and POS
        prev_word = sentence[i-1][0]
        prev_pos = sentence[i-1][1]
        # Add characteristics of the sentence's previous word and POS to the feature dictionary
        features.update({ "-1:word.lower()": prev_word.lower(),
                          "-1:word.istitle()": prev_word.istitle(),
                          "-1:word.isupper()": prev_word.isupper(),
                          "-1:pos": prev_pos,
                          "-1:pos[:2]": prev_pos[:2],
                        })
        # If chunks are being used, add the previous chunk to the feature dictionary
        if use_chunks:
            prev_chunk = sentence[i-1][2]
            features.update({ "-1:chunk": prev_chunk })
    # Otherwise, add 'BOS' (beginning of sentence) to the feature dictionary
    else:
        features["BOS"] = True
    # If this is not the last word in the sentence...
    if i < len(sentence)-1:
        # Get the sentence's next word and POS
        next_word = sentence[i+1][0]
        next_pos = sentence[i+1][1]
        # Add characteristics of the sentence's previous next and POS to the feature dictionary
        features.update({ "+1:word.lower()": next_word.lower(),
                          "+1:word.istitle()": next_word.istitle(),
                          "+1:word.isupper()": next_word.isupper(),
                          "+1:pos": next_pos,
                          "+1:pos[:2]": next_pos[:2],
                        })
        # If chunks are being used, add the next chunk to the feature dictionary
        if use_chunks:
            next_chunk = sentence[i+1][2]
            features.update({ "+1:chunk": next_chunk })
    # Otherwise, add 'EOS' (end of sentence) to the feature dictionary
    else:
        features["EOS"] = True
    # Return the feature dictionary
    return features

Using the word_features function, a list of feature dictionaries for each word token in a sentence can be extracted, corresponding to a list of NE labels for each word token in a sentence.

In [None]:
# Return a feature dictionary for each word in a given sentence
def sentence_features(sentence, use_chunks=False):
    return [word_features(sentence, i, use_chunks) for i in range(len(sentence))]

# Return the label (NER tag) for each word in a given sentence
def sentence_labels(sentence):
    return [label for token, pos, chunk, label in sentence]

## Split the sentences into training and test sets

Using the predefined functions, X and y can be extracted as lists of feature dictionaries for each word token in each sentence, and as lists of NE labels for each word token in each sentence, respectively. scikit-learn's 'test_train_split' function can then be used to split X and y into training and test sets, split 80% training to 20% test.

In [None]:
# For each sentence, extract the sentence features as X, and the labels as y
X = [sentence_features(sentence) for sentence in sentences]
y = [sentence_labels(sentence) for sentence in sentences]

# Split X and y into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("First token features:\n{}\n{}".format("-"*21, X_train[0][0]))
print("\nFirst token label:\n{}\n{}".format("-"*18, y_train[0][0]))

## Train a CRF model

The training data can now be used to train a CRF model to map the feature dictionaries to output NE labels. CRF's have been a popular choice for training named entity recognition models following the success of the Stanford NLP group's work on NER (Finkel et al., 2005). The model employs the gradient descent-based L-BFGS algorithm, and uses elastic net (C1+C2) regularisation.

In [None]:
CRF._get_param_names()

In [None]:
!pip install scikit-learn==0.23.1 --user

In [None]:
# Create a new CRF model
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
crf = CRF(algorithm="lbfgs",
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=True)


# Train the CRF model on the supplied training data
crf.fit(X_train, y_train)

## Evalute the CRF model

The trained model can now be used to make predictions based on the test data, which can in turn be compared to the expected labels from the test data to produce a classification report (precision, recall and F1 scores).

The model is performing pretty well, with a 91% F1-score.

In [None]:
# Use the CRF model to make predictions on the test data
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels=classes))

Check the state features

In [None]:
from collections import Counter

def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))
print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))
print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])

# References

Finkel, J.R., Grenager, T. & Manning, C. (2005). 'Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling'. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL '05). pp. 363–370.

sklearn-crfsuite (n.d.). 'Tutorial - scklearn-crfsuite 0.3 documentation' [https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#features]. Accessed 2018-11-30.