# Named Entity Recognition(NER) on Twitter 

In this notewook, I will use 4 ways solve custom Named Entity Recognition (NER) problem on Twitter. NER is a task that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In this dataset, we have 21 different tags for sentences.

tags = ['O', 'B-musicartist', 'I-musicartist', 'B-product', 'I-product', 'B-company', 'B-person', 'B-other', 'I-other', 'B-facility',
    'I-facility', 'B-sportsteam', 'B-geo-loc', 'I-geo-loc', 'I-company', 'I-person', 'B-movie', 'I-movie', 'B-tvshow', 'I-tvshow',
    'I-sportsteam'],

where 'B-' and 'I-' prefixes stand for the beginning and inside of the entity, 'O' stands for out of tag or no tag.
    
    




### Models

In the following three notebooks, we will use five ways to examine the dataset.

- <mark>Naive Bayes multinomial model</mark>
- <mark>Conditional Random Fields (CRFs)</mark>
- <mark>Custom SpaCy</mark>
- BERT in Spark NLP
- Simple Transformer 

In this notebook we will only consider the first three model. In the following two notbooks we will discuss the last two models.

### Preprocess Data

In [None]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()
            # Replace all urls with <URL> token
            # Replace all users with <USR> token

            ######################################
            ######### YOUR CODE HERE #############
            ######################################
            if token[0] == "@":
                token = "<USR>"
            elif token[:7] == "http://" or token[:8] == "https://":
                token = "<URL>"
            
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

In [None]:
train_tokens, train_tags = read_data('data/train.txt')
test_tokens, test_tags = read_data('data/test.txt')

### 1. Naive Bayes classifier for multinomial model

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#### Transform list data to pandas dataframe

In [None]:
df_train_tokens = pd.DataFrame({'words':train_tokens})
df_train_tokens = df_train_tokens.explode('words')
df_train_tokens["sentence #"] = df_train_tokens.index
df_train_tokens = df_train_tokens.reset_index(drop=True)

df_train_tags = pd.DataFrame({'tags':train_tags})
df_train_tags = df_train_tags.explode('tags').reset_index(drop=True)

In [None]:
df_test_tokens = pd.DataFrame({'words':test_tokens})
df_test_tokens = df_test_tokens.explode('words')
df_test_tokens["sentence #"] = df_test_tokens.index
df_test_tokens = df_test_tokens.reset_index(drop=True)

df_test_tags = pd.DataFrame({'tags':test_tags})
df_test_tags = df_test_tags.explode('tags').reset_index(drop=True)

#### Counts of tags(labels)
- Class "O" is highly represented, 1670 times of the counts of "B-tvshow". Thus the data is highly imbalanced.

In [None]:
df_value_counts_train = df_train_tags.tags.value_counts()
df_value_counts_train

In [None]:
df_value_counts_test = df_test_tags.tags.value_counts()
df_value_counts_test

In [None]:
df_value_percentage_train = df_value_counts_train / float(df_train_tags.shape[0])
df_value_percentage_test = df_value_counts_test / float(df_test_tags.shape[0])

In [None]:
df_value_percentage_train

In [None]:
df_value_percentage_test

In [None]:
df_distr = pd.DataFrame(df_value_percentage_train)
df_distr.columns = ['Train']
df_distr["Test"] = df_value_percentage_test
df_distr = df.drop(df.index[[0]])

In [None]:
fig_distr = df.plot.bar(figsize=(10,5))
fig_distr.figure.savefig('./images/distribution.png', bbox_inches='tight')

In [None]:
ax = df_value_counts_train.plot.bar(figsize=(10,5))
ax.figure.savefig('./images/counts.png', bbox_inches='tight')

#### Transform the train data to vector using DictVectorizer

In [None]:
v = DictVectorizer(sparse=False)
X_train_nb = v.fit_transform(df_train_tokens.to_dict('records'))
y_train_nb = df_train_tags.tags.values
classes = np.unique(y_train_nb)
classes = classes.tolist()

X_train_nb.shape, y_train_nb.shape

In [None]:
X_test_nb = v.transform(df_test_tokens.to_dict('records'))
y_test_nb = df_test_tags.tags.values

tag "O" is highly represented. Remove "O" when evaluate metrics precision, recall and f1-score. 

In [None]:
new_classes = classes.copy()
new_classes = new_classes[:-1]
new_classes

Using GridSearchCV to evaluating estimator performance. 

In [None]:
%%time
# define fixed parameters and parameters to search
nb = MultinomialNB()
params_space = { 'alpha': [0.01, 0.1, 1.0]}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.f1_score,
                        average='macro', labels=new_classes)

# search
gs = GridSearchCV(nb, params_space,
                        cv=5,
                        verbose=1,
                        n_jobs=3,
                        scoring=f1_scorer)
gs.fit(X_train_nb, y_train_nb)

#### Print best parameters and best score

In [None]:
print('best params:', gs.best_params_)
print('best CV score:', gs.best_score_)

#### Select best estimator

In [None]:
nb = gs.best_estimator_

#### Evaluation

In [None]:
from sklearn.metrics import f1_score
print('-' * 20 + ' Train set quality: ' + '-' * 20)
print(f1_score(y_pred=nb.predict(X_train_nb), y_true=y_train_nb, labels=classes, average='micro'))
print('-' * 20 + ' Test set quality: ' + '-' * 20)
print(f1_score(y_pred=nb.predict(X_test_nb), y_true=y_test_nb, labels=classes, average='micro'))

In [None]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
print(classification_report(y_pred=nb.predict(X_train_nb), y_true=y_train_nb, labels=new_classes))
print('-' * 20 + ' Test set quality: ' + '-' * 20)
print(classification_report(y_pred=nb.predict(X_test_nb), y_true=y_test_nb, labels=new_classes))

### 2. Conditional Random Fields (CRFs)

In [None]:
#!pip install sklearn-crfsuite

In [None]:
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from collections import Counter
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

#### Preprocess data in CRFs

In [None]:
df_train = df_train_tokens
df_train['tags'] = df_train_tags["tags"]


df_test = df_test_tokens
df_test['tags'] = df_test_tags["tags"]

In [None]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, t) for w, t in zip(s['words'].values.tolist(),
                                                           s['tags'].values.tolist())]
        self.grouped = self.data.groupby('sentence #').apply(agg_func)
        self.sentences = [s for s in self.grouped]
        
    def get_next(self):
        try: 
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s 
        except:
            return None
        
getter_train = SentenceGetter(df_train)
sentences_train = getter_train.sentences


getter_test = SentenceGetter(df_test)
sentences_test = getter_test.sentences

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    
    features = {
        'bias': 1.0, 
        'word.lower()': word.lower(), 
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True
    return features
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
    return [label for token,  label in sent]
def sent2tokens(sent):
    return [token for token, label in sent]

In [None]:
X_train_crf = [sent2features(s) for s in sentences_train]
y_train_crf = [sent2labels(s) for s in sentences_train]


X_test_crf = [sent2features(s) for s in sentences_test]
y_test_crf = [sent2labels(s) for s in sentences_test]

#### Train and find the best parameters

In [None]:
%%time
# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='macro', labels=classes)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-2,
                        n_iter=30,
                        scoring=f1_scorer)
rs.fit(X_train_crf, y_train_crf)

In [None]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

In [None]:
crf = rs.best_estimator_

#### Evaluation

In [None]:
from sklearn.metrics import f1_score
print('-' * 20 + ' Train set quality: ' + '-' * 20)
print(metrics.flat_f1_score(y_pred=crf.predict(X_train_crf), y_true=y_train_crf, labels=classes, average='micro'))
print('-' * 20 + ' Test set quality: ' + '-' * 20)
print(metrics.flat_f1_score(y_pred=crf.predict(X_test_crf), y_true=y_test_crf, labels=classes, average='micro'))

In [None]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
print(metrics.flat_classification_report(y_pred=crf.predict(X_train_crf), y_true=y_train_crf, labels=new_classes))
print('-' * 20 + ' Test set quality: ' + '-' * 20)
print(metrics.flat_classification_report(y_pred=crf.predict(X_test_crf), y_true=y_test_crf, labels=new_classes))

#### CRFs Transitions

In [None]:
def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))
print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))
print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

#### features weight

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))
print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))
print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])

#### eli5 check weights

In [None]:
import eli5
eli5.show_weights(crf, top=10)

In [None]:
eli5.show_weights(crf, top=10, targets=['O', 'B-company', 'I-person'])

In [None]:
eli5.show_weights(crf, top=10, feature_re='^word\.is',
                  horizontal_layout=False, show=['targets'])

### 3. SpaCy

- Oneline learning of pre-trained spacy ner model.

Architecture of spacy ner:

- The Spacy NER system contains a word embedding strategy using <mark>sub word features</mark> and <mark>"Bloom" embed</mark>, and a deep <mark>convolution</mark> neural network with <mark>residual</mark> connections(<mark>residual CNNs</mark>). 

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

#### Preprocess data in SpaCy

In [None]:
train_data = []
for i in range(len(train_tokens)):
    text = " ".join(train_tokens[i])
    entities = []
    token_start_point = 0
    for j in range(len(train_tags[i])):
        entities.append((token_start_point, token_start_point + len(train_tokens[i][j]) ,train_tags[i][j].upper()))
        token_start_point += len(train_tokens[i][j]) + 1
    train_data.append((text, {"entities" : entities}))  

#### Save train_data

In [None]:
# with open("train_data", 'wb') as fp:
#     pickle.dump(train_data, fp)

In [None]:
# Setting up the pipeline and entity recognizer.
model = None
if model is not None:
    nlp = spacy.load(model)  # load existing spacy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank('en')  # create blank Language class
    print("Created blank 'en' model")
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner)
else:
    ner = nlp.get_pipe('ner')

In [None]:
# Add new entity labels to entity recognizer

LABEL = [item.upper() for item in classes]
for i in LABEL:
    ner.add_label(i)
# Inititalizing optimizer
if model is None:
    optimizer = nlp.begin_training()
else:
    optimizer = nlp.entity.create_optimizer()

In [None]:
# Get names of other pipes to disable them during training to train # only NER and update the weights

import random
from spacy.util import minibatch, compounding
import time

start = time.time()
n_iter = 30
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    for itn in range(n_iter):
        random.shuffle(train_data)
        losses = {}
        batches = minibatch(train_data, 
                            size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch) 
            # Updating the weights
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print('Losses', losses)
print(time.time()-start)        

In [None]:
# # Save model 
# from pathlib import Path
# new_model_name = "spacy_ner"
# output_dir = '/Users/wenjuanyang/natural-language-processing/week2'
# if output_dir is not None:
#     output_dir = Path(output_dir)
#     if not output_dir.exists():
#         output_dir.mkdir()
#     nlp.meta['name'] = new_model_name  # rename model
#     nlp.to_disk(output_dir)
#     print("Saved model to", output_dir)

#### Evaluation

- As spacy split sentence in different ways. len(y_train) != len(y_pred), so here we cann't use classification_report or flat_classification_report to evaluate the performance spacy nlp model.

In [None]:
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot["entities"])
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

#### Evaluation

In [None]:
# load model from dir
#nlp2 = spacy.load(output_dir)
#train_results = evaluate(nlp2, train_data)
#print(train_results)




train_results = evaluate(nlp, train_data)
print(train_results)

In [None]:
test_data = []
for i in range(len(test_tokens)):
    text = " ".join(test_tokens[i])
    entities = []
    token_start_point = 0
    for j in range(len(test_tags[i])):
        entities.append((token_start_point, token_start_point + len(test_tokens[i][j]) ,test_tags[i][j].upper()))
        token_start_point += len(test_tokens[i][j]) + 1
    test_data.append((text, {"entities" : entities}))  

In [None]:
test_results = evaluate(nlp, test_data)
test_results

In [None]:
def macro_f1_score(results):
    f1_score_sum = 0
    for key, _ in results['ents_per_type'].items():
        if key != 'O':
            f1_score_sum += results['ents_per_type'][key]['f']
    return f1_score_sum/(len(results['ents_per_type']) - 1)        

In [None]:
test_macro_f1 = macro_f1_score(test_results)
test_macro_f1

In [None]:
train_macro_f1 = macro_f1_score(train_results)
train_macro_f1

#### Wrong Splits
- Positive example: "I just called into work Tuesday night...it 's", split "night...it" into three seperate part.

- Negative example: "Twist Ring Twist Ring by* TheJewelryGirlsPlace", treat "by* TheJewelryGirlsPlace" as subwords.
  In some examples miss some words.


In [None]:
def wrong_split(ner_model,test_data):
    wrong_count = 0
    for i in range(len(test_data)):
        doc = ner_model(test_data[i][0])
        pred = [(ent.text, ent.label_) for ent in doc.ents]
        if len(pred) != len(test_data[i][1]["entities"]):
            print(i, "\n")
            print(test_data[i][0], "\n")
            print(pred, "\n")
            print(test_data[i][1]["entities"], "\n\n")
            wrong_count += 1
        if wrong_count == 10:
            break

In [None]:
wrong_split(nlp, test_data)