# Sketch for Named Entity Recognition for Number of Employees

In this notebook we started trying to implement a Named Entity Recognition tagger for `n_employees`. This has not been implemented as a part of the binary classifier for number of employees. We will see the performance and limitations below.

In [1]:
import sys
# adds folder ../scripts to look for module imports
sys.path.append('../scripts')
import pandas as pd
import re
import numpy as np
from cleaning import *
from preprocessing import *
import random
from sklearn.model_selection import train_test_split

We will import a set of labelled data for `n_employees` as use the `train_test_split` function for dividing the training and testing set.

In [4]:
data = pd.read_csv('../data/processed/training_data_n_employees.csv')
y = data['class']
X = data[['filename', 'text', 'match_string', 'page']]
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=1, test_size = 0.3)

As the data is quite unbalanced, we artificially balanced the data.

In [5]:
# Get dataframe containing actual text mentions of number of employees and 
# dataframes with results from matcher not containing the number of employees

positives = data[data['class']==1]
negatives = data[data['class']==0]
index = negatives.index

train_data = pd.concat([positives, 
                        negatives.loc[random.sample(list(index),51)]],
                      ignore_index = True)

In [6]:
def identify_n_employees(row, label):
    """
    Defines start and end position of the entity n_employees
    on dataframe with labels.
    
    Args:
    
    row (Series): row of the dataframe 
    label (int): 1 if the text in the row contains the number of employees
    0 if it doesn't
    
    Returns:
    ID (tuple): (start,end, name) where
        start: starting character of the entity in the string
        end: ending character of the entity in the string
        name: name given to the entity
    """
    # If the row contains text with the number of employees
    # return the ID with name="n_employees"
    if label==1:
        text = row['text']
        match_strings = row['match_string'].split(' ')
        start = re.search(match_strings[0],text).start()
        end = re.search(match_strings[-1], text).end()
        ID = (start, end, "n_employees")
        
    #If the row does not contain text with number of employees
    # return empty ID
    else:
        ID = (None, None, None)
    return ID

We will create a blank pipeline with spacy and add a pipe for a named entity recognition.

In [7]:
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
ner = nlp.get_pipe("ner")
ner.add_label("n_employees")

In [8]:
# Define the where the entities are in the training set
# For each row on the trainiing data and each corresponding label
# insert the ID tuple in a row or an empty list

X_train['entities'] = [[identify_n_employees(X_train.loc[i], y_train[i])]
                          if identify_n_employees(X_train.loc[i], y_train[i])[2] !=None
                          else [] for i in X_train.index]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['entities'] = [[identify_n_employees(X_train.loc[i], y_train[i])]



In [9]:
X_train['entities']

127     [(29, 55, n_employees)]
38     [(159, 39, n_employees)]
225     [(12, 39, n_employees)]
269    [(91, 117, n_employees)]
125                          []
                 ...           
203      [(4, 26, n_employees)]
255                          []
72      [(20, 47, n_employees)]
235                          []
37     [(115, 39, n_employees)]
Name: entities, Length: 224, dtype: object

We proceed with the training of the model with 20 iterations.

In [10]:
import random 
import datetime as dt

optimizer = nlp.begin_training()
for i in range(20):
    index= list(X_train.index)
    random.shuffle(index)
    losses = {}
    for j in index:
        nlp.update([X_train.loc[j,'text']],
                   [{'entities':X_train.loc[j,'entities']}],
                   sgd=optimizer,
                   losses=losses)
    print(f"Losses at iteration {i} - {dt.datetime.now()}", losses)

  proc.begin_training(

  proc.begin_training(



Losses at iteration 0 - 2021-09-02 23:31:04.851451 {'ner': 807.7058540482242}
Losses at iteration 1 - 2021-09-02 23:31:17.320495 {'ner': 242.8957403522908}
Losses at iteration 2 - 2021-09-02 23:31:27.888339 {'ner': 145.48093393548902}
Losses at iteration 3 - 2021-09-02 23:31:44.963613 {'ner': 175.7385099519872}
Losses at iteration 4 - 2021-09-02 23:31:56.314223 {'ner': 180.7361507564931}
Losses at iteration 5 - 2021-09-02 23:32:11.974292 {'ner': 257.0131924496304}
Losses at iteration 6 - 2021-09-02 23:32:28.282448 {'ner': 97.66807456508076}
Losses at iteration 7 - 2021-09-02 23:32:42.590461 {'ner': 483.3261671778362}
Losses at iteration 8 - 2021-09-02 23:32:57.245884 {'ner': 125.59524510648001}
Losses at iteration 9 - 2021-09-02 23:33:10.476808 {'ner': 153.976823864141}
Losses at iteration 10 - 2021-09-02 23:33:23.228166 {'ner': 192.11500536327532}
Losses at iteration 11 - 2021-09-02 23:33:37.345677 {'ner': 169.92822112347025}
Losses at iteration 12 - 2021-09-02 23:33:57.362813 {'ner':

See how the named entity recognition performs as a binary classifier on the training data itself.

In [11]:
spacy_pred = []
for text in X_train['text']:
    doc = nlp(text)
    if len(doc.ents)>0:
        spacy_pred.append(1)
    else:
        spacy_pred.append(0)
X_train['spacy_pred']=spacy_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['spacy_pred']=spacy_pred



In [12]:
from sklearn.metrics import accuracy_score,f1_score, precision_score, recall_score

In [13]:
y_true = y_train
y_pred = X_train['spacy_pred']

acc = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true,y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true,y_pred)

print(f""" Accuracy score: {acc}
F1 score: {f1}
Precision: {precision}
Recall: {recall}
""")

 Accuracy score: 0.9642857142857143
F1 score: 0.9595959595959594
Precision: 0.979381443298969
Recall: 0.9405940594059405



More importantly let us see how it performs as a classifier on the test data

In [14]:
spacy_pred = []
for text in X_test['text']:
    doc = nlp(text)
    if len(doc.ents)>0:
        spacy_pred.append(1)
    else:
        spacy_pred.append(0)
X_test['spacy_pred']=spacy_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['spacy_pred']=spacy_pred



We can have a look at the scores.

In [15]:
y_true = y_test
y_pred = X_test['spacy_pred']

acc = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true,y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true,y_pred)

print(f""" Accuracy score: {acc}
F1 score: {f1}
Precision: {precision}
Recall: {recall}
""")

 Accuracy score: 0.8333333333333334
F1 score: 0.8222222222222222
Precision: 0.8809523809523809
Recall: 0.7708333333333334



As a binary classifier, the spacy NER trained is not better than the sklearn model used in the implementation. We now visualize some of the output of the NER using displacy.

In [19]:
from spacy import displacy

In [20]:
for text in X_test['text'][:20]:
    doc = nlp(text)
    displacy.render(doc, style="ent", jupyter=True)




## Conclusion

While many expressions used on the training data are indicative of the presence of the metric nearby, they are not suitable for the extraction of the metric itself (e.g. average headcount). And therefore the training data used for the binary classifier cannot be the same as one used for the NER.

To refine this we suggest the we train a model to look for cardinal entities using SpaCy on the text already classified as containing the number of employees.