# Machine learning entity recognition with Spacy

The aim of this notebook is going to be to present the main components of spacy, and the way it can be used for NLP(*Natural Language Processing*) and NER(*Named Entity Recognition*).


In [23]:
from __future__ import unicode_literals, print_function

import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
from spacy import displacy
import copy
import shutil

from traindata import TRAIN_DATA
from testdata import test_text

# What is a NER?

A named entity is a ”real-world object” that’s assigned a name; for example, a person, a country, a
product or a book title. A NER is the problem of automatically recognizing these entities from the text.

For our case, our model will be trained in order to recognise Machine Learning related entities, which are shown below.

In [3]:
"""
USED ENTITIES
SML = "SUPERVISED MACHINE LEARNING ALGORITHMS" 
USML = "UNSUPERVISED MACHINE LEARNING ALGORITHMS"
MLS = "MACHINE LEARNING SOFTWARE"
NN = "NEURAL NETWORKS"
EVM = "EVATUATION METHODS" 
OPM = "OPTIMIZATION METHODS" 
MLP = "MACHINE LEARNING PREPROCESSING" 
MLA = "MACHINE LEARNING APPLICATIONS" 
"""

LABELS = ["EVM","MLA","MLP","MLS","NN","OPM","SML","USML"]


# How does spaCy use the data?

Here some training data is defined. The mode data is presented to the model is the following one:


```
[
    (
        "We introduced a multilayer perceptron neural network (MLPNN) based classification model as a diagnostic decision support mechanism in the epilepsy treatment.",
        {"entities": [(27, 37, "NN")]}
    )
    ,
    .
    .
    .
    ,
    (
        "The main issue in gradient descent is: how should we set the step size?",
        {"entities": [(18, 34, "OPM")]}
    )
]
   
```

As it is shown, each sample of data is formed by 3 main components:


*   The text which is going to be used.
*   The "entities" label, which indicates de entities inside the text.
*   For each entity that is into the text, this will be represented giving its first character, its last one and the entity related to that sequence of characters.




That whole training set is an array of 245 of those kind of samples. That is because we used 7 samples of text to train each entity keyword and there are 35 keywords defined.

Although each keyword on our set has its own 7 training text samples, some of them could appear on other keywords' samples. So each keyword doesn't necesarily only have to appear only on 7 samples.

After defining the test data and the parameters which are going to be given to the model, the model is set.

For that, the first step is to define and set a pipeline for the model. Once the pipe is set, the labels we have chosen are given to the model.

The following step is to start and give the training data to the model.

After being trained, the test data is given, and the predictions are made.

In [28]:
def ner(model=None, new_model_name="machine_learning", output_dir=None,
         n_iter=30, no_train=False, train=TRAIN_DATA, test=test_text,
         display=False, verbose=True):
    
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
        
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    for label in LABELS:
        ner.add_label(label)  # add new entity label to entity recognizer

    if no_train is False:
        if model is None:
            optimizer = nlp.begin_training()
        else:
            optimizer = nlp.resume_training()
        move_names = list(ner.move_names)
        # get names of other pipes to disable them during training
        other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
        with nlp.disable_pipes(*other_pipes):  # only train NER
            sizes = compounding(1.0, 4.0, 1.001)
            # batch up the examples using spaCy's minibatch
            for itn in range(n_iter):
                random.shuffle(train)
                batches = minibatch(train, size=sizes)
                losses = {}
                for batch in batches:
                    texts, annotations = zip(*batch)
                    nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
                if verbose:
                    print("Losses", losses)

    # test the trained model
    doc = nlp(test)
    if verbose:
        print("Entities in '%s'" % test)
    for ent in doc.ents:
        print(ent.label_, ent.text)
    
    if display:
      displacy.render(doc, style="ent")

        
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test)
        for i in range(len(doc2.ents)):
            print(doc2.ents[i].label_, doc2.ents[i].text)


Now, with the *ner* function we can search for entities on a given piece of text.

To show how the algorithm performs, here we have 3 test datasets to test it.

## Test Data 1 

bibliography: https://arxiv.org/abs/1603.04467

Traffic signs are characterized by a wide variability in their visual appearance in real-world environments. For example, changes of illumination, varying weather conditions and partial occlusions impact the perception of road signs. In practice, a large number of different sign classes needs to be recognized with very high accuracy. Traffic signs have been designed to be easily readable for humans, who perform very well at this task. For computer systems, however, classifying traffic signs still seems to pose a challenging pattern recognition problem. Both image processing and machine learning algorithms are continuously refined to improve on this task. But little systematic comparison of such systems exist. What is the status quo? Do today’s algorithms reach human performance? For assessing the performance of state-of-the-art machine learning algorithms, we present a publicly available traffic sign dataset with more than 50,000 images of German road signs in 43 classes. The data was considered in the second stage of the German Traffic Sign Recognition Benchmark held at IJCNN 2011. The results of this competition are 
reported and the best-performing algorithms are briefly described. Convolutional neural networks (CNNs) showed particularly high classification accuracies in the competition. We measured the performance of human subjects on the same data—and the CNNs outperformed the human test persons.

True Results:


EVM accuracy

MLA pattern recognition

NN Convolutional neural networks

NN CNN

NN CNN

In [5]:
test_data1 = "Traffic signs are characterized by a wide variability in their visual appearance in real-world environments. For example, changes of illumination, varying weather conditions and partial occlusions impact the perception of road signs. In practice, a large number of different sign classes needs to be recognized with very high accuracy. Traffic signs have been designed to be easily readable for humans, who perform very well at this task. For computer systems, however, classifying traffic signs still seems to pose a challenging pattern recognition problem. Both image processing and machine learning algorithms are continuously refined to improve on this task. But little systematic comparison of such systems exist. What is the status quo? Do today’s algorithms reach human performance? For assessing the performance of state-of-the-art machine learning algorithms, we present a publicly available traffic sign dataset with more than 50,000 images of German road signs in 43 classes. The data was considered in the second stage of the German Traffic Sign Recognition Benchmark held at IJCNN 2011. The results of this competition are reported and the best-performing algorithms are briefly described. Convolutional neural networks (CNNs) showed particularly high classification accuracies in the competition. We measured the performance of human subjects on the same data—and the CNNs outperformed the human test persons."

In [11]:
ner(test=test_data1, display=True)

Created blank 'en' model
Losses {'ner': 683.4059019869624}
Losses {'ner': 502.16813123820083}
Losses {'ner': 516.5445525329308}
Losses {'ner': 383.4062892875651}
Losses {'ner': 406.7460062252648}
Losses {'ner': 395.65957051832754}
Losses {'ner': 238.2972186868646}
Losses {'ner': 241.82102537283455}
Losses {'ner': 154.90342826756088}
Losses {'ner': 157.53743602455177}
Losses {'ner': 126.72646123833393}
Losses {'ner': 127.5487532796723}
Losses {'ner': 103.67188086076496}
Losses {'ner': 85.86203695173698}
Losses {'ner': 69.00787480475861}
Losses {'ner': 61.76679128620278}
Losses {'ner': 70.36747670356141}
Losses {'ner': 55.17540647737548}
Losses {'ner': 42.04819482826178}
Losses {'ner': 51.78060115235919}
Losses {'ner': 41.47540000290169}
Losses {'ner': 30.02583450758375}
Losses {'ner': 30.67997645859194}
Losses {'ner': 57.700090934355096}
Losses {'ner': 43.05153185557546}
Losses {'ner': 32.067996120803095}
Losses {'ner': 18.837938587331397}
Losses {'ner': 21.052852233171475}
Losses {'ner

## Test Data 2

bibliography: https://www.sciencedirect.com/science/article/abs/pii/S0893608012000524

We describe the approach that won the final phase of the German traffic sign recognition benchmark. Our method is the only one that achieved a better-than-human recognition rate of 99.46%. We use a fast, fully parameterizable GPU implementation of a Deep Neural Network (DNN) that does not require careful design of pre-wired feature extractors, which are rather learned in a supervised way. Combining various DNNs trained on differently preprocessed data into a Multi-Column DNN (MCDNN) further boosts recognition performance, making the system insensitive also to variations in contrast and illumination.

True Results: it should not detect any entity.


In [12]:
test_data2 = "We describe the approach that won the final phase of the German traffic sign recognition benchmark. Our method is the only one that achieved a better-than-human recognition rate of 99.46%. We use a fast, fully parameterizable GPU implementation of a Deep Neural Network (DNN) that does not require careful design of pre-wired feature extractors, which are rather learned in a supervised way. Combining various DNNs trained on differently preprocessed data into a Multi-Column DNN (MCDNN) further boosts recognition performance, making the system insensitive also to variations in contrast and illumination."

In [13]:
ner(test=test_data2, display=True)

Created blank 'en' model
Losses {'ner': 773.8040341823641}
Losses {'ner': 518.2561895478623}
Losses {'ner': 510.74411564770134}
Losses {'ner': 473.45823618790524}
Losses {'ner': 413.0625895038406}
Losses {'ner': 326.6295818049622}
Losses {'ner': 309.35498028165375}
Losses {'ner': 239.39783058905294}
Losses {'ner': 218.64945087878874}
Losses {'ner': 154.09937041041846}
Losses {'ner': 132.809166232902}
Losses {'ner': 121.1173487412944}
Losses {'ner': 103.95665659366922}
Losses {'ner': 62.24152791426415}
Losses {'ner': 66.59271001888239}
Losses {'ner': 59.69104201657272}
Losses {'ner': 68.42556428107255}
Losses {'ner': 48.03789954268608}
Losses {'ner': 63.92999775133345}
Losses {'ner': 45.52803203126494}
Losses {'ner': 26.4302666521596}
Losses {'ner': 20.03063847110248}
Losses {'ner': 41.1804953381024}
Losses {'ner': 23.836054179640925}
Losses {'ner': 39.86883502827105}
Losses {'ner': 40.25180307324177}
Losses {'ner': 18.98684155493404}
Losses {'ner': 24.402821979894163}
Losses {'ner': 23


## Test data 3

https://www.sciencedirect.com/science/article/pii/S0377221719307581

## Sample 1

Deep learning is essential in the context of big data and, for this purpose, its performance across different scenarios is compared and contrasted in this overview article. The empirical results of the different cases studies suggest that deep learning is a feasible and effective method, which can considerably and consistently outperform its traditional counterparts in both prediction and operational performance from the family of data-analytic models. As such, DNNs are able identify previously unknown, potentially useful, non-trivial, and interesting patterns more accurately than other popular predictive models such as random forest, artificial neural networks, and support vector machines. One of the reasons they yield superior results originates from the strong mathematical assumptions in traditional machine learning, whereas these are relaxed by DNNs as a result of their larger parameter space.

True Results:

SML random forest

SML support vector machines

In [14]:
test_data3 = "Deep learning is essential in the context of big data and, for this purpose, its performance across different scenarios is compared and contrasted in this overview article. The empirical results of the different cases studies suggest that deep learning is a feasible and effective method, which can considerably and consistently outperform its traditional counterparts in both prediction and operational performance from the family of data-analytic models. As such, DNNs are able identify previously unknown, potentially useful, non-trivial, and interesting patterns more accurately than other popular predictive models such as random forest, artificial neural networks, and support vector machines. One of the reasons they yield superior results originates from the strong mathematical assumptions in traditional machine learning, whereas these are relaxed by DNNs as a result of their larger parameter space."

In [18]:
ner(test=test_data3, display=True)

Created blank 'en' model
Losses {'ner': 705.553140829905}
Losses {'ner': 554.2483269347199}
Losses {'ner': 439.4455183279283}
Losses {'ner': 392.1363177653914}
Losses {'ner': 377.02664412544016}
Losses {'ner': 319.94063104402966}
Losses {'ner': 292.340112187175}
Losses {'ner': 329.0608547936795}
Losses {'ner': 212.31664854824}
Losses {'ner': 248.6126177221075}
Losses {'ner': 202.64447807742937}
Losses {'ner': 120.82670855404132}
Losses {'ner': 121.56290537528159}
Losses {'ner': 150.0824573068004}
Losses {'ner': 133.65968443412342}
Losses {'ner': 102.89023267600622}
Losses {'ner': 70.57187362424028}
Losses {'ner': 88.76385886270172}
Losses {'ner': 57.55155543511073}
Losses {'ner': 62.1374046718277}
Losses {'ner': 43.009942288648745}
Losses {'ner': 49.801021237717215}
Losses {'ner': 51.02923547224943}
Losses {'ner': 30.691979923617637}
Losses {'ner': 33.66686487736281}
Losses {'ner': 23.268300897113658}
Losses {'ner': 52.83792033528988}
Losses {'ner': 34.88742009944289}
Losses {'ner': 14

# Iterations

We can also specify how many learning steps we want to let the ner algorithm to perform. The default is 30.

With this we can make the model more or less generalized.

In [16]:
ner(test=test_data3, display=True, n_iter=15)

Created blank 'en' model
Losses {'ner': 769.1856485565568}
Losses {'ner': 517.0872431410693}
Losses {'ner': 503.3544283852964}
Losses {'ner': 429.2526813369054}
Losses {'ner': 314.8621959970043}
Losses {'ner': 267.1537978904657}
Losses {'ner': 185.5939556734168}
Losses {'ner': 193.72961720384188}
Losses {'ner': 233.95218292936556}
Losses {'ner': 147.46257670226055}
Losses {'ner': 120.3743944709245}
Losses {'ner': 139.67006263467715}
Losses {'ner': 98.70510378411329}
Losses {'ner': 125.82624417980153}
Losses {'ner': 82.90958383556752}
Entities in 'Deep learning is essential in the context of big data and, for this purpose, its performance across different scenarios is compared and contrasted in this overview article. The empirical results of the different cases studies suggest that deep learning is a feasible and effective method, which can considerably and consistently outperform its traditional counterparts in both prediction and operational performance from the family of data-analyti

# Other options

The next 4 parameters are disabled by default:

*   An original model can be used with *model* parameter.
*   The model generated with the training data and, possibly, the model provided as a parameter, can be saved to be used later with *output_dir* parameter.
*   It can be specified not to train the model using the train data with *no_train* parameter.
*   We used this option previusly: *display*. When this set, displaCy module is used to show NER results in a more visual and graphical way.

All the following examples will have displaCy module active.

This will save the model to a folder called 'model_saved' on the current directory

In [19]:
ner(test=test_data3, display=True, output_dir="model_saved", n_iter=15)

Created blank 'en' model
Losses {'ner': 757.7506059012806}
Losses {'ner': 479.8928215445575}
Losses {'ner': 463.38400331127855}
Losses {'ner': 519.7771797129587}
Losses {'ner': 369.00485377589735}
Losses {'ner': 264.27750366838416}
Losses {'ner': 243.41312621867684}
Losses {'ner': 167.39341862923348}
Losses {'ner': 190.41872648994325}
Losses {'ner': 171.8223864435501}
Losses {'ner': 132.00731378448305}
Losses {'ner': 118.1058399069023}
Losses {'ner': 132.1097551817286}
Losses {'ner': 78.1121502385319}
Losses {'ner': 89.21295682021456}
Entities in 'Deep learning is essential in the context of big data and, for this purpose, its performance across different scenarios is compared and contrasted in this overview article. The empirical results of the different cases studies suggest that deep learning is a feasible and effective method, which can considerably and consistently outperform its traditional counterparts in both prediction and operational performance from the family of data-analyt

Saved model to model_saved
Loading from model_saved
SML random forest


This will use 'model_saved' as input model. Which does not mean it will not have to train again the model with training data.

In [21]:
ner(test=test_data3, display=True, model="model_saved", n_iter=15)

Loaded model 'model_saved'
Losses {'ner': 121.9760862417908}
Losses {'ner': 113.96285160290937}
Losses {'ner': 166.37370022802315}
Losses {'ner': 148.82273698699805}
Losses {'ner': 127.59486012951297}
Losses {'ner': 117.22499197867774}
Losses {'ner': 144.65114638474927}
Losses {'ner': 87.04298682047589}
Losses {'ner': 66.3504866443029}
Losses {'ner': 83.7146834552211}
Losses {'ner': 64.1496275232343}
Losses {'ner': 46.56448164254833}
Losses {'ner': 52.06255387813236}
Losses {'ner': 18.869948108420093}
Losses {'ner': 28.853743164465758}
Entities in 'Deep learning is essential in the context of big data and, for this purpose, its performance across different scenarios is compared and contrasted in this overview article. The empirical results of the different cases studies suggest that deep learning is a feasible and effective method, which can considerably and consistently outperform its traditional counterparts in both prediction and operational performance from the family of data-analy

Here 'model_saved' is used again, but the model will not be trainined using training data, so the execution should be very fast.

In [20]:
ner(test=test_data3, display=True, no_train=True, model="model_saved", n_iter=15)

Loaded model 'model_saved'
Entities in 'Deep learning is essential in the context of big data and, for this purpose, its performance across different scenarios is compared and contrasted in this overview article. The empirical results of the different cases studies suggest that deep learning is a feasible and effective method, which can considerably and consistently outperform its traditional counterparts in both prediction and operational performance from the family of data-analytic models. As such, DNNs are able identify previously unknown, potentially useful, non-trivial, and interesting patterns more accurately than other popular predictive models such as random forest, artificial neural networks, and support vector machines. One of the reasons they yield superior results originates from the strong mathematical assumptions in traditional machine learning, whereas these are relaxed by DNNs as a result of their larger parameter space.'
SML random forest


In [None]:
ner(test=test_data3, display=False, output_dir="model_saved2", verbose=False)
for i in range (53):
    print("\n\nITERATION: " + str(i))
    ner(test=test_data3, display=False, model="model_saved2", output_dir="model_saved1", verbose=False)
    shutil.rmtree("model_saved2")
    ner(test=test_data3, display=False, model="model_saved1", output_dir="model_saved2", verbose=False)
    shutil.rmtree("model_saved1")
    
ner(test=test_data1, display=True, model="model_saved2")

Created blank 'en' model
SML random forest
Saved model to model_saved2
Loading from model_saved2
SML random forest


ITERATION: 0
Loaded model 'model_saved2'
SML random forest
Saved model to model_saved1
Loading from model_saved1
SML random forest
Loaded model 'model_saved1'
SML random forest
Saved model to model_saved2
Loading from model_saved2
SML random forest


ITERATION: 1
Loaded model 'model_saved2'
SML random forest
Saved model to model_saved1
Loading from model_saved1
SML random forest
Loaded model 'model_saved1'
SML random forest
Saved model to model_saved2
Loading from model_saved2
SML random forest


ITERATION: 2
Loaded model 'model_saved2'
SML random forest
Saved model to model_saved1
Loading from model_saved1
SML random forest
Loaded model 'model_saved1'
SML random forest
Saved model to model_saved2
Loading from model_saved2
SML random forest


ITERATION: 3
Loaded model 'model_saved2'
SML random forest
Saved model to model_saved1
Loading from model_saved1
SML random forest

In [24]:
shutil.rmtree("model_saved")

# Cross Validation

In order to to test the overall performance of our algorithm, we have used cross validation.

As mentioned before, we trained the model with 245 samples, split in 35 groups of 7 samples for a specific keyword. That means that cross validation cannot be applied in any way, as the condition to find a keyword on a given text, is to have learned about that keyword before.

For example, if we used Leave-one-out cross validation (LOOCV), the keyword on its sample of text used for validation in each iteration of the process would only be found because the other 6 samples of text used for learning that. However, the rest of samples on the training set would not be used as they have no information about the keyword cross validation is trying to find on each iteration. So, for each keyword, only samples related to it are relevant (generally 7).

That said, what we are using is Leave-p-out cross validation, where on this case: ```p = 35``` (leave one sample of text out for each keyword).

In [22]:
def getFold(traindata, fold):
    return [traindata[i][0] for i in range(len(traindata)) if i%7 == fold]

def getNotFold(traindata, fold):
    return [traindata[i] for i in range(len(traindata)) if i%7 != fold]

def concatByNewLine(strings):
    res = ""
    if len(strings) > 0:
        res = strings[0]
        for s in strings[1:]:
            res += "\n" + s
    return res

def crossvalidation(TRAIN_DATA, iterations=30):
    print("\n\nITERATIONS = " + str(iterations))
    traindata = copy.deepcopy(TRAIN_DATA)
  
    for i in range(7):
        print("\nIteration: " + str(i))
        test = concatByNewLine(getFold(TRAIN_DATA, i))
        train = getNotFold(TRAIN_DATA, i)
        ner(n_iter=iterations, train=train, test=test)
        print("---------------------------------------------------")

In order to use cross validation we just run the next cell with the whole training set with 245 samples:

In [0]:
crossvalidation(TRAIN_DATA)

Number of iterations can be set using *iterations* parameter. Default is 30.

In [0]:
crossvalidation(TRAIN_DATA, iterations=15)