## Install [SpaCy](https://nlpforhackers.io/complete-guide-to-spacy/) (en)

In [1]:
!pip install spacy==v2.2.4



In [2]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


# Import Spacy and View versions

In [6]:
import spacy
from spacy import displacy #for visualization

In [7]:
spacy.info()

[1m

spaCy version    2.2.4                         
Location         /usr/local/lib/python3.6/dist-packages/spacy
Platform         Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version   3.6.9                         
Models           en                            



{'Location': '/usr/local/lib/python3.6/dist-packages/spacy',
 'Models': 'en',
 'Platform': 'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic',
 'Python version': '3.6.9',
 'spaCy version': '2.2.4'}

# Example of Spacy NER usage with pretrained model

Spacy Entity Types in pre-trained spacy model. Putting them here to compare with those we'll receive after training on Conll2003

In [8]:
nlp_pretrained = spacy.load('en')
doc = nlp_pretrained(u'Ronald just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ. Us is a country in north America')
for ent in doc.ents:
    print(ent.text, ent.label_)
displacy.render(doc, style='ent', jupyter=True)

Ronald PERSON
2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG
Us GPE
north America LOC


The most convenient way is training a spacy model from cli, docs: https://spacy.io/api/cli#train

# Upload Conll2003 data

Upload train.txt, valid.txt, test.txt files (training, validation and test parts of the Conll2003)

In [9]:
from google.colab import files
#files.upload() # uncomment this line if you prefer to upload the dataset from your fs

In [10]:
# download the train, valid and test files from github:
!wget https://raw.githubusercontent.com/natalyasegal/spacy_ner_conll2003_en/master/ner_proj/data/train.txt
!wget https://raw.githubusercontent.com/natalyasegal/spacy_ner_conll2003_en/master/ner_proj/data/valid.txt
!wget https://raw.githubusercontent.com/natalyasegal/spacy_ner_conll2003_en/master/ner_proj/data/test.txt

--2020-09-28 12:46:44--  https://raw.githubusercontent.com/natalyasegal/spacy_ner_conll2003_en/master/ner_proj/data/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283418 (3.1M) [text/plain]
Saving to: ‘train.txt.1’


2020-09-28 12:46:45 (10.6 MB/s) - ‘train.txt.1’ saved [3283418/3283418]

--2020-09-28 12:46:45--  https://raw.githubusercontent.com/natalyasegal/spacy_ner_conll2003_en/master/ner_proj/data/valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827441 (808K) [text/plain]
Saving to: ‘valid.txt.1’


2020-09-28 12:46

In [11]:
ls -la

total 9528
drwxr-xr-x 1 root root    4096 Sep 28 12:46 [0m[01;34m.[0m/
drwxr-xr-x 1 root root    4096 Sep 28 09:47 [01;34m..[0m/
drwxr-xr-x 1 root root    4096 Sep 18 16:15 [01;34m.config[0m/
drwxr-xr-x 2 root root    4096 Sep 28 12:44 [01;34mconll2003_json[0m/
drwxr-xr-x 2 root root    4096 Sep 28 12:44 [01;34mns_models2[0m/
drwxr-xr-x 1 root root    4096 Sep 16 16:29 [01;34msample_data[0m/
-rw-r--r-- 1 root root  748093 Sep 28 12:44 test.txt
-rw-r--r-- 1 root root  748093 Sep 28 12:46 test.txt.1
-rw-r--r-- 1 root root 3283418 Sep 28 12:44 train.txt
-rw-r--r-- 1 root root 3283418 Sep 28 12:46 train.txt.1
-rw-r--r-- 1 root root  827441 Sep 28 12:44 valid.txt
-rw-r--r-- 1 root root  827441 Sep 28 12:46 valid.txt.1


 Alternatively the dataset files can be taken from https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003/

# Convert training and validation data

In [12]:
!mkdir conll2003_json

mkdir: cannot create directory ‘conll2003_json’: File exists


In [13]:
!python -m spacy convert -c ner -b en -n 10 train.txt conll2003_json
!python -m spacy convert -c ner -b en -n 10 -l en valid.txt conll2003_json

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1499 documents): conll2003_json/train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (347 documents): conll2003_json/valid.json[0m


In [14]:
ls conll2003_json/

train.json  valid.json


# Debug resulting json training and validation data

In [15]:
!python -m spacy debug-data en conll2003_json/train.json conll2003_json/valid.json -b en -V 

[1m
[2K[38;5;2m✔ Corpus is loadable[0m
[1m
Training pipeline: tagger, parser, ner
Starting with base model 'en'
1499 training docs
347 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[1m
[38;5;4mℹ 204567 total words in the data (23624 unique)[0m
10 most common words: '.' (7374), ',' (7290), 'the' (7243), 'of' (3751), 'in'
(3398), 'to' (3382), 'a' (2994), '(' (2861), ')' (2861), 'and' (2838)
[38;5;4mℹ No word vectors present in the model[0m
[1m
[38;5;4mℹ 2 new labels, 2 existing labels[0m
0 missing values (tokens with '-' label)
New: 'LOC' (7140), 'PER' (6600), 'ORG' (6321), 'MISC' (3438)
Existing: 'ORG', 'LOC'
[38;5;3m⚠ 15 entity span(s) with punctuation[0m
[38;5;2m✔ Good amount of examples for all labels[0m
[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ No entities consisting of or starting/ending with whitespace[0m
Entity spans consisting of or starting/ending with punctuation can not be
trained with a 

# Train Spacy NER model on Coll2003 

Train spacy ner part of Spacy pypeline with data from the train and valid, converted above

In [16]:
!python -m spacy train en ns_models2 conll2003_json/train.json conll2003_json/valid.json -p ner

Training pipeline: ['ner']
Starting with blank model 'en'
Counting training words (limit=0)

Itn  NER Loss   NER P   NER R   NER F   Token %  CPU WPS
---  ---------  ------  ------  ------  -------  -------
  1  21346.434  79.993  79.939  79.966  100.000    25079
  2  10178.730  84.821  84.450  84.635  100.000    25726
  3   7131.814  86.371  86.065  86.218  100.000    21971
  4   5446.672  87.411  86.940  87.175  100.000    26066
  5   4135.961  87.857  87.428  87.642  100.000    25618
  6   3551.136  88.005  87.546  87.775  100.000    26047
  7   2995.955  88.131  87.849  87.990  100.000    25537
  8   2943.858  88.130  87.715  87.922  100.000    25387
  9   2551.111  87.996  87.715  87.855  100.000    25524
 10   2335.410  88.028  87.731  87.879  100.000    25827
 11   2034.013  88.150  87.883  88.016  100.000    25749
 12   1967.503  87.943  87.765  87.854  100.000    25154
 13   1929.714  88.165  87.883  88.024  100.000    25482
 14   1784.500  88.123  87.782  87.952  100.000    2

# Evaluation

At this stage we'll nedd our test data for model evaluation, so we convert it to json

In [26]:
!python -m spacy convert -c ner -n 10  test.txt conll2003_json

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (369 documents): conll2003_json/test.json[0m


Make sure train and test data examples do not overlap

In [27]:
!python -m spacy debug-data en conll2003_json/train.json conll2003_json/test.json -b en -V 

[1m
[2K[38;5;2m✔ Corpus is loadable[0m
[1m
Training pipeline: tagger, parser, ner
Starting with base model 'en'
1499 training docs
369 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[1m
[38;5;4mℹ 204567 total words in the data (23624 unique)[0m
10 most common words: '.' (7374), ',' (7290), 'the' (7243), 'of' (3751), 'in'
(3398), 'to' (3382), 'a' (2994), '(' (2861), ')' (2861), 'and' (2838)
[38;5;4mℹ No word vectors present in the model[0m
[1m
[38;5;4mℹ 2 new labels, 2 existing labels[0m
0 missing values (tokens with '-' label)
New: 'LOC' (7140), 'PER' (6600), 'ORG' (6321), 'MISC' (3438)
Existing: 'LOC', 'ORG'
[38;5;3m⚠ 15 entity span(s) with punctuation[0m
[38;5;2m✔ Good amount of examples for all labels[0m
[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ No entities consisting of or starting/ending with whitespace[0m
Entity spans consisting of or starting/ending with punctuation can not be
trained with a 

Evaluate precision (NER P), recall (NER R) and f score (NER F) of the best model option on our test set

In [28]:
!python -m spacy evaluate ns_models2/model-best conll2003_json/test.json

[1m

Time      1.82 s
Words     46666 
Words/s   25675 
TOK       100.00
POS       0.00  
UAS       0.00  
LAS       0.00  
NER P     81.79 
NER R     82.08 
NER F     81.94 
Textcat   0.00  



Results may differ between trainings, I'll put my 1st training resuts below for comparison.

Results:

---


NER P     81.85 %
NER R     82.42 %
NER F     82.13 %
Look slightly better then in https://github.com/Djia09/Named-Entity-Recognition-spaCy may be due to the used Spacy version

# Generating a model package

The following step generages model package, more info: https://spacy.io/api/cli#package

Please note that we integrate shell comands with exclamation mark, more info: https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.05-IPython-And-Shell-Commands.ipynb#scrollTo=01vD1Inmgwin

In [29]:
!mkdir ns_model_out

In [30]:
!python -m spacy package -f  ns_models2/model-best ns_model_out

[38;5;2m✔ Loaded meta.json from file[0m
ns_models2/model-best/meta.json
[38;5;2m✔ Successfully created package 'en_model-0.0.0'[0m
ns_model_out/en_model-0.0.0
To build the package, run `python setup.py sdist` in this directory.


In [31]:
!ls ns_model_out/en_model-0.0.0

en_model  MANIFEST.in  meta.json  setup.py


In [32]:
%cd ns_model_out/en_model-0.0.0
!pwd

/content/ns_model_out/en_model-0.0.0
/content/ns_model_out/en_model-0.0.0


In [33]:
!python setup.py sdist

running sdist
running egg_info
creating en_model.egg-info
writing en_model.egg-info/PKG-INFO
writing dependency_links to en_model.egg-info/dependency_links.txt
writing requirements to en_model.egg-info/requires.txt
writing top-level names to en_model.egg-info/top_level.txt
writing manifest file 'en_model.egg-info/SOURCES.txt'
reading manifest file 'en_model.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'en_model.egg-info/SOURCES.txt'

running check


creating en_model-0.0.0
creating en_model-0.0.0/en_model
creating en_model-0.0.0/en_model.egg-info
creating en_model-0.0.0/en_model/en_model-0.0.0
creating en_model-0.0.0/en_model/en_model-0.0.0/ner
creating en_model-0.0.0/en_model/en_model-0.0.0/vocab
copying files to en_model-0.0.0...
copying MANIFEST.in -> en_model-0.0.0
copying meta.json -> en_model-0.0.0
copying setup.py -> en_model-0.0.0
copying en_model/__init__.py -> en_model-0.0.0/en_model
copying en_model/meta.json -> en_model-0.0.0/en_model


In [34]:
!pwd
!ls dist/en_model-0.0.0.tar.gz

/content/ns_model_out/en_model-0.0.0
dist/en_model-0.0.0.tar.gz


In [35]:
!pip install dist/en_model-0.0.0.tar.gz

Processing ./dist/en_model-0.0.0.tar.gz
Building wheels for collected packages: en-model
  Building wheel for en-model (setup.py) ... [?25l[?25hdone
  Created wheel for en-model: filename=en_model-0.0.0-cp36-none-any.whl size=4500551 sha256=fb71327111ca0e02c041b0843f02e58bb78868681a27d0185f4044687bfcf799
  Stored in directory: /root/.cache/pip/wheels/8a/2e/11/027f59dfa87c6ce585df7d3756f6e161ab1c4cba9806cd4ff8
Successfully built en-model
Installing collected packages: en-model
Successfully installed en-model-0.0.0


# Example with new model

Let's try our new ner model

In [36]:
nlp = spacy.load("en_model/en_model-0.0.0/")

In [37]:
doc = nlp(u'Ronald just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ. Us is a country in north America')
for ent in doc.ents:
    print(ent.text, ent.label_)
displacy.render(doc, style='ent', jupyter=True)  

Ronald PER
WSJ ORG
Us LOC
America LOC


# Download results

Download the resulting model from Colab

In [40]:
ls dist

en_model-0.0.0.tar.gz


In [41]:
from google.colab import files
files.download("dist/en_model-0.0.0.tar.gz")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>