## Install [SpaCy](https://nlpforhackers.io/complete-guide-to-spacy/) (en)

In [None]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


# Import Spacy and View versions

In [None]:
import spacy
from spacy import displacy #for visualization

In [None]:
spacy.info()

[1m

spaCy version    2.2.4                         
Location         /usr/local/lib/python3.6/dist-packages/spacy
Platform         Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version   3.6.9                         
Models           en                            



{'Location': '/usr/local/lib/python3.6/dist-packages/spacy',
 'Models': 'en',
 'Platform': 'Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic',
 'Python version': '3.6.9',
 'spaCy version': '2.2.4'}

# Example of Spacy NER usage with pretrained model

Spacy Entity Types in pre-trained spacy model. Putting them here to compare with those we'll receive after training on Conll2003

In [None]:
nlp_pretrained = spacy.load('en')
doc = nlp_pretrained(u'Ronald just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ. Us is a country in north America')
for ent in doc.ents:
    print(ent.text, ent.label_)
displacy.render(doc, style='ent', jupyter=True)

Ronald PERSON
2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG
Us GPE
north America LOC


The most convenient way is training a spacy model from cli, docs: https://spacy.io/api/cli#train

# Upload Conll2003 data

Upload train.txt, valid.txt, test.txt files (training, validation and test parts of the Conll2003)

In [None]:
from google.colab import files
files.upload()

Saving test.txt to test.txt
Saving train.txt to train.txt
Saving valid.txt to valid.txt


 'valid.txt': b'-DOCSTART- -X- -X- O\n\nCRICKET NNP B-NP O\n- : O O\nLEICESTERSHIRE NNP B-NP B-ORG\nTAKE NNP I-NP O\nOVER IN B-PP O\nAT NNP B-NP O\nTOP NNP I-NP O\nAFTER NNP I-NP O\nINNINGS NNP I-NP O\nVICTORY NN I-NP O\n. . O O\n\nLONDON NNP B-NP B-LOC\n1996-08-30 CD I-NP O\n\nWest NNP B-NP B-MISC\nIndian NNP I-NP I-MISC\nall-rounder NN I-NP O\nPhil NNP I-NP B-PER\nSimmons NNP I-NP I-PER\ntook VBD B-VP O\nfour CD B-NP O\nfor IN B-PP O\n38 CD B-NP O\non IN B-PP O\nFriday NNP B-NP O\nas IN B-PP O\nLeicestershire NNP B-NP B-ORG\nbeat VBD B-VP O\nSomerset NNP B-NP B-ORG\nby IN B-PP O\nan DT B-NP O\ninnings NN I-NP O\nand CC O O\n39 CD B-NP O\nruns NNS I-NP O\nin IN B-PP O\ntwo CD B-NP O\ndays NNS I-NP O\nto TO B-VP O\ntake VB I-VP O\nover IN B-PP O\nat IN B-PP O\nthe DT B-NP O\nhead NN I-NP O\nof IN B-PP O\nthe DT B-NP O\ncounty NN I-NP O\nchampionship NN I-NP O\n. . O O\n\nTheir PRP$ B-NP O\nstay NN I-NP O\non IN B-PP O\ntop NN B-NP O\n, , O O\nthough RB B-ADVP O\n, , O O\nmay MD B-VP O\

 Alternatively the dataset files can be taken from https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003/

# Train Spacy NER model on Coll2003 

Here we can see what models are currenlly avaliable in Spacy

In [None]:
!python -m spacy validate

⠙ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.6/dist-packages/spacy[0m

TYPE      NAME             MODEL            VERSION                            
package   en-core-web-sm   en_core_web_sm   [38;5;2m2.2.5[0m   [38;5;2m✔[0m
link      en               en_core_web_sm   [38;5;2m2.2.5[0m   [38;5;2m✔[0m



Convert training and validation files into input format for Spacy training (json)

In [None]:
!mkdir conll2003_json
!python -m spacy convert -c ner train.txt conll2003_json
!python -m spacy convert -c ner valid.txt conll2003_json

mkdir: cannot create directory ‘conll2003_json’: File exists
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents):
conll2003_json/train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): conll2003_json/valid.json[0m


Train spacy ner part of Spacy pypeline with data from the train and valid, converted above

In [None]:
!python -m spacy train en ns_models2 conll2003_json/train.json conll2003_json/valid.json -p ner

[38;5;2m✔ Created output directory: ns_models2[0m
Training pipeline: ['ner']
Starting with blank model 'en'
Counting training words (limit=0)

Itn  NER Loss   NER P   NER R   NER F   Token %  CPU WPS
---  ---------  ------  ------  ------  -------  -------
  1  16908.468  83.271  82.683  82.976  100.000    19145
  2   7845.200  86.856  86.520  86.687  100.000    20124
  3   5253.116  88.049  87.782  87.915  100.000    20157
  4   3900.627  88.506  88.118  88.312  100.000    20000
  5   3116.596  89.178  88.758  88.968  100.000    20215
  6   2624.024  89.161  88.741  88.951  100.000    20031
  7   2186.150  89.066  88.556  88.810  100.000    19918
  8   2012.472  89.310  88.859  89.084  100.000    20170
  9   1848.069  89.296  89.010  89.153  100.000    20339
 10   1780.480  89.248  88.842  89.044  100.000    20462
 11   1534.208  89.095  88.825  88.960  100.000    20671
 12   1449.838  89.058  88.758  88.908  100.000    20531
 13   1491.330  88.801  88.472  88.636  100.000    20507


# Evaluation

At this stage we'll nedd our test data for model evaluation, so we convert it json

In [None]:
!python -m spacy convert -c ner test.txt conll2003_json

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): conll2003_json/test.json[0m


Evaluate precision (NER P), recall (NER R) and f score (NER F) of the best model option on our test set

In [None]:
!python -m spacy evaluate ns_models2/model-best conll2003_json/test.json

[1m

Time      2.49 s
Words     46666 
Words/s   18717 
TOK       100.00
POS       0.00  
UAS       0.00  
LAS       0.00  
NER P     82.16 
NER R     82.79 
NER F     82.48 
Textcat   0.00  



Results may differ between trainings, I'll put my 1st training resuts below for comparison.

Results:

---


NER P     81.85 %
NER R     82.42 %
NER F     82.13 %
Look slightly better then in https://github.com/Djia09/Named-Entity-Recognition-spaCy may be due to the used Spacy version

# Generating a model package

The following step generages model package, more info: https://spacy.io/api/cli#package

Please note that we integrate shell comands with exclamation mark, more info: https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.05-IPython-And-Shell-Commands.ipynb#scrollTo=01vD1Inmgwin

In [None]:
!mkdir ns_model_out

In [None]:
!python -m spacy package -f  ns_models2/model-best ns_model_out

[38;5;2m✔ Loaded meta.json from file[0m
ns_models2/model-best/meta.json
[38;5;2m✔ Successfully created package 'en_model-0.0.0'[0m
ns_model_out/en_model-0.0.0
To build the package, run `python setup.py sdist` in this directory.


In [None]:
!ls ns_model_out/en_model-0.0.0

en_model  MANIFEST.in  meta.json  setup.py


In [134]:
%cd ns_model_out/en_model-0.0.0
!pwd

/content/ns_models2/ns_model_out/en_model-0.0.0
/content/ns_models2/ns_model_out/en_model-0.0.0


In [135]:
!python setup.py sdist

running sdist
running egg_info
creating en_model.egg-info
writing en_model.egg-info/PKG-INFO
writing dependency_links to en_model.egg-info/dependency_links.txt
writing requirements to en_model.egg-info/requires.txt
writing top-level names to en_model.egg-info/top_level.txt
writing manifest file 'en_model.egg-info/SOURCES.txt'
reading manifest file 'en_model.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'en_model.egg-info/SOURCES.txt'

running check


creating en_model-0.0.0
creating en_model-0.0.0/en_model
creating en_model-0.0.0/en_model.egg-info
creating en_model-0.0.0/en_model/en_model-0.0.0
creating en_model-0.0.0/en_model/en_model-0.0.0/ner
creating en_model-0.0.0/en_model/en_model-0.0.0/vocab
copying files to en_model-0.0.0...
copying MANIFEST.in -> en_model-0.0.0
copying meta.json -> en_model-0.0.0
copying setup.py -> en_model-0.0.0
copying en_model/__init__.py -> en_model-0.0.0/en_model
copying en_model/meta.json -> en_model-0.0.0/en_model


In [144]:
!pwd
!ls dist/en_model-0.0.0.tar.gz

/content/ns_models2/ns_model_out/en_model-0.0.0
dist/en_model-0.0.0.tar.gz


In [136]:
!pip install dist/en_model-0.0.0.tar.gz

Processing ./dist/en_model-0.0.0.tar.gz
Building wheels for collected packages: en-model
  Building wheel for en-model (setup.py) ... [?25l[?25hdone
  Created wheel for en-model: filename=en_model-0.0.0-cp36-none-any.whl size=4157297 sha256=13e60ce20b5bbddf9ea6c9d49aa6062e5e6a47e283997dff416ec317018227ca
  Stored in directory: /root/.cache/pip/wheels/5d/22/c0/f7bb5fb98694a2632a61f98e15ad32157f81cfe5c4389d84bb
Successfully built en-model
Installing collected packages: en-model
Successfully installed en-model-0.0.0


# Example with new model

Let's try our new ner model

In [141]:
nlp = spacy.load("/content/ns_models2/ns_model_out/en_model-0.0.0/en_model/en_model-0.0.0/")

In [142]:
doc = nlp(u'Ronald just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ. Us is a country in north America')
for ent in doc.ents:
    print(ent.text, ent.label_)
displacy.render(doc, style='ent', jupyter=True)  

Ronald PER
WSJ ORG
Us LOC
America LOC


# Download results

Download the resulting model from Colab

In [146]:
ls dist

en_model-0.0.0.tar.gz


In [147]:
from google.colab import files
files.download("dist/en_model-0.0.0.tar.gz")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>