# Neural NER with spacy
In this hands-on we use https://spacy.io, a framework for all basic NLP processing steps and support several languages out-of-the-box:

 - Tokenization/Word segmentation
 - Sentence splitting
 - Part-of-Speech tagging
 - Lemmatization
 - NER
 - Dependency Parsing
 
 

## Hands-on
Downloading a small model for English trained on modern Web data. 

In [1]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
[K     |████████████████████████████████| 11.1MB 1.8MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-rwcqwuma/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Setting up an NLP pipeline for English:

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

Reading and linguistic processing of a single sentence. More information on the meaning of the results can be found under https://spacy.io/usage/linguistic-features .
Note that the human readable properties of word tokens always end with underscore.

In [2]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


The spacy default NLP pipeline for English includes NER. See https://spacy.io/usage/linguistic-features#named-entities for more information.

Let's look at the text of all found named entities: 

In [None]:
doc

In [4]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Accessing IOB information at the token level

In [5]:
print([doc[0].text, doc[0].ent_iob_, doc[0].ent_type_])

['Apple', 'B', 'ORG']


In [6]:
print([doc[9].text, doc[9].ent_iob_, doc[9].ent_type_])

['1', 'I', 'MONEY']


More information of accessing NER information: https://spacy.io/usage/linguistic-features#accessing

### Working with different language
If you want to work on French data, use the following commands for setting up an NLP pipeline.

In [10]:
!python -m spacy download fr_core_news_sm

Collecting fr_core_news_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.1.0/fr_core_news_sm-2.1.0.tar.gz#egg=fr_core_news_sm==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.1.0/fr_core_news_sm-2.1.0.tar.gz (13.1MB)
[K     |████████████████████████████████| 13.1MB 1.8MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: fr-core-news-sm
  Building wheel for fr-core-news-sm (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-n9i49rco/wheels/ab/82/2a/61dd0ff02e22f10eef65a5aa35453a0eb745c84b4c874b612f
Successfully built fr-core-news-sm
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


In [31]:
# without this line spacy is not able to find the downloaded model

! python -m spacy link --force fr_core_news_sm fr_core_news_sm

[38;5;2m✔ Linking successful[0m
/srv/conda/envs/notebook/lib/python3.7/site-packages/fr_core_news_sm -->
/srv/conda/envs/notebook/lib/python3.7/site-packages/spacy/data/fr_core_news_sm
You can now load the model via spacy.load('fr_core_news_sm')


In [32]:
nlp_fr = spacy.load("fr_core_news_sm")

See https://spacy.io/usage/models for more languages.

### Testing the robustness
 - Try sentences from other domains than contemporary news.
 - Add noise (OCR errors, typos) to the text.

How much do the results suffer?


## Next step: Combining rule-based and statistical NER
A tutorial how to use rule-based pattern matchers in spacy can be found here: https://spacy.io/usage/rule-based-matching#entityruler
A nice example is the rule-based addition of titles to named entities that have been recognized without the titles by statistical NER: https://spacy.io/usage/rule-based-matching#models-rules-ner

## Next step: Online training of existing models
All spacy NER models can be updated easily by further training them on new labeled examples.
The relevant documentation and sample code of spacy can be found here: https://spacy.io/usage/training#ner

There is a step-by-step tutorial that shows how an existing model can be adapted to your own data:
https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718