# Named Entity Recognition in spaCy

In [1]:
from ipynb.fs.defs.utilities import print_pipeline_info

## Using the standard model

This is simply a matter of just loading a language model and running it.

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp("Sue lives in London.")
for ent in doc.ents:
    print(f"{ent.start_char:2} {ent.end_char:2}  {ent.text}")

 0  3  Sue
13 19  London


However, we want to approve on this and maybe add a few bells and whistles and spaCy gives us a few options here:

1. Adding a rule-based component
2. Adding a custom component that enriches the document
3. Updating the model included in "en_core_web_sm"

We look into the first two here.

## Rule-based entity recognition

For this we add a component to the pipeline. Let's first see what is in the nlp pipeline we have used so far. It's an instance of `spacy.lang.en.English` which has a tokenizer and a set of further components. Note how the tokenizer has a special status and how it is not part of the components list.

In [4]:
print(nlp)
print(nlp.tokenizer, '\n')
for name, component in nlp.components:
    print(f"{name:16} {component}")

<spacy.lang.en.English object at 0x111016100>
<spacy.tokenizer.Tokenizer object at 0x121271670> 

tok2vec          <spacy.pipeline.tok2vec.Tok2Vec object at 0x121424c70>
tagger           <spacy.pipeline.tagger.Tagger object at 0x12144b540>
parser           <spacy.pipeline.dep_parser.DependencyParser object at 0x1212819a0>
senter           <spacy.pipeline.senter.SentenceRecognizer object at 0x121469f90>
ner              <spacy.pipeline.ner.EntityRecognizer object at 0x1212814c0>
attribute_ruler  <spacy.pipeline.attributeruler.AttributeRuler object at 0x12140ecc0>
lemmatizer       <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x1214dc240>


In [5]:
print('METADATA       %s' % ' '.join(list(nlp.meta.keys())[:8]), '...')
print('NAME           %s' % nlp.meta['name'])
print('DESCRIPTION    %s' % nlp.meta['description'][:80], '...')
print('PIPELINE       %s' % ' '.join(list(nlp.meta['pipeline'])))
print('COMPONENTS     %s' % ' '.join(list(nlp.meta['components'])))
print('LABELS         %s' % ' '.join(list(nlp.meta['labels'].keys())))
print('LABELS["ner"]  %s' % ' '.join(list(nlp.meta['labels']['ner'][:10])), '...')

METADATA       lang name version spacy_version description author email url ...
NAME           core_web_sm
DESCRIPTION    English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ...
PIPELINE       tok2vec tagger parser ner attribute_ruler lemmatizer
COMPONENTS     tok2vec tagger parser senter ner attribute_ruler lemmatizer
LABELS         tok2vec tagger parser senter ner attribute_ruler lemmatizer
LABELS["ner"]  CARDINAL DATE EVENT FAC GPE LANGUAGE LAW LOC MONEY NORP ...


| Name          | Description         
| :-            | :-------------
| NLP           | <spacy.lang.en.English object at 0x10e8e5f10>
| NAME          | core_web_sm
| DESCRIPTION   | English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
| METADATA      | lang name version spacy_version description author email url license spacy_git_version vectors labels pipeline components disabled performance sources requirements
| COMPONENTS    | tok2vec tagger parser senter ner attribute_ruler lemmatizer
| LABELS        | tok2vec tagger parser senter ner attribute_ruler lemmatizer
| LABELS["ner"] | CARDINAL DATE EVENT FAC GPE LANGUAGE LAW LOC MONEY NORP ORDINAL ORG PERCENT PERSON PRODUCT QUANTITY TIME WORK_OF_ART

### A very simple pipeline

You can start a pipeline from scratch using `spacy.lang.en.English` or `spacy.blank("en")`.

In [6]:
from spacy.lang.en import English

This pipeline has no components, no description and a default name, but it will have a tokenizer, which is the minimal requirement to turn a text into a document. It wil have the same type as the pipleline loaded from `en_core_web_sm`.

In [7]:
nlp_e = English()

print_pipeline_info(nlp_e)

print("\ntype(nlp) == type(nlp_e)  ==> ", type(nlp) == type(nlp_e))

<spacy.lang.en.English object at 0x1216dbe80>
<spacy.tokenizer.Tokenizer object at 0x122371ca0>

NAME           pipeline
DESCRIPTION    
PIPELINE       []
COMPONENTS     []
LABELS         []
LABELS["ner"]  []

type(nlp) == type(nlp_e)  ==>  True


In [8]:
# this does the same

nlp_e2 = spacy.blank("ne")
print_pipeline_info(nlp_e)

<spacy.lang.en.English object at 0x1216dbe80>
<spacy.tokenizer.Tokenizer object at 0x122371ca0>

NAME           pipeline
DESCRIPTION    
PIPELINE       []
COMPONENTS     []
LABELS         []
LABELS["ner"]  []


In [9]:
# We have no entities yet...

nlp_e = English()
doc_e = nlp_e("Apple is opening its first big office in San Francisco.")
print(doc_e.ents)

()


Now let's put in a rule-based entity extractor.

In [10]:
nlp_e = English()

# the names you can use here are fixed, you cannot call it "special_entities" for example
ruler = nlp_e.add_pipe("entity_ruler")

# you can use token-based patterns just as before, the first one matches to the text field by default
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]

ruler.add_patterns(patterns)

print_pipeline_info(nlp_e)

<spacy.lang.en.English object at 0x122664fd0>
<spacy.tokenizer.Tokenizer object at 0x1214cd040>

entity_ruler     <spacy.pipeline.entityruler.EntityRuler object at 0x12265b980>

NAME           pipeline
DESCRIPTION    
PIPELINE       ['entity_ruler']
COMPONENTS     ['entity_ruler']
LABELS         ['entity_ruler']
LABELS["ner"]  []


In [11]:
nlp_e.meta

{'lang': 'en',
 'name': 'pipeline',
 'version': '0.0.0',
 'spacy_version': '>=3.0.3,<3.1.0',
 'description': '',
 'author': '',
 'email': '',
 'url': '',
 'license': '',
 'spacy_git_version': 'f4f46b617',
 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None},
 'labels': {'entity_ruler': ['GPE', 'ORG']},
 'pipeline': ['entity_ruler'],
 'components': ['entity_ruler'],
 'disabled': []}

In [12]:
doc_e = nlp_e("Apple is opening its first big office in San Francisco.")
print([(ent.__class__.__name__, ent.text, ent.label_) for ent in doc_e.ents])

[('Span', 'Apple', 'ORG'), ('Span', 'San Francisco', 'GPE')]


### Inserting the rule-based NER in a full pipeline

You can insert a specialized matcher before the "ner" component in the pipeline. Say we want to recognize gadgets as a special kind of product.

In [13]:
nlp_e2 = spacy.load("en_core_web_sm")

ruler = nlp_e2.add_pipe("entity_ruler", before="ner")
patterns = [{"label": "GADGET", "pattern": "Apple iPhone"}]
ruler.add_patterns(patterns)

print_pipeline_info(nlp_e2)

<spacy.lang.en.English object at 0x122664790>
<spacy.tokenizer.Tokenizer object at 0x1214cd670>

tok2vec          <spacy.pipeline.tok2vec.Tok2Vec object at 0x121424ef0>
tagger           <spacy.pipeline.tagger.Tagger object at 0x1226d2400>
parser           <spacy.pipeline.dep_parser.DependencyParser object at 0x1224c9c40>
senter           <spacy.pipeline.senter.SentenceRecognizer object at 0x1226d2630>
entity_ruler     <spacy.pipeline.entityruler.EntityRuler object at 0x121932440>
ner              <spacy.pipeline.ner.EntityRecognizer object at 0x12265aee0>
attribute_ruler  <spacy.pipeline.attributeruler.AttributeRuler object at 0x122720200>
lemmatizer       <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x12272be40>

NAME           core_web_sm
DESCRIPTION    English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
PIPELINE       ['tok2vec', 'tagger', 'parser', 'entity_ruler', 'ner', 'attribute_ruler', 'lemmatizer']
COMPONE

With the updated pipeline we can now get GADGETS.

In [14]:
#nlp_e2.enable_pipe('entity_ruler')
doc = nlp_e2("I got my Apple iPhone today.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Apple iPhone', 'GADGET'), ('today', 'DATE')]


Compare this to running it without the extra component. Note how the "ner" module did not create the ORG.

In [15]:
nlp_e2.disable_pipe('entity_ruler')

print('PIPELINE:   ', [name for name, component in nlp_e2.pipeline])
print('COMPONENTS: ', nlp_e2.component_names, '\n')

doc = nlp_e2("I got my Apple iPhone today.")
print([(ent.text, ent.label_) for ent in doc.ents])

PIPELINE:    ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
COMPONENTS:  ['tok2vec', 'tagger', 'parser', 'senter', 'entity_ruler', 'ner', 'attribute_ruler', 'lemmatizer'] 

[('Apple', 'ORG'), ('today', 'DATE')]


### Adding a post NER component

You can define and register your own components by using the `Language.component` decorator on some function. The function should take an instance of Doc as input and return the instance.

In [16]:
from spacy.language import Language

nlp_e3 = spacy.load("en_core_web_sm")

# First custom component
@Language.component('count-ner-before')
def count_entities(doc):
    print(f"I found {len(doc.ents)} named entities")
    return doc

# Another one, you can use the same embedded function, will in real life probably refactor this
@Language.component('count-ner-after')
def count_entities(doc):
    print(f"I found {len(doc.ents)} named entities")
    return doc

nlp_e3.add_pipe('count-ner-before', before="ner")
nlp_e3.add_pipe('count-ner-after', after="ner")

<function __main__.count_entities(doc)>

In [17]:
print_pipeline_info(nlp_e3)

<spacy.lang.en.English object at 0x1224d5490>
<spacy.tokenizer.Tokenizer object at 0x122708310>

tok2vec          <spacy.pipeline.tok2vec.Tok2Vec object at 0x123730310>
tagger           <spacy.pipeline.tagger.Tagger object at 0x123730810>
parser           <spacy.pipeline.dep_parser.DependencyParser object at 0x1229878e0>
senter           <spacy.pipeline.senter.SentenceRecognizer object at 0x12237ad60>
count-ner-before <function count_entities at 0x122371820>
ner              <spacy.pipeline.ner.EntityRecognizer object at 0x1224d44c0>
count-ner-after  <function count_entities at 0x122708ee0>
attribute_ruler  <spacy.pipeline.attributeruler.AttributeRuler object at 0x12378d500>
lemmatizer       <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x123790580>

NAME           core_web_sm
DESCRIPTION    English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
PIPELINE       ['tok2vec', 'tagger', 'parser', 'count-ner-before', 'ner', 

In [18]:
doc_e3 = nlp_e3("Sue lives in London.")

I found 0 named entities
I found 2 named entities


### Changing the Doc

You can change the doc, for example by resetting `doc.ents` to a new value if you want. 

But you can also set custom variables on Doc, Span and Token instances. These attributes are accessible via `Doc._.`, `Span._.` and `Token._.` and allow you to add information.

```python
>>> Token.set_extension("is_color", default=False)
>>> doc = nlp("The sky is blue.")
>>> doc[3]._.is_color = True
```

You typically do this in a custom component. Below is a somewhat trivial example, but the technique itself is very powerful.

In [19]:
from spacy.tokens import Doc
from spacy.language import Language

In [20]:
nlp_e4 = spacy.load("en_core_web_sm")

# setting the extension, first removing it if it were already there.
if Doc.has_extension("ner_count"):
    Doc.remove_extension("ner_count")
Doc.set_extension("ner_count", default=0)

# define the custom component as a function
@Language.component('count-ner')
def count_entities(doc):
    doc._.ner_count = len(doc.ents)
    return doc

# adding it to the pipeline just after ner
nlp_e4.add_pipe('count-ner', after="ner")

<function __main__.count_entities(doc)>

In [21]:
# running the pipeline and accessing the result
doc_e4 = nlp_e4("Sue lives in London.")
doc_e4._.ner_count

2