<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center> 

# Introduction to Named Entity Recognition by using Spacy


Named entity recognition (NER) is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP).

In this notebook we will see how Spacy can deal with this task.

First, we must install Spacy.


In [1]:
!pip install -q spacy
# download a model
!python -m spacy download en_core_web_sm



2023-04-25 09:58:39.328879: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-25 09:58:44.114896: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-25 09:58:44.115479: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-

We can process a text and show its entities: 

In [2]:
article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

import spacy

nlp = spacy.load('en_core_web_sm')
document = nlp(article)

print('Original Sentence: {}'.format(article))
print()

for entity in document.ents:
    print('Type: {}, Value: {}, star: {}, end: {}'.format(entity.label_, entity.text,entity.start_char, entity.end_char))


Original Sentence: 
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.

Type: NORP, Value: Asian, star: 1, end: 6
Type: DATE, Value: Tuesday, star: 25, end: 32
Type: LOC, Value: Europe, star: 148, end: 154
Type: DATE, Value: 16-month, star: 176, end: 184
Type: PERSON, Value: MSCI, star: 228, end: 232
Type: LOC, Value: As

Spacy also provides a nice library to highlight the entity mentions in the texts:

In [3]:
from spacy import displacy

displacy.render(nlp(str(article)), jupyter=True, style='ent')


## Spacy for Spanish NER
Spacy also allows us to recognize named entities in Spanish.

In [6]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [8]:
!python -m spacy download es_core_news_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting es-core-news-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.5.0/es_core_news_sm-3.5.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m94.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [9]:
article = '''
Junts per Catalunya opta ahora por no poner palos en las ruedas para que Esquerra facilite la investidura de Pedro Sánchez. 
La formación que lidera Carles Puigdemont —a la espera de que la justicia belga decida sobre su extradición— anunció este 
martes que retira una moción sobre la autodeterminación, que tenía que ser votada hoy miércoles en el Parlament y que ponía a ERC
 en una situación comprometida. La decisión, que generó mucho debate interno, se gestó en la reunión que tuvieron 
 el expresident y varios cargos electos de Junts, el pasado lunes en Bélgica..'''

import spacy

nlp = spacy.load('es_core_news_sm')
document = nlp(article)

print('Original Sentence: %s' % (article))

for entity in document.ents:
    print('Type: {}, Value: {}, star: {}, end: {}'.format(entity.label_, entity.text,entity.start_char, entity.end_char))


Original Sentence: 
Junts per Catalunya opta ahora por no poner palos en las ruedas para que Esquerra facilite la investidura de Pedro Sánchez. 
La formación que lidera Carles Puigdemont —a la espera de que la justicia belga decida sobre su extradición— anunció este 
martes que retira una moción sobre la autodeterminación, que tenía que ser votada hoy miércoles en el Parlament y que ponía a ERC
 en una situación comprometida. La decisión, que generó mucho debate interno, se gestó en la reunión que tuvieron 
 el expresident y varios cargos electos de Junts, el pasado lunes en Bélgica..
Type: ORG, Value: Junts per Catalunya, star: 1, end: 20
Type: ORG, Value: Esquerra, star: 74, end: 82
Type: PER, Value: Pedro Sánchez, star: 110, end: 123
Type: MISC, Value: La formación que lidera, star: 126, end: 149
Type: PER, Value: Carles Puigdemont, star: 150, end: 167
Type: MISC, Value: Parlament, star: 351, end: 360
Type: ORG, Value: ERC, star: 375, end: 378
Type: MISC, Value: La decisión, star: 4

In [10]:
displacy.render(nlp(str(article)), jupyter=True, style='ent')


## Limitations of Spacy

Unfortunately, Spacy is not able to detect any kind of named entities. For examples, it cannot recognize chemical named entities. 

In [14]:
text='Benz(a)anthracene is a polycyclic aromatic hydrocarbon.' 
text+='The phosphoinositide, phosphatidylinositol-3,4,5-trisphosphate '
text+='(PI(3,4,5)P3), is a key signaling lipid.'
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for entity in doc.ents:
    print('Type: {}, Value: {}, star: {}, end: {}'.format(entity.label_, entity.text,entity.start_char, entity.end_char))

# displacy.render(nlp(str(text)), jupyter=True, style='ent')
