# Named Entity Recognition (NER) With SpaCy

We will be performing NER on threads from the **Investing** subreddit, but first let's test SpaCy for named entity recognition (NER) using an example from */r/investing*.

In [1]:
import spacy

2022-11-13 08:29:18.491362: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-13 08:29:18.601171: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-13 08:29:18.927886: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-13 08:29:18.927959: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [2]:
!python -m spacy download en_core_web_sm

2022-11-13 08:29:32.053619: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-13 08:29:32.163705: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-13 08:29:32.450370: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-13 08:29:32.450432: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [3]:
nlp = spacy.load('en_core_web_sm')

#### Format for models: [lang]\_[type]\_[genre]_[size]

* lang: language
* type: core: general purpose, but there are other specialized ones e.g. vocabulary
* genre: type of text the model has been trained on; e.g. web, news
* size: small, medium, large, transformer

In [4]:
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [5]:
doc = nlp(txt)

In [6]:
from spacy import displacy

displacy.render(doc, style='ent')
# displacy.serve(doc, style='ent') if not running in a notebook

#### Note: 

Immediately we're able to produce not perfect, but pretty good NER. We are using the [`en_core_web_sm`](https://spacy.io/models/en) model - `en` referring to English and `sm` small.

The model is accurately identifying ARK as an organization. It does also classify ETF (exchange traded fund) as an organization, which is not the case (an ETF is a grouping of securities on the markets), but it's easy to see why this is being classified as one. The other tag we can see is `WORK_OF_ART`, it isn't inherently clear what exactly this means, so we can get more information using `spacy.explain`:

In [15]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

In [16]:
spacy.explain('WORK_OF_ART')

'Titles of books, songs, etc.'

And we can see that this description fits well to the tagged item, which refers to an article (although not quite a book).

We have a visual output from our tagged text, but this won't be particularly useful programatically. What we need is a way to extract the relevant tags (the organizations) from our text. To do that we can use `doc.ents` which will return a list of all identified entities.

Each item in this entity list contains two attributes that we are interested in, `label_` and `text`:

In [22]:
doc.ents[1].label_

'WORK_OF_ART'

In [21]:
doc.ents[1].text

'The Bear Cave](https://thebearcave.substack.com/p/special-edition'

In [17]:
for entity in doc.ents:
    print(f"{entity.label_}: {entity.text}")

ORG: ARK
WORK_OF_ART: The Bear Cave](https://thebearcave.substack.com/p/special-edition
ORG: ARK
ORG: ARK


We're almost there. Now, we need to filter out any entities that are not `ORG` entities, and append those remaining `ORG`s to an organization list:

In [23]:
# initialize our list
org_list = []

for entity in doc.ents:
    # if label_ is ORG, we append text, otherwise ignore
    if entity.label_ == 'WORK_OF_ART':
        org_list.append(entity.text)

org_list

['The Bear Cave](https://thebearcave.substack.com/p/special-edition']

In [24]:
# we don't need to see 'ARK' three times, so we use set() to remove duplicates, and then convert back to list
org_list = list(set(org_list))

org_list

['The Bear Cave](https://thebearcave.substack.com/p/special-edition']

In [26]:
txt2 = "Apple reached an all-time high stock price for 143 dollars in January"

In [27]:
doc2 = nlp(txt2)

In [28]:
type(doc2)

spacy.tokens.doc.Doc

In [29]:
displacy.render(doc2, style='ent')

In [31]:
# initialize our list
org_list = []

for entity in doc2.ents:
    # if label_ is ORG, we append text, otherwise ignore
    if entity.label_ == 'ORG':
        org_list.append(entity.text)

org_list

['Apple']