<a href="https://colab.research.google.com/github/marcusborela/Aprendizado-Profundo-Unicamp/blob/main/udemy_course_nlp_with_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition (NER) With SpaCy

We will be performing NER on threads from the **Investing** subreddit, but first let's test SpaCy for named entity recognition (NER) using an example from */r/investing*.

In [1]:
import spacy
from spacy import displacy

In [2]:
!python -m spacy download en_core_web_sm

2022-08-16 19:54:13.016030: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 14.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [5]:
doc = nlp(txt)

In [6]:
displacy.render(doc, style='ent')
# displacy.serve(doc, style='ent') if not running in a notebook

'<div class="entities" style="line-height: 2.5; direction: ltr">Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, I thought it would be prudent to share the risks of investing in \n<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    ARK ETFs\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>\n</mark>\n, written up very nicely by [\n<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    The Bear Cave](https://thebearcave.substack.com/p\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>\n</mark>\n/special-edition-will-ark-invest-blow). The risks comes primarily from \n<mark c

Immediately we're able to produce not perfect, but pretty good NER. We are using the [`en_core_web_sm`](https://spacy.io/models/en) model - `en` referring to English and `sm` small.

The model is accurately identifying ARK as an organization. It does also classify ETF (exchange traded fund) as an organization, which is not the case (an ETF is a grouping of securities on the markets), but it's easy to see why this is being classified as one. The other tag we can see is `WORK_OF_ART`, it isn't inherently clear what exactly this means, so we can get more information using `spacy.explain`:

In [7]:
spacy.explain('WORK_OF_ART')

'Titles of books, songs, etc.'

And we can see that this description fits well to the tagged item, which refers to an article (although not quite a book).

We have a visual output from our tagged text, but this won't be particularly useful programatically. What we need is a way to extract the relevant tags (the organizations) from our text. To do that we can use `doc.ents` which will return a list of all identified entities.

Each item in this entity list contains two attributes that we are interested in, `label_` and `text`:

In [8]:
for entity in doc.ents:
    print(f"{entity.label_}: {entity.text}")

ORG: ARK ETFs
ORG: The Bear Cave](https://thebearcave.substack.com/p
ORG: ARK
ORG: ARK


We're almost there. Now, we need to filter out any entities that are not `ORG` entities, and append those remaining `ORG`s to an organization list:

In [9]:
# initialize our list
org_list = []

for entity in doc.ents:
    # if label_ is ORG, we append text, otherwise ignore
    if entity.label_ == 'ORG':
        org_list.append(entity.text)

org_list

['ARK ETFs', 'The Bear Cave](https://thebearcave.substack.com/p', 'ARK', 'ARK']

In [10]:
# we don't need to see 'ARK' three times, so we use set() to remove duplicates, and then convert back to list
org_list = list(set(org_list))

org_list

['ARK ETFs', 'ARK', 'The Bear Cave](https://thebearcave.substack.com/p']

In [11]:
 txt = ( "Apple reached an all-time high stock price of 143 dollars this January.")

In [12]:
doc = nlp(txt)

In [13]:
displacy.render(doc, style='ent')
# displacy.serve(doc, style='ent') if not running in a notebook

'<div class="entities" style="line-height: 2.5; direction: ltr">\n<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Apple\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>\n</mark>\n reached an all-time high stock price of \n<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    143 dollars\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">MONEY</span>\n</mark>\n this \n<mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    January\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-le

In [15]:
txt = ( "Apple reached an all-time high stock price of 143 dollars this January.")
doc = nlp(txt)
# initialize our list
org_list = []

for entity in doc.ents:
    # if label_ is ORG, we append text, otherwise ignore
    if entity.label_ == 'ORG':
        org_list.append(entity.text)

org_list

['Apple']

In [16]:
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'Apple reached an all-time high stock price of 143 dollars this January.'
doc = nlp(text)
org_list = [ent.text for ent in doc.ents if ent.label_ == 'ORG']

In [17]:
org_list

['Apple']

Outro exercício Q&A

In [24]:
import os

squad_dir = 'data\squad'

if not os.path.exists(squad_dir):
    os.mkdir(squad_dir)



In [25]:
import requests

url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'
file =  'dev-v2.0.json'

res = requests.get(url+file)
# write to file in chunks
with open(os.path.join(squad_dir, file), 'wb') as f:
    for chunk in res.iter_content(chunk_size=40):
        f.write(chunk)

In [26]:
import json

with open(os.path.join(squad_dir, file), 'rb') as f:
    squad = json.load(f)

In [29]:
# initialize list where we will place all of our data
new_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the logic to check if we have 'answers' or 'plausible_answers'
            answer_set = set()
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
              for answ in qa_pair['answers']:
                answer_set.add( answ['text'])
            if 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
              for answ in qa_pair['plausible_answers']:
                answer_set.add( answ['text'])
            # append dictionary sample to parsed squad
            new_squad.append({
                'question': question,
                'answer': list(answer_set),
                'context': context
            })


In [30]:
new_squad[0]

{'question': 'In what country is Normandy located?',
 'answer': ['France'],
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'}

In [31]:
with open(os.path.join(squad_dir, 'dev.json'), 'w') as f:
    json.dump(new_squad, f)

In task 2, after Lecture 59, I think it would be important to group the answers in a list. This grouping should be considered in lesson 61, in the calculation of EM metric. In the calculation of EM, it is enough that it is equal to one of the answers in the list.

