# Initialization & Introduction

In [19]:
!pip install transformers



In [20]:
from random import choice
import spacy
import en_core_web_sm
from spacy import displacy

import pandas as pd

from spacy.matcher import Matcher

# NER

we will analyze a sample text, copied from the CNN:

> Business leaders and industry groups took to Twitter and released statements Saturday congratulating President-elect Joe Biden on his victory, while calling for the country to come together after a hard-fought and sometimes bitter campaign.
> "Now is a time for unity. We must respect the results of the U.S. presidential election and, as we have with every election, honor the decision of the voters and support a peaceful transition of power," said Jamie Dimon, CEO of JPMorgan Chase (JPM).
> "We are a stronger country when we treat each other with dignity, share a commitment to a common purpose and are united to address our greater challenges. No matter our political views, let's come together to strengthen our exceptional country."
> Facebook (FB) COO Sheryl Sandberg said that America has taken "a big step toward creating a government that reflects the diverse country we are."
"Congratulations to Kamala Harris on this remarkable achievement -- shattering glass ceilings and norms around what leadership looks like -- and to President-Elect Biden on this historic milestone," Sandberg wrote in a Facebook post.
> Corporate America had been supportive of Biden in the run-up to the election. A survey of CEOs conducted by the Yale School of Management in late September found that 77% of participants would vote for Biden. More than 60% predicted he would win.
> Leaders of industry groups also are sending word of their support to the incoming administration.
The American Bankers Association President and CEO Rob Nichols said the association and its members "stand ready to work with the Biden administration and lawmakers from both parties to bolster the economy, increase opportunity and create a brighter future for all Americans."
While the nation's banks have worked to assist their business and consumer customers, he added, "we know more must be done to fuel the recovery."
US Chamber of Commerce CEO Thomas J. Donohue said the industry group looks forward to working "with the Biden administration and leaders on both sides of the aisle to restore public health, revitalize our economy, and help rebuild American lives and communities."
> He added, "We stand ready to help break through the gridlock and help get things done through collaboration and good governance," and said the Chamber stands ready "to help break through the gridlock and help get things done through collaboration and good governance."
In a statement, National Association of Manufacturers President and CEO Jay Timmons said that "the American people are not interested in extreme policies from either party; they are looking for smart, stable and solutions-oriented governance."
His group's agenda advocates for a competitive tax and regulatory system, infrastructure investment, comprehensive immigration reform, expanded trade and a strengthened workforce.

---


# spaCy for Information Extraction

We will use here [spaCy](https://spacy.io) to extract information out of our dataset.

SpaCy has several Language Models, pre-trained. 
We're loading the English (en) one. There are actually several different models for every language: small, medium, large, medical, etc., which differentiate from one another by the number and the type of words each contains in the Vocabulary.

Explore the rest of the models here: https://spacy.io/usage/models

In [21]:
nlp = en_core_web_sm.load()

In [22]:
copied_text = """ Business leaders and industry groups took to Twitter and released statements Saturday congratulating President-elect Joe Biden on his victory, while calling for the country to come together after a hard-fought and sometimes bitter campaign.
"Now is a time for unity. We must respect the results of the U.S. presidential election and, as we have with every election, honor the decision of the voters and support a peaceful transition of power," said Jamie Dimon, CEO of JPMorgan Chase (JPM).
"We are a stronger country when we treat each other with dignity, share a commitment to a common purpose and are united to address our greater challenges. No matter our political views, let's come together to strengthen our exceptional country."
Facebook (FB) COO Sheryl Sandberg said that America has taken "a big step toward creating a government that reflects the diverse country we are."
"Congratulations to Kamala Harris on this remarkable achievement -- shattering glass ceilings and norms around what leadership looks like -- and to President-Elect Biden on this historic milestone," Sandberg wrote in a Facebook post.
Corporate America had been supportive of Biden in the run-up to the election. A survey of CEOs conducted by the Yale School of Management in late September found that 77% of participants would vote for Biden. More than 60% predicted he would win.
Leaders of industry groups also are sending word of their support to the incoming administration.
The American Bankers Association President and CEO Rob Nichols said the association and its members "stand ready to work with the Biden administration and lawmakers from both parties to bolster the economy, increase opportunity and create a brighter future for all Americans."
While the nation's banks have worked to assist their business and consumer customers, he added, "we know more must be done to fuel the recovery."
US Chamber of Commerce CEO Thomas J. Donohue said the industry group looks forward to working "with the Biden administration and leaders on both sides of the aisle to restore public health, revitalize our economy, and help rebuild American lives and communities."
He added, "We stand ready to help break through the gridlock and help get things done through collaboration and good governance," and said the Chamber stands ready "to help break through the gridlock and help get things done through collaboration and good governance."
In a statement, National Association of Manufacturers President and CEO Jay Timmons said that "the American people are not interested in extreme policies from either party; they are looking for smart, stable and solutions-oriented governance."
His group's agenda advocates for a competitive tax and regulatory system, infrastructure investment, comprehensive immigration reform, expanded trade and a strengthened workforce."""

In [23]:
doc = nlp(copied_text)

spaCy include a built-in parsing utilities, that are activated as soon as we load the text into it, through the nlp object we've created.

For example, it already breaks down the text into sentences:

In [24]:
for s in list(doc.sents)[:3]:
  print(s)

 Business leaders and industry groups took to Twitter and released statements Saturday congratulating President-elect Joe Biden on his victory, while calling for the country to come together after a hard-fought and sometimes bitter campaign.

"Now is a time for unity.
We must respect the results of the U.S. presidential election and, as we have with every election, honor the decision of the voters and support a peaceful transition of power," said Jamie Dimon, CEO of JPMorgan Chase (JPM).



Let's explore one of these sentenes.

We print here for every token in the sentence, the:
   
*   part-of-speech (POS),
*   the dependency parsing name (nk - Noun Kernel, mnr, manner, etc.)
*   and to which word they are connected to in the grammar context parsing tree (the clause head)

In [25]:
for token in list(doc.sents)[11]:
  print(token.text, token.pos_, token.dep_, token.head.text)

Leaders NOUN nsubj sending
of ADP prep Leaders
industry NOUN compound groups
groups NOUN pobj of
also ADV advmod sending
are AUX aux sending
sending VERB ROOT sending
word NOUN dobj sending
of ADP prep word
their DET poss support
support NOUN pobj of
to ADP prep sending
the DET det administration
incoming ADJ amod administration
administration NOUN pobj to
. PUNCT punct sending

 SPACE  .


Let's see what it looks like, graphically:

In [26]:
displacy.render(list(doc.sents)[11], style="dep", jupyter=True, options={'distance': 90})

spaCy includes NER parser built in, which already ran over the tokenized sentence. 

Let's see what it captured:

In [27]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Saturday DATE
Joe Biden PERSON
U.S. GPE
Jamie Dimon PERSON
JPMorgan Chase ORG
JPM ORG
Sheryl Sandberg PERSON
America GPE
Kamala Harris PERSON
Sandberg PERSON
Corporate America ORG
Biden FAC
the Yale School of Management ORG
late September DATE
77% PERCENT
Biden LOC
More than 60% PERCENT
The American Bankers Association ORG
Rob Nichols PERSON
Biden PERSON
Americans NORP
US GPE
Chamber of Commerce ORG
Thomas J. Donohue PERSON
Biden PERSON
American NORP
Chamber PERSON
National Association of Manufacturers ORG
Jay Timmons PERSON
American NORP


Not all is accurate - as Chamber should have been tagged as ORG, not PER.

However, these mistakes and confusions are very common, and we have to learn how to work our way around them.

In [28]:
displacy.render(doc.sents, style="ent", jupyter=True, options={'distance': 90})

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


## spaCy Matcher

Remind yourself the RegEx from day 1.

spaCy's power is a similar regEx, but that can capture also other language phenomenons, such as POS, or NER labels:
https://spacy.io/usage/rule-based-matching

Let's activate it on the vocabulary:

In [29]:
# Instantiate the matcher over the trained vocabulary:
matcher = Matcher(nlp.vocab)

Let's create a simple pattern, and activate it on the text:

In [30]:
pattern = [
           {"POS": "NOUN"},
          {"POS": "ADP"},
          {"POS": "NOUN"},
      ]

In [31]:
matcher.add("POS", None, pattern)

matches = matcher(doc)

for match_id, start, end in matches:

    print("Match found:", doc[start:end].text)

Match found: time for unity
Match found: transition of power
Match found: survey of CEOs
Match found: % of participants
Match found: Leaders of industry


# Data Exploration with spaCy

## Your Turn:

The text contains many mentions of people and their role. See if you can capture these phrases of:

**A, role of C**

examples: 
* Facebook (FB) COO Sheryl Sandberg
* US Chamber of Commerce CEO Thomas J. Donohue said  ...

> Hint: you can include the NER outcome in your pattern. Check out the documentations for more details: https://spacy.io/usage/rule-based-matching


In [35]:
### your code here

# NER Training with spaCy

Spacy can also be used to quickly train your own NER modules.

A potential training data looks like this:

In [None]:
train_data = [
    ("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
    ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]),
    ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
    ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
    ("Google rebrands its business apps", [(0, 6, "ORG")]),
    ("look what i found on google! 😂", [(21, 27, "PRODUCT")])]


1. Create your own training data to detect emojis :-) !
compose several (at least 6) sentenecs where you use the textual form of emojis - `<3`, `:)`, `:-(` etc.
Write the training code and add it to the spacy pipeline. You can find code-samples [here](https://spacy.io/usage/training)

2. If you wish to take your coding skills to the next level, try this challenge:
Use this [annotated data from Twitter](https://github.com/aritter/twitter_nlp/tree/master/data/annotated), and train spacy to recognize the following labels:

* facility
* company
* other
* tvshow
* sportsteam
* geo-loc
* movie
* product
* musicartist

**Attention**: you will need to do data-wrangeling and change the data so that it match spacy's expected format 

The format looks like:
```
Cant	O
wait	O
for	O
the	O
ravens	B-sportsteam
game	O
tomorrow	O
....	O
go	O
ray	B-person
rice	I-person
!!!!!!!	O
```


3. Lastly, if you want to train your model on different languages, here you will find [several annotated languages](https://github.com/EuropeanaNewspapers/ner-corpora) to train your model on.

# BERT NER

Let's also give BERT a try, using the huggingface transformers package that we played with yesterday.

In [10]:
from pprint import pprint

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=829.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=59.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433316646.0, style=ProgressStyle(descri…


[{'word': 'Wolfgang', 'score': 0.9990139603614807, 'entity': 'B-PER', 'index': 4}, {'word': 'Berlin', 'score': 0.9996449947357178, 'entity': 'B-LOC', 'index': 9}]


huggingface has NER models out of the box.

Yet, if you need a domain-specific NER, you will need to train your own.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

In [5]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

In [11]:
example = "Paris Whitney Hilton born February 17, 1981 is an American television personality and businesswoman . She is the great-granddaughter of Conrad Hilton , the founder of Hilton Hotels . Born in New York City and raised in both California and New York, Hilton began a modeling career when she signed with Donald Trump’s modeling agency"

ner_results = nlp(example)
pprint(ner_results)

[{'entity': 'B-PER', 'index': 1, 'score': 0.9973036646842957, 'word': 'Paris'},
 {'entity': 'I-PER',
  'index': 2,
  'score': 0.9825794100761414,
  'word': 'Whitney'},
 {'entity': 'I-PER', 'index': 3, 'score': 0.9956507086753845, 'word': 'Hilton'},
 {'entity': 'B-MISC',
  'index': 11,
  'score': 0.9995628595352173,
  'word': 'American'},
 {'entity': 'B-PER',
  'index': 25,
  'score': 0.9946050643920898,
  'word': 'Conrad'},
 {'entity': 'I-PER',
  'index': 26,
  'score': 0.9543987512588501,
  'word': 'Hilton'},
 {'entity': 'B-ORG',
  'index': 31,
  'score': 0.9979767203330994,
  'word': 'Hilton'},
 {'entity': 'I-ORG',
  'index': 32,
  'score': 0.9959855079650879,
  'word': 'Hotels'},
 {'entity': 'B-LOC', 'index': 36, 'score': 0.999575674533844, 'word': 'New'},
 {'entity': 'I-LOC', 'index': 37, 'score': 0.9993460774421692, 'word': 'York'},
 {'entity': 'I-LOC', 'index': 38, 'score': 0.999595046043396, 'word': 'City'},
 {'entity': 'B-LOC',
  'index': 43,
  'score': 0.9996414184570312,
  'w

Fine-tuning BERT is not a quick task to fulfill, and may require time and GPU sources.

if you wish to try the previous assignment on BERT - you will find [this tutorial useful](https://huggingface.co/transformers/master/custom_datasets.html).