<a href="https://colab.research.google.com/github/liadmagen/NLP-Course/blob/master/exercises_notebooks/10_LM_IE_using_spacy_regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Initialization & Introduction

In [1]:
from random import choice
import spacy
import en_core_web_sm

import pandas as pd

from spacy.matcher import Matcher

we will analyze a sample text, copied from the CNN:

> Business leaders and industry groups took to Twitter and released statements Saturday congratulating President-elect Joe Biden on his victory, while calling for the country to come together after a hard-fought and sometimes bitter campaign.
> "Now is a time for unity. We must respect the results of the U.S. presidential election and, as we have with every election, honor the decision of the voters and support a peaceful transition of power," said Jamie Dimon, CEO of JPMorgan Chase (JPM).
> "We are a stronger country when we treat each other with dignity, share a commitment to a common purpose and are united to address our greater challenges. No matter our political views, let's come together to strengthen our exceptional country."
> Facebook (FB) COO Sheryl Sandberg said that America has taken "a big step toward creating a government that reflects the diverse country we are."
"Congratulations to Kamala Harris on this remarkable achievement -- shattering glass ceilings and norms around what leadership looks like -- and to President-Elect Biden on this historic milestone," Sandberg wrote in a Facebook post.
> Corporate America had been supportive of Biden in the run-up to the election. A survey of CEOs conducted by the Yale School of Management in late September found that 77% of participants would vote for Biden. More than 60% predicted he would win.
> Leaders of industry groups also are sending word of their support to the incoming administration.
The American Bankers Association President and CEO Rob Nichols said the association and its members "stand ready to work with the Biden administration and lawmakers from both parties to bolster the economy, increase opportunity and create a brighter future for all Americans."
While the nation's banks have worked to assist their business and consumer customers, he added, "we know more must be done to fuel the recovery."
US Chamber of Commerce CEO Thomas J. Donohue said the industry group looks forward to working "with the Biden administration and leaders on both sides of the aisle to restore public health, revitalize our economy, and help rebuild American lives and communities."
> He added, "We stand ready to help break through the gridlock and help get things done through collaboration and good governance," and said the Chamber stands ready "to help break through the gridlock and help get things done through collaboration and good governance."
In a statement, National Association of Manufacturers President and CEO Jay Timmons said that "the American people are not interested in extreme policies from either party; they are looking for smart, stable and solutions-oriented governance."
His group's agenda advocates for a competitive tax and regulatory system, infrastructure investment, comprehensive immigration reform, expanded trade and a strengthened workforce.

---


# spaCy for Information Extraction

We will use here [spaCy](https://spacy.io) to extract information out of our dataset.

SpaCy has several Language Models, pre-trained. 
We're loading the English (en) one. There are actually several different models for every language: small, medium, large, medical, etc., which differentiate from one another by the number and the type of words each contains in the Vocabulary.

Explore the rest of the models here: https://spacy.io/usage/models

In [2]:
nlp = en_core_web_sm.load()

In [3]:
copied_text = """ Business leaders and industry groups took to Twitter and released statements Saturday congratulating President-elect Joe Biden on his victory, while calling for the country to come together after a hard-fought and sometimes bitter campaign.
"Now is a time for unity. We must respect the results of the U.S. presidential election and, as we have with every election, honor the decision of the voters and support a peaceful transition of power," said Jamie Dimon, CEO of JPMorgan Chase (JPM).
"We are a stronger country when we treat each other with dignity, share a commitment to a common purpose and are united to address our greater challenges. No matter our political views, let's come together to strengthen our exceptional country."
Facebook (FB) COO Sheryl Sandberg said that America has taken "a big step toward creating a government that reflects the diverse country we are."
"Congratulations to Kamala Harris on this remarkable achievement -- shattering glass ceilings and norms around what leadership looks like -- and to President-Elect Biden on this historic milestone," Sandberg wrote in a Facebook post.
Corporate America had been supportive of Biden in the run-up to the election. A survey of CEOs conducted by the Yale School of Management in late September found that 77% of participants would vote for Biden. More than 60% predicted he would win.
Leaders of industry groups also are sending word of their support to the incoming administration.
The American Bankers Association President and CEO Rob Nichols said the association and its members "stand ready to work with the Biden administration and lawmakers from both parties to bolster the economy, increase opportunity and create a brighter future for all Americans."
While the nation's banks have worked to assist their business and consumer customers, he added, "we know more must be done to fuel the recovery."
US Chamber of Commerce CEO Thomas J. Donohue said the industry group looks forward to working "with the Biden administration and leaders on both sides of the aisle to restore public health, revitalize our economy, and help rebuild American lives and communities."
He added, "We stand ready to help break through the gridlock and help get things done through collaboration and good governance," and said the Chamber stands ready "to help break through the gridlock and help get things done through collaboration and good governance."
In a statement, National Association of Manufacturers President and CEO Jay Timmons said that "the American people are not interested in extreme policies from either party; they are looking for smart, stable and solutions-oriented governance."
His group's agenda advocates for a competitive tax and regulatory system, infrastructure investment, comprehensive immigration reform, expanded trade and a strengthened workforce."""

In [4]:
doc = nlp(copied_text)

spaCy include a built-in parsing utilities, that are activated as soon as we load the text into it, through the nlp object we've created.

For example, it already breaks down the text into sentences:

In [5]:
for s in list(doc.sents)[:3]:
  print(s)

 Business leaders and industry groups took to Twitter and released statements Saturday congratulating President-elect Joe Biden on his victory, while calling for the country to come together after a hard-fought and sometimes bitter campaign.

"Now is a time for unity.
We must respect the results of the U.S. presidential election and, as we have with every election, honor the decision of the voters and support a peaceful transition of power," said Jamie Dimon, CEO of JPMorgan Chase (JPM).



Let's explore one of these sentenes.

We print here for every token in the sentence, the:
   
*   part-of-speech (POS),
*   the dependency parsing name (nk - Noun Kernel, mnr, manner, etc.)
*   and to which word they are connected to in the grammar context parsing tree (the clause head)

In [6]:
for token in list(doc.sents)[11]:
  print(token.text, token.pos_, token.dep_, token.head.text)

Leaders NOUN nsubj sending
of ADP prep Leaders
industry NOUN compound groups
groups NOUN pobj of
also ADV advmod sending
are AUX aux sending
sending VERB ROOT sending
word NOUN dobj sending
of ADP prep word
their DET poss support
support NOUN pobj of
to ADP prep sending
the DET det administration
incoming ADJ amod administration
administration NOUN pobj to
. PUNCT punct sending

 SPACE  .


spaCy includes NER parser built in, which already ran over the tokenized sentence. 

Let's see what it captured:

In [7]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Saturday DATE
Joe Biden PERSON
U.S. GPE
Jamie Dimon PERSON
JPMorgan Chase ORG
JPM ORG
Sheryl Sandberg PERSON
America GPE
Kamala Harris PERSON
Sandberg PERSON
Corporate America ORG
Biden FAC
the Yale School of Management ORG
late September DATE
77% PERCENT
Biden LOC
More than 60% PERCENT
The American Bankers Association ORG
Rob Nichols PERSON
Biden PERSON
Americans NORP
US GPE
Chamber of Commerce ORG
Thomas J. Donohue PERSON
Biden PERSON
American NORP
Chamber PERSON
National Association of Manufacturers ORG
Jay Timmons PERSON
American NORP


Not all is accurate - as Chamber should have been tagged as ORG, not PER.

However, these mistakes and confusions are very common, and we have to learn how to work our way around them.

## spaCy Matcher

Remind yourself the RegEx from day 1.

spaCy's power is a similar regEx, but that can capture also other language phenomenons, such as POS, or NER labels:
https://spacy.io/usage/rule-based-matching

Let's activate it on the vocabulary:

In [8]:
# Instantiate the matcher over the trained vocabulary:
matcher = Matcher(nlp.vocab)

Let's create a simple pattern, and activate it on the text:

In [9]:
pattern = [
           {"POS": "NOUN"},
          {"POS": "ADP"},
          {"POS": "NOUN"},
      ]

In [10]:
matcher.add("POS", None, pattern)

matches = matcher(doc)

for match_id, start, end in matches:

    print("Match found:", doc[start:end].text)

Match found: time for unity
Match found: transition of power
Match found: survey of CEOs
Match found: % of participants
Match found: Leaders of industry


# Data Exploration with spaCy

Your Turn:

The text contains many mentions of people and their role. See if you can capture these phrases of:

**A, role of C**

examples: 
* Facebook (FB) COO Sheryl Sandberg
* US Chamber of Commerce CEO Thomas J. Donohue said  ...

> Hint: you can also use the NER outcome. Check out the documentations for more details: https://spacy.io/usage/rule-based-matching
