<a href="https://colab.research.google.com/github/heejipark/Knowlege-graph/blob/master/Using_Spacy_to_Extract_Information_from_Cast_Biographies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using SpaCy to Extract Information from Cast Biographies

<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>INF 558: Building Knowledge Graphs</u></sub>

SpaCy is an open-source software library for advanced natural language processing (NLP). SpaCy provides a one-stop-shop for tasks commonly used in any NLP project, including: Tokenisation, Lemmatisation, Part-of-speech (POS) tagging, Entity recognition, Dependency parsing, Sentence recognition, Word-to-vector transformations and many more methods for cleaning and normalising text data.

This notebook introduces some applied examples of NLP tasks to extract information from unstructured data using spaCy. The extracted structured data we produce can be used for downstream applications, such as creating Knowledge Graphs!

In [None]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.1.2-cp39-cp39-macosx_10_9_x86_64.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 4.7 MB/s eta 0:00:01
[?25hCollecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp39-cp39-macosx_10_9_x86_64.whl (32 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.5-cp39-cp39-macosx_10_9_x86_64.whl (18 kB)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp39-cp39-macosx_10_9_x86_64.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 6.3 MB/s eta 0:00:01     |███████▎                        | 614 kB 6.3 MB/s eta 0:00:01
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.4-cp39-cp39-macosx_10_9_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 7.9 MB/s eta 0:00:01
[?25hCollecting pathy>=0.3.5
  Download

## Language Model

There are various different types of models in spaCy. We well use an available pretrained statistical model for English (`en_core_web_sm`). Let’s download then load it.

In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 6.3 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import spacy
import en_core_web_sm
import csv

We will store the model in an nlp object which is a language model instance.

In [None]:
nlp = en_core_web_sm.load()

## Sentence Segmentation

Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

First, let's load a cast biography from the provided sxample `tsv` file

In [None]:
data = [
    ["https://www.goodreads.com/author/show/1077326.J_K_Rowling", """See also: Robert GalbraithAlthough she writes under the pen name J.K. Rowling, pronounced like rolling, her name when her first Harry Potter book was published was simply Joanne Rowling. Anticipating that the target audience of young boys might not want to read a book written by a woman, her publishers demanded that she use two initials, rather than her full name. As she had no middle name, she chose K as the second initial of her pen name, from her paternal grandmother Kathleen Ada Bulgen Rowling. She calls herself Jo and has said, \"No one ever called me 'Joanne' when I was young, unless they were angry.\" Following her marriage, she has sometimes used the name Joanne Murray when conducting personal business. During the Leveson Inquiry she gave evidence under the name of Joanne Kathleen Rowling. In a 2012 interview, Rowling noted that she no longer cared that people pronounced her name incorrectly. Rowling was born to Peter James Rowling, a Rolls-Royce aircraft engineer, and Anne Rowling (n\u00e9e Volant), on 31 July 1965 in Yate, Gloucestershire, England, 10 miles (16 km) northeast of Bristol. Her mother Anne was half-French and half-Scottish. Her parents first met on a train departing from King's Cross Station bound for Arbroath in 1964. They married on 14 March 1965. Her mother's maternal grandfather, Dugald Campbell, was born in Lamlash on the Isle of Arran. Her mother's paternal grandfather, Louis Volant, was awarded the Croix de Guerre for exceptional bravery in defending the village of Courcelles-le-Comte during the First World War.Rowling's sister Dianne was born at their home when Rowling was 23 months old. The family moved to the nearby village Winterbourne when Rowling was four. She attended St Michael's Primary School, a school founded by abolitionist William Wilberforce and education reformer Hannah More. Her headmaster at St Michael's, Alfred Dunn, has been suggested as the inspiration for the Harry Potter headmaster Albus Dumbledore.As a child, Rowling often wrote fantasy stories, which she would usually then read to her sister. She recalls that: \"I can still remember me telling her a story in which she fell down a rabbit hole and was fed strawberries by the rabbit family inside it. Certainly the first story I ever wrote down (when I was five or six) was about a rabbit called Rabbit. He got the measles and was visited by his friends, including a giant bee called Miss Bee.\" At the age of nine, Rowling moved to Church Cottage in the Gloucestershire village of Tutshill, close to Chepstow, Wales. When she was a young teenager, her great aunt, who Rowling said \"taught classics and approved of a thirst for knowledge, even of a questionable kind,\" gave her a very old copy of Jessica Mitford's autobiography, Hons and Rebels. Mitford became Rowling's heroine, and Rowling subsequently read all of her books.Rowling has said of her teenage years, in an interview with The New Yorker, \"I wasn\u2019t particularly happy. I think it\u2019s a dreadful time of life.\" She had a difficult homelife; her mother was ill and she had a difficult relationship with her father (she is no longer on speaking terms with him). She attended secondary school at Wyedean School and College, where her mother had worked as a technician in the science department. Rowling said of her adolescence, \"Hermione [a bookish, know-it-all Harry Potter character] is loosely based on me. She's a caricature of me when I was eleven, which I'm not particularly proud of.\" Steve Eddy, who taught Rowling English when she first arrived, remembers her as \"not exceptional\" but \"one of a group of girls who were bright, and quite good at English.\" Sean Harris, her best friend in the Upper Sixth owned a turquoise Ford Anglia, which she says inspired the one in her books."""]
]

for (idx, (act, bio)) in enumerate(data):
    print(f'[{idx:2d}] >', act)
    biog = bio

[ 0] > https://www.goodreads.com/author/show/1077326.J_K_Rowling


Here's the full biography text:

In [None]:
biog

'See also: Robert GalbraithAlthough she writes under the pen name J.K. Rowling, pronounced like rolling, her name when her first Harry Potter book was published was simply Joanne Rowling. Anticipating that the target audience of young boys might not want to read a book written by a woman, her publishers demanded that she use two initials, rather than her full name. As she had no middle name, she chose K as the second initial of her pen name, from her paternal grandmother Kathleen Ada Bulgen Rowling. She calls herself Jo and has said, "No one ever called me \'Joanne\' when I was young, unless they were angry." Following her marriage, she has sometimes used the name Joanne Murray when conducting personal business. During the Leveson Inquiry she gave evidence under the name of Joanne Kathleen Rowling. In a 2012 interview, Rowling noted that she no longer cared that people pronounced her name incorrectly. Rowling was born to Peter James Rowling, a Rolls-Royce aircraft engineer, and Anne Ro

Let’s read a text using spaCy and store in a `doc` object which is a container for accessing linguistic annotations.

In [None]:
doc = nlp(biog)

In spaCy, the `sents` property is used to extract sentences. Here’s how you would extract the sentences for a given input text:

In [None]:
for idx, sent in enumerate(doc.sents):
    print(f'[{idx:2d}] >', sent)
    if idx == 7:
        mysent = sent

[ 0] > See also: Robert GalbraithAlthough she writes under the pen name J.K. Rowling, pronounced like rolling, her name when her first Harry Potter book was published was simply Joanne Rowling.
[ 1] > Anticipating that the target audience of young boys might not want to read a book written by a woman, her publishers demanded that she use two initials, rather than her full name.
[ 2] > As she had no middle name, she chose K as the second initial of her pen name, from her paternal grandmother Kathleen Ada Bulgen Rowling.
[ 3] > She calls herself Jo and has said, "No one ever called me 'Joanne' when I was young, unless they were angry.
[ 4] > " Following her marriage, she has sometimes used the name Joanne Murray when conducting personal business.
[ 5] > During the Leveson Inquiry she gave evidence under the name of Joanne Kathleen Rowling.
[ 6] > In a 2012 interview, Rowling noted that she no longer cared that people pronounced her name incorrectly.
[ 7] > Rowling was born to Peter James

Here's the sentence we will work on moving forward:

In [None]:
mysent

Rowling was born to Peter James Rowling, a Rolls-Royce aircraft engineer, and Anne Rowling (née Volant), on 31 July 1965 in Yate, Gloucestershire, England, 10 miles (16 km) northeast of Bristol.

## Tokenization & POS tagging

Tokenization is the next step after sentence detection. It allows you to identify the basic units in your text. These basic units are called tokens. Tokenization is useful because it breaks a text into meaningful units. These units are used for further analysis, like part of speech tagging.

Parts-of-speech (POS) is a grammatical role that explains how a particular word is used in a sentence. There are eight parts-of-speech: Noun, Pronoun, Adjective, Verb, Adverb, Preposition, Conjunction, Interjection.

You can print tokens and their POS tages by iterating on the `doc` object:

In [None]:
for w in mysent:
    print(f'{w.text:15s} [{w.tag_:5s} | {w.pos_:6s} | {spacy.explain(w.tag_)}]')

Rowling         [NNP   | PROPN  | noun, proper singular]
was             [VBD   | AUX    | verb, past tense]
born            [VBN   | VERB   | verb, past participle]
to              [IN    | ADP    | conjunction, subordinating or preposition]
Peter           [NNP   | PROPN  | noun, proper singular]
James           [NNP   | PROPN  | noun, proper singular]
Rowling         [NNP   | PROPN  | noun, proper singular]
,               [,     | PUNCT  | punctuation mark, comma]
a               [DT    | DET    | determiner]
Rolls           [NNP   | PROPN  | noun, proper singular]
-               [HYPH  | PUNCT  | punctuation mark, hyphen]
Royce           [NNP   | PROPN  | noun, proper singular]
aircraft        [NN    | NOUN   | noun, singular or mass]
engineer        [NN    | NOUN   | noun, singular or mass]
,               [,     | PUNCT  | punctuation mark, comma]
and             [CC    | CCONJ  | conjunction, coordinating]
Anne            [NNP   | PROPN  | noun, proper singular]
Rowling       

## Relation Extraction & Dependency Parsing

The POS tags alone are not sufficient for various cases and require further analysis like dependency parsing. Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. Now, let’s extract the dependency relations among entities:

In [None]:
for w in mysent: 
    print(f'{w.text:15s} [{w.dep_}]')

Rowling         [nsubjpass]
was             [auxpass]
born            [ROOT]
to              [prep]
Peter           [compound]
James           [compound]
Rowling         [pobj]
,               [punct]
a               [det]
Rolls           [compound]
-               [punct]
Royce           [compound]
aircraft        [compound]
engineer        [appos]
,               [punct]
and             [cc]
Anne            [compound]
Rowling         [conj]
(               [punct]
née             [compound]
Volant          [appos]
)               [punct]
,               [punct]
on              [prep]
31              [nummod]
July            [pobj]
1965            [nummod]
in              [prep]
Yate            [pobj]
,               [punct]
Gloucestershire [conj]
,               [punct]
England         [conj]
,               [punct]
10              [nummod]
miles           [appos]
(               [punct]
16              [nummod]
km              [appos]
)               [punct]
northeast       [advmod]

## Visualization: Using displaCy

spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

In [None]:
from spacy import displacy
options = {"distance": 120}
displacy.render(mysent, style="dep", options=options)

## Entity recognition

Entity recognition is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc. spaCy uses a statistical model to classify a broad range of entities, including persons, events, works-of-art and nationalities / religion.

Let's parse our sentence, then access the identified entities using the `doc` or `sent` object's `.ents` method. With this method called on the `mysent` we can access additional `token` methods, specifically `.label_`:

In [None]:
for ent in mysent.ents:
    print(f'{ent.text:15s} [{ent.label_}]')

Peter James Rowling [PERSON]
Rolls-Royce     [ORG]
Anne Rowling    [PERSON]
Volant          [PERSON]
31 July 1965    [DATE]
Yate            [PERSON]
England         [GPE]
10 miles        [QUANTITY]
16 km           [QUANTITY]
Bristol         [ORG]


## Rule-Based Matching

Rule-based matching is one of the steps in extracting information from unstructured text. It’s used to identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).

Rule-based matching can use regular expressions to extract entities or relations from an unstructured text. It’s different from extracting text using regular expressions only in the sense that regular expressions don’t consider the lexical and grammatical attributes of the text.

The spaCy library comes with `Matcher` tool that can be used to specify custom rules for phrase matching. The process to use the `Matcher` tool is pretty straight forward. Here's an example:

In [None]:
for tok in mysent:
    print(f'{tok} ({tok.pos_}) ', end='')

Rowling (PROPN) was (AUX) born (VERB) to (ADP) Peter (PROPN) James (PROPN) Rowling (PROPN) , (PUNCT) a (DET) Rolls (PROPN) - (PUNCT) Royce (PROPN) aircraft (NOUN) engineer (NOUN) , (PUNCT) and (CCONJ) Anne (PROPN) Rowling (PROPN) ( (PUNCT) née (PROPN) Volant (PROPN) ) (PUNCT) , (PUNCT) on (ADP) 31 (NUM) July (PROPN) 1965 (NUM) in (ADP) Yate (PROPN) , (PUNCT) Gloucestershire (PROPN) , (PUNCT) England (PROPN) , (PUNCT) 10 (NUM) miles (NOUN) ( (PUNCT) 16 (NUM) km (NOUN) ) (PUNCT) northeast (ADV) of (ADP) Bristol (PROPN) . (PUNCT) 

In [None]:
from spacy.matcher import Matcher

# define the pattern 
patterns = [
    [{'POS': 'PROPN'}, {'LOWER': 'married'}, {'ENT_TYPE': 'PERSON'}],
    [{'POS':'PROPN'}, {'LOWER':'was'}, {'LOWER':'born'}, {'LOWER':'to'}, {'POS':'PROPN', 'OP':'+'}],
]
   
# Matcher class object 
matcher = Matcher(nlp.vocab) 
#matcher.add("matching_1", None, pattern) 
matcher.add("matching_2", patterns) 

matches = matcher(doc) # multiple matches
last_match = matches[-1]
span = doc[last_match[1]:last_match[2]] 
print(span.text)

Rowling was born to Peter James Rowling


**Notes**:
- You can find additional examples and use-cases in [SpaCy's documentation](https://spacy.io/usage/rule-based-matching).
- You can use the online [Rule-based Matcher Explorer](https://explosion.ai/demos/matcher) to test spaCy's rule-based `Matcher` by creating token patterns interactively and executing them.
- Here's a nice [article](https://stackabuse.com/python-for-nlp-vocabulary-and-phrase-matching-with-spacy/) you can review. In the article, the author explores vocabulary and phrase matching using the spaCy library. He defines patterns and detects phrases that match the defined patterns. 

Now, you know how to perform some basic NLP tasks like sentence segmentation, tokenization, POS tagging, entity recognition, and - most important - Rule-Based Matching. You now have enough knowledge about how to get the entities and the relations between entities and extract structured data that can be used for downstream applications, such as building a Knowledge Graph! Congratulations!

You can start applying this knowledge on the tasks you are required to do for Homework 02 of the class :)