# Introduction: 


This notebook was created as a place for practicing NLP library - spaCy. It includes experimenting with various features of this tool:

* exploring available data structures
* learning aboout recognizing patterns
* investigating similarities between sentences and spans
* rule-based matching and writing own rules
* learning about the role of entities.


# Importing libraries

In [154]:
import spacy

#### Ensuring that we have a small english model imported correctly 

In [4]:
nlp = spacy.load("en_core_web_sm")

# 1. Exporing basic features of Spacy

Create a blank pipeline of a given language class (English).

In [5]:
nlp = spacy.blank("en")

Define a Doc and iterate over tokens (words, symbols):

In [80]:
doc = nlp("I had a wonderful day today! I went on a walk, read a book and listened to music.")

for token in doc:
    print(token.text)

I
had
a
wonderful
day
today
!
I
went
on
a
walk
,
read
a
book
and
listened
to
music
.


In [81]:
for sentence in doc.sents:
    print(sentence)

I had a wonderful day today!
I went on a walk, read a book and listened to music.


In [82]:
for sentence in doc.sents:
    for word in sentence:
        print(word)

I
had
a
wonderful
day
today
!
I
went
on
a
walk
,
read
a
book
and
listened
to
music
.


A slice from a Doc = Span:

In [7]:
span = doc[1:4]
print(span)

had a wonderful


In [8]:
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4, 5, 6]
Text:     ['I', 'had', 'a', 'wonderful', 'day', 'today', '!']
is_alpha: [True, True, True, True, True, True, False]
is_punct: [False, False, False, False, False, False, True]
like_num: [False, False, False, False, False, False, False]


Three methods as an example:
* `is_alpha` - whether the token consists of alphabetic characters
* `is_punct` - whether it's punctuation
* `like_num` - whether it resembles a number

Processing a simple example searching for a percentage symbol (%):

In [9]:
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are.")

for token in doc:
    if token.like_num:
        next_token = doc[token.i + 1]
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


### Trained pipelines

In [10]:
doc = nlp("Anna is a lawyer and she lives in Sydney")

#### Part of speech (`pos_`)

Attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

In [11]:
for token in doc:
    print(token.text, token.pos_) 

Anna 
is 
a 
lawyer 
and 
she 
lives 
in 
Sydney 


#### Dependency (`dep_`) - how the words are related  

* nominal subject
* direct object
* determiner (article)
* ...

In [12]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text) # parent token

Anna   Anna
is   is
a   a
lawyer   lawyer
and   and
she   she
lives   lives
in   in
Sydney   Sydney


#### Entities

In [13]:
doc = nlp("Anna is a lawyer and she lives in Sydney, she earns $1 million per year")

In [14]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [15]:
spacy.explain("pobj")

'object of preposition'

### Finding patterns - official spaCy tutorial, own example

In [24]:
from spacy.matcher import Matcher

In [39]:
matcher = Matcher(nlp.vocab)

doc = nlp(
    "A banana has only 105 calories, whereas one slice of pizza has 298 calories." # punctuation marks MATTER!
    "1 cup of rice has 205 calories, which might be a good foundation of your dinner."
    "Nuts are a good source of energy, 1 ounce has 168 calories"
)

pattern = [{"IS_DIGIT": True}, {"TEXT": "calories"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("CALORIES_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: 105 calories
Match found: 205 calories
Match found: 168 calories


## Data Structures

In [41]:
nlp.vocab.strings.add("coffee")
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

In [44]:
string = nlp.vocab.strings[3197928453018144401]

Lexemes are context-independent entries in the vocabulary. You can get a lexeme by looking up a string or a hash ID in the vocab. Lexemes expose attributes, just like tokens.

Hashes can’t be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.

### Spans

In [47]:
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

In [48]:
span_with_label

Hello world

In [49]:
doc.ents

(Hello world,)

#### Creating a Doc

In [53]:
from spacy.tokens import Doc
nlp = spacy.blank("en")

# Desired text: "Oh, really?!"
words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


#### Screating a Span, adding the span to the doc's entities

In [54]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


In [55]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin


## Similarity between sentences and words

In [57]:
# Load a larger pipeline with vectors
nlp = spacy.load("en_core_web_md")

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8698332283318978


#### Word vectors

In [None]:
doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)

## Rule-based Matching

In [58]:
# Initialize with the shared vocab
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{"LEMMA": "love", "POS": "VERB"}, {"LOWER": "cats"}]
matcher.add("LOVE_CATS", [pattern])

# Operators can specify how often a token should be matched
pattern = [{"TEXT": "very", "OP": "+"}, {"TEXT": "happy"}]
matcher.add("VERY_HAPPY", [pattern])

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

In [59]:
matches

[(9137535031263442622, 1, 3),
 (2447047934687575526, 7, 9),
 (2447047934687575526, 6, 9)]

### Adding statistical prediction

In [None]:
matcher = Matcher(nlp.vocab)
matcher.add("DOG", [[{"LOWER": "golden"}, {"LOWER": "retriever"}]])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span:", span.text)
    # Get the span's root token and root head token
    print("Root token:", span.root.text)
    print("Root head token:", span.root.head.text)
    # Get the previous token and its POS tag
    print("Previous token:", doc[start - 1].text, doc[start - 1].pos_)

## Phrase matcher

The phrase matcher is another helpful tool to find sequences of words in your data. It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context. 
It takes Doc objects as patterns. It's also really fast. This makes it very useful for matching large dictionaries and word lists on large volumes of text.

In [62]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", [pattern])
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print("Matched span:", span.text)

Matched span: Golden Retriever


### Exercise

In [61]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}] # amazon needs to be lowercase
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}] # separating "ad" and "free" in "ad-free"

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", [pattern1])
matcher.add("PATTERN2", [pattern2])

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


### Summary of pipeline components

* tagger (Part-of-speech tagger) --> `Token.tag`, `Token.pos`
* parser (Dependency parser) --> `Token.dep`, `Token.head`, `Doc.sents`, `Doc.noun_chunks`
* ner (Named entity recognizer) -->	`Doc.ents`, `Token.ent_iob`, `Token.ent_type`
* textcat (Text classifier) -->	`Doc.cats`

List of pipeline component names: `nlp.pipe_names` 

## Custom components


A pipeline component is a function or callable that takes a doc, modifies it and returns it, so it can be processed by the next component in the pipeline.

To tell spaCy where to find your custom component and how it should be called, you can decorate it using the @Language.component decorator. Just add it to the line right above the function definition.

Once a component is registered, it can be added to the pipeline using the nlp.add_pipe method. The method takes at least one argument: the string name of the component.

In [69]:
from spacy import Language

@Language.component("custom_component")
def custom_component_function(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe("custom_component")

# Print the pipeline component names
print("Pipeline:", nlp.pipe_names)

ValueError: [E007] 'custom_component' already exists in pipeline. Existing names: ['tok2vec', 'tagger', 'parser', 'senter', 'attribute_ruler', 'lemmatizer', 'ner', 'custom_component']

In [68]:
# Examples
# nlp.add_pipe("component", last=True)
# nlp.add_pipe("component", first=True)
# nlp.add_pipe("component", before="ner")
# nlp.add_pipe("component", after="tagger")

In [70]:
import spacy
from spacy.language import Language

# Define the custom component
@Language.component("length_component")
def length_component_function(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe("length_component", first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

['length_component', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
This document is 5 tokens long.


In [79]:
import spacy
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

# Define the custom component
@Language.component("animal_component")
def animal_component_function(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

# Add the component to the pipeline after the "ner" component
nlp.add_pipe("animal_component", after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
['cat', 'Golden Retriever']


## Sample operations of TXT file

Based on YouTube course: Natural Language Processing with spaCy & Python - Course for Beginners  (author: freeCodeCamp.org)

Source of data: [Kaggle - Harry Potter Books](https://www.kaggle.com/datasets/balabaskar/harry-potter-books-corpora-part-1-7/versions/2?resource=download) (smaller sample of Book one)



In [138]:
with open('data/Book1_s.txt', "r", encoding="utf8") as f:
    text = f.read()

In [139]:
text

'HE BOY WHO LIVED\n\n(short version of Book 1 file)\n\nMr. and Mrs. Dursley, of number four, Privet Drive, \nwere proud to say that they were perfectly normal, \nthank you very much. They were the last people you’d \nexpect to be involved in anything strange or \nmysterious, because they just didn’t hold with such \nnonsense. \n\nMr. Dursley was the director of a firm called \nGrunnings, which made drills. He was a big, beefy \nman with hardly any neck, although he did have a \nvery large mustache. Mrs. Dursley was thin and \nblonde and had nearly twice the usual amount of \nneck, which came in very useful as she spent so \nmuch of her time craning over garden fences, spying \non the neighbors. The Dursley s had a small son \ncalled Dudley and in their opinion there was no finer \nboy anywhere. \n\nThe Dursleys had everything they wanted, but they \nalso had a secret, and their greatest fear was that \nsomebody would discover it. They didn’t think they \ncould bear it if anyone found o

In [140]:
doc = nlp(text)

In [141]:
print(len(text))
print(len(doc))

19142
4485


Why are these numbers different?

In [142]:
for token in text[:10]:
    print(token)

# Tokens in doc object are letters.

H
E
 
B
O
Y
 
W
H
O


In [144]:
for token in doc[:10]:
    print(token)
    
# Tokens in doc object are words/punctuation marks!
#

HE
BOY
WHO
LIVED



(
short
version
of
Book


Each word and punctuation mark is a separate token. What's improtant, spaCy is able to recognize dots as parts of words and not separate tokens (for example "Mr." or other examples like "U.S.A.").

In [145]:
for token in text.split()[:10]:
    print(token)

HE
BOY
WHO
LIVED
(short
version
of
Book
1
file)


In plain text, parentheses are concatenated with words.

### Sentences:

In [146]:
for sent in doc.sents:
    print(sent)

HE BOY WHO LIVED

(short version of Book 1 file)


Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much.
They were the last people you’d 
expect to be involved in anything strange or 
mysterious, because they just didn’t hold with such 
nonsense. 


Mr. Dursley was the director of a firm called 
Grunnings, which made drills.
He was a big, beefy 
man with hardly any neck, although he did have a 
very large mustache.
Mrs. Dursley was thin and 
blonde and had nearly twice the usual amount of 
neck, which came in very useful as she spent so 
much of her time craning over garden fences, spying 
on the neighbors.
The Dursley s had a small son 
called Dudley and in their opinion there was no finer 
boy anywhere. 


The Dursleys had everything they wanted, but they 
also had a secret, and their greatest fear was that 
somebody would discover it.
They didn’t think they 
could bear it if anyone found out about the Potters. 


The text has been tokenize and we printed sentences of this document.

Let's exatract a sentence. We cannot simply type `doc.sents[4]`, we need to first make a list!

In [155]:
sentence1 = list(doc.sents)[3]

In [156]:
sentence1

Mr. Dursley was the director of a firm called 
Grunnings, which made drills.

### Setting custom attributes

In [159]:
doc._.title = "My document"
token._.is_color = True
span._.has_color = False

AttributeError: 'str' object has no attribute '_'

In [160]:
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension("title", default=None)
Token.set_extension("is_color", default=False)
Span.set_extension("has_color", default=False)

ValueError: [E090] Extension 'title' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

### Set a default value that can be overwritten

In [162]:
from spacy.tokens import Token

# Set extension on the Token with default value
Token.set_extension("is_color", default=False, force=True)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

In [164]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ["red", "yellow", "blue"]
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension("is_color", getter=get_is_color, force=True)

doc = nlp("The sky is blue.")
print(c, "-", doc[3].text)

True - blue


In [None]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension("has_token", method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token("blue"), "- blue")
print(doc._.has_token("cloud"), "- cloud")