### Readme:

This notebook documents what I learnt from https://course.spacy.io/en/. It contains notes/sample codes/sample problems from the course. 

Special thanks to the content creators and the presenter Ines

This notebook is intended for self-study, not re-distributing contents. 
If you want to learn more about spaCy, please visit https://spacy.io/ or https://course.spacy.io/en/

Thank you!

### Chapter 2: Large-scale data analysis with spaCy

Vocab : stores data shared across multiple documents

Strings are encoded to hash values

Strings are only stored once in the StringStore via
nlp.vocab.strings

Strings store : lookup table in both directions

staCy internally looks at hash id

Hashes can't be reversed

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [7]:
coffee_hash = nlp.vocab.strings["coffee"]

In [8]:
coffee_hash # hash value

3197928453018144401

In [9]:
coffee_string = nlp.vocab.strings[coffee_hash]

In [10]:
coffee_string # string value

'coffee'

In [11]:
string = coffee_string = nlp.vocab.strings[319786] # errors if never see the strings before

KeyError: "[E018] Can't retrieve string for hash '319786'. This usually refers to an issue with the `Vocab` or `StringStore`."

In [12]:
# lexeme doesn't have context-depended pos, dependencies, or entity labels

In [13]:
from spacy.lang.en import English

nlp = English()
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [16]:
from spacy.lang.en import English
nlp = English()
doc = nlp("David Bowie is a PERSON")
# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)
# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


In [17]:
# #from spacy.lang.en import English
# from spacy.lang.de import German

# # Create an English and German nlp object
# nlp = English()
# nlp_de = German()

# # Get the ID for the string 'Bowie'
# bowie_id = nlp.vocab.strings["Bowie"]
# print(bowie_id)

# # Look up the ID for "Bowie" in the vocab
# print(nlp_de.vocab.strings[bowie_id])

# errors! : The string "Bowie" isn’t in the German vocab, so the hash can’t be resolved in the string store.
# never see it, errors
# Hashes can’t be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.

Doc and span are very powerful and hold references and relationships of words and sentences
   
   * Convert results to strings as late as possible
   * Use Token attributes if available, such as index, token.i for token index
   
Don't forget to pass in the shared vocab

In [3]:
# manually create Doc object
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ["spaCy", "is", "cool", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [4]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [5]:
from spacy.lang.en import English

nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ["Oh", ",","really", "?", "!"]
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab,words=words, spaces=spaces)
print(doc.text)

Oh, really?!


In [6]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


In [7]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

Found proper noun before a verb: Berlin


This is not a good code as it only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.

In [9]:
# This is a better way, doc[i] will gives you token, so can just use doc[token.i+1].pos_
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

for token in doc:
    if token.pos_ == "PROPN":
        if doc[token.i+1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin


Word vectors and semantic similarity:

* spaCy can compare/predict similarity
* Doc/Span/Token.similarity
* Similar score of 0 to 1
* Needs a model that has word vectors included
    * en_core_web_md (medium)
    * en_core_web_lg (large)
    * NOT en_core_web_sm (small model)
    * https://stackoverflow.com/questions/50487495/what-is-difference-between-en-core-web-sm-en-core-web-mdand-en-core-web-lg-mod
* Similarity is determined using word vectors
* generated using algos like Word2Vec and lots of textx
* default : consine similarity 
* doc and span vectors default to average of token vectors
* short phrases better than long documents with tons of irrelevant words
*  similarity depends on the application context



In [24]:
import spacy
#!python3 -m spacy download en_core_web_md

nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")
print(doc1.similarity(doc2))

# both describe sentiments regarding cats, similiar
# but opposite sentiments

0.9501447503553421


In [25]:
import spacy

# Load the en_core_web_md model
nlp = spacy.load("en_core_web_md")

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

In [26]:
import spacy

nlp = spacy.load("en_core_web_md")

doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


In [27]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [29]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)
print(span1)
print(span2)

0.75173926
great restaurant
really nice bar


Combining models and rules
   * Statistical Models
        *  use cases(generalize)
        *  entity recognizer, dependency parser, and pos tags
        *  product/person names, subject/object relationships
   *  Rule-based systems
        * https://spacy.io/usage/rule-based-matching
        *  dics with finite examples
        *  countries/cities, drug_names, dog_breeds
        *  Matcher/PhraseMatcher/Tokenizer
            * phrase matching is great for matching large word lists
        *  lower() The lowercase form of the token text, Silican -> silican, to compare
        *  tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token. Does not need extra {text: " "}
        *  is_title() -> cat -> Cat

In [32]:
# example
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

pattern = [{"LEMMA":"love","POS":"VERB"},
           {"LOWER":"cats"}]
matcher.add("LOVE_CATS", [pattern])

In [33]:
pattern = [{"TEXT":"very","OP":"+"},
           {"TEXT":"happy"}]
matcher.add("VERY_HAPPY", [pattern])

doc = nlp("I love cats and i'm very happy")
matches = matcher(doc)

In [37]:
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text, span.root.head.text)
    print(doc[start-1].text, doc[start-1].pos_)

love cats love
I PRON
very happy 'm
'm VERB


In [39]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns, follow the orders
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", [pattern1])
matcher.add("PATTERN2", [pattern2])

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. 

This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

In [40]:
# import json
# from spacy.lang.en import English
# with open("exercises/en/countries.json", encoding="utf8") as f:
#     COUNTRIES = json.loads(f.read())
# nlp = English()
# doc = nlp("Czech Republic may help Slovakia protect its airspace")
# # Import the PhraseMatcher and initialize it
# from spacy.matcher import PhraseMatcher
# matcher = PhraseMatcher(nlp.vocab)
# # Create pattern Doc objects and add them to the matcher
# # This is the faster version of: [nlp(country) for country in COUNTRIES]
# patterns = list(nlp.pipe(COUNTRIES))
# matcher.add("COUNTRY", None, *patterns)
# # Call the matcher on the test document and print the result
# matches = matcher(doc)
# print([doc[start:end] for match_id, start, end in matches])

# return [Czech Republic, Slovakia]

In [None]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())
with open("exercises/en/country_text.txt", encoding="utf8") as f:
    TEXT = f.read()

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Create a doc and reset existing entities
doc = nlp(TEXT)
doc.ents = []

# Iterate over the matches
for match_id, start, end in matcher(doc):  # match countries, and label as "GPE", and add to NER lists
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])


in --> Namibia
in --> South Africa
Africa --> Cambodia
of --> Kuwait
as --> Somalia
Somalia --> Haiti
Haiti --> Mozambique
in --> Somalia
for --> Rwanda
Britain --> Singapore
War --> Sierra Leone
of --> Afghanistan
invaded --> Iraq
in --> Sudan
of --> Congo
earthquake --> Haiti
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]