# What is [spaCy](https://spacy.io/)?

**spaCy** is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy provides a variety of linguistic annotations to give you **insights in to text's grammatical structure.** This includes the word types, like the parts of speech, and how the words are related to each other.

![spacy](figures/spacy_logo.png "NLTK")

Similar to NLTK, spaCy provides a range of features and capabilities related to linguistic concepts and even more general machine learning functionalities.

Here is the list of things spaCy allows you to do (you can also visualize them using spaCy's built-in function, **displaCy**.


| NAME	| DESCRIPTION |
| :------------- | :----------: |
**Tokenization** | Segmenting text into words, punctuations marks etc.
**Part-of-speech (POS) Tagging**	 | Assigning word types to tokens, like verb or noun.
**Dependency Parsing**	 | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
**Lemmatization**	 | Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
**Sentence Boundary Detection (SBD)** | 	Finding and segmenting individual sentences.
**Named Entity Recognition (NER)**	 | Labelling named “real-world” objects, like persons, companies or locations.
**Entity Linking (EL)** | 	Disambiguating textual entities to unique identifiers in a knowledge base.
**Similarity**	 | Comparing words, text spans and documents and how similar they are to each other.
**Text Classification**	 | Assigning categories or labels to a whole document, or parts of a document.
**Rule-based Matching**	 | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
**Training**	 | Updating and improving a statistical model’s predictions.
**Serialization**	 | Saving objects to files or byte strings.

In [1]:
'''
If you do not have spacy and its english package installed, un-comment the following two lines and make sure you install them.
'''

# !pip install spacy
# !python -m spacy download en_core_web_sm

'\nIf you do not have spacy and its english package installed, un-comment the following two lines and make sure you install them.\n'

# Chapter 1. Word Tokenization
This section uses a trained pipline (i.e., `en_core_web_sm`) to create a `Language` object containing all the components and data needed to process text.

The `Language` object will enable you to lemmatize, apply POS tags, dependency parse and even figure out the shape of the tokens present in the given text.

In [2]:
import spacy

# Load the `Language` object
nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print("{:<10}{:<7}{}".format(token.text, token.pos_, token.dep_))

Apple     PROPN  nsubj
is        AUX    aux
looking   VERB   ROOT
at        ADP    prep
buying    VERB   pcomp
U.K.      PROPN  dobj
startup   NOUN   advcl
for       ADP    prep
$         SYM    quantmod
1         NUM    compound
billion   NUM    pobj


In [4]:
doc = nlp("The first Tesla Model S Plaid models are about to be delivered to customers on June 10th, \
            but CEO Elon Musk has just announced that the range-topping 1,100hp Plaid+ variant will no longer be offered.")
for token in doc:
    print("{:<10}{:<7}{}".format(token.text, token.pos_, token.dep_))

The       DET    det
first     ADJ    amod
Tesla     PROPN  compound
Model     PROPN  compound
S         PROPN  compound
Plaid     PROPN  compound
models    NOUN   nsubj
are       AUX    ccomp
about     ADJ    acomp
to        PART   aux
be        AUX    auxpass
delivered VERB   xcomp
to        ADP    prep
customers NOUN   pobj
on        ADP    prep
June      PROPN  compound
10th      NOUN   pobj
,         PUNCT  punct
            SPACE  nsubj
but       CCONJ  cc
CEO       NOUN   compound
Elon      PROPN  compound
Musk      PROPN  conj
has       AUX    aux
just      ADV    advmod
announced VERB   ROOT
that      SCONJ  mark
the       DET    det
range     NOUN   npadvmod
-         PUNCT  punct
topping   VERB   amod
1,100hp   NOUN   compound
Plaid+    PROPN  compound
variant   NOUN   nsubjpass
will      AUX    aux
no        ADV    neg
longer    ADV    advmod
be        AUX    auxpass
offered   VERB   ccomp
.         PUNCT  punct


Calling `nlp` object on a string of text will return a processed `Doc` object. You can think of `Doc` (denoted as the object name `doc` here) as a container for accessing linguistic annotations.

Take a look at these `Doc` attributes!

In [5]:
print(dir(doc))

['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merge', '_get_array_attrs', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'copy', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_dict', 'from_disk', 'from_docs', 'get_extension', 'get_lca_matrix', 'has_annotation', 'has_extension', 'has_unknown_spaces', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'noun_chunks', 'noun_chunks_iterator', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_ents', 'set_extension', 'similarity', 'spans', 'tensor', 'text', 'text_with_ws'

## Tokenization

During the above process, spaCy tokenizes the text as follows.

In [6]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, end="  /  ")

Apple  /  is  /  looking  /  at  /  buying  /  U.K.  /  startup  /  for  /  $  /  1  /  billion  /  

First, the raw text is split on whitespace characters, similar to `text.split(' ')`. Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. Does the substring match a tokenizer exception rule? For example, **“don’t”** does not contain whitespace, but should be split into two tokens, **“do”** and **“n’t”**, while **“U.K.”** should always remain one token.

2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

In [7]:
doc = nlp("Don't do that, John. You're intimidating him. Let's go.")
for token in doc:
    print(token.text,end="  /  ")

Do  /  n't  /  do  /  that  /  ,  /  John  /  .  /  You  /  're  /  intimidating  /  him  /  .  /  Let  /  's  /  go  /  .  /  

## Part-of-speech (POS) tags and dependencies
The trained pipelines and statistical models within spaCy enable it to **make predictions** of which tag or label most likely applies in this context.

In [8]:
from spacy import displacy

In [9]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print("{:<20}{:<15}{:<10}{:<10}{:<10}{:<10}{:<10}{}".format(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))

# Call `displacy.serve` for dependency visualization
displacy.serve(doc, style="dep")

Apple               Apple          PROPN     NNP       nsubj     Xxxxx     1         False
is                  be             AUX       VBZ       aux       xx        1         True
looking             look           VERB      VBG       ROOT      xxxx      1         False
at                  at             ADP       IN        prep      xx        1         True
buying              buy            VERB      VBG       pcomp     xxxx      1         False
U.K.                U.K.           PROPN     NNP       dobj      X.X.      0         False
startup             startup        NOUN      NN        advcl     xxxx      1         False
for                 for            ADP       IN        prep      xxx       1         True
$                   $              SYM       $         quantmod  $         0         False
1                   1              NUM       CD        compound  d         0         False
billion             billion        NUM       CD        pobj      xxxx      1         False





Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [10]:
doc = nlp("Don't do that, John. You're intimidating him. Let's go.")
for token in doc:
    print("{:<20}{:<15}{:<10}{:<10}{:<10}{:<10}{:<10}{}".format(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))

Do                  do             AUX       VBP       aux       Xx        1         True
n't                 n't            PART      RB        neg       x'x       0         True
do                  do             VERB      VB        ROOT      xx        1         True
that                that           DET       DT        dobj      xxxx      1         True
,                   ,              PUNCT     ,         punct     ,         0         False
John                John           PROPN     NNP       npadvmod  Xxxx      1         False
.                   .              PUNCT     .         punct     .         0         False
You                 you            PRON      PRP       nsubj     Xxx       1         True
're                 be             AUX       VBP       aux       'xx       0         True
intimidating        intimidate     VERB      VBG       ROOT      xxxx      1         False
him                 he             PRON      PRP       dobj      xxx       1         True
.     

## Named Entity Recognition (NER)

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can **recognize various types of named entities in a document, by asking the model for a prediction.**

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.serve(doc, style="ent")

In [None]:
doc = nlp("The first Tesla Model S Plaid models are about to be delivered to customers on June 10th, \
            but CEO Elon Musk has just announced that the range-topping 1,100hp Plaid+ variant will no longer be offered.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
displacy.serve(doc, style="ent")

## (Quick Tip) BIO Tagging (a.k.a IOB schema)

How does an NER model work?

An NER model typically uses the BIO schema, which tags the beginning of a name entity token as **"B"** and the intermediate and last tokens as **"I"**.

![bio](figures/bio_tagging.png)

In [None]:
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
for word in doc:
    print("{:<10} / {:<2} / {} ".format(word.text, word.ent_iob_, word.ent_type_))

In [None]:
doc = nlp("In the U.S., 328 million doses have been given so far. In the last week, an average of 1.12 million doses per day \
            were administered.")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
for word in doc:
    print("{:<10} / {:<2} / {} ".format(word.text, word.ent_iob_, word.ent_type_))

## Word vectors and similarity
**Similarity** of two words is determined by comparing two word vectors or **"word embeddings"**.

> A **word embedding** is a multi-dimensional meaning representations of a word.

> Word vectors can be generated using an algorithm like **word2vec**.

Here, we'll be using `en_core_web_lg`, since `en_core_web_sm` we've been using so far does not contain the token-level word embeddings.

In [None]:
# !python -m spacy download en_core_web_lg

You can check if a token has a vector assigned, and get the L2-norm, which can be used to normalize vectors.

In [None]:
nlp = spacy.load("en_core_web_lg")

tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print("{:<10}{:<7}{:.3f}{:<100}".format(token.text, token.has_vector, token.vector_norm, token.is_oov))

## Similarity between Documents, Spans and Words

Pipeline packages that come with built-in word vectors make them available as the **`Token.vector`** attribute.

**`Doc.vector`** and **`Span.vector`** will default to an average of their token vectors. 

Using these vectors, you can compare two objects and make a prediction of **how similar they are.** This is very useful in building recommendation systems or flagging duplicates.

In [None]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print("{} <-> {} :\n[ Similarity : {}]\n".format(doc1, doc2, doc1.similarity(doc2)))

# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print("{} <-> {} :\n[ Similarity : {}]\n".format(french_fries, burgers, french_fries.similarity(burgers)))

In [None]:
doc1 = nlp("Hyundai is a company that employs cutting-edge technology to build cars.")
doc2 = nlp("Tesla is a car manufacturer that builds electric cars.")

print("{} <-> {} :\n[ Similarity : {}]\n".format(doc1, doc2, doc1.similarity(doc2)))

In [None]:
doc1 = nlp("I love trees.")
doc2 = nlp("I hate trees.")

print("{} <-> {} :\n[ Similarity : {}]\n".format(doc1, doc2, doc1.similarity(doc2)))

word1 = nlp("love")
word2 = nlp("hate")
print("{} <-> {} :\n[ Similarity : {}]\n".format(word1, word2, word1.similarity(word2)))

word3 = nlp("tree")
word4 = nlp("tree")
print("{} <-> {} :\n[ Similarity : {}]\n".format(word3, word4, word3.similarity(word4)))

# Exercise
1. **spaCy**
    - (1.1) Apply POS tagging and extract all the proper nouns (PROPN) and numbers (NUM) from the given text.
    - (1.2) Apply NER and extract all the numbers from the given text, along with their type (e.g., MONEY, CARDINAL, etc.)
    - (1.3) For the given set of documents, compare: (i) Document-wise similarity, (ii) Span-wise similarity, and (iii) Word-level similarity
    

### (1.1) Apply POS tagging and extract all the proper nouns (PROPN) and numbers (NUM) from the given text.

1. Advanced Micro had unveiled the deal in October last year as part of its battle to overtake chipmaking rival Intel Corp.

2. Just like most of big tech, Nvidia topped in early September, hitting roughly \$588 a share. 

3. Most analysts assumed a net increase of 7 to 10% from the starting price of 712 dollars per share. Nevertheless, the stock went through the roof when it shattered the 800 dollar cap.

### (1.2) Apply NER and extract all the numbers from the given text, along with their type (e.g., MONEY, CARDINAL, etc.)

1. Of 263,000 Amazon factory workers, just over 88,000 (33.6%) had received their first shot and about 43,000 (16.3%) had received both doses.

2. To support vaccination access for the sector, the federal government announced 13 clinics in multiple locations for aged-care staff and that they will be providing supplementary vaccines until the end of December.

3. The decision reflects a greater understanding of the real, but extremely low, risk of the clotting disorder called thrombosis with thrombocytopenia (TTS) for people aged 50-59, who are now recommended to have the Pfizer vaccine.

### (1.3) For the given set of documents, compare: 
    (i) Document-wise similarity
    (ii) Span-wise similarity
    (iii) Word-level similarity

**Make sure to extract the "spans" and "words" straight from the document. Do NOT just write the words (e.g., doc = nlp("pop rock")) and compute the similarity. They must be extracted from the document (e.g., doc = nlp(document[14:20]))**

1.

**[ Document 1 ]**

AC/DC are an Australian rock band formed in Sydney in 1973 by Scottish-born brothers Malcolm and Angus Young. Their music has been variously described as hard rock, blues rock, and heavy metal,but the band themselves call it simply "rock and roll".

**[ Document 2 ]**

Queen are a British rock band formed in London in 1970. Their classic line-up was Freddie Mercury (lead vocals, piano), Brian May (guitar, vocals), Roger Taylor (drums, vocals) and John Deacon (bass). Their earliest works were influenced by progressive rock, hard rock and heavy metal, but the band gradually ventured into more conventional and radio-friendly works by incorporating further styles, such as arena rock and pop rock.

**[ Spans ]**

"Their music has been variously described as hard rock" **<->** "Queen are a British rock band formed in London in 1970"

"Their music has been variously described as hard rock, blues rock, and heavy metal" **<->** "Their earliest works were influenced by progressive rock, hard rock and heavy metal"

**[ Words ]**

"pop rock" **<->** "heavy metal"

"pop rock" **<->** "hard rock"

"Austrailian" **<->** "British"

2.

**[ Document 1 ]**

Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California. Tesla's current products include electric cars, battery energy storage from home to grid-scale, solar panels and solar roof tiles, as well as other related products and services. In 2020, Tesla had the highest sales in the plug-in and battery electric passenger car segments, capturing 16% of the plug-in market (which includes plug-in hybrids) and 23% of the battery-electric (purely electric) market. Through its subsidiary Tesla Energy, the company develops and is a major installer of solar photovoltaic energy generation systems in the United States. Tesla Energy is also one of the largest global suppliers of battery energy storage systems, with 3 GWh of battery storage supplied in 2020.

**[ Document 2 ]**

Apple Inc. is an American multinational technology company that specializes in consumer electronics, computer software, and online services. Apple is the world's largest technology company by revenue (totalling $274.5 billion in 2020) and, since January 2021, the world's most valuable company. As of 2021, Apple is the world's fourth-largest PC vendor by unit sales,[9] and fourth-largest smartphone manufacturer. It is one of the Big Five American information technology companies, along with Amazon, Google, Microsoft, and Facebook

**[ Spans ]**

"American electric vehicle and clean energy company based in Palo Alto, California" **<->** "American multinational technology company"

"Tesla's current products include electric cars, battery energy storage from home to grid-scale" **<->** "Apple is the world's largest technology company by revenue"

**[ Words ]**

"solar panels" **<->** "smartphone"

"highest sales" **<->** "multinational"

"United States" **<->** "Amazon"