# Spacy Intro

In [2]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

###### containers

    - doc: The doc object contains metadata about the text that we pass to the spcy pipeline.
    
    

In [4]:
with open("wiki_us.txt" ,"r") as file:
    text = file.read()
text

"The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.\n\nPaleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies 

In [5]:
doc = nlp(text)

In [6]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [7]:
print(len(text))
print(len(doc))

3521
654


###### Why the difference in length between text and doc?

    - the text variable is counting the characters while the doc variable is counting the words/punctuation (AKA Tokens)
    - the doc object is counting over individual "tokens"

In [10]:
for i in text[:10]:
    print(i)

T
h
e
 
U
n
i
t
e
d


In [11]:
for i in doc[:10]:
    print(i)

The
United
States
of
America
(
U.S.A.
or
USA
)


###### why not use .split() function rather than the doc object provided by spacy?

    - the .split() function does not split the punctuation like (U.S.A.)
    - the doc object allows for this functionality

In [13]:
for i in text.split()[:10]:
    print(i)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


### Sentence Boundary Detection

    - SBD is the identification of sentences in a text.
    - since the doc.sents method returns a generator, it is not subscriptable.
        - You have to convert it to a type that is subscriptable.

In [16]:
for sentence in doc.sents:
    print(sentence)
    print('---')

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
---
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
---
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
---
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
---
With a population of more than 331 million people, it is the third most populous country in the world.
---
The national capital is Washington, D.C., and the most populous city is New York.


---
Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
---
The United States emerged from the 

In [17]:
sent1 = list(doc.sents)[0]
sent1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

# Token Attributes

    - The token object contains a lot of different attributes that are vital to perform NLP tasks in spacy.
    - Common token attributes:
        .text
        .head
        .left_edge
        .right_edge
        .ent_type_
        .iob_
        .lemma_
        .morph
        .pos_
        .dep_
        .lang_

In [18]:
sentence1 = list(doc.sents)[0]
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

###### .text
    - essentially converts data of type spacy token to a string type

In [25]:
token2 = sentence1[2]
print(token2)
token2.text

States


'States'

In [27]:
print(type(token2))
print(type(token2.text))

<class 'spacy.tokens.token.Token'>
<class 'str'>


   - The code above shows that even though token2 and token2.text are similar in their output, token2 is of type spacy token while the token2.text converts token2 to a string type.

In [28]:
for i in token2:
    print(i)

TypeError: 'spacy.tokens.token.Token' object is not iterable

In [29]:
for i in token2.text:
    print(i)

S
t
a
t
e
s


###### .head
    - Basically tells us which word the item of interest is governed by.

In [30]:
token2.head

is

In [32]:
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

###### .left_edge
    - Returns the left most token.
    - if part of a sequence of tokens that are collectively meaningful, known as multi-word tokens, this will tell us where the multi-word token begins.

In [33]:
token2.left_edge

The

###### .right_edge
    - Returns the rightmost token.
    - if part of a sequence of tokens that are collectively meaningful, this will tell us where the multi-word token ends.

In [34]:
token2.right_edge

America

###### .ent_type_
    = Returns the named entity type.
    - This return value will be of type string.
        - if .ent_type (without the ending "_") is used, it will return an integer that corresponds with an entity type.
    

In [35]:
print(token2.ent_type)
print(token2.ent_type_)

384
GPE


###### .ent_iob_
    -IOB code of named entity tag "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity.

In [36]:
token2.ent_iob_

'I'

- "I" here means that the token "states" is inside an entity. The larger entity being "The United States of America".

###### .lemma_
    - Returns the base form of the token, with no inflectional suffixes.

In [43]:
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

In [39]:
token2.lemma_

'States'

In [63]:
print(sentence1[12])
print(sentence1[12].lemma_)

known
know


###### .morph
    - returns morphological analysis

In [64]:
print(f"{token2}: {token2.morph}")
print(f"{sentence1[4]}: {sentence1[4].morph}")
print(f"{sentence1[12]}: {sentence1[12].morph}")

States: Number=Sing
America: Number=Sing
known: Aspect=Perf|Tense=Past|VerbForm=Part


###### .pos
    - Returns coarse-graines part-of-speech from the universal POS tag set.

In [65]:
print(f"{token2}: {token2.pos_}")
print(f"{sentence1[4]}: {sentence1[4].pos_}")
print(f"{sentence1[12]}: {sentence1[12].pos_}")

States: PROPN
America: PROPN
known: VERB


###### .dep_
    - return the syntactic dependency relation.

In [66]:
print(f"{token2}: {token2.dep_}")
print(f"{sentence1[4]}: {sentence1[4].dep_}")
print(f"{sentence1[12]}: {sentence1[12].dep_}")

States: nsubj
America: pobj
known: acl


###### .lang_
    - returns the language of the parent document's vocabulary.

In [58]:
print(f"{token2}: {token2.lang_}")

States: en


# Part of Speech Tagging and Dependecy Parser

In [72]:
text = "Mike enjoys playing football."
doc2 = nlp(text)

In [73]:
for token in doc2:
    print(f"{token}: {token.pos_}, {token.dep_}")

Mike: PROPN, nsubj
enjoys: VERB, ROOT
playing: VERB, xcomp
football: NOUN, dobj
.: PUNCT, punct


In [75]:
from spacy import displacy
displacy.render(doc2, style='dep')

# Named Entity Recognition Teaser

The cell below shows how we can loop through each entity of a text and extract labels for each entity.

In [79]:
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")

The United States of America: GPE
U.S.A.: GPE
USA: GPE
the United States: GPE
U.S.: GPE
US: GPE
America: GPE
North America: LOC
50: CARDINAL
five: CARDINAL
326: CARDINAL
Indian: NORP
3.8 million square miles: QUANTITY
9.8 million square kilometers: QUANTITY
fourth: ORDINAL
The United States: GPE
Canada: GPE
Mexico: GPE
Bahamas: GPE
Cuba: GPE
more than 331 million: CARDINAL
third: ORDINAL
Washington: GPE
D.C.: GPE
New York: GPE
Paleo-Indians: NORP
Siberia: LOC
North American: NORP
at least 12,000 years ago: DATE
European: NORP
the 16th century: DATE
The United States: GPE
thirteen: CARDINAL
British: NORP
the East Coast: LOC
Great Britain: GPE
the American Revolutionary War: ORG
the late 18th century: DATE
U.S.: GPE
North America: LOC
Native Americans: NORP
1848: DATE
the United States: GPE
United States: GPE
the second half of the 19th century: DATE
the American Civil War: ORG
Spanish: NORP
World War: EVENT
U.S.: GPE
World War II: EVENT
the Cold War: EVENT
the United States: GPE
the Kor

In [80]:
displacy.render(doc, style='ent')

# Word Vectors

    - word vectors ("word embeddings") are numerical representations of words in multidimensional space through matrices.
    - These are necessary to convert words to numerical values that the computer can understand.

In [81]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.0/en_core_web_md-3.7.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [3]:
nlp = spacy.load("en_core_web_md")

In [4]:
with open("wiki_us.txt", "r") as file:
    text = file.read()

In [5]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
sentence1


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

In [25]:
sample_sentence = nlp("USA is a country that is located in the northern part of the americas.")
sample_sentence2 = nlp("Russia is the largest country in Europe and has a complicated history.")
sample_sentence3 = nlp("The egg mcmuffin is the tastiest sandwich I have ever had.")

In [26]:
print(sample_sentence.similarity(sentence1))
print(sample_sentence2.similarity(sentence1))
print(sample_sentence3.similarity(sentence1))

0.7670948457261745
0.7540404066488818
0.3309637077623314


# Pipelines

###### create an initial blank pipeline

In [None]:
nlp = spacy.blank("en")

###### add a function into your pipeline

    - In this case, we are adding the sentencizer function into the pipeline.

In [27]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7fe8925a8640>

In [28]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sen