#### Resources:
- [Reference video](https://www.youtube.com/watch?v=dIUTsFT2MeQ&list=PLUorQzJgGyjFMESZTXl_6ZPYLnwG2EDvm&index=4)
- [Reference book](https://spacy.pythonhumanities.com/intro.html)

#### NLP v/s NLU
##### NLP
- NLP is processing the input to process the text from the strict specs of a language.
- In NLP, we try to gather all the possible units to place the input in a structure, that can be understood by more intelligent language models to weave their magic with this derived structure. 
- It performs an array of functions which includes:
    - NER (Named Entity Recognition)
    - syntax parsing
    - POS
    - Text categorization

##### NLU
- NLU is an attempt to humanize the language, trying to understand the language from a more human point of view.
- It performs functions like:
    - semantic parsing (understanding the context)
    - question & answering (important in bots)
    - sentiment analysis
    - paraphrasing
    - summarization

#### SPACY Intro
- What is spaCY?
    - It is a python framework used to implement NLP.
- why spaCY?
 - It comes with off the shelf models that are quick and precise, we as a developer don't need to tweak too much to get the best performance out of it.
 - It can use the latest transformer models.
 - Easier to perform custom trainer relative to other NLP frameworks
 - It is scalable, i.e it works well with real world data that are large in nature and are ever growing.


#### SPACY installation and setup
- site
- commands




In [3]:
import spacy

In [4]:
#check whether all the resources of spacy is loaded correctly
nlp = spacy.load("en_core_web_sm")

#### Containers
- containers - what is a container in a spacy?
- different types of containers in spacy

##### Doc

In [5]:
#let's take a text containing multiple paras from the file usa.txt
with open("usa.txt", "r") as f:
    text = f.read()

print(text)

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America. It is composed of 50 states, each with its own unique culture, geography, and attractions. From the bustling streets of NYC (New York City) to the serene landscapes of the Grand Canyon, the USA offers a wide range of experiences for both residents and visitors. The country is known for its rich history, which includes significant events such as the Am. Revolution, the Civil War, and the Civil Rights Movement.

The USA is a melting pot of cultures, with people from all over the world contributing to its vibrant society. This cultural diversity is reflected in the nation's cuisine, music, art, and festivals. Cities like L.A. (Los Angeles), Miami, and Chicago are renowned for their cultural scenes, offering everything from world-class museums to lively music festivals. The Am. Dream, the idea that anyone can achieve success through hard work and determination, continues to

In [6]:
#let's use spacy to create a doc out of this text
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print(doc)

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America. It is composed of 50 states, each with its own unique culture, geography, and attractions. From the bustling streets of NYC (New York City) to the serene landscapes of the Grand Canyon, the USA offers a wide range of experiences for both residents and visitors. The country is known for its rich history, which includes significant events such as the Am. Revolution, the Civil War, and the Civil Rights Movement.

The USA is a melting pot of cultures, with people from all over the world contributing to its vibrant society. This cultural diversity is reflected in the nation's cuisine, music, art, and festivals. Cities like L.A. (Los Angeles), Miami, and Chicago are renowned for their cultural scenes, offering everything from world-class museums to lively music festivals. The Am. Dream, the idea that anyone can achieve success through hard work and determination, continues to

In [7]:
#the output looks similar doesnt it, now let's check the length of text and doc
print(len(text))
print(len(doc))

#print first 10 elements of text and doc
for char in text[:10]:
    print(char)
print("-------------------------------")
for element in doc[:10]:
    print(element)

2607
489
T
h
e
 
U
.
S
.
A
.
-------------------------------
The
U.S.A.
(
United
States
of
America
)
,
commonly


In [8]:
#The doc is not just splitting the text into separate tokens just based on space.
#we can confirm that by printing the first 10 elements of the splitted text
for word in text.split()[:10]:
    print(word)


The
U.S.A.
(United
States
of
America),
commonly
known
as
the


- See how the outputs vary, the doc takes each parenthesis as a separate token. It gathers all the semantic units of the text and lists each of them as a token. And also see how U.S.A is not splitted even though there is a full stop in between, which is infact a semantic unit, but it knows in this case, it is a part of an abbrievation.

##### Sents
- It is a container that has the sentences of the text in a doc.
- Even for a simple function like splitting sentences, there is a lot of things to keep in mind which varies with language.
- spacy does the sentence boundary detection out of the box using the doc object.

In [9]:
#Let us try to extract the sentences out of our doc
sents = doc.sents

for sent in sents:
    print (sent)

#notice how even the para blankline is also considered a separate sentence

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America.
It is composed of 50 states, each with its own unique culture, geography, and attractions.
From the bustling streets of NYC (New York City) to the serene landscapes of the Grand Canyon, the USA offers a wide range of experiences for both residents and visitors.
The country is known for its rich history, which includes significant events such as the Am.
Revolution, the Civil War, and the Civil Rights Movement.


The USA is a melting pot of cultures, with people from all over the world contributing to its vibrant society.
This cultural diversity is reflected in the nation's cuisine, music, art, and festivals.
Cities like L.A. (Los Angeles), Miami, and Chicago are renowned for their cultural scenes, offering everything from world-class museums to lively music festivals.
The Am.
Dream, the idea that anyone can achieve success through hard work and determination, continues t

In [10]:
#let us try to print the first sentence
sent1 = sents[0]
print(sent)

TypeError: 'generator' object is not subscriptable

- it does not work, because sents is a generator on doc object, hence it is not subscriptable (i.e we cannot loop over it)

In [12]:
#so let's extact the sents into a list and access the first sentence
sent1 = list(doc.sents)[0]
print(sent1)

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America.


##### Token
- It is the smallest unit of a doc (typically a word in the sentence).
- It has various attributes that represent the linguistic features of the word. These features are gathered from the nature of the word, it's position and usage in the sentence.
- Let's check some of the attributes of a token obtained from a sent.

In [14]:
token = sent1[10]
print(token)
print("------------")
print(token.text)
print(token.lemma_) #root word
print(token.ent_id_) #entity type that is associated with this token (no entity associated with a verb)
print(token.ent_iob_) #inside, outside or beginning of an entity
print(token.left_edge) #starting point of the syntactic subtree
print(token.right_edge) #ending point of the syntactic subtree
print(token.pos_)


#we have covered many of the concepts like POS, lemma, entity in the 2nd document - 2-understanding-NLP-using-nltk.ipynb

known
------------
known
know

O
commonly
USA
VERB


- Visualizing Dependency Parser and Parts of speech

##### Ent
- list ents in a doc


In [20]:
ents = doc.ents

# List of entities along with their types
for ent in doc.ents[:10]:
    print (ent, ent.label_)



U.S.A. GPE
United States of America GPE
USA GPE
N. America LOC
50 CARDINAL
NYC ORG
New York City GPE
the Grand Canyon LOC
USA GPE
the Civil War EVENT


- some of ent abbrievations

| **Abbreviation** | **Description**                  |
|------------------|----------------------------------|
| PERSON           | People, including fictional      |
| ORG              | Companies, agencies, institutions, etc. |
| GPE              | Countries, cities, states        |
| DATE             | Absolute or relative dates or periods |
| TIME             | Times smaller than a day         |
| MONEY            | Monetary values, including unit  |
| NORP             | Nationalities or religious/political groups |


- visualize ents

In [21]:
#Visualizing ents in the doc using displacy module
from spacy import displacy
displacy.render(doc, style="ent")

#### Word Vectors and spacy
- install a more complex model that uses word vectors
    - Before getting into word vectors lets use a more complex english model that uses word vectors.




In [1]:
!py -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     - -------------------------------------- 1.0/33.5 MB 5.0 MB/s eta 0:00:07
     -- ------------------------------------- 1.8/33.5 MB 5.0 MB/s eta 0:00:07
     --- ------------------------------------ 2.9/33.5 MB 4.7 MB/s eta 0:00:07
     ---- ----------------------------------- 3.7/33.5 MB 4.4 MB/s eta 0:00:07
     ----- ---------------------------------- 4.2/33.5 MB 4.0 MB/s eta 0:00:08
     ------ --------------------------------- 5.2/33.5 MB 4.2 MB/s eta 0:00:07
     ------- -------------------------------- 6.3/33.5 MB 4.3 MB/s eta 0:00:07
     -------- ------------------------------- 7.3/33.5 MB 4.4 MB/s eta 0:00:07
     ---------- ----------------------------- 8.

- what are word vectors?
    - Word vectors are numerical representations of words. They are represented typically using single dimensional arrays made up of float numbers. 
        `word_vector = [0.134, -0.872, 0.562, 0.781, -0.234]`
    - Where each value tries to represent the word in conjunction with its usage with other words in a language.
- how are word vectors obtained?
    - word vectors are obtained by training models using a lot of text of a natural language. 
    - As the models get trained, they keep learning and updating the word vectors of each word.
    - So what goes on in this training?


- word vector simulation in spacy

In [24]:
#once again lets process a text using a more complex model
nlp = spacy.load("en_core_web_md")
with open ("usa.txt", "r") as f:
    text = f.read()
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America.


In [31]:
#Let us view the vector of a word
word1 = sentence1[20]
print(word1.text)
print(word1.vector)
# you can see that a word vector associated with the word "nation" 
# a word vector is a huge array of floats here

nation
[-7.5829e-01  3.9237e-02  2.3143e-01 -4.1592e-01 -7.6574e-03 -3.2032e-01
  4.1367e-01  6.9440e-01 -3.0146e-01  2.0164e+00 -8.1299e-01  2.4818e-01
 -2.5705e-01 -3.8748e-02 -9.2144e-03  4.6765e-01 -3.9313e-01 -3.2113e-01
  9.7444e-01 -3.5395e-01 -5.4682e-01 -1.2090e-01  4.6114e-01  2.0801e-01
  7.1732e-02 -4.0251e-01  2.6083e-01  1.2042e-01 -3.3666e-01 -1.0894e-01
  2.1845e-01 -5.6275e-01  3.2672e-01 -6.6966e-01 -1.0907e-02  6.6302e-02
 -8.1467e-01 -3.0444e-01  1.4507e-01  5.2402e-01  8.2946e-02  1.7739e-01
 -5.0846e-02  1.2669e-01  5.5579e-02 -6.3532e-01  1.6651e-01  6.5790e-01
 -9.4693e-03 -3.9117e-01  3.9326e-02  1.9498e-01  5.1374e-02 -1.8961e-01
  5.6267e-01 -3.9534e-01  5.0680e-01  6.4809e-02  5.7252e-01  3.0074e-02
 -4.2439e-01 -2.5851e-01 -6.3585e-01  1.0755e-01  2.0361e-01  3.9265e-02
  1.4775e-01 -3.8724e-02  3.0193e-01  4.3575e-02 -4.5714e-01  4.4257e-01
 -4.6342e-01 -1.2837e-01 -4.0075e-01  9.5775e-02 -9.2030e-02  9.7823e-02
  9.5401e-01  1.2454e-01 -8.0596e-02  8.8876

- similar words
    - The advantage of having words in the form of vectors, is we can learn its association with other words in many ways. One such mode of association is similarity. Now as a simualation let's look at similar words of the word 'nation'

In [36]:
#https://stackoverflow.com/questions/54717449/mapping-word-vector-to-the-most-similar-closest-word-using-spacy
import numpy as np

your_word = "nation"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['suburb', 'Crime-Ridden', 'south-east', 'anti-poverty', 'inner-city', 'SLUMS', 'suburbs', 'TWENTIES', 'TENEMENTS', 'Mothers']


- the above result may seem confusing. But what infact similarity is checking is using the word vectors trying to find words that have been used along with or in the same context as the word 'nation'. This similarity is ofcourse influenced by the data used to train these models.

- Obtaining similarity rating
    - between docs
    


In [39]:
doc1 = nlp("I like pizza and burgers")
doc2 = nlp("I love fast food")
doc3 = nlp("I exercise every day")
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))

0.8655217885971069
0.7408841848373413


- similarity between words

In [44]:
doc1 = nlp("Orange")
doc2 = nlp("Fruit")
doc3 = nlp("Burger")

print(doc1[0].similarity(doc2[0]))
print(doc1[0].similarity(doc3[0]))

0.5979633927345276
0.45480555295944214
