#### Resources:
- [Reference video](https://www.youtube.com/watch?v=dIUTsFT2MeQ&list=PLUorQzJgGyjFMESZTXl_6ZPYLnwG2EDvm&index=4)
- [Reference book](https://spacy.pythonhumanities.com/intro.html)

#### NLP v/s NLU
##### NLP
- NLP is processing the input to process the text from the strict specs of a language.
- In NLP, we try to gather all the possible units to place the input in a structure, that can be understood by more intelligent language models to weave their magic with this derived structure. 
- It performs an array of functions which includes:
    - NER (Named Entity Recognition)
    - syntax parsing
    - POS
    - Text categorization

##### NLU
- NLU is an attempt to humanize the language, trying to understand the language from a more human point of view.
- It performs functions like:
    - semantic parsing (understanding the context)
    - question & answering (important in bots)
    - sentiment analysis
    - paraphrasing
    - summarization

#### SPACY Intro
- What is spaCY?
    - It is a python framework used to implement NLP.
- why spaCY?
 - It comes with off the shelf models that are quick and precise, we as a developer don't need to tweak too much to get the best performance out of it.
 - It can use the latest transformer models.
 - Easier to perform custom trainer relative to other NLP frameworks
 - It is scalable, i.e it works well with real world data that are large in nature and are ever growing.


#### SPACY installation and setup
- site
- commands




In [1]:
import spacy

In [2]:
#check whether all the resources of spacy is loaded correctly
nlp = spacy.load("en_core_web_sm")

#### Containers
- containers - what is a container in a spacy?
- different types of containers in spacy

##### Doc

In [3]:
#let's take a text containing multiple paras from the file usa.txt
with open("usa.txt", "r") as f:
    text = f.read()

print(text)

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America. It is composed of 50 states, each with its own unique culture, geography, and attractions. From the bustling streets of NYC (New York City) to the serene landscapes of the Grand Canyon, the USA offers a wide range of experiences for both residents and visitors. The country is known for its rich history, which includes significant events such as the Am. Revolution, the Civil War, and the Civil Rights Movement.

The USA is a melting pot of cultures, with people from all over the world contributing to its vibrant society. This cultural diversity is reflected in the nation's cuisine, music, art, and festivals. Cities like L.A. (Los Angeles), Miami, and Chicago are renowned for their cultural scenes, offering everything from world-class museums to lively music festivals. The Am. Dream, the idea that anyone can achieve success through hard work and determination, continues to

In [4]:
#let's use spacy to create a doc out of this text
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print(doc)

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America. It is composed of 50 states, each with its own unique culture, geography, and attractions. From the bustling streets of NYC (New York City) to the serene landscapes of the Grand Canyon, the USA offers a wide range of experiences for both residents and visitors. The country is known for its rich history, which includes significant events such as the Am. Revolution, the Civil War, and the Civil Rights Movement.

The USA is a melting pot of cultures, with people from all over the world contributing to its vibrant society. This cultural diversity is reflected in the nation's cuisine, music, art, and festivals. Cities like L.A. (Los Angeles), Miami, and Chicago are renowned for their cultural scenes, offering everything from world-class museums to lively music festivals. The Am. Dream, the idea that anyone can achieve success through hard work and determination, continues to

In [10]:
#the output looks similar doesnt it, now let's check the length of text and doc
print(len(text))
print(len(doc))

#print first 10 elements of text and doc
for char in text[:10]:
    print(char)
print("-------------------------------")
for element in doc[:10]:
    print(element)

2607
489
T
h
e
 
U
.
S
.
A
.
-------------------------------
The
U.S.A.
(
United
States
of
America
)
,
commonly


In [11]:
#The doc is not just splitting the text into separate tokens just based on space.
#we can confirm that by printing the first 10 elements of the splitted text
for word in text.split()[:10]:
    print(word)


The
U.S.A.
(United
States
of
America),
commonly
known
as
the


- See how the outputs vary, the doc takes each parenthesis as a separate token. It gathers all the semantic units of the text and lists each of them as a token. And also see how U.S.A is not splitted even though there is a full stop in between, which is infact a semantic unit, but it knows in this case, it is a part of an abbrievation.

##### Sents
- It is a container that has the sentences of the text in a doc.
- Even for a simple function like splitting sentences, there is a lot of things to keep in mind which varies with language.
- spacy does the sentence boundary detection out of the box using the doc object.

In [16]:
#Let us try to extract the sentences out of our doc
sents = doc.sents

for sent in sents:
    print (sent)

#notice how even the para blankline is also considered a separate sentence

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America.
It is composed of 50 states, each with its own unique culture, geography, and attractions.
From the bustling streets of NYC (New York City) to the serene landscapes of the Grand Canyon, the USA offers a wide range of experiences for both residents and visitors.
The country is known for its rich history, which includes significant events such as the Am.
Revolution, the Civil War, and the Civil Rights Movement.


The USA is a melting pot of cultures, with people from all over the world contributing to its vibrant society.
This cultural diversity is reflected in the nation's cuisine, music, art, and festivals.
Cities like L.A. (Los Angeles), Miami, and Chicago are renowned for their cultural scenes, offering everything from world-class museums to lively music festivals.
The Am.
Dream, the idea that anyone can achieve success through hard work and determination, continues t

In [17]:
#let us try to print the first sentence
sent1 = sents[0]
print(sent)

TypeError: 'generator' object is not subscriptable

- it does not work, because sents is a generator on doc object, hence it is not subscriptable (i.e we cannot loop over it)

In [22]:
#so let's extact the sents into a list and access the first sentence
sent1 = list(doc.sents)[0]
print(sent1)

The U.S.A. (United States of America), commonly known as the USA, is a vast and diverse nation located in N. America.


##### Token
- It is the smallest unit of a doc (typically a word in the sentence).
- It has various attributes that represent the linguistic features of the word. These features are gathered from the nature of the word, it's position and usage in the sentence.
- Let's check some of the attributes of a token obtained from a sent.

In [32]:
token = sent1[10]
print(token)
print("------------")
print(token.text)
print(token.lemma_) #root word
print(token.ent_id_) #entity type that is associated with this token (no entity associated with a verb)
print(token.ent_iob_) #inside, outside or beginning of an entity
print(token.left_edge) #starting point of the syntactic subtree
print(token.right_edge) #ending point of the syntactic subtree
print(token.pos_)


#we have covered many of the concepts like POS, lemma, entity in the 2nd document - 2-understanding-NLP-using-nltk.ipynb

known
------------
known
know

O
commonly
USA
VERB


- Visualizing Dependency Parser and Parts of speech

##### Ent
- list ents in a doc
- some of ent abbrievations
- visualize ents

#### Word Vectors and spacy
- install a more complex model that uses word vectors
- what are word vectors?
- how are word vectors obtained?
- word vector simulation in spacy
- similar words
- Obtaining similarity rating
    - between docs
    - between words

In [1]:
!py -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     - -------------------------------------- 1.0/33.5 MB 5.0 MB/s eta 0:00:07
     -- ------------------------------------- 1.8/33.5 MB 5.0 MB/s eta 0:00:07
     --- ------------------------------------ 2.9/33.5 MB 4.7 MB/s eta 0:00:07
     ---- ----------------------------------- 3.7/33.5 MB 4.4 MB/s eta 0:00:07
     ----- ---------------------------------- 4.2/33.5 MB 4.0 MB/s eta 0:00:08
     ------ --------------------------------- 5.2/33.5 MB 4.2 MB/s eta 0:00:07
     ------- -------------------------------- 6.3/33.5 MB 4.3 MB/s eta 0:00:07
     -------- ------------------------------- 7.3/33.5 MB 4.4 MB/s eta 0:00:07
     ---------- ----------------------------- 8.