<a href="https://colab.research.google.com/github/jrhumberto/cd/blob/main/SpacyLife.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b> Spacy Journey </b>

1. Intro to NLP (Natural Language Processing)
2. What is Spacy
3. POS Tagging
4. Stemming and Lemmatization
5. Named entity recognition <a href = '#scrollTo=UO0TV-anfEgf'> [link] </a>
6. Stop words
7. Dependency Parsing
8. Noun chunks
9. Finding Similarity
10. Glossary <a href = '#scrollTo=sICfJaxVl6qv'> [link] </a>

## Glossary

| Item                           | Description                                                                                                        |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------|
| Tokenization                   | Segmenting text into words, punctuation etc.                                                                       |
| Lemmatization                  | Assigning the base forms of words, for example: "was" → "be" or "rats" → "rat".                                    |
| Sentence Boundary Detection    | Finding and segmenting individual sentences.                                                                       |
| Part-of-speech (POS) Tagging   | Assigning word types to tokens like verb or noun.                                                                  |
| Dependency Parsing             | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| Named Entity Recognition (NER) | Labeling named "real-world" objects, like persons, companies or locations.                                         |
| Text Classification            | Assigning categories or labels to a whole document, or parts of a document.                                        |
| Statistical model              | Process for making predictions based on examples.                                                                  |
| Training                       | Updating a statistical model with new examples.                                                                    |
| Similarity                       | Comparing words, text spans and documents and how similar they are to each other.                                     |

# Short Introduction to Spacy

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']
Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun ORG
Recode PRODUCT
earlier this week DATE



## Named Entity

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY



|    TEXT    | START | END | LABEL |                     DESCRIPTION                      |
|------------|-------|-----|-------|------------------------------------------------------|
| Apple      |     0 |   5 | ORG   | Companies, agencies, institutions.                   |
| U.K.       |    27 |  31 | GPE   | Geopolitical entity, i.e. countries, cities, states. |
| $1 billion |    44 |  54 | MONEY | Monetary values, including unit.                     |



## POS Tagging

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_)

Apple Apple PROPN NNP
is be AUX VBZ
looking look VERB VBG
at at ADP IN
buying buy VERB VBG
U.K. U.K. PROPN NNP
startup startup NOUN NN
for for ADP IN
$ $ SYM $
1 1 NUM CD
billion billion NUM CD


| TEXT    | LEMMA   | POS   | TAG |
|---------|---------|-------|-----|
| Apple   | apple   | PROPN | NNP |
| is      | be      | VERB  | VBZ |
| looking | look    | VERB  | VBG |
| at      | at      | ADP   | IN  |
| buying  | buy     | VERB  | VBG |
| U.K.    | u.k.    | PROPN | NNP |
| startup | startup | NOUN  | NN  |
| for     | for     | ADP   | IN  |
| $       | $       | SYM   | $   |
| 1       | 1       | NUM   | CD  |
| billion | billion | NUM   | CD  |

## Stop Words removal using Spacy

*Let's Stop the STOP WORDS*

In [None]:
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)

from spacy.lang.en.stop_words import STOP_WORDS

# Create list of word tokens after removing stopwords
filtered_sentence =[] 

for word in token_list:
    lexeme = nlp.vocab[word]
    if lexeme.is_stop == False:
        filtered_sentence.append(word) 
print(token_list)
print(filtered_sentence)   


['He', 'determined', 'to', 'drop', 'his', 'litigation', 'with', 'the', 'monastry', ',', 'and', 'relinguish', 'his', 'claims', 'to', 'the', 'wood', '-', 'cuting', 'and', '\n', 'fishery', 'rihgts', 'at', 'once', '.', 'He', 'was', 'the', 'more', 'ready', 'to', 'do', 'this', 'becuase', 'the', 'rights', 'had', 'become', 'much', 'less', 'valuable', ',', 'and', 'he', 'had', '\n', 'indeed', 'the', 'vaguest', 'idea', 'where', 'the', 'wood', 'and', 'river', 'in', 'question', 'were', '.']
['determined', 'drop', 'litigation', 'monastry', ',', 'relinguish', 'claims', 'wood', '-', 'cuting', '\n', 'fishery', 'rihgts', '.', 'ready', 'becuase', 'rights', 'valuable', ',', '\n', 'vaguest', 'idea', 'wood', 'river', 'question', '.']


## Stemming VS Lemmatization

Both are Text Normalisation Techniques used for reducing Machine Learing Modelling Time. 

<b>Stemming</b> algorithm works by cutting the suffix or prefix from the word. Lemmatization is a more powerful operation as it takes into consideration the morphological analysis of the word.

<b> Lemmatization </b> returns the lemma, which is the root word of all its inflection forms.

We can say that stemming is a quick and dirty method of chopping off words to its root form while on the other hand, lemmatization is an intelligent operation that uses dictionaries which are created by in-depth linguistic knowledge. Hence, Lemmatization helps in forming better features.

In [None]:

#make sure to download the english model with "python -m spacy download en"

import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were.""")

lemma_word1 = [] 
for token in doc:
    lemma_word1.append(token.lemma_)
print(lemma_word1)

['-PRON-', 'determine', 'to', 'drop', '-PRON-', 'litigation', 'with', 'the', 'monastry', ',', 'and', 'relinguish', '-PRON-', 'claim', 'to', 'the', 'wood', '-', 'cuting', 'and', '\n', 'fishery', 'rihgts', 'at', 'once', '.', '-PRON-', 'be', 'the', 'more', 'ready', 'to', 'do', 'this', 'becuase', 'the', 'right', 'have', 'become', 'much', 'less', 'valuable', ',', 'and', '-PRON-', 'have', '\n', 'indeed', 'the', 'vague', 'idea', 'where', 'the', 'wood', 'and', 'river', 'in', 'question', 'be', '.']


`NOTE: Stemming is not available in Spacy`

## Noun Chunks

In [None]:
# Generate Noun Phrases 
doc = nlp(u'I love data science on analytics vidhya') 
for np in doc.noun_chunks:
    print (np.text, np.root.dep_, np.root.head.text)

I nsubj love
data science dobj love
analytics pobj on


Dependency Parsing ([link](https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/))

## Finding Similarity

In [None]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.0/en_core_web_md-2.2.0.tar.gz --no-deps

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.0/en_core_web_md-2.2.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.0/en_core_web_md-2.2.0.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 54kB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.0-cp36-none-any.whl size=98072934 sha256=dedf3bebb2f71f26fa2e53d05e654d03261c4d4cd8384b5bab1423531504e7c2
  Stored in directory: /root/.cache/pip/wheels/5f/3e/c9/36dd6e13b449fd84cd1f94b72dfbc559daf09f53dbf4e697a3
Successfully built en-core-web-md


In [None]:
spacy.load('en')

<spacy.lang.en.English at 0x7f5bab15ea90>

In [None]:
import spacy
import en_core_web_md
#nlp = spacy.load("en_core_web_md")
nlp = en_core_web_md.load()
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
banana True 6.700014 False
afskfsd False 0.0 True


In [None]:
nlp = en_core_web_md.load() # make sure to use larger model!
tokens = nlp("dog cat banana")

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.80168545
dog banana 0.24327643
cat dog 0.80168545
cat cat 1.0
cat banana 0.28154364
banana dog 0.24327643
banana cat 0.28154364
banana banana 1.0
