# 2. Natural Language Processing for Chatbots

**spaCy**: "an open-source software library for advanced NLP, written in Python and Cython, built by Matthew Honnibal. It provides intuitive APIs to access its methods trained by deep learning models".

More @ [spaCy website](https://spacy.io/).


In [1]:
import spacy

In [2]:
spacy.__version__

'2.0.11'

In [3]:
#!python3 -m spacy download en

## Parts-of-speech (POS) tagging

"a process where you read some text and assign parts of speech to each word or token, such as noun, verb, adjective, etc".

In [4]:
# Loads spacy en model
nlp = spacy.load('en')

# Creates doc object
doc = nlp(u'I am learning how to build chatbots')

for token in doc:
    print(f'Text: {token.text} \t POS: {token.pos_}')


Text: I 	 POS: PRON
Text: am 	 POS: VERB
Text: learning 	 POS: VERB
Text: how 	 POS: ADV
Text: to 	 POS: PART
Text: build 	 POS: VERB
Text: chatbots 	 POS: NOUN


In [5]:
doc = nlp(u'I am going to London next week for a meeting.')

for token in doc:
    print(f'Text: {token.text} \t POS: {token.pos_}')


Text: I 	 POS: PRON
Text: am 	 POS: VERB
Text: going 	 POS: VERB
Text: to 	 POS: ADP
Text: London 	 POS: PROPN
Text: next 	 POS: ADJ
Text: week 	 POS: NOUN
Text: for 	 POS: ADP
Text: a 	 POS: DET
Text: meeting 	 POS: NOUN
Text: . 	 POS: PUNCT


In [6]:
def print_token(token):
    print(f'Text: {token.text}\tLemma: {token.lemma_}\tPOS: {token.pos_}')
    print(f'Tag: {token.tag_}\tDependency: {token.dep_}\tShape: {token.shape_}')
    print(f'Is alpha-numeric? {token.is_alpha}\tIs stopword? {token.is_stop}')
    print()


In [7]:
doc = nlp(u'Google release "Move Mirror" AI experiment that matches your pose from 80,000 images')

for token in doc:
    print_token(token)


Text: Google	Lemma: google	POS: PROPN
Tag: NNP	Dependency: compound	Shape: Xxxxx
Is alpha-numeric? True	Is stopword? False

Text: release	Lemma: release	POS: NOUN
Tag: NN	Dependency: nmod	Shape: xxxx
Is alpha-numeric? True	Is stopword? False

Text: "	Lemma: "	POS: PUNCT
Tag: ``	Dependency: punct	Shape: "
Is alpha-numeric? False	Is stopword? False

Text: Move	Lemma: move	POS: PROPN
Tag: NNP	Dependency: nmod	Shape: Xxxx
Is alpha-numeric? True	Is stopword? False

Text: Mirror	Lemma: mirror	POS: PROPN
Tag: NNP	Dependency: nmod	Shape: Xxxxx
Is alpha-numeric? True	Is stopword? False

Text: "	Lemma: "	POS: PUNCT
Tag: ''	Dependency: punct	Shape: "
Is alpha-numeric? False	Is stopword? False

Text: AI	Lemma: ai	POS: PROPN
Tag: NNP	Dependency: compound	Shape: XX
Is alpha-numeric? True	Is stopword? False

Text: experiment	Lemma: experiment	POS: NOUN
Tag: NN	Dependency: ROOT	Shape: xxxx
Is alpha-numeric? True	Is stopword? False

Text: that	Lemma: that	POS: ADJ
Tag: WDT	Dependency: nsubj	Shape: xxxx

### Token attributes:

<img src='./IMG/token-attrs.png'>

### POS attributes:

<img src='./IMG/pos-attrs.png'>

## Stemming and Lemmatization

Stemming: "reducing inflected words to their word stem, base form".
- Ex.: saying -> say.

Lemmatization: "algorithmic process of determining the *lemma* of a word based on its intended meaning".
- Ex.: walk, walks, walked, walking -> walk.


**Change in API!**

See [this](https://stackoverflow.com/questions/58779371/importerror-cannot-import-name-lemma-index-from-spacy-lang-en) and [this](https://spacy.io/usage/v2-2#migrating).

In [8]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

In [9]:
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

In [10]:
lemmatizer('chuckling', 'VERB')

['chuckle']

In [11]:
lemmatizer('blazing', 'VERB')

['blaze']

In [12]:
lemmatizer('fastest', 'ADJ')

['fast']

## Named-Entity Recognition

Named-Entity Recognition (NER): "process of finding and classifying [named entities](https://en.wikipedia.org/wiki/Named_entity) existing in the given text into pre-defined categories".
- "hugely dependent on the knowledge base used to train the NE extraction algorithm".


In [13]:
my_string = 'Google has its headquarters in Mountain View, \
California having revenue amounted to 109.65 billion US dollars'

doc = nlp(my_string)

for ent in doc.ents:
    print(f'Text: {ent.text}\tLabel: {ent.label_}')


Text: Google	Label: ORG
Text: Mountain View	Label: GPE
Text: California	Label: GPE
Text: 109.65 billion US dollars	Label: MONEY


In [14]:
my_string= 'Mark Zuckerberg born May 14, 1984 in New York \
is an American technology entrepreneur and philanthropist \
best known for co-founding and leading Facebook as its chairman and CEO.'

doc = nlp(my_string)

for ent in doc.ents:
    print(f'Text: {ent.text}\tLabel: {ent.label_}')


Text: Mark Zuckerberg	Label: PERSON
Text: May 14, 1984	Label: DATE
Text: New York	Label: GPE
Text: American	Label: NORP
Text: Facebook	Label: ORG


In [15]:
my_string = 'I usually wake up at 9:00 AM. 90% of my daytime goes in learning new things.'

doc = nlp(my_string)

for ent in doc.ents:
    print(f'Text: {ent.text}\tLabel: {ent.label_}')


Text: 9:00 AM	Label: TIME
Text: 90%	Label: PERCENT


### Entity types:

<img src='./IMG/entity-types.png'>

"Whenever we intend to build a conversational agent or chatbot in simple terms, we always have a domain in mind."
- "By finding out the entity in the question, one can get a fair idea of the context in which the question was asked."


In [16]:
my_string1 = 'Imagine Dragons are the best band.'
my_string2 = 'Imagine dragons come and take over the city.'

doc1 = nlp(my_string1)
doc2 = nlp(my_string2)

for ent in doc1.ents:
    print(ent.text, ent.label_)

for ent in doc2.ents:
    print(ent.text, ent.label_)


Imagine Dragons ORG


### Stopwords

<img src='./IMG/stopwords.png'>

In [17]:
from spacy.lang.en.stop_words import STOP_WORDS

In [18]:
print(STOP_WORDS)

{'might', 'the', 'full', 'therefore', 'that', 'if', 'here', 'such', 'those', 'back', 'done', 'whence', 'to', 'ten', 'there', 'amount', 'whether', 'really', 'hereafter', 'yours', 'can', 'fifteen', 'already', 'no', 'wherein', 'next', 'or', 'in', 'he', 'own', 'nobody', 'may', 'noone', 'now', 'three', 'via', 'during', 'through', 'anywhere', 'show', 'eight', 'various', 'with', 'as', 'any', 'anyway', 'doing', 'must', 'none', 're', 'somehow', 'first', 'across', 'namely', 'serious', 'she', 'us', 'your', 'please', 'further', 'never', 'someone', 'sometimes', 'thereupon', 'under', 'five', 'whereupon', 'one', 'until', 'could', 'yourself', 'meanwhile', 'before', 'so', 'everything', 'not', 'neither', 'out', 'yet', 'themselves', 'anyone', 'this', 'herein', 'how', 'its', 'my', 'onto', 'thereafter', 'you', 'but', 'using', 'sixty', 'thus', 'just', 'rather', 'whenever', 'among', 'every', 'fifty', 'give', 'even', 'latterly', 'mine', 'ours', 'had', 'elsewhere', 'twelve', 'whatever', 'have', 'mostly', 'only

In [19]:
nlp.vocab['is'].is_stop

True

In [20]:
nlp.vocab['hello'].is_stop

False

### Dependency parsing

"gives you a parsed tree that explains the parent-child relationship between the words or phrases and is independent of the order in which words occur."

**Ancestors**: "the rightmost token of this token's syntactic descendants".

**Children**: "immediate syntactic dependents of the token."

In [21]:
doc = nlp(u'Book me a flight from Bangalore to Goa')
blr, goa = doc[5], doc[7]

list(blr.ancestors)

[from, flight, Book]

Ancestors:

In [22]:
list(goa.ancestors)

[to, flight, Book]

In [23]:
print(doc[4], list(doc[4].ancestors))

from [flight, Book]


In [24]:
doc[3].is_ancestor(doc[5])

True

In [25]:
doc[2].is_ancestor(doc[5])

False

In [26]:
doc = nlp('Book a table at the restaurant and the taxi to the hotel')
tasks = doc[2], doc[8] #(table, taxi)
tasks_target = doc[5], doc[11] #(restaurant, hotel)

for task in tasks_target:
    for tok in task.ancestors:
        if tok in tasks:
            print("Booking of {} belongs to {}".format(tok, task))
            break

Booking of table belongs to restaurant
Booking of taxi belongs to hotel


Children:

In [27]:
doc = nlp(u'Book me a flight from Bangalore to Goa')

list(doc[3].children)

[a, from, to]

Interactive visualization:

In [28]:
from spacy import displacy

In [31]:
doc = nlp('Book a table at the restaurant and the taxi to the hotel')
displacy.serve(doc, style='dep')

ValueError: buffer source array is read-only