# 2. Natural Language Processing for Chatbots

**spaCy**: "an open-source software library for advanced NLP, written in Python and Cython, built by Matthew Honnibal. It provides intuitive APIs to access its methods trained by deep learning models".

More @ [spaCy website](https://spacy.io/).


In [1]:
import spacy

In [2]:
spacy.__version__

'2.0.11'

In [3]:
#!python3 -m spacy download en

## Parts-of-speech (POS) tagging

"a process where you read some text and assign parts of speech to each word or token, such as noun, verb, adjective, etc".

In [4]:
# Loads spacy en model
nlp = spacy.load('en')

# Creates doc object
doc = nlp(u'I am learning how to build chatbots')

for token in doc:
    print(f'Text: {token.text} \t POS: {token.pos_}')


Text: I 	 POS: PRON
Text: am 	 POS: VERB
Text: learning 	 POS: VERB
Text: how 	 POS: ADV
Text: to 	 POS: PART
Text: build 	 POS: VERB
Text: chatbots 	 POS: NOUN


In [5]:
doc = nlp(u'I am going to London next week for a meeting.')

for token in doc:
    print(f'Text: {token.text} \t POS: {token.pos_}')


Text: I 	 POS: PRON
Text: am 	 POS: VERB
Text: going 	 POS: VERB
Text: to 	 POS: ADP
Text: London 	 POS: PROPN
Text: next 	 POS: ADJ
Text: week 	 POS: NOUN
Text: for 	 POS: ADP
Text: a 	 POS: DET
Text: meeting 	 POS: NOUN
Text: . 	 POS: PUNCT


In [6]:
def print_token(token):
    print(f'Text: {token.text}\tLemma: {token.lemma_}\tPOS: {token.pos_}')
    print(f'Tag: {token.tag_}\tDependency: {token.dep_}\tShape: {token.shape_}')
    print(f'Is alpha-numeric? {token.is_alpha}\tIs stopword? {token.is_stop}')
    print()


In [7]:
doc = nlp(u'Google release "Move Mirror" AI experiment that matches your pose from 80,000 images')

for token in doc:
    print_token(token)


Text: Google	Lemma: google	POS: PROPN
Tag: NNP	Dependency: compound	Shape: Xxxxx
Is alpha-numeric? True	Is stopword? False

Text: release	Lemma: release	POS: NOUN
Tag: NN	Dependency: nmod	Shape: xxxx
Is alpha-numeric? True	Is stopword? False

Text: "	Lemma: "	POS: PUNCT
Tag: ``	Dependency: punct	Shape: "
Is alpha-numeric? False	Is stopword? False

Text: Move	Lemma: move	POS: PROPN
Tag: NNP	Dependency: nmod	Shape: Xxxx
Is alpha-numeric? True	Is stopword? False

Text: Mirror	Lemma: mirror	POS: PROPN
Tag: NNP	Dependency: nmod	Shape: Xxxxx
Is alpha-numeric? True	Is stopword? False

Text: "	Lemma: "	POS: PUNCT
Tag: ''	Dependency: punct	Shape: "
Is alpha-numeric? False	Is stopword? False

Text: AI	Lemma: ai	POS: PROPN
Tag: NNP	Dependency: compound	Shape: XX
Is alpha-numeric? True	Is stopword? False

Text: experiment	Lemma: experiment	POS: NOUN
Tag: NN	Dependency: ROOT	Shape: xxxx
Is alpha-numeric? True	Is stopword? False

Text: that	Lemma: that	POS: ADJ
Tag: WDT	Dependency: nsubj	Shape: xxxx

### Token attributes:

<img src='./IMG/token-attrs.png'>

### POS attributes:

<img src='./IMG/pos-attrs.png'>

## Stemming and Lemmatization

Stemming: "reducing inflected words to their word stem, base form".
- Ex.: saying -> say.

Lemmatization: "algorithmic process of determining the *lemma* of a word based on its intended meaning".
- Ex.: walk, walks, walked, walking -> walk.


**Change in API!**

See [this](https://stackoverflow.com/questions/58779371/importerror-cannot-import-name-lemma-index-from-spacy-lang-en) and [this](https://spacy.io/usage/v2-2#migrating).

In [8]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

In [9]:
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

In [10]:
lemmatizer('chuckling', 'VERB')

['chuckle']

In [11]:
lemmatizer('blazing', 'VERB')

['blaze']

In [12]:
lemmatizer('fastest', 'ADJ')

['fast']

## Named-Entity Recognition

Named-Entity Recognition (NER): "process of finding and classifying [named entities](https://en.wikipedia.org/wiki/Named_entity) existing in the given text into pre-defined categories".
- "hugely dependent on the knowledge base used to train the NE extraction algorithm".


In [13]:
my_string = 'Google has its headquarters in Mountain View, \
California having revenue amounted to 109.65 billion US dollars'

doc = nlp(my_string)

for ent in doc.ents:
    print(f'Text: {ent.text}\tLabel: {ent.label_}')


Text: Google	Label: ORG
Text: Mountain View	Label: GPE
Text: California	Label: GPE
Text: 109.65 billion US dollars	Label: MONEY


In [14]:
my_string= 'Mark Zuckerberg born May 14, 1984 in New York \
is an American technology entrepreneur and philanthropist \
best known for co-founding and leading Facebook as its chairman and CEO.'

doc = nlp(my_string)

for ent in doc.ents:
    print(f'Text: {ent.text}\tLabel: {ent.label_}')


Text: Mark Zuckerberg	Label: PERSON
Text: May 14, 1984	Label: DATE
Text: New York	Label: GPE
Text: American	Label: NORP
Text: Facebook	Label: ORG


In [15]:
my_string = 'I usually wake up at 9:00 AM. 90% of my daytime goes in learning new things.'

doc = nlp(my_string)

for ent in doc.ents:
    print(f'Text: {ent.text}\tLabel: {ent.label_}')


Text: 9:00 AM	Label: TIME
Text: 90%	Label: PERCENT


### Entity types:

<img src='./IMG/entity-types.png'>

"Whenever we intend to build a conversational agent or chatbot in simple terms, we always have a domain in mind."
- "By finding out the entity in the question, one can get a fair idea of the context in which the question was asked."


In [16]:
my_string1 = 'Imagine Dragons are the best band.'
my_string2 = 'Imagine dragons come and take over the city.'

doc1 = nlp(my_string1)
doc2 = nlp(my_string2)

for ent in doc1.ents:
    print(ent.text, ent.label_)

for ent in doc2.ents:
    print(ent.text, ent.label_)


Imagine Dragons ORG


### Stopwords

<img src='./IMG/stopwords.png'>

In [17]:
from spacy.lang.en.stop_words import STOP_WORDS

In [18]:
print(STOP_WORDS)

{'from', 'thereafter', 'anywhere', 'around', 'nothing', 'hence', 'take', 'upon', 'how', 'go', 'too', 'further', 'here', 'is', 'becoming', 'cannot', 'regarding', 'toward', 'out', 'seemed', 'although', 'several', 'nine', 'even', 'does', 'thereupon', 'seems', 'been', 'his', 'behind', 'into', 'neither', 'anyway', 'thus', 'throughout', 'well', 'anyhow', 'still', 'our', 'above', 'become', 'she', 'thru', 'amount', 'will', 'whether', 'becomes', 'without', 'many', 'such', 'only', 'with', 'say', 'whereas', 'did', 'meanwhile', 'except', 'four', 'hereupon', 'ever', 'off', 'over', 'him', 'another', 'himself', 'they', 'after', 'among', 'own', 'before', 'some', 'a', 'always', 'not', 'should', 'none', 'herself', 'else', 'along', 'please', 'first', 'yourself', 'has', 'something', 'quite', 'through', 'hereafter', 'whose', 'if', 'more', 'put', 'do', 'anyone', 'enough', 'much', 'then', 'whatever', 'get', 'about', 'me', 'once', 'show', 'latterly', 'for', 'towards', 'keep', 'yours', 'against', 'the', 'an', 

In [19]:
nlp.vocab['is'].is_stop

True

In [20]:
nlp.vocab['hello'].is_stop

False

### Dependency parsing

"gives you a parsed tree that explains the parent-child relationship between the words or phrases and is independent of the order in which words occur."

**Ancestors**: "the rightmost token of this token's syntactic descendants".

**Children**: "immediate syntactic dependents of the token."

In [21]:
doc = nlp(u'Book me a flight from Bangalore to Goa')
blr, goa = doc[5], doc[7]

list(blr.ancestors)

[from, flight, Book]

Ancestors:

In [22]:
list(goa.ancestors)

[to, flight, Book]

In [23]:
print(doc[4], list(doc[4].ancestors))

from [flight, Book]


In [24]:
doc[3].is_ancestor(doc[5])

True

In [25]:
doc[2].is_ancestor(doc[5])

False

In [26]:
doc = nlp('Book a table at the restaurant and the taxi to the hotel')
tasks = doc[2], doc[8] #(table, taxi)
tasks_target = doc[5], doc[11] #(restaurant, hotel)

for task in tasks_target:
    for tok in task.ancestors:
        if tok in tasks:
            print("Booking of {} belongs to {}".format(tok, task))
            break

Booking of table belongs to restaurant
Booking of taxi belongs to hotel


Children:

In [27]:
doc = nlp(u'Book me a flight from Bangalore to Goa')

list(doc[3].children)

[a, from, to]

Interactive visualization:

In [28]:
from spacy import displacy

### CHECK LATER!

In [29]:
# doc = nlp('Book a table at the restaurant and the taxi to the hotel')
# displacy.serve(doc, style='dep')

In [31]:
doc = nlp('What are some places to visit in Berlin and stay in Lubeck?')

places = [doc[7], doc[11]] #[Berlin, Lubeck]
actions = [doc[5], doc[9]] #[visit, stay]

for place in places:
    for tok in place.ancestors:
        if tok in actions:
            print("User is referring {} to {}".format(place, tok))
            break


User is referring Berlin to visit
User is referring Lubeck to stay


### Noun chunks

Noun chunks: "flat phrases that have a noun as their head".

In [35]:
doc = nlp('Boston Dynamics is gearing up to produce thousands of robot dogs')
list(doc.noun_chunks)

[Boston Dynamics, thousands, robot dogs]

In [39]:
doc = nlp('Deep learning cracks the code of messenger RNAs and protein-coding potential')

for chunk in doc.noun_chunks:
    print('Text: ', chunk.text)
    print('Root text: ', chunk.root.text)
    print('Root dependency: ', chunk.root.dep_)
    print('Root head text: ', chunk.root.head.text)
    print()


Text:  Deep learning
Root text:  learning
Root dependency:  nsubj
Root head text:  cracks

Text:  the code
Root text:  code
Root dependency:  dobj
Root head text:  cracks

Text:  messenger RNAs
Root text:  RNAs
Root dependency:  pobj
Root head text:  of

Text:  protein-coding potential
Root text:  potential
Root dependency:  conj
Root head text:  RNAs



### Noun chunks attributes:

<img src='./IMG/noun-chunks-attrs.png'>


## Finding Similarity

"spaCy uses high-quality word vectors to find similarity between two words using [GloVe](https://nlp.stanford.edu/projects/glove/) algorithm".


In [40]:
doc = nlp('How are you doing today?')

for token in doc:
    print(token.text, token.vector[:5])


How [-0.29742622  0.73939687 -0.04001444  0.44034058  2.8967497 ]
are [-0.23435101 -1.6145048   1.0197463   0.99281645  0.2822714 ]
you [ 0.10252154 -3.5647113   2.482279    4.2824993   3.5902457 ]
doing [-0.6240917  -2.0210214  -0.91014886  2.7051926   4.1892524 ]
today [ 3.5409102  -0.6218591   2.6274276   2.0504882   0.20191938]
? [ 2.8914993  -0.25079137  3.3764176   1.6942683   1.9849057 ]


In [47]:
hello_doc = nlp('hello')
hi_doc = nlp('hi')
hella_doc = nlp('hella')

print(hello_doc.similarity(hi_doc))
print(hello_doc.similarity(hella_doc))

0.7879069991402569
0.4193426329044165


In [48]:
GoT_str1 = nlp('When will next season of Game of Thrones be releasing?')
GoT_str2 = nlp('Game of Thrones next season release date?')

GoT_str1.similarity(GoT_str2)

0.785019065199066

In [50]:
example_doc = nlp(u"car truck google")

for t1 in example_doc:
    for t2 in example_doc:
        similarity_perc = int(t1.similarity(t2) * 100)
        print('Word {} is {}% similar to word {}'.format(t1.text, similarity_perc, t2.text))


Word car is 100% similar to word car
Word car is 71% similar to word truck
Word car is 24% similar to word google
Word truck is 71% similar to word car
Word truck is 100% similar to word truck
Word truck is 36% similar to word google
Word google is 24% similar to word car
Word google is 36% similar to word truck
Word google is 100% similar to word google


### Tokenization

Tokenization: "split[ting] a text into meaningful segments".


In [51]:
doc = nlp('Brexit is the impending withdrawal of the U.K. from the European Union.')

for token in doc:
    print(token.text)


Brexit
is
the
impending
withdrawal
of
the
U.K.
from
the
European
Union
.


### Regular expressions


In [52]:
import re

In [54]:
sentence1 = "Book me a metro from Airport Station to Hong Kong Station."
sentence2 = "Book me a cab to Hong Kong Airport from AsiaWorld-Expo."

from_to = re.compile('.* from (.*) to (.*)')
to_from = re.compile('.* to (.*) from (.*)')

In [55]:
from_to_match = from_to.match(sentence2)
to_from_match = to_from.match(sentence2)

if from_to_match and from_to_match.groups():
    _from = from_to_match.groups()[0]
    _to = from_to_match.groups()[1]
    print("from_to pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))

elif to_from_match and to_from_match.groups():
    _to = to_from_match.groups()[0]
    _from = to_from_match.groups()[1]
    print("to_from pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))


to_from pattern matched correctly. Printing values

From: AsiaWorld-Expo., To: Hong Kong Airport


In [56]:
from_to_match = from_to.match(sentence1)
to_from_match = to_from.match(sentence1)

if from_to_match and from_to_match.groups():
    _from = from_to_match.groups()[0]
    _to = from_to_match.groups()[1]
    print("from_to pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))

elif to_from_match and to_from_match.groups():
    _to = to_from_match.groups()[0]
    _from = to_from_match.groups()[1]
    print("to_from pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))


from_to pattern matched correctly. Printing values

From: Airport Station, To: Hong Kong Station.
