# Building intelligent bots. Retrieval-based chatbots

In this section we build a retrieval-based chatbot with Rasa. Before we go to this point, we go through a few NLP methods and word vectorization.


## NLP methods for NLU

Let's take one of President Trump's speech and divide into words.

In [None]:
import spacy

file = open("trump.txt", "r",encoding='utf-8') 
trump = file.read() 

nlp = spacy.load("en")
doc = nlp(trump)

for span in doc.sents:
    print("> ", span)

We have are able to divide it using SpaCy and get the part of speech of each word.

In [None]:
for span in doc.sents:
    for i in range(span.start, span.end):
        token = doc[i]
        print(i, token.text, token.pos_)    

A smaller example:

In [None]:
sample = "Broadcasting today, live from Kraków, on chatbots."

doc = nlp(sample)
for token in doc:
    print(token.text, token.pos_)

### Noun chunks

This NLP method is used to get the nouns from any sentene. It's important to understand what is the sentence about.

In [None]:
doc = nlp(sample)
for nc in doc.noun_chunks:
    print(nc)

### Named Entity Recognition

NER is a NLP method where we get not the nouns or part of speech, but meanings of the words.

In [None]:
doc = nlp(sample)
for entity in doc.ents:
    print(entity.label_, entity.text)

## Word vectorization

Word vectorization is a process of preparing a vector representing each word. Gensim has an implementation of Word2Vec. We use a dimension of 100 and distance between two words in a sentence to 5.

In [None]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

We can get the vocabulary as follows:

In [None]:
vocab = list(model.wv.vocab)
X = model[vocab]
print(vocab[0])

To train we just use the TSNE to reduce the dimensionality:

In [None]:
from sklearn.manifold import TSNE
import pandas as pd

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
df

We can draw the words in a two-dimensional space:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos)
plt.show()    

Let's take a more complex example and use a longer text.

In [None]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

We get the sentences from the text as follows:

In [None]:
from gensim.models import word2vec

sentences = word2vec.Text8Corpus('text8',max_sentence_length=100)

Now it's the time consuming part, load it and save the model:

In [None]:
model = Word2Vec(sentences, size=100, window=5)
model.save("word2vec.model")

We choose only 100 words. It's easier to draw 100 words from the whole dataset.

In [None]:
import random

vocab = list(model.wv.vocab)
vocab = random.sample(vocab,100)
X = model[vocab]

Let's reduce the dimensionality of the new dataset.

In [None]:
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

Let's take a look at the chosen words.

In [None]:
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
df

Draw the new dataset.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos)
plt.show()  

### Similarity measure through vectors

SpaCy already has words vectorized and we can simply check the similarity between two sentences.

In [None]:
import spacy

nlp = spacy.load('en')

doc1 = nlp(u"Warsaw is the largest city in Poland.")
doc2 = nlp(u"Crossaint is baked in France.")
doc3 = nlp(u"An emu is a large bird.")

for doc in [doc1, doc2, doc3]:
    for other_doc in [doc1, doc2, doc3]:
        print(doc.similarity(other_doc))

A nice example of word vectorization done by some researchers at Warsaw University: [Word2Vec](https://lamyiowce.github.io/word2viz/).

## Retrieval-based chatbot

In this section we use Rasa to build a very simple HR assistant bot. We can use Rasa as a server or use it directly from Python level. To start Rasa server you need to execute the following command:
python3 -m rasa_nlu.server &
It starts a server on default port 5000. You can test it using the request package. We should get the intent of the phrase `hi`.

In [None]:
import requests

def get_intent(sentence):
    url = "http://localhost:5000/parse"
    payload = {"q":sentence}
    response = requests.get(url,params=payload)    
    print(response.json())
    intent = response.json()['intent']
    if intent['confidence'] > 0.5: 
        return intent['name']
    return response.json()

get_intent("hi")

To use Rasa from Python level you need to prepare a config file that contains the pipeline and the filename of examples used for learning.

In [None]:
config = """
{
  "pipeline": "spacy_sklearn",
  "path" : ".",
  "data" : ".anna.json"
}
"""

config_file = open("config.json", "w")
config_file.write(config)
config_file.close()

The data file contains examples that are used for training.

In [None]:
anna_common_examples = """
{
  "rasa_nlu_data": {
    "entity_synonyms": [
      {
        "value": "candidate",
        "synonyms": ["developer", "data scientist"]
      }
    ],
    "common_examples": [
      {
        "text": "hey", 
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "howdy", 
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "hey there",
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "hello", 
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "hi", 
        "intent": "greet", 
        "entities": []
      },
      {
        "text": "good morning",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "good evening",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "dear sir",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "yes", 
        "intent": "affirm", 
        "entities": []
      }, 
      {
        "text": "yep", 
        "intent": "affirm", 
        "entities": []
      }, 
      {
        "text": "yeah", 
        "intent": "affirm", 
        "entities": []
      },
      {
        "text": "indeed",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "that's right",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "ok",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "great",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "right, thank you",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "add candidate",
        "intent": "candidate_add",
        "entities": []
      }, 
      {
        "text": "add candidate",
        "intent": "candidate_add",
        "entities": [
            {
      "start": 5,
      "end": 13,
      "value": "candidate",
      "entity": "candidate"
        }
        ]
      },         
      {
        "text": "adding candidate",
        "intent": "candidate_add",
        "entities": [
            {
              "start": 8,
              "end": 16,
              "value": "candidate",
              "entity": "candidate"
            }        
        ]
      },
      {
        "text": "please add candidate",
        "intent": "candidate_add",
        "entities": []
      },              
      {
        "text": "please add new candidate",
        "intent": "candidate_add",
        "entities": []
      },           
      {
        "text": "we have new prescreening upcoming",
        "intent": "candidate_add",
        "entities": []
      }, 
      {
        "text": "we have a new candidate for prescreening",
        "intent": "candidate_add",
        "entities": []
      },         
      {
        "text": "correct",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "great choice",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "sounds really good",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "bye", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "goodbye", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "good bye", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "stop", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "end", 
        "intent": "goodbye", 
        "entities": []
      },
      {
        "text": "farewell",
        "intent": "goodbye",
        "entities": []
      },
      {
        "text": "Bye bye",
        "intent": "goodbye",
        "entities": []
      },
      {
        "text": "have a good one",
        "intent": "goodbye",
        "entities": []
      }
    ]
  }
}
"""

training_data = open("anna.json", "w")
training_data.write(anna_common_examples)
training_data.close()

The training is straight forward.

In [None]:
from rasa_nlu.converters import load_data
from rasa_nlu.config import RasaNLUConfig
from rasa_nlu.model import Trainer

training_data = load_data('anna.json')
trainer = Trainer(RasaNLUConfig("config.json"))
trainer.train(training_data)
model_directory = trainer.persist('.')

To get the intent we use the parse method.

In [None]:
from rasa_nlu.model import Metadata, Interpreter

interpreter = Interpreter.load(model_directory, RasaNLUConfig("config.json"))

interpreter.parse(u"a new developer")

## EXERCISE 2

Extend the training examples and add an intent `change_status` with entities: `passed` and `failed`.

In [None]:
anna_common_examples = """
{
  "rasa_nlu_data": {
    "entity_synonyms": [
      {
        "value": "candidate",
        "synonyms": ["developer", "data scientist"]
      }
    ],
    "common_examples": [
      {
        "text": "hey", 
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "howdy", 
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "hey there",
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "hello", 
        "intent": "greet", 
        "entities": []
      }, 
      {
        "text": "hi", 
        "intent": "greet", 
        "entities": []
      },
      {
        "text": "good morning",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "good evening",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "dear sir",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "yes", 
        "intent": "affirm", 
        "entities": []
      }, 
      {
        "text": "yep", 
        "intent": "affirm", 
        "entities": []
      }, 
      {
        "text": "yeah", 
        "intent": "affirm", 
        "entities": []
      },
      {
        "text": "indeed",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "that's right",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "ok",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "great",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "right, thank you",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "add candidate",
        "intent": "candidate_add",
        "entities": []
      }, 
      {
        "text": "add candidate",
        "intent": "candidate_add",
        "entities": [
            {
      "start": 5,
      "end": 13,
      "value": "candidate",
      "entity": "candidate"
        }
        ]
      },         
      {
        "text": "adding candidate",
        "intent": "candidate_add",
        "entities": [
            {
              "start": 8,
      "end": 16,
      "value": "candidate",
      "entity": "candidate"
        }        
        ]
      },
      {
        "text": "please add candidate",
        "intent": "candidate_add",
        "entities": []
      },              
      {
        "text": "please add new candidate",
        "intent": "candidate_add",
        "entities": []
      },           
      {
        "text": "we have new prescreening upcoming",
        "intent": "candidate_add",
        "entities": []
      }, 
      {
        "text": "we have a new candidate for prescreening",
        "intent": "candidate_add",
        "entities": []
      },         
      {
        "text": "correct",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "great choice",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "sounds really good",
        "intent": "affirm",
        "entities": []
      },
      {
        "text": "bye", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "goodbye", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "good bye", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "stop", 
        "intent": "goodbye", 
        "entities": []
      }, 
      {
        "text": "end", 
        "intent": "goodbye", 
        "entities": []
      },
      {
        "text": "farewell",
        "intent": "goodbye",
        "entities": []
      },
      {
        "text": "Bye bye",
        "intent": "goodbye",
        "entities": []
      },
      {
        "text": "have a good one",
        "intent": "goodbye",
        "entities": []
      }
    ]
  }
}
"""

training_data = open("anna.json", "w")
training_data.write(anna_common_examples)
training_data.close()

Train it:

In [None]:
from rasa_nlu.converters import load_data
from rasa_nlu.config import RasaNLUConfig
from rasa_nlu.model import Trainer

training_data = load_data('anna.json')
trainer = Trainer(RasaNLUConfig("config.json"))
trainer.train(training_data)
model_directory = trainer.persist('.')

Test it:

In [None]:
from rasa_nlu.model import Metadata, Interpreter

interpreter = Interpreter.load(model_directory, RasaNLUConfig("config.json"))

interpreter.parse(u"the developer didn't passed")