# Chit-Chat improvement - DistilBERT for Q&A similarity

In this honor project I propose to improve the dialogue model by means of an approach based on the suggestion 4 in the coursera assignment: *Selective model: embeddings-based ranking*.

**Summary**

The system will rank a list of answers from Cornell dataset, which are binded to questions that we have encoded previously using DistilBERT. At chatting time, the system perform question2embedding computation for the actual user's question using an endpoint deployed in a different EC2 machine. This endpoint is exposed by a Flask App inside a Docker that serves DistilBERT (here you can find my Dockerhub https://hub.docker.com/repository/docker/javiermcebrian/sentence_transformers_service).

The reason for this split into services, is because loading DistilBERT into RAM memory surpases the resources provided by Free Tier in AWS. Once we have the served embedding that represents that user's question, we perform distance similarity to the staticaly loaded (pre-computed) questions' embeddings inside a Docker service deployed in the main EC2 machine (here you can find my Dockerhub https://hub.docker.com/repository/docker/javiermcebrian/coursera_nlp_honors_serve). With this approach we aim to improve chat quality as DistilBERT are better embeddings that the used during the course.


**Implementation**

I wanted to try state-of-the-art models and to build custom applications using Docker. Appart from this notebook, I've required a lot of source code, docker containers, docker compose, bash scripts, etc. that I've develop. Here you can find all the related stuff: https://github.com/javiermcebrian/natural-language-processing/tree/master/honor:
* bert_as_a_service (not used at the end): Dockerfile to launch a service for bert-as-a-service implementation (https://github.com/hanxiao/bert-as-service). I wanted to try different ways of computing embeddings, but finally I preferred other implementations.
* datasets: as in the original coursera's repository
* environment (for experimentation):
    * environment-manager.sh: this is the entry point for running experiments. You can deploy experiment environment for both PyCharm interpreter and remote Notebook service. It provides a help guide if you use run it without args. It basically manages a docker compose yaml that offers this 2 services.
    * docker-compose.yml: define the services for experimentation.
    * Dockerfile.serve: base Dockerfile that is going to be useful for both EC2 cheap environment for the bot, and base docker for the experiment one.
    * Dockerfile.experiments: it adds a lot of useful libraries for experimentation as well as jupyter cappabilities to serve notebook endpoints to connect at. It is bigger expensive (RAM and size) than serve one.
    * requirements-XXXXXXXXXXX.txt: requirement files.
* sentence_transformers_service (to serve DistilBERT in AWS EC2 instance):
    * Dockerfile: defines service thath uses app.py Flask App
    * app.py: service that manages DistilBERT. It assumes that the model is serialized in a predefined path (run docker using volumes). Provides singleton model management and warmup features (using PING) to get the model deployed before the user requires it.
    * build_and_push.sh: build docker, tag it, and push to DockerHub
    * start-service.sh: run the service in background
* SRC: the rest of the code is the one provided by Coursera, updated with that functionality. It calls the DistilBERT service and perform the warmup via PING to the service. Finally it does the answering selection by embedding similarity.

### Try the model
First, we are going to test DistilBERT and to serialize the model for future usage in the serving app. We can compute embeddings of size 768.

In [1]:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np
import pickle
import random

In [2]:
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
model.save('/root/coursera/artifacts/distilbert-base-nli-stsb-mean-tokens')

In [3]:
sentences = ['Hi how are you']
embeddings_dim = len(model.encode(sentences)[0])
embeddings_dim

768

### Dataset
We load the Cornell dataset using max_len parameter equals to 100 to get large enough sentences. Then we perform random sampling as we have RAM constraints in the Free Tier AWS machines, and finally perform data cleaning using text_prepare() function. We achieve 34.109 possible responses from the dialogue model.

In [4]:
from datasets import datasets
from utils import text_prepare

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
dataset_path = '/root/coursera/datasets/data/cornell'
data = datasets.readCornellData(dataset_path, max_len=100)

100%|██████████| 83097/83097 [00:04<00:00, 18387.97it/s]


In [6]:
# define parameters for sample a 20% of the QA data
rate = 0.2
indices = list(range(len(data)))
nb = int(len(data) * rate)

# sample the data and apply text_prepare() only to questions
indices_selected = random.sample(indices, nb)
data_selected = [(text_prepare(data[i][0]), data[i][1]) for i in indices_selected]

# show the number of answers the system will be able to offer for chit-chat
nb

34109

In [7]:
data_selected[:10]

[('think likes think', 'finally came to your senses huh'),
 ('blankets notice warm fifty percent wool',
  'they also smell of moth balls when were they issued this morning'),
 ('youre dark horse ripley engaged', 'your parents met her'),
 ('scattered smothered covered',
  'exactly well i guess a couple more photos wont kill me'),
 ('youre anya rosson arent ive heard back new york',
  'sorry i cant return the compliment'),
 ('jimmy', 'im not sure'),
 ('think', 'well thank goodness thats settled'),
 ('dont call doll larry hate call doll',
  'you used to love it when i called you doll'),
 ('come cuervo delivered didnt im asking promised', 'well see'),
 ('oh love song isnt great doesnt make want dance cmon',
  'uh well thats okay i dont dance heh heh')]

### Distil-BERT Embeddings

Here we pre-compute Distil-BERT sentence embeddings as for StackOverflow answers.

#### Resources' constraints

Each StackOverflow pickle has a size in mean of 100Mb 280.000 samples. Here, for Cornell dataset we have 34.109 samples. These are going to be saved into a pickle with larger embeddings than StackOverflow ones. This means that in size (RAM constraints) and in computational time (min distance between question embedding and these) they are going to compensate: less samples but larger vectors.

In [8]:
questions, answers = zip(*data_selected)

In [9]:
batchsize = 200

q_embeddings = np.zeros((nb, embeddings_dim), dtype=np.float32)
for i in tqdm(range(0, nb, batchsize)):
    end = min(nb, i+batchsize)
    q_embeddings[i:end, :] = model.encode(questions[i:end])

100%|██████████| 171/171 [07:09<00:00,  2.51s/it]


In [10]:
with open('/root/coursera/artifacts/qa_embeddings.pkl', 'wb') as f:
    pickle.dump((answers, q_embeddings), f)

### Fitting resources into RAM

Here we would like to show that, theoretically, the proposed system fits into the constrained resources provided by Free Tier in AWS, using only 1 EC2 instance. However, in practice it's not true, so we have build the aforementioned service for DistilBERT in a different machine.

Anyway, we describe the hypothesis: first we must restart the kernel inside the experiments Docker, then we are going to sequentially execute the following cells, to check the RAM increments. The followind command will give us this information:

```sh
docker ps -q | xargs  docker stats --no-stream
```

Here are the RAM usage, incrementally:
- Docker experiments: 82.34Mb
- Docker experiments + sentence_transformers: 155Mb
- Docker experiments + sentence_transformers + distilbert: 582.5Mb
- Docker experiments + sentence_transformers + distilbert + embeddings: 684.7Mb

Non-chit-chat RAM resources (the rest of the resources, checked for project 1 chatbot at AWS) are of ~300Mb for the worst case, i.e., when the user request for a SatckOverflow answer, so the embeddings are loaded into memory.

Anyway, docker for serving will be much lighter: 4.75Mb.

With this we can estimate RAM usage in AWS: 684.7 - 82.34 + 4.75 + 300 = **907.11Mb for the worst case**, being the improbable time slot at which the service is releasing chit-chat model memory and reading StakOverflow embeddings into memory. This assumption is for a single user. It's not in the requirements to support multi-user as AWS Free Tier has few resources.

As I've already mentioned, this does not work in practice, leading us to a server client solution for the DistilBERT stuff, but it's useful to roughly understand the RAM requirements that we have, to design our model. With this information we can decide if using DistilBERT or another different model, as well as how much question embeddings to store to perform similarity matching.

In [14]:
from sentence_transformers import SentenceTransformer
import pickle

In [15]:
model = SentenceTransformer('/root/coursera/artifacts/distilbert-base-nli-stsb-mean-tokens')

In [16]:
with open('/root/coursera/artifacts/qa_embeddings.pkl', 'rb') as f:
    qa_embeddings = pickle.load(f)

### Experiments for predictions

Here we provide some experiments using the model and the pre-computed embeddings to show how is the prediction function and how it works for some samples. The source code is at my Github repository (https://github.com/javiermcebrian/natural-language-processing/blob/master/honor/dialogue_manager.py and https://github.com/javiermcebrian/natural-language-processing/blob/master/honor/sentence_transformers_service/app.py) as I've already mentioned, with the prediction function, but additionaly I provide the experiments expected.

In [26]:
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances_argmin

def test_system(qa_embeddings, user_questions):
    answers, q_embeddings = qa_embeddings
    res = []
    for uq in user_questions:
        question_vec = np.expand_dims(model.encode([uq])[0], axis=0)
        best_answer_id = pairwise_distances_argmin(X=question_vec, Y=q_embeddings, axis=1)[0]
        res.append(answers[best_answer_id])
    return res

In [33]:
user_questions = [
    'Would you like to go to the park or to the office?',
    'I would like to see your improvements',
    'Why do you think they are beautiful?',
    'Thank you for helping me'
]

test_system(qa_embeddings, user_questions)

['the park',
 'its a huge moment in their life',
 'it reminded me of you so i bought it it cost me more than all the others',
 'thats what were paid for']

### Conclusion

The results are quite good as they seem to be related to the fictitious conversation. We are asking the system about preferences (park or offic), self attributes (improvements), object's attributes (beautiful) and acknowledgements. Anyway, as any generative model, will be prone to significant errors from a human perspective as it's a complicated task. As a conclusion, I think the proposed system scales in AWS and provides interesting chit-chat features.