In [1]:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np
import pickle
import random

In [2]:
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
model.save('/root/coursera/artifacts/distilbert-base-nli-stsb-mean-tokens')

In [3]:
sentences = ['Hi how are you']
embeddings_dim = len(model.encode(sentences)[0])
embeddings_dim

768

# Dataset

In [4]:
from datasets import datasets
from utils import text_prepare

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
dataset_path = '/root/coursera/datasets/data/cornell'
data = datasets.readCornellData(dataset_path, max_len=100)

100%|██████████| 83097/83097 [00:04<00:00, 18387.97it/s]


In [6]:
# define parameters for sample a 20% of the QA data
rate = 0.2
indices = list(range(len(data)))
nb = int(len(data) * rate)

# sample the data and apply text_prepare() only to questions
indices_selected = random.sample(indices, nb)
data_selected = [(text_prepare(data[i][0]), data[i][1]) for i in indices_selected]

# show the number of answers the system will be able to offer for chit-chat
nb

34109

In [7]:
data_selected[:10]

[('think likes think', 'finally came to your senses huh'),
 ('blankets notice warm fifty percent wool',
  'they also smell of moth balls when were they issued this morning'),
 ('youre dark horse ripley engaged', 'your parents met her'),
 ('scattered smothered covered',
  'exactly well i guess a couple more photos wont kill me'),
 ('youre anya rosson arent ive heard back new york',
  'sorry i cant return the compliment'),
 ('jimmy', 'im not sure'),
 ('think', 'well thank goodness thats settled'),
 ('dont call doll larry hate call doll',
  'you used to love it when i called you doll'),
 ('come cuervo delivered didnt im asking promised', 'well see'),
 ('oh love song isnt great doesnt make want dance cmon',
  'uh well thats okay i dont dance heh heh')]

# Distil-BERT Embeddings

Here we pre-compute compute Distil-BERT sentence embeddings as for StackOverflow answers.

### Resources' constraints

Each StackOverflow pickle has a size in mean of 100Mb 280.000 samples. Here, for Cornell dataset we have 34.109 samples. These are going to be saved into a pickle with larger embeddings than StackOverflow ones. This means that in size (RAM constraints) and in computational time (min distance between question embedding and these) they are going to compensate: less samples but larger vectors.

In [8]:
questions, answers = zip(*data_selected)

In [9]:
batchsize = 200

q_embeddings = np.zeros((nb, embeddings_dim), dtype=np.float32)
for i in tqdm(range(0, nb, batchsize)):
    end = min(nb, i+batchsize)
    q_embeddings[i:end, :] = model.encode(questions[i:end])

100%|██████████| 171/171 [07:09<00:00,  2.51s/it]


In [10]:
with open('/root/coursera/artifacts/qa_embeddings.pkl', 'wb') as f:
    pickle.dump((answers, q_embeddings), f)

# Fitting resources into RAM

Here we would like to show that the proposed system fits into the constrained resources provided by Free Tier in AWS. First we must restart the kernel inside the experiments Docker, then we are going to sequentially execute the following cells, to check the RAM increments. The followind command will give us this information:

```sh
docker ps -q | xargs  docker stats --no-stream
```

Here are the RAM usage, incrementally:
- Docker experiments: 82.34Mb
- Docker experiments + sentence_transformers: 155Mb
- Docker experiments + sentence_transformers + distilbert: 582.5Mb
- Docker experiments + sentence_transformers + distilbert + embeddings: 684.7Mb

Non-chit-chat RAM resources (the rest of the resources, checked for project 1 chatbot at AWS) are of ~300Mb for the worst case, i.e., when the user request for a SatckOverflow answer, so the embeddings are loaded into memory.

Anyway, docker for serving will be much lighter: 4.75Mb.
Here is my DockerHub repository with the Docker for serving: https://hub.docker.com/repository/docker/javiermcebrian/coursera_nlp_honors_serve/general

With this we can estimate RAM usage in AWS: 684.7 - 82.34 + 4.75 + 300 = **907.11Mb for the worst case**, being the improbable time slot at which the service is releasing chit-chat model memory and reading StakOverflow embeddings into memory. This assumption is for a single user. It's not in the requirements to support multi-user as AWS Free Tier has few resources.

In [11]:
from sentence_transformers import SentenceTransformer
import pickle

In [12]:
model = SentenceTransformer('/root/coursera/artifacts/distilbert-base-nli-stsb-mean-tokens')

In [13]:
with open('/root/coursera/artifacts/qa_embeddings.pkl', 'rb') as f:
    qa_embeddings = pickle.load(f)