In [1]:
# add utils folder into sys path to import our modules

import sys
sys.path.append('./')

# Day 3: Inference with seq2seq model

## -Outline-

## 1. Intro (happening now)

## 2. Today application task: open-domain dialogue bot (chatbot)

## 3. Dataset: Persona-Chat (convai2 official training data)

### sections 4, 5 are presented in the second notebook

## 4. Seq2seq recap: traininng our model

## 5. Inference: how to extract a response from our model

***
## 2. Today application task: open-domain dialogue bot (chatbot)

In [2]:
# sanity check for all our data files

! ls -l ../data/convai2*

-rw-r--r-- 1 kulikov ammi 16202916 Mar 10 10:46 ../data/convai2_simple_train_wpersona_with_starts.txt
-rw-r--r-- 1 kulikov ammi   990188 Mar 10 10:46 ../data/convai2_simple_valid_wpersona_with_starts.txt


In [3]:
# lets see how our samples look like

! head -n 20 ../data/convai2_simple_train_wpersona_with_starts.txt

0	 i like to remodel homes. i like to go hunting. i like to shoot a bow. my favorite holiday is halloween.hi , how are you doing ? i'm getting ready to do some cheetah chasing to stay in shape .	you must be very fast . hunting is one of my favorite hobbies .
0	i am ! for my hobby i like to do canning or some whittling .	i also remodel homes when i am not out bow hunting .
0	that's neat . when i was in high school i placed 6th in 100m dash !	that's awesome . do you have a favorite season or time of year ?
0	i do not . but i do have a favorite meat since that is all i eat exclusively .	what is your favorite meat to eat ?
0	i would have to say its prime rib . do you have any favorite foods ?	i like chicken or macaroni and cheese .
0	do you have anything planned for today ? i think i am going to do some canning .	i am going to watch football . what are you canning ?
0	i think i will can some jam . do you also play footfall for fun ?	if i have time outside of hunting and remodeling homes . 

## *discuss* --> any similar task we have discussed so far? Any particular differences?

Today we are going to make an open-domain chat bot. We will become more familiar with typical issues arising in the context of dialogue modeling.

***
## 3. Dataset: Persona-Chat (convai2 official training data)

Competition [link](http://convai.io/). There exist many other open-domain datasets. Main feature of convai2 is that it provides additional grounding context for every dialogue which makes it much more interesting.

![Example dialogue](nbimgs/personachat.png)

This dataset is fully supported by ParlAI framework. Here we are going to avoid large dependencies, so we make our own dataset handler for persona chat.

In [5]:
from chat_dataset import TextDataset

In [6]:
TextDataset?

[0;31mInit signature:[0m
[0mTextDataset[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtext_file_path[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_voc_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdevice[0m[0;34m=[0m[0;34m'cpu'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdictionary[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Simple text dataset. Loads everything in RAM. Preprocess all data in tensors
in advance.
[0;31mInit docstring:[0m
:param text_file_path: filename of the dataset file
:param max_voc_size: max number of words in the dictionary
[0;31mFile:[0m           ~/ammi-2019-nlp/utils/chat_dataset.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


In [7]:
train_dataset = TextDataset('../data/convai2_simple_train_wpersona_with_starts.txt', device='cpu')

HBox(children=(IntProgress(value=0, max=131428), HTML(value='')))




In [9]:
train_dataset??

[0;31mType:[0m           TextDataset
[0;31mString form:[0m    <chat_dataset.TextDataset object at 0x7fa33b4f99e8>
[0;31mLength:[0m         131428
[0;31mFile:[0m           ~/ammi-2019-nlp/utils/chat_dataset.py
[0;31mSource:[0m        
[0;32mclass[0m [0mTextDataset[0m[0;34m([0m[0mDataset[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    Simple text dataset. Loads everything in RAM. Preprocess all data in tensors[0m
[0;34m    in advance.[0m
[0;34m[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mtext_file_path[0m[0;34m,[0m [0mmax_voc_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdevice[0m[0;34m=[0m[0;34m'cpu'[0m[0;34m,[0m [0mdictionary[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""[0m
[0;34m        :param text_file_path: filename of the dataset file[0m
[0;34m        :param max_voc_size: max number 

## Lets check dataset content

In [14]:
print('Vocabulary size: {}'.format(train_dataset.get_vocab_size()))

# some words from the dataset
print('\nFirst words: {}'.format([train_dataset.ind2word[i] for i in range(20)]))

# counts of those words
print('\nFirst words counts: {}'.format([train_dataset.counts[train_dataset.ind2word[i]] for i in range(20)]))

Vocabulary size: 18760

First words: ['<pad>', '<unk>', '<sos>', '<eos>', '<sep>', 'i', 'like', 'to', 'remodel', 'homes', '.', 'go', 'hunting', 'shoot', 'a', 'bow', 'my', 'favorite', 'holiday', 'is']

First words counts: [10000000000.0, 10000000000.0, 10000000000.0, 10000000000.0, 10000000000.0, 270778, 41228, 79239, 41, 130, 276855, 9210, 454, 111, 89135, 87, 73999, 14225, 183, 53579]


In [15]:
# get some sample from our dataset

train_dataset[42]

(tensor([  5, 319,  14, 320,  10,   5, 117,   7, 321,  10,   5,   6, 322, 323,
          10,  20,  19,  16,  17,  18,  10, 324, 111,  22,  23,  19, 207,  95,
           4,   5,  46, 227, 258,  25,  27,   5,  68,  14, 325, 255,  10, 326,
           4,   5,  28,  29, 196, 327, 137,  97,  69,   4,   5, 328, 329,  37,
          14, 330, 331, 320, 258, 117, 332,  69,   4,  25, 175,  48,  14, 333,
         334,  27,   4, 335, 207,  19, 165,  22,  20,  19,  16, 336,  10,   4,
         199,  22,   5,  99, 337,  10,   5,  28, 222,  40, 338,  14, 304,  54,
          97,  19, 339,   4, 218,   5, 248,  28, 249, 175, 170,  14, 333, 334,
           4, 340,   5, 205,  22,  57,  28,  58,  79,  20, 341,   4,  78,  32,
          25,   6,   7, 342,  27,   5,   6, 323, 322, 343,  10,   4,   5,   6,
          91,   7,  32, 121, 344,   4, 345, 218, 207, 346,  28, 249, 207,  19,
          16, 347, 199,   4, 199,  22, 345,   5, 205,  22, 348,  25, 319, 207,
           4, 154,  53,  78,  19,  79,  17,  27,   4

In [16]:
# turn the input back to the text form

train_dataset.pred2text(train_dataset[42][0])

"i own a hearse . i love to crochet . i like alternative rock . halloween is my favorite holiday . hello friend , how is it going <sep> i am well an you ? i have a creepy ride . guess <sep> i ' m great enjoying the football season <sep> i drive around in a long black hearse an love this season <sep> you work for a funeral home ? <sep> yes it is nice , halloween is my fav . <sep> lol , i can imagine . i ' ll be reading a lot when football is over <sep> no i don ' t work at a funeral home <sep> ok i see , that ' s your halloween costume <sep> what do you like to read ? i like rock alternative music . <sep> i like anything to do with mystery <sep> oh no it isn ' t it is my car lol <sep> lol , oh i see , taught you own it <sep> me also what is your favorite ? <sep> well i like sherlock holmes and others"

## *discuss* --> What can you say about such input?

## Some math formulation of our task

Each example in a training set $D$ consists of auxiliary information or context $U$ (such as a persona profile or external knowledge context) and a sequence of utterances, each of which is marked with a speaker tag, i.e.,

## $$C = (U, (Y_1^a, Y_1^b, \ldots, Y_L^a, Y_L^b) \in D$$

where $Y_l^s$ is the utterance from the $l$-th turn by a speaker $s$. The conditional log-probability assigned to this example given by a neural sequence model is then written as

## $$\log p(C) = \sum_{s \in \left\{ a, b \right\}} \sum_{l=1}^L \log p(Y_l^s|Y_{<l}^s, Y_{\leq l}^{\bar{s}}, U), $$