# RETRIEVAL CHATBOT

# 1. INTRODUCTION

In this exercise we will see step by step the process of building a retrieval chatbot, preparing the dataset we want to work with, create a model and train the model to get a dialog system with the ability to answer to the users questions. 

As you know from the slides, a retrieval chatbot doesn't generate an answer from scratch. It receives a question (the user input), use some heuristic to retrieve a set of candidates to be answer to that question and finally it selects the best one as final answer. Our goal in this exercise is to have a chatbot able to perform this task in a closed domain: Ubuntu customer support.

## 1.1. Dataset
In this case we are going to work with the Ubuntu Corpus (https://arxiv.org/pdf/1506.08909) to create a retrieval chatbot capable of answering technical support questions about the well known OS Ubuntu. The set can be downloaded in **https://drive.google.com/file/d/0B_bZck-ksdkpVEtVc1R6Y01HMWM/view**. It consists of dialogs extracted from the forums, so each conversation has two participants. **Please, create a folder called 'data' in the same path where this notebook is being executed and extract there the dataset**

In the training dataset, the dialogs have been processed to obtain a series of pairs **context** - **utterance**. Each sentence of the dialog is going to appear as an utterance in one of the pairs, while the context of that especific pair is formed by the sentences previous to the utterance.

The testing dataset is different, as we have each sentence of the dialog as **context** and then the following sentence of the dialog (from the other user) as **utterance**. In addition, each pair has also 9 **distractors**, false utterances selected randomly from the dataset. Given a context, the model will receive the correct utterance and the distractors as candidates to be answers, and the model should be able to give the correct one a better score than the others.


## 1.2. Model
The architecture of the neural network is called the Dual Encoder LSTM. It's described also in the paper mentioned before, and it's formed by two encoders. One of them encodes the question we want to answer and the other one the candidate to be answer. The output of the architecture is a score between 0 and 1. The closer the score is to 1, the better the answer is for that question.

<img src="dualencoder.png">


# 2. Requirements

First of all, we need to install the libraries required to complete this project. The most important are:

* Python = 2.7
* Tensorflow = 0.12.1 

Once installed, import them into the project and we are ready to start.


In [None]:
import os
import time
import sys
import itertools
import tensorflow as tf
import numpy as np
import functools
from collections import namedtuple
from tensorflow.contrib.learn.python.learn.metric_spec import MetricSpec

# 3. Setting everything up

For the Tensorflow graph, we have to define a series of variables that will be needed once we create the model. 

In [None]:
# Data read/write Parameters
tf.flags.DEFINE_string("input_dir", "./data", "Directory containing input data files 'train.tfrecords' and 'validation.tfrecords'")
tf.flags.DEFINE_integer("loglevel", 20, "Tensorflow log level")

# Training Parameters
tf.flags.DEFINE_integer("num_epochs", None, "Number of training Epochs. Defaults to indefinite.")
tf.flags.DEFINE_integer("eval_every", 4000, "Evaluate after this many train steps")
tf.flags.DEFINE_integer("steps", 20000, "Number of steps")
tf.flags.DEFINE_float("learning_rate", 0.001, "Learning rate")
tf.flags.DEFINE_integer("batch_size", 64, "Batch size during training")
tf.flags.DEFINE_string("optimizer", "Adam", "Optimizer Name (Adam, Adagrad, etc)")

# Evaluation/Testing Parameters
tf.flags.DEFINE_integer("eval_steps", 100, "Number of steps")
tf.flags.DEFINE_integer("eval_batch_size", 8, "Batch size during evaluation")
tf.flags.DEFINE_integer("test_batch_size", 8, "Batch size for testing")

# Vocabulary Parameters
tf.flags.DEFINE_integer("vocab_size", 91620, "The size of the vocabulary. Only change this if you changed the preprocessing")

# Pre-trained embeddings
tf.flags.DEFINE_string("glove_path", None, "Path to pre-trained Glove vectors")
tf.flags.DEFINE_string("vocab_path", None, "Path to vocabulary.txt file")

# Prediction parameters
tf.flags.DEFINE_string("vocab_processor_file", "./data/vocab_processor_p2.bin", "Saved vocabulary processor file")

# Model Parameters
tf.flags.DEFINE_integer("embedding_dim", 100, "Dimensionality of the embeddings")
tf.flags.DEFINE_integer("rnn_dim", 256, "Dimensionality of the RNN cell")
tf.flags.DEFINE_integer("max_context_len", 160, "Truncate contexts to this length")
tf.flags.DEFINE_integer("max_utterance_len", 80, "Truncate utterance to this length")

FLAGS = tf.flags.FLAGS

Although in this Notebook we can store global variables, it's convenient to save the parameters in the <b>Tensorflow Flags</b>. After defining them, we can initialize an object that contains them.

In [None]:
HParams = namedtuple(
  "HParams",
  [
    "batch_size",
    "embedding_dim",
    "eval_batch_size",
    "learning_rate",
    "max_context_len",
    "max_utterance_len",
    "optimizer",
    "rnn_dim",
    "vocab_size",
    "glove_path",
    "vocab_path"
  ])

def create_hparams():
    return HParams(
        batch_size=FLAGS.batch_size,
        eval_batch_size=FLAGS.eval_batch_size,
        vocab_size=FLAGS.vocab_size,
        optimizer=FLAGS.optimizer,
        learning_rate=FLAGS.learning_rate,
        embedding_dim=FLAGS.embedding_dim,
        max_context_len=FLAGS.max_context_len,
        max_utterance_len=FLAGS.max_utterance_len,
        glove_path=FLAGS.glove_path,
        vocab_path=FLAGS.vocab_path,
        rnn_dim=FLAGS.rnn_dim)

TIMESTAMP = int(time.time())

# The directory where your model is going to be stored. Change the timestamp for a more recognizable name if you want

MODEL_DIR = os.path.abspath(os.path.join("./runs", str(TIMESTAMP)))
MODEL_DIR_PRETRAINED = os.path.abspath(os.path.join("./runs", "1486584016/"))

print(MODEL_DIR)

TRAIN_FILE = os.path.abspath(os.path.join(FLAGS.input_dir, "train.tfrecords"))
VALIDATION_FILE = os.path.abspath(os.path.join(FLAGS.input_dir, "validation.tfrecords"))
TEST_FILE = os.path.abspath(os.path.join(FLAGS.input_dir, "test.tfrecords"))

# Level of logging
tf.logging.set_verbosity(tf.logging.INFO)

# We initialize the object that will be used in the model to access the variables.
hparams = create_hparams()

# 4. Description of the model

Once we have our parameters initialized, it's time to start defining the <b>Graph</b> of the model. In first place, we define a function to initialize the <b>embedding matrix</b>. Its shape is <i>Number of words in the vocabulary</i> x <i>Number of embedding dimensions</i>, both parameters defined in the global variables. 

In [None]:
def get_embeddings(hparams):
    tf.logging.info("Starting with random embeddings.")
    initializer = tf.random_uniform_initializer(-0.25, 0.25)
    return tf.get_variable(
        "word_embeddings",
        shape=[hparams.vocab_size, hparams.embedding_dim],
        initializer=initializer)

Now it's time to define our <b>model</b>. Using the last function, we can get the embedding matrix with the desired dimensions. From that matrix, we translate both the context and the utterance to get the embedded representation of the sentences. 

Once we have them, we define what type of **cell** we want to use for the Recurrent Neural Network. In this model we are going to use **LSTM**, although GRU could also work. Then, we use that cell to define the complete RNN which will be used as encoder. We concatenate both embedded sentences and pass it through the RNN, obtaining at the end the encoded context and utterance.

Finally, the **prediction** for a given context will be obtained by multiplying the context encoded by the *prediction matrix* M, that will be trained. However, it isn't this prediction what we want to get. Now, we can multiply it to the real encoded utterance and apply the *sigmoid* function to get the probability of the pair context-utterance being correct. 

For the testing we will return this score as a result. However, for the training we must minimize the **mean loss** for each batch. The chosen loss function is the **cross entropy**, as in the training dataset we have labelled whether each utterance belongs to the context. 

Thanks to that, if the label is 1 (the pair is correct) the loss will be very close to 0 only if the score given is high, penalizing the mistake. The same works for the other case, being the label 0 (the pair is wrong), if the score is high then it will be penalized as the loss will increase.

In [None]:
def dual_encoder_model(
    hparams,
    mode,
    context,
    context_len,
    utterance,
    utterance_len,
    targets):

    # Initialize embedidngs randomly or with pre-trained vectors if available
    embeddings_W = get_embeddings(hparams)

    # Embed the context and the utterance
    context_embedded = tf.nn.embedding_lookup(
        embeddings_W, context, name="embed_context")
    utterance_embedded = tf.nn.embedding_lookup(
        embeddings_W, utterance, name="embed_utterance")

    # Build the RNN
    with tf.variable_scope("rnn") as vs:
    # We use an LSTM Cell
        cell = tf.nn.rnn_cell.LSTMCell(
            hparams.rnn_dim,
            forget_bias=2.0,
            use_peepholes=True,
            state_is_tuple=True)

        # Run the utterance and context through the RNN
        rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
            cell,
            tf.concat(0, [context_embedded, utterance_embedded]),
            sequence_length=tf.concat(0, [context_len, utterance_len]),
            dtype=tf.float32)
        encoding_context, encoding_utterance = tf.split(0, 2, rnn_states.h)

    with tf.variable_scope("prediction") as vs:
        M = tf.get_variable("M",
          shape=[hparams.rnn_dim, hparams.rnn_dim],
          initializer=tf.truncated_normal_initializer())

        # "Predict" a  response: c * M
        generated_response = tf.matmul(encoding_context, M)
        generated_response = tf.expand_dims(generated_response, 2)
        encoding_utterance = tf.expand_dims(encoding_utterance, 2)

        # Dot product between generated response and actual response
        # (c * M) * r
        logits = tf.batch_matmul(generated_response, encoding_utterance, True)
        logits = tf.squeeze(logits, [2])

        # Apply sigmoid to convert logits to probabilities
        probs = tf.sigmoid(logits)

        if mode == tf.contrib.learn.ModeKeys.INFER:
            return probs, None

        # Calculate the binary cross-entropy loss
        losses = tf.nn.sigmoid_cross_entropy_with_logits(logits, tf.to_float(targets))

    # Mean loss across the batch of examples
    mean_loss = tf.reduce_mean(losses, name="mean_loss")
    return probs, mean_loss

# 5. Training

We have defined our parameters and prepared the model for training and testing. Now, we need to actually **train** the model. During this process we are going to train different variables: vocabulary embeddings, prediction matrix and RNN parameters. We are going to create a function that returns the function to be used, depending on the mode (training/testing/evaluation).

First, some helper functions. With *get_id_feature()* we can retrieve the list of IDs of the words of a sentence and its length.

In [None]:
def get_id_feature(features, key, len_key, max_len):
    ids = features[key]
    ids_len = tf.squeeze(features[len_key], [1])
    ids_len = tf.minimum(ids_len, tf.constant(max_len, dtype=tf.int64))
    return ids, ids_len

Now, we define the function that is going to perform the training:

In [None]:
def create_train_op(loss, hparams):
    train_op = tf.contrib.layers.optimize_loss(
        loss=loss,
        global_step=tf.contrib.framework.get_global_step(),
        learning_rate=hparams.learning_rate,
        clip_gradients=10.0,
        optimizer=hparams.optimizer)
    return train_op

Now, to create the whole training process, we create a function that returns the function we need for the training. The explanation for this is that we might want to create new models to experiment with and thanks to this we can just create or modify the model without changing the next function:

In [None]:
def create_model_fn(hparams, model_impl):
    def model_fn(features, targets, mode):
        context, context_len = get_id_feature(
            features, "context", "context_len", hparams.max_context_len)

        utterance, utterance_len = get_id_feature(
            features, "utterance", "utterance_len", hparams.max_utterance_len)
        
        if mode == tf.contrib.learn.ModeKeys.TRAIN:
            probs, loss = model_impl(
                  hparams,
                  mode,
                  context,
                  context_len,
                  utterance,
                  utterance_len,
                  targets)
            train_op = create_train_op(loss, hparams)
            return probs, loss, train_op

        if mode == tf.contrib.learn.ModeKeys.INFER:
            probs, loss = model_impl(
                hparams,
                mode,
                context,
                context_len,
                utterance,
                utterance_len,
                None)
            return probs, 0.0, None

        if mode == tf.contrib.learn.ModeKeys.EVAL:
            batch_size = targets.get_shape().as_list()[0]
            # We have 10 exampels per record, so we accumulate them
            all_contexts = [context]
            all_context_lens = [context_len]
            all_utterances = [utterance]
            all_utterance_lens = [utterance_len]
            all_targets = [tf.ones([batch_size, 1], dtype=tf.int64)]

            for i in range(9):
                distractor, distractor_len = get_id_feature(features,
                    "distractor_{}".format(i),
                    "distractor_{}_len".format(i),
                    hparams.max_utterance_len)
                all_contexts.append(context)
                all_context_lens.append(context_len)
                all_utterances.append(distractor)
                all_utterance_lens.append(distractor_len)
                all_targets.append(
                  tf.zeros([batch_size, 1], dtype=tf.int64)
                )

            probs, loss = model_impl(
                hparams,
                mode,
                tf.concat(0, all_contexts),
                tf.concat(0, all_context_lens),
                tf.concat(0, all_utterances),
                tf.concat(0, all_utterance_lens),
                tf.concat(0, all_targets))

            split_probs = tf.split(0, 10, probs)
            shaped_probs = tf.concat(1, split_probs)

            # Add summaries
            tf.histogram_summary("eval_correct_probs_hist", split_probs[0])
            tf.scalar_summary("eval_correct_probs_average", tf.reduce_mean(split_probs[0]))
            tf.histogram_summary("eval_incorrect_probs_hist", split_probs[1])
            tf.scalar_summary("eval_incorrect_probs_average", tf.reduce_mean(split_probs[1]))

            return shaped_probs, loss, None

    return model_fn

After defining the model function, we want to define the function that is going to prepare the data to be used in batches for training/evaluating/testing. We are going to follow the same proccedure as before, defining a function that returns nother customiezed function depending on the mode and other parameteres passed to it.

In [None]:
# Helper fucntion to store the information depending on the MODE used 
TEXT_FEATURE_SIZE = 160

def get_feature_columns(mode):
    feature_columns = []

    feature_columns.append(tf.contrib.layers.real_valued_column(column_name="context", dimension=TEXT_FEATURE_SIZE, dtype=tf.int64))
    feature_columns.append(tf.contrib.layers.real_valued_column(column_name="context_len", dimension=1, dtype=tf.int64))
    feature_columns.append(tf.contrib.layers.real_valued_column(column_name="utterance", dimension=TEXT_FEATURE_SIZE, dtype=tf.int64))
    feature_columns.append(tf.contrib.layers.real_valued_column(column_name="utterance_len", dimension=1, dtype=tf.int64))

    if mode == tf.contrib.learn.ModeKeys.TRAIN:
        # During training we have a label feature to know if the pair context-utterance is correct
        feature_columns.append(tf.contrib.layers.real_valued_column(column_name="label", dimension=1, dtype=tf.int64))

    if mode == tf.contrib.learn.ModeKeys.EVAL:
        # During evaluation we have 9 distractors
        for i in range(9):
            feature_columns.append(tf.contrib.layers.real_valued_column(column_name="distractor_{}".format(i), dimension=TEXT_FEATURE_SIZE, dtype=tf.int64))
            feature_columns.append(tf.contrib.layers.real_valued_column(column_name="distractor_{}_len".format(i), dimension=1, dtype=tf.int64))

    return set(feature_columns)

In [None]:
# Prepare the data to be used in batches
def create_input_fn(mode, input_files, batch_size, num_epochs):
    def input_fn():
        features = tf.contrib.layers.create_feature_spec_for_parsing(get_feature_columns(mode))

        feature_map = tf.contrib.learn.io.read_batch_features(
            file_pattern=input_files,
            batch_size=batch_size,
            features=features,
            reader=tf.TFRecordReader,
            randomize_input=True,
            num_epochs=num_epochs,
            queue_capacity=200000 + batch_size * 10,
            name="read_batch_features_{}".format(mode))

        # This is an ugly hack because of a current bug in tf.learn
        # During evaluation TF tries to restore the epoch variable which isn't defined during training
        # So we define the variable manually here
        if mode == tf.contrib.learn.ModeKeys.TRAIN:
            tf.get_variable("read_batch_features_eval/file_name_queue/limit_epochs/epochs",
                initializer=tf.constant(0, dtype=tf.int64))

        if mode == tf.contrib.learn.ModeKeys.TRAIN:
            target = feature_map.pop("label")
        else:
            # In evaluation we have 10 classes (utterances).
            # The first one (index 0) is always the correct one and the other 9 are the distractors
            target = tf.zeros([batch_size, 1], dtype=tf.int64)
        return feature_map, target
    return input_fn

Now, for the evaluation metric we are going to define recall@k, as was described before:

In [None]:
def create_evaluation_metrics():
    eval_metrics = {}
    for k in [1, 2, 5, 10]:
        eval_metrics["recall_at_%d" % k] = MetricSpec(metric_fn=functools.partial(
            tf.contrib.metrics.streaming_sparse_recall_at_k,
            k=k))
    return eval_metrics


That's it, now we just have to specify the model we want to train with to obtain a valid function. After all the functions have been created, we can initialize our model, estimator, training input, metrics and monitor to see how the loss evolves.

In [None]:
# Initialize the model, specifying that we want the Dual Encoder (the only one we have)
model_fn = create_model_fn(
    hparams,
    model_impl=dual_encoder_model)

# Initialize the estimator
estimator = tf.contrib.learn.Estimator(
    model_fn=model_fn,
    model_dir=MODEL_DIR,
    config=tf.contrib.learn.RunConfig())

# Initialize the input batch estructure for the training
input_fn_train = create_input_fn(
    mode=tf.contrib.learn.ModeKeys.TRAIN,
    input_files=[TRAIN_FILE],
    batch_size=hparams.batch_size,
    num_epochs=FLAGS.num_epochs)

# Same with the evaluation input, to monitor the accuracy on testing during the training
input_fn_eval = create_input_fn(
    mode=tf.contrib.learn.ModeKeys.EVAL,
    input_files=[VALIDATION_FILE],
    batch_size=hparams.eval_batch_size,
    num_epochs=1)

# Initialize the metric we are going to use to measure the accuracy of the model (recall@k)
eval_metrics = create_evaluation_metrics()

# We put the evaluation data and the metric together to monitorize the accuracy
eval_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    input_fn=input_fn_eval,
    every_n_steps=FLAGS.eval_every,
    metrics=eval_metrics)

**Everything is ready for the training!** Note that on the original experiment they trained for 20000 steps, which can take long (about 30-40 hours) without a GPU. Feel free to change the number of steps, although the results can be notably worse.

In [None]:
estimator.fit(input_fn=input_fn_train, steps=FLAGS.steps, monitors=[eval_monitor])

# 6. EVALUATING

Now that we have a model fit, we can test it with the evaluation set included in the Ubuntu Corpus. Remember the estructure:
* Context
* Correct utterance
* Nine distractors (wrong utterances)

The evaluation is based on the function **recall@k**, being k the size of the subset selected. In other words, for each context the model will evaluate all 10 possible utterances and assign a score to each of them. For recall@1 only is correct if the best score is the correct utterance, for recall@5 it's considered correct if the correct utterance is between the 5 best scores, etc.

**IMPORTANT NOTE**: From now on, if you want to evaluate/test with the provided checkpoint, please use "estimator_pretrained" as a variable instead of "estimator". If you have trained your own estimator, then just keep using the default one.

In [None]:
estimator_pretrained = tf.contrib.learn.Estimator(
    model_fn=model_fn,
    model_dir=MODEL_DIR_PRETRAINED,
    config=tf.contrib.learn.RunConfig())

In [None]:
input_fn_test = create_input_fn(
    mode=tf.contrib.learn.ModeKeys.EVAL,
    input_files=[TEST_FILE],
    batch_size=FLAGS.test_batch_size,
    num_epochs=1)

# use estimator_pretrained if using predefined checkpoint
estimator.evaluate(input_fn=input_fn_test, steps=FLAGS.steps, metrics=eval_metrics)

# 7. MAKING PREDICTIONS

We have to remember that the main goal of this course is to be able to build a chatbot that it's able to interact with human beings. That means that it should be able to **give answers to questions outside the dataset**. For that, everytime a question is asked we can retrieve a set of possible answers and pass them by the model to obtain the score. After all the process is gone, we select the one with best score as the answer that will be returned to the user!

In [None]:
def tokenizer_fn(iterator):
  return (x.split(" ") for x in iterator)

# Load vocabulary
vp = tf.contrib.learn.preprocessing.VocabularyProcessor.restore(FLAGS.vocab_processor_file)

# Load your own data here
INPUT_CONTEXT = "How can I remove a file"
POTENTIAL_RESPONSES = ["what do you mean?", "rm -r", "top", "ifconfig"]

def get_features(context, utterance):
  context_matrix = np.array(list(vp.transform([context])))
  utterance_matrix = np.array(list(vp.transform([utterance])))
  context_len = len(context.split(" "))
  utterance_len = len(utterance.split(" "))
  features = {
    "context": tf.convert_to_tensor(context_matrix, dtype=tf.int64),
    "context_len": tf.constant(context_len, shape=[1,1], dtype=tf.int64),
    "utterance": tf.convert_to_tensor(utterance_matrix, dtype=tf.int64),
    "utterance_len": tf.constant(utterance_len, shape=[1,1], dtype=tf.int64),
  }
  return features, None

In [None]:
# Ugly hack, seems to be a bug in Tensorflow
# estimator.predict doesn't work without this line
# use estimator_pretrained if using predefined checkpoint
estimator._targets_info = tf.contrib.learn.estimators.tensor_signature.TensorSignature(tf.constant(0, shape=[1,1]))


# use estimator_pretrained if using predefined checkpoint
print("Context: {}".format(INPUT_CONTEXT))
for r in POTENTIAL_RESPONSES:
    prob = estimator.predict(input_fn=lambda: get_features(INPUT_CONTEXT, r))
    results = next(prob)
    print(r, results)

However, in the last step we have cheated. We have manually added the candidates to be evaluated, but this is not going to be possible in a real world scenario. For that, we came up with the idea of using <b>Solr</b>. Solr gives you the opportunity (among many others that we don't need here) of indexing the whole dataset and performing similarity queries in it.

The best way to perform the indexing is by creating an appropiate estructure of the data. We are going to need to query the user input (the question) against the database, select a group of the most similar existing questions and get the answer of the other user in the Ubuntu forum to be evaluated. Each sentence in the dataset can be stored with the following information:

- **author**: name of the user that wrote the sentence
- **recipient**: name of the other user present in the dialog
- **content**: the sentence (can be considered the <i>answer</i>
- **responseTo**: the last sentence from the other user that came before this one (can be considered the <i>question</i>)

With this estructure in Solr we can query by the user question to the chatbot against the <i>responseTo</i> field of all the stored sentences. The ones with biggest Solr similarity score are the sentences that have the best probability to be asking the same questions as the user, so we can take their <i>content</i> field and add them to the set of possible answers to return to the user.

In [None]:
import requests
solr_server = "http://localhost:8983/solr/"
col_name = "ubuntu_corpus/"
def predict(raw_query, number_results):
    query = raw_query.replace(" ", "%20")

    url_query = solr_server + col_name + 'select?defType=edismax&indent=on&bq=responseTo:[*%20TO%20*]^5&q.alt=' + query + '&qf=responseTo&rows=' + str(number_results) + '&wt=json'

    r = requests.get(url_query).json()

    candidates_objects = r['response']['docs']
    candidates = [ c['content'] for c in candidates_objects ]
    print(candidates)

In [None]:
predict("what is the command to remove a file?", 20)