# Question-Answering Chatbot

The goal of this homework is to build a question-answering Chatbot using a couple of very different deep network architectures (a retrieval architecture, and a generative architecture) as well as a hybrid  of these  architectures. 

You will be using the **Stanford Question Answering Dataset** or [**SQuAD**](https://rajpurkar.github.io/SQuAD-explorer/) for training and evaluation.

In this dataset, utterances are questions and responses are sentence. A typical question-answer pair is of the form:
* QUESTION: Where do water droplets collide with ice crystals to form precipitation?
* ANSWER: within a cloud

For the purposes of this homework, we will be ignoring the context vector in the Q-A pairs.

** Evaluation and discussion of this homework assigment **

Please complete section 5 of this assigment. If this section is not completed then you can expect **ZERO grade** for this assignment.

# A Retrieval-based model for QA-Chatbotting

The goal of this task it to build a **QA-chatbot** using a **retrieval** model. A suggested way of accomplishing this is to take the provided baseline implementation based on DSSM (described below) and extending it in a number of ways that are detailed below. 

## DSSM baseline implementation
To solve this problem it is recommended to use an architecture similar to the **Deep Structured Semantic Model** or [**DSSM**](https://www.microsoft.com/en-us/research/project/dssm/) and **Stanford Question Answering Dataset** or [**SQuAD**](https://rajpurkar.github.io/SQuAD-explorer/). We provide a starter notebook on the class GitHub in the Labs subfolder `Unit16_DSSM_IR`. It contains a working baseline for building a DSSM retrieval model with the same dataset. But the performance of the model presented there can be greatly improved. Next is a sketch of some ideas on how one can improve this baseline DSSM model:

## Improvement 1: better encoder
In the baseline architecture `Unit16_DSSM_IR` we use Global Average Pooling (GAP) as the only encoder layer in both towers. It's quite obvious that the learning capacity of GAP is way too low (it does not even have learnable parameters!). One possible improvment is use more elaborate and learnable layers the encoder phase of the towers. For example, even one-layer LSTM should work much better. As an alternative one can also consider using  [Convolutional](https://arxiv.org/pdf/1705.03122.pdf) or [Attentive](https://arxiv.org/pdf/1706.03762.pdf) encoders which are some of the of state-of-the-art encoders. Other possible improvements  that one can also think of is stacking multiple layers and adding skip-connections between them. Please **dont** limit your exploration to these suggestions.

## Improvement 2: better sampling

["Sampling Matters in Deep Embedding Learning"](https://arxiv.org/pdf/1706.07567.pdf) - as hightlighted by the name of this recently published paper. In the baseline `Unit16_DSSM_IR` notebook we discussed a few different sampling strategies  like `Easy Negatives`, `Hard Negatives`, `Semi-hard Negatives` etc. But we only used the `easy negatives` sampling strategy which is known to be not very informative. A possible improvement here might be to implement one of the more advanced sampling techniques and see how the quality of the models changes (semi-hard negative should work well from our experience). The performance should change dramatically. 

### Hybrid sampling strategy

An another possible improvement is explore combining different sampling strategies.  For example, one possible hybrid strategy could be the following:

* one can start training with `easy negatives` for 10 epochs so the network can learn something; 
* then switch to the `hard negatives` which could lead to a more agressive and informative stage of the training and
* finally switch to the semi-hard negatives which could be viewed as a kind of fine-tuning.

# A generative approach for QA-Chatbotting 

The goal of this task it to build a **chatbot** using a **generative** model.

For that purpose you are to use the **Encoder-Decoder Sequence-2-Sequence** model on the same QA-chatbot task (**Stanford Question Answering Dataset** or [**SQuAD**](https://rajpurkar.github.io/SQuAD-explorer/)).

One can adapt the same data reading and preprocessing steps that were used in the `Unit16_DSSM_IR` notebook.

![seq2seq.png](seq2seq.png)

The basic idea behind Encoder-Decoder generative models is to use two different interconnected parts of the network:
* **Encoder** is responsible for understanding and encoding the input question
* **Decoder** uses the encoded qustion and generate the answer on-the-fly
The main difference from the retrieval-based models is that the answer is indeed being generated instead of taking one of the *canned* training responses. [Here](http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/) one can read more on the differences between generative and retrieval models.

Your task in this homework is to build a generative Encoder-Decoder model. The [following](https://github.com/tensorflow/nmt) tutorial gives a very detailed description of  Encoder-Decoder models and how they can be implemented in Tensorflow (alas). Recently it has [become](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html) possible to build Encoder-Decoder models in pure Keras (yay!). An alternative to pure Keras is to use a thirdparty library such as  [seq2seq](https://github.com/farizrahman4u/seq2seq)). Seq2Seq is built on top of Keras and is  easy to use for Encoder-Decoder architectures. But with a pure Keras implementation one would have more fine-grained control over the model. We recommend you use a pure Keras implementation wherever possible.

Many different architecture are possible using a Encoder-Decoder solution. For example, a baseline architecture might be the following:

* Embedding layer
* One LSTM Encoder layer
* One LSTM Decoder layer

We can improve this baseline architecture in many ways that are sketched out here (but you are not limited to these. Be brave and try others!):

* More complicated Encoder and Decoder. For example, one might consider to use stacked bidirectional LSTMs in the Encoder part (note that one can NOT use bidirectional layers in Decoder) and stacked LSTMs in the Decoder.
* Explore using an Attention mechanism from Decoder to Encoder. More theory and intuition behind the attention layer can be found [here](https://distill.pub/2016/augmented-rnns/)
* Beam search during the Decoder generation stage instead of simple greedy "most probable word" approach can lead to a few percentage points improvement.

# A hybrid retrieval and generative approach to QA-Chatbotting 

Both the generative and retrieval approaches have their own merits and limitations when it comes to chatbots. Here is a short list for each approach:
* Generative:
    * **+** diverse responses which are generated on-the-fly
    * **-** responses might sometimes be gramatically incorrect and logically inconsistent
* Retrieval:
    * **+** gramatically correct and consistent responses as they are chosen from the human-written phrases
    * **-** lack of diversity; it's impossible to have ready-to-use phrase for each possible question even with the very huge training set

[Here](http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/) one can find a more detailed comparison of these approaches. The bottom line here is that these models are good for  different tasks. However, combining the retrieval and generative approaches might lead to the best of both worlds. Let's explore that!

Your task here is to come up with a good way to combine the generative and retrieval approaches into one or more networks and implement this solution. 

## Recommended hybrid architectures

[This](https://arxiv.org/abs/1610.07149) paper might be a good  background and starting point for this task.

Here are some suggestion candidate architectures (please do NOT limit your exploration to these):
* Build and train retrieval and generative models. Get top-N (say 5 or 10) answers from each model. Build a model to re-rank the top 2N answers to find the most suitable one. There are at least two architecture variations here: 
    * Use DSSM model from retrieval part for the re-ranking purposes
    * Build and train a new DSSM model for the re-ranking purposes
    
* Get top-10 answers with retrieval model and then build a conditional generative model which will modify retrieved answers and thereby generate a more diverse sets of responses. One of the ways to do it is:
    * Get top-10 answers from the retrieval model
    * Treat these 10 answers and the  question as 10 training pairs for the generative model. Train the generative model on these pairs.
    * Build a re-ranking model for the answers of the generative model.
    


# QA ChatBot Discussion

Please use this section to consolidate the results of all experiments and discuss your findings and recommendations.

Please report the following table of performance statistics for each task on the train and validation sets. In addition provide a block diagram detailing the best architecture used for each of the three tasks in this assignment. 

With regard to a results table, for each of the tasks  please report the following:

* Hardware used
* Experiment ID
* ExactMatch (EM) and F1 scores of the best models on both train and validation set
* The duration of training in wallclock time (in seconds)
* The duration of validation in wallclock time (in seconds)
* Experiment description

**Note: ** [This](https://worksheets.codalab.org/rest/bundles/0xbcd57bee090b421c982906709c8c27e1/contents/blob/) python script can be used to produce ExactMatch (EM) and F1 scores.

Please use a logging table structure like this to keep track of your experiments and make sure to embed them in your notebook (along with a brief description and discussion that is provided outside the table!):

In [None]:
import pandas as pd

results = pd.DataFrame(columns=["Hardware", "ExpID", "EMTrain", "F1Train", "EMValid", "F1Valid" "TrainTime(s)", "ValidTime(s)", "Experiment description"])

results.loc[len(results)] = ["Blah", 
                             np.round(emTrain * 100, 1), 
                             .........
                             expDuration,
                             "Blah",  
                             "W2V Full Wiki"]
results