# Conditioned Language Models
In this homework, you will be implementing a language model that generates texts based off of ML Medium articles. This language model will be conditioned on the title of article.

## Model Overview
A language model predicts a word at time t *w<sub>t</sub>* given the previously generated words *w<sub>t-K</sub>*, *w<sub>t-K+1</sub>*, ..., *w<sub>t-1</sub>*, where *K* is a given window size. We will be implementing this by using an RNN, feeding in the previous *K* words sequentially as input and doing a classification among all possible words in the vocabulary.  

We will also be conditioning this RNN with a vector representation of the title of the given article. While there are many ways we can do this, for this homework we will do so by running a bidirectional RNN over the title and combining the outputs together to form a single vector that we will initialize the language model RNN with.  

**Note:** This model is not expected to get stellar results, and is not based off of any state of the art work. The purpose of building this architecture is to familiarize yourself with various ways to implement and combine different layers together.

We'll start off with some imports that we'll be using for text preprocessing, training, and inference. Additionally, we'll be defining some constants for different parts of the model. Feel free to modify the constants as needed or desired.

In [None]:
import csv
import itertools
import math
import nltk
import os
import random

from gensim.models import FastText

import numpy as np
import tensorflow as tf

# Constants
BATCH_SIZE = 256
BEAM_SEARCH_SIZE = 10
CONTEXT_LSTM_UNITS = 128
DECODE_LENGTH = 50
LEARNING_RATE = 1e-3
LOG_DIR = 'logs'
LOG_RATE = 10
NUM_EPOCHS = 3
TITLE_LSTM_UNITS = 128
WINDOW_LENGTH = 15
WORDVEC_DIM = 50

Next, we'll be importing our data and running preprocessing to extract tokenized titles and the raw text of the article. We will also be using gensim's FastText implementation to get word embeddings for each of the words.

In [None]:
raw_text = []
first_line = True
with open('articles.csv', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        if first_line:
            first_line = False
        else:
            raw_text.append(row)

# List of article titles, where each element is a list of tokenized words in each title
titles = [nltk.word_tokenize(raw_text[i][4]) for i in range(len(raw_text))]

# Intermediary list for tokenized text extraction
sent_texts = [[s.replace('\n', ' ') for s in nltk.sent_tokenize(raw_text[i][5])] for i in range(len(raw_text))]

# List of tokenized article content, where each element is a list of sentences containing tokenized words
tokenized = [[nltk.word_tokenize(sent.lower()) for sent in text] for text in sent_texts]

# Object holding FastText word embeddings for our corpus of words
ft = FastText(list(itertools.chain(*tokenized)) + [[w.lower() for w in t] for t in titles],
              size=WORDVEC_DIM, min_count=1)

With our data now read, we can define a couple more values that will be needed in our model.

In [None]:
title_length = max([len(t) for t in titles])
vocab_size = len(ft.wv.vocab)

Once we have our data processed, we can generate our data samples to be fed in during training. Each sample will be a tuple of `(article title, K-window, target word)`.

In [None]:
data = []
flat_docs = [list(itertools.chain(*text)) for text in tokenized]
for i in range(len(flat_docs)):
    for j in range(len(flat_docs[i])):
        data.append((titles[i], flat_docs[i][max(j-WINDOW_LENGTH,0):j], flat_docs[i][j]))

## Problem 1: Batch Generation
Despite all our preprocessing and generating all samples, there is still more that needs to be done before we can directly feed our data into our model for training. In particular, there are a few things that we need to do:
1. Shuffle the data at each epoch
2. Convert the words in the data into an array of word vectors
3. Convert the target word into an index
4. Pad the title and window sequences with null values so all arrays are of equal size 
5. Extract sequence length for RNN masking
6. Separate the data into batches

For this part, you need to implement a generator over a single epoch of data.  
The generator should yield a 5-tuple `(title, window, title length, window length, target index)`, each of which should be a numpy array. The shape of `title` should be `[BATCH_SIZE, title_length, WORDVEC_DIM]`, `window` should be `[BATCH_SIZE, WINDOW_LENGTH, WORDVEC_DIM]`, and `title length`, `window length`, and `target index` should all be `[BATCH_SIZE]`.

In [None]:
def generate_batch(data, wordvecs):
    """ YOUR CODE HERE """

Now its time to start building the model! We will begin by defining the placeholders corresponding to the data we will be inputting.

In [None]:
title_input = tf.placeholder(tf.float32, [None, title_length, WORDVEC_DIM])
title_seqlen_input = tf.placeholder(tf.int32, [None])
context_input = tf.placeholder(tf.float32, [None , WINDOW_LENGTH, WORDVEC_DIM])
context_seqlen_input = tf.placeholder(tf.int32, [None])
target_idx_input = tf.placeholder(tf.int32, [None])

## Problem 2: Bidirectional RNN for Conditioning on Titles
We will now be constructing the bidirectional RNNs for the title for use in conditioning the language model. There are 4 steps in doing so:
1. Create the forward and backward LSTM cells using `tf.nn.rnn_cell.LSTMCell`
2. Pass the title into `tf.nn.bidirectional_dynamic_rnn` using the LSTM cells that you have created
3. Take the mean over timesteps of all outputs of the Bidirectional RNN
4. Transform the vector with a linear layer to fit the dimensions of the language model RNN

Assign the variable `title_hidden` to the final of these layers. The constants `TITLE_LSTM_UNITS` and `CONTEXT_LSTM_UNITS` will be useful here.

In [None]:
""" YOUR CODE HERE """
title_hidden = None

## Problem 3: RNN Language Model
With the bidirectional RNN constructed, we can now proceed with the language model RNN. For this part, not only will you need to construct the RNN, but you'll also need to create the loss and optimizer TensorFlow operations. You can do so following the steps below:
1. As with the Bidirectional RNN, create the LSTM cell (only a single one is needed this time)
2. Initialize the cell state to be `title_hidden` (see documentation relating to `tf.nn.rnn_cell.LSTMStateTuple`)
3. Pass the `context_input` through the RNN and retrieve the last output
4. Feed said output through a linear layer to compute logits for every word in the vocab
5. Use a cross entropy loss between the predicted words and target labels
6. Create an optimizer that minimizes over the loss

You can use `vocab_size` to determine the size of the embeddings to predict over, and `LEARNING_RATE` for the opitimizer learning rate.  
Assign your predicted logits to `pred`, loss to `loss`, and optimization operation to `train_step`.

In [None]:
""" YOUR CODE HERE """
pred = None
loss = None
train_step = None

It's useful to look at TensorBoard logging graphs to keep track of training, so I've added code for that below. Feel free to add other summarization or related ops as desired.

In [None]:
if not os.path.isdir(LOG_DIR):
    os.mkdir(LOG_DIR)
run_num = 0
while os.path.isdir(os.path.join(LOG_DIR, 'run_%d' % run_num)):
    run_num += 1

tf.summary.scalar('model_loss', loss)
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter(os.path.join(LOG_DIR, 'run_%d' % run_num))

Prior to training, we begin a new session with `tf.Session()` and initialize all variables with `tf.global_variables_initializer()`.

In [None]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

With our data ready to be batched and model constructed, we can finally begin training. Run the code block below, and feel free to start a TensorBoard server to keep track of progress.  
**Note:** Training the model may take a considerable amount of time (only my laptop it takes nearly half an hour per epoch).

In [None]:
print('Beginning training...')
step = 0
for epoch in range(NUM_EPOCHS):
    batches = generate_batch(data, ft)
    while(True):
        try:
            batch_title, batch_context, batch_titlelen, batch_contextlen, batch_target = next(batches)
            feed_dict = {title_input: batch_title, title_seqlen_input: batch_titlelen,
                         context_input: batch_context, context_seqlen_input: batch_contextlen,
                         target_idx_input: batch_target}
            if step % LOG_RATE == 0:
                _, step_loss, summary = sess.run([train_step, loss, merged], feed_dict)
                writer.add_summary(summary, global_step=step)
            else:
                _, step_loss = sess.run([train_step, loss], feed_dict)
            step += 1
        except StopIteration:
            break
    print('Epoch %02d complete, current loss is %.5f' % (epoch, step_loss))
print('Training complete.')

Once training is done, we'd want to try to generate some text of our own to see what kind of outputs we get.

In [None]:
decode_k = tf.placeholder(tf.int32, [])
top_vals, top_idx = tf.nn.top_k(pred, decode_k)

# Since there is no easy way of getting a word from an index with the word vector object, we'll manually create
# our own dictionary for the mapping
idx2word = {ft.wv.vocab.get(k).index: k for k in ft.wv.vocab.keys()}

## Problem 4: Beam Search Decoder
While a simple greedy decoding (only take the most probable word when decoding) is sufficient to get a generated sequence, using beam search can more easily decode longer word dependencies that a simple greedy approach may not. The beam search decoding works as follows:
1. Take the top N most probable existing sequences (or if the initial sequence, the single empty sequence)
2. Compute the logits for all possible following words for each sequences
3. Add the logits to a running sum of all possible logits for the sequence
4. Rank all the new sequences by the updated logit sums, and keep the top N
5. Repeat until the sequence is fully generated
6. Keep the sequence with the highest logit sum (directly correlated with the generation probability)

Implement the `beam_search_decode` function to generate text of `DECODE_LENGTH` words.

In [None]:
def beam_search_decode(title, wordvecs):
    sequences = [([], 0)]
    if isinstance(title, str):
        title = nltk.word_tokenize(title.lower())
    """ YOUR CODE HERE """

As you may have noticed, even after training our model doesn't perform all too well. This isn't surprising, since you're likely running off of a laptop, using a single layer RNN on a very limited dataset. However, I hope that through this you have gained some experience working with RNNs and textual data in an actual coding setting.