# Deep Learning with MXNet Gluon - Assignment 3


## Assignment Description


Welcome to Deep Learning with MXNet/Gluon Week 3 assignment. This assignment will focus natural language processing and using gluon-nlp. In the first question, you will answer questions about RNNs and LSTMs. Then you will do some NLP specific processing tasks and you will also get the opportunity to try out some pretrained word-embeddings in gluonnlp. Finally you will combine word embeddings and finetune an image classification model on a new dataset and train an object detection model on a dataset.

### Supplemental Reading
* [Deep Learning, NLP, Representations (Blog)](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)
* [Recurrent Neural Networks (Dive into deep learning)](https://d2l.ai/chapter_recurrent-neural-networks/rnn.html)
* [LSTM (Dive into deep learning)](https://d2l.ai/chapter_recurrent-neural-networks/lstm.html)
* [Natural Language Processing(Dive into deep learning)](https://d2l.ai/chapter_natural-language-processing/index.html)

In [1]:
!pip install gluonnlp

[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Recurrent Neural Networks

We saw that feed forward neural networks are not very good for modelling data that is sequential. In order to explicitly model patterns in sequential data we introduced Recurrent Neural Networks (RNNs). The hidden state in RNNs allows us to capture historical information of the sequence up to the current time step. Now, you will walk through an exercise of how this works, using a handcrafted example.

## Question 1

Write code that performs a single RNN update given an input and the current state. 

The input `X` and state `H` have been initialized for you below. So have the weights. Recall that the rnn cell simply updates the hidden state by performing

$$ H = \sigma(X \cdot W_{xh} + H \cdot W_{hh})$$

and produces the output by performing

$$ O = \sigma(X \cdot W_{hq})$$

where $\sigma$ is the activation function. Try using both `nd.sigmoid` and `nd.relu` as the activation function. What differences do you observe. Run the rnn function for 10 time steps, feeding the output of one time step as the input to the next. What do you observe?

In [21]:
from mxnet import nd

# Data X and hidden state H
X = nd.random.normal(shape=(3, 1))
H = nd.random.normal(shape=(3, 2))

# Weights
W_xh = nd.random.normal(shape=(1, 2))
W_hh = nd.random.normal(shape=(2, 2))
W_hq = nd.random.normal(shape=(2, 1))

# Your code here



[[0.       ]
 [2.331464 ]
 [0.4035033]]
<NDArray 3x1 @cpu(0)>


## Training an LSTM Language Model

Now we will train an LSTM language model but with a model we designed by hand. First we load and prepare the training dataset. Similar to the example in lecture, we will be using the 'wikitext-2' dataset. We will create a training dataloader using a batched version of the dataset with the `nlp.data.batchify.CorpusBPTTBatchify` function as in lecture, so that our model can train in batches.

In [4]:
import warnings
warnings.filterwarnings('ignore')

import glob
import time
import math

import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download

import gluonnlp as nlp

dataset_name = 'wikitext-2'
train_dataset, val_dataset, test_dataset = (nlp.data.WikiText2(segment=segment,
                                                               bos=None, 
                                                               eos='<eos>', 
                                                               skip_empty=False)
                                            for segment in ['train', 'val', 'test'])

num_gpus = 1
context = mx.gpu(0)
log_interval = 200

batch_size = 20
bptt = 35

vocab = nlp.Vocab(nlp.data.Counter(train_dataset), padding_token=None, bos_token=None)
bptt_batchify = nlp.data.batchify.CorpusBPTTBatchify(vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = (bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset])

## Question 2

In the example in lecture, we used a standard LSTM model from gluon NLP for this assignment, we will write the LSTM model by extending `gluon.HybridBlock`. As in assignment one, the `__init__` method has been written for you and you simply need to write the `hybrid_forward` method with the signature provided.

The forward method should consist of the following steps in order.
* Encoder on input, with dropout after
* LSTMcell. `self.rnn` in the code with dropout on the output of the LSTM cell
* Decoder on output of LSTM cell.

In [5]:
from mxnet.gluon import nn, rnn

class RNNModel(gluon.HybridBlock):
    """A model with an encoder, recurrent layer, and a decoder."""

    def __init__(self, vocab_size, num_embed, num_hidden,
                 num_layers, dropout=0.5, tie_weights=False, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        with self.name_scope():
            self.drop = nn.Dropout(dropout)
            self.encoder = nn.Embedding(vocab_size, num_embed,
                                        weight_initializer=mx.init.Uniform(0.1))
            self.rnn = rnn.LSTM(num_hidden, num_layers, dropout=dropout,
                                input_size=num_embed)
            
            if tie_weights:
                self.decoder = nn.Dense(vocab_size, in_units=num_hidden,
                                        params=self.encoder.params)
            else:
                self.decoder = nn.Dense(vocab_size, in_units=num_hidden)

            self.num_hidden = num_hidden
            
    def begin_state(self, *args, **kwargs):
        return self.rnn.begin_state(*args, **kwargs)

    def hybrid_forward(self, F, inputs, hidden):
        # Your code here
        
    


lr = 20
model = RNNModel(len(vocab), 650, 650, 2, 0.5)
print(model)

# Your code here
model.initialize(mx.init.Xavier(), ctx=context)
trainer = gluon.Trainer(model.collect_params(), 'sgd',
                        {'learning_rate': lr,
                         'momentum': 0,
                         'wd': 0})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

RNNModel(
  (drop): Dropout(p = 0.5, axes=())
  (encoder): Embedding(33278 -> 650, float32)
  (rnn): LSTM(650 -> 650, TNC, num_layers=2, dropout=0.5)
  (decoder): Dense(650 -> 33278, linear)
)


## Question 3

Using the example from lecture as an inspiration, write the training function to train the custom language model that we've built. 

In [None]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [i.detach() for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

def eval(data_source):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(func=mx.nd.zeros, batch_size=batch_size, ctx=context)
    for i, (data, target) in enumerate(data_source):
        data = data.as_in_context(context)
        target = target.as_in_context(context).reshape((-1, 1))
        output, hidden = model(data, hidden)
        L = loss(output, target)
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

grad_clip = 0.25
epochs = 3

# Your code here: write training 

def train(model, train_data, val_data, test_data, epochs, lr):
   
    
train(model, train_data, val_data, test_data, epochs, lr)

## Analogies via Embeddings
In class you saw an example of an application of embeddings to get word similarity. Now we will extend this by applying word embeddings to complete word analogies. Because the vector space that the word embeddings live in capture distributional semantics for the words, we can use well trained word embeddigns to perform word analogy tasks. For example, if I asked you 

## Question 4 

We will create a gluon-nlp embedding with the glove dataset using the `'glove.6B.50d'` source and also create a gluon-nlp `Vocab` that uses that embeddding in the `get_top_k_by_analogy` function that you will implement below.

Recall, that to find the word that completes an analogy like `a:b::c:?`. You need to find the word who's embedding is the closest in cosine similarity to the vector given by `vocab.embedding[a] - vocab.embedding[b] + vocab.embedding[c]`.

Check out the following methods in `gluonnlp.Vocab` and `gluonnlp.embedding` that may be helpful in your implementation: `gluonnlp.Vocab.set_embedding`, `gluonnlp.Vocab.to_tokens`, and `gluonnlp.embedding.idx_to_vec`. 

Try different values for `k` so you can see what other words could potentially complete the analogy according to the embedding. What do you observe? Try other word analogies you can think of and report what you observe.

In [None]:
glove_6b50d = nlp.embedding.create('glove', source='glove.6B.50d') # Your code here
vocab = nlp.Vocab(nlp.data.Counter(glove_6b50d.idx_to_token)) # Your code here

vocab.set_embedding(glove_6b50d)


def get_top_k_by_analogy(vocab, word1, word2, word3, k=1):
# Your code here

print(get_top_k_by_analogy(vocab, 'man', 'woman', 'son'))
print(get_top_k_by_analogy(vocab, 'london', 'england', 'berlin'))
print(get_top_k_by_analogy(vocab, 'france', 'crepes', 'argentina'))
print(get_top_k_by_analogy(vocab, 'argentina', 'football', 'india'))
print(get_top_k_by_analogy(vocab,'bad', 'worst', 'big'))
print(get_top_k_by_analogy(vocab, 'do', 'did', 'go'))
print(get_top_k_by_analogy(vocab, 'argentina', 'messi', 'france', k=3))