In [12]:
import string
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from utils import char_tensor, n_characters, all_characters
from torch.autograd import Variable

# 0. Assignment

## 0.1 Readings

The reading for this week will be about [using PyTorch to train a Part-of-Speech tagger](http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#example-an-lstm-for-part-of-speech-tagging). Incidentally, the same reading should help you with the assignment, should you find you're having trouble.

## 0.2 Character LSTM

The assignment this week will be to train a character rnn to generate new Sherlock Holmes text. If you've downloaded the complete stories of Sherlock Holmes (as per the instructions in the README), then you're good to go!

# 1. Character LSTM

## 1.1 Training the Model

To start with, I'll give you the same character LSTM architecture we looked at earlier:

In [2]:
class CharLSTM(nn.Module):
    def __init__(self, char_dim, hidden_dim):
        super(CharLSTM, self).__init__()
        
        self.char_embeddings = nn.Embedding(char_dim, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim, hidden_dim)
        self.decoder = nn.Linear(hidden_dim, char_dim)
        
        self.hidden_dim = hidden_dim
        self.hidden = self.init_hidden()
    
    def init_hidden(self):
        return (Variable(torch.zeros(1, 1, self.hidden_dim)),
                Variable(torch.zeros(1, 1, self.hidden_dim)))
    
    def forward(self, x):
        x = self.char_embeddings(x)
        x, self.hidden = self.lstm(x.view(len(x), 1, -1), self.hidden)
        x = self.decoder(x.view(len(x), -1))
        x = F.log_softmax(x, dim=1)
        
        return x

You could theoretically use the same approach that was used in the assignment to train the network, but that would be quite slow. To get training time down to a reasonable level, you will need to:
- split the data into reasonable sequence lengths (say, 50 or 100-length sequences) and use those as the training examples
- batch (and shuffle) the data

In [16]:
with open('../data/cano.txt') as fp:
    # You're welcome to play around with how the data is processed!
    # I stuck with the newline join because it preserved the
    # structure of the text. You could use the raw data if you want,
    # or, join with a space, etc.
    data = '\n'.join(line.strip() for line in fp if line.strip())

print(data[:105])

THE COMPLETE SHERLOCK HOLMES
Arthur Conan Doyle
Table of contents
A Study In Scarlet
The Sign of the Four


In [9]:
def split_data(sequence):
    # Your code goes here.
    pass

def batch_data():
    # Your code goes here.
    return []

NUM_EPOCHS = 1

for i in range(NUM_EPOCHS):
    print(f'EPOCH: {i+1}')
    for sequence in batch_data():
        # Your code goes here
        pass

EPOCH: 1


## 1.2 Sampling from the Model

In the slides, we simply took the most likely character at every step. However, this is prone to lead us into loops of common words (`'the the the the'`), so we want to introduce the ability to explore a bit. There are numerous ways to do this; Karpathy introduces the notion of `temperature` when sampling which character to select next.

To use the notion of `temperature`, we must actually treat this like a *sample*, and not just a greedy selection. Karpathy uses temperature by dividing out the log probability values by the temperature, exponentiating them (to bring them back to positive real numbers), and then renormalizing to take a proper multinomial sample.

This looks like:

$$
mult(o, temp) = \frac{exp(\frac{o}{temp})}{\sum_{o'} mult(o', temp)}
$$

Which gives us a new multinomial distribution from which we can sample. Lower temperatures perform less smoothing (and thus leave us more likely to end up in a loop), whereas higher temperatures can oversmooth (and reduce our output to nonsense).

In [4]:
def sample(start_char='S', length=100, temperature=1.0):
    # Your code goes here.
    pass

sample()

## Extra Credit

Extra credit for this assignment will be to **use the same network architecture to train a full language model** over the Sherlock Holmes dataset. You can start with pretrained embeddings to speed up training, which you can use to initialize the `nn.Embedding` layer. (Recall, `self.embeddings.weight.data = E` will set the data inside the weight variable of the embeddings to the matrix E.

The extra credit this week shouldn't require much more than data-wrangling and some amount of patience.

In [5]:
# Your code goes here.