# Text Generation with RNNs


In [17]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

### 1 Dataset
Define the path of the file, you want to read and train the model on


In [18]:
'''TODO: set the path of the file'''
path_to_file = 'Shakespeare_karpathy.txt'
text = open(path_to_file, encoding='utf-8').read()


#### Inspect the dataset
Take a look at the first 250 characters in text

In [19]:
print(text[:250])

That, poor contempt, or claim'd thou slept so faithful,
I may contrive our father; and, in their defeated queen,
Her flesh broke me and puttance of expedition house,
And in that same that ever I lament this stomach,
And he, nor Butly and my fury, kno


In [20]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

62 unique characters


### 2 Process the dataset for the learning task
The task that we want our model to achieve is: given a character, or a sequence of characters, what is the most probable next character?

To achieve this, we will input a sequence of characters to the model, and train the model to predict the output, that is, the following character at each time step. RNNs maintain an internal state that depends on previously seen elements, so information about all characters seen up until a given moment will be taken into account in generating the prediction.

#### Vectorize the text
Before we begin training our RNN model, we'll need to create a numerical representation of our text-based dataset. To do this, we'll generate two lookup tables: one that maps characters to numbers, and a second that maps numbers back to characters. Recall that we just identified the unique characters present in the text.

In [21]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
text_as_int = np.array([char2idx[c] for c in text])

# Create a mapping from indices to characters
idx2char = np.array(vocab)


This gives us an integer representation for each character. Observe that the unique characters (i.e., our vocabulary) in the text are mapped as indices from 0 to len(unique). Let's take a peek at this numerical representation of our dataset:

In [22]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  "'" :   3,
  ',' :   4,
  '-' :   5,
  '.' :   6,
  ':' :   7,
  ';' :   8,
  '?' :   9,
  'A' :  10,
  'B' :  11,
  'C' :  12,
  'D' :  13,
  'E' :  14,
  'F' :  15,
  'G' :  16,
  'H' :  17,
  'I' :  18,
  'J' :  19,
  ...
}




We can also look at how the first part of the text is mapped to an integer representation:

In [23]:
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'That, poor co' ---- characters mapped to int ---- > [29 43 36 55  4  1 51 50 50 53  1 38 50]


#### Defining a method to encode one hot labels

In [24]:
def one_hot_encode(arr, n_labels):
    # Initialize the the encoded array
    one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.float32)

    # Fill the appropriate elements with ones
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    return one_hot


#### Defining a method to make mini-batches for training

In [25]:
def get_batches(arr, batch_size, seq_length):
    '''Create a generator that returns batches of size
       batch_size x seq_length from arr.

       Arguments
       ---------
       arr: Array you want to make batches from
       batch_size: Batch size, the number of sequences per batch
       seq_length: Number of encoded chars in a sequence
    '''
    batch_size_total = batch_size * seq_length
    # total number of batches we can make
    n_batches = len(arr) // batch_size_total
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size_total]
    # Reshape into batch_size rows
    arr = arr.reshape((batch_size, -1))
    # iterate through the array, one sequence at a time
    for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:, n:n + seq_length]
        # The targets, shifted by one
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n + seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y


## 3 The Recurrent Neural Network (RNN) model


###### Check if GPU is available

In [26]:
train_on_gpu = torch.cuda.is_available()
print ('Training on GPU' if train_on_gpu else 'Training on CPU')

Training on GPU



### Declaring the model

In [27]:
class VanillaCharRNN(nn.Module):
    def __init__(self, vocab, n_hidden=256, n_layers=2,
                 drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        self.vocab = vocab
        
        self.rnn = nn.RNN(len(self.vocab), self.n_hidden, self.n_layers, batch_first=True, dropout=self.drop_prob)   #batch_first=True -> input and output tensors are provided as (batch, seq, feature)
        self.fc = nn.Linear(self.n_hidden, len(self.vocab))

    def forward(self, x, hidden):

        # Passing in the input and hidden state into the model and obtaining outputs
        # x of shape (seq_len, batch, input_size)
        # hidden of shape (num_layers * num_directions, batch, hidden_size)
        out, hidden_t = self.rnn(x, hidden)
        # Reshaping the outputs such that it can be fit into the fully connected layer
        out = out.contiguous().view(-1, self.n_hidden)
        out = self.fc(out)
        # return the final output and the hidden state
        return out, hidden_t

    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        hidden = torch.zeros(self.n_layers, batch_size, self.n_hidden)

        return hidden

In [28]:
class LSTMCharRNN(nn.Module):
    def __init__(self, vocab, n_hidden=256, n_layers=2,
                 drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        self.vocab = vocab
        
        self.lstm = nn.LSTM(len(self.vocab), self.n_hidden, self.n_layers, dropout=self.drop_prob, batch_first=True)
        self.fc = nn.Linear(self.n_hidden, len(self.vocab))


    def forward(self, x, hidden):
        out, hidden = self.lstm(x, hidden)
        out = out.contiguous().view(-1, self.n_hidden)
        out = self.fc(out)
        # return the final output and the hidden state
        return out, hidden

    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        hidden_state = torch.zeros(self.n_layers, batch_size, self.n_hidden).cuda()
        cell_state = torch.zeros(self.n_layers, batch_size, self.n_hidden).cuda()
        return (hidden_state, cell_state)



#### Declaring the train method


train(vanilla_model, text_as_int, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=50)

In [29]:
def train(model, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    ''' Training a network

        Arguments
        ---------

        model: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        batch_size: Number of mini-sequences per mini-batch, aka batch size
        seq_length: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        print_every: Number of steps for printing training and validation loss

    '''
    model.train()

    opt = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # create training and validation data
    val_idx = int(len(data) * (1 - val_frac))
    data, val_data = data[:val_idx], data[val_idx:]

    if (train_on_gpu):
        model.cuda()

    counter = 0
    n_vocab = len(model.vocab)
    for e in range(epochs):
        # initialize hidden state
        h = model.init_hidden(batch_size)
        
        '''TODO: use the get_batches function to generate sequences of the desired size'''
        dataset = get_batches(data, batch_size, seq_length)

        for x, y in dataset:
            counter += 1
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_vocab)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)

            if (train_on_gpu):
                inputs, targets = inputs.cuda(), targets.cuda()

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            if type(h) is tuple:
                if train_on_gpu:
                    h = tuple([each.data.cuda() for each in h])
                else:
                    h = tuple([each.data for each in h])
            else:
                if train_on_gpu:
                    h = h.data.cuda()
                else:
                    h = h.data
            # zero accumulated gradients
            model.zero_grad()
            output, h = model(inputs, h) 
            loss = criterion(output, targets.view(-1).long())

            # perform backprop
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(model.parameters(), clip)
            opt.step()

            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = model.init_hidden(batch_size)
                val_losses = []
                model.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_vocab)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)

                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    if type(val_h) is tuple:
                        if train_on_gpu:
                            val_h = tuple([each.data.cuda() for each in val_h])
                        else:
                            val_h = tuple([each.data for each in val_h])
                    else:
                        if train_on_gpu:
                            val_h = val_h.data.cuda()
                        else:
                            val_h = val_h.data

                    inputs, targets = x, y
                    if (train_on_gpu):
                        inputs, targets = inputs.cuda(), targets.cuda()
                    output, val_h = model(inputs, val_h)

                    val_loss = criterion(output, targets.view(-1).long())
                    val_losses.append(val_loss.item())

                print("Epoch: {}/{}...".format(e + 1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))
                
                input_eval = 'Dear'
                print(sample(model, 1000, prime=input_eval, top_k=10))
                
                model.train()  # reset to train mode after iterationg through validation data


##### Defining a method to generate the next character

In [30]:
def predict(model, char, h=None, top_k=None):
    ''' 
    Given a character, predict the next character.
    Returns the predicted character and the hidden state.
    '''

    # tensor inputs
    x = np.array([[char2idx[char]]])
    x = one_hot_encode(x, len(model.vocab))
    inputs = torch.from_numpy(x)
    
    if (train_on_gpu):
        inputs = inputs.cuda()

    # detach hidden state from history
    if type(h) is tuple:
        if train_on_gpu:
            h = tuple([each.data.cuda() for each in h])
        else:
            h = tuple([each.data for each in h])
    else:
        if train_on_gpu:
            h = h.data.cuda()
        else:
            h = h.data
    output, h = model(inputs, h)

    # get the character probabilities
    p = F.softmax(output, dim=1).data
    if (train_on_gpu):
        p = p.cpu()  # move to cpu

    # get top characters
    if top_k is None:
        top_ch = np.arange(len(model.vocab))
    else:
        p, top_ch = p.topk(top_k)
        top_ch = top_ch.numpy().squeeze()

    # select the likely next character with some element of randomness
    p = p.numpy().squeeze()
    char = np.random.choice(top_ch, p=p / p.sum())

    # return the encoded value of the predicted char and the hidden state
    return idx2char[char], h



#### Declaring a method to generate new text

In [31]:
def sample(model, size, prime='The', top_k=None):
    if (train_on_gpu):
        model.cuda()
    else:
        model.cpu()

    model.eval()  # eval mode

    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = model.init_hidden(1)
    for ch in prime:
        char, h = predict(model, ch, h, top_k=top_k)

    chars.append(char)

    for ii in range(size):
        char, h = predict(model, char, h, top_k=top_k)
        chars.append(char)

    model.train()
    return ''.join(chars)


#### Generate new Text using the RNN model

###### Define and print the net

In [32]:
''''TODO: Try changing the number of units in the network to see how it affects performance'''
n_hidden = 256
n_layers = 2

vanilla_model = VanillaCharRNN(vocab, n_hidden, n_layers)
print(vanilla_model)
lstm_model = LSTMCharRNN(vocab, n_hidden, n_layers)
print(lstm_model)

VanillaCharRNN(
  (rnn): RNN(62, 256, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=256, out_features=62, bias=True)
)
LSTMCharRNN(
  (lstm): LSTM(62, 256, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=256, out_features=62, bias=True)
)



###### Declaring the hyperparameters

In [33]:
''''TODO: Try changing the hyperparameters in the network to see how it affects performance'''
batch_size = 50
seq_length = 20
n_epochs = 20  # start smaller if you are just testing initial behavior


##### Train the model and have fun with the generated texts

In [34]:
train(vanilla_model, text_as_int, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=50)

Epoch: 1/20... Step: 50... Loss: 3.1997... Val Loss: 3.1841
Dear ent eir e iess
 i on  aer    he e ohiahth  e sa   inse  r o  n ate  ee aat  hii nei notsier si  thri hheoi eseiienos tao tein saa si eo re tetr  rrooor 
e 
ttin raee a  heee eti ensi  oi  shi oi o nheoir aen
oi ane
s oiinaashs hn oniet
h  hteeeron i

sss tn aaoti ot a n  s ns a rti  riataeiae o
 se
n  oiso t a hh o no iheste tatrt re
eenoh n ooe taior ernsotses ats tet niin ne  he  sasii  seiaeta ho
e th ss  htnar te oo srnrter a io tnern ent ne to snne se aatr osrestteona e  oeo teoo sos heioeonise
a eh  eaeo itaa ae thtsooias 
  ertnneto inie rhn sh niiatee

anta noiaeee
sn t e ionr  ao  no n es estrt  iha  ar soai i tt re eo t tt taee i re  noasis  h   n snasaaotre hoaaaeorent iseoaerteaaooaiees tier ihenoetahon s eao i n earaeshstner ns enatton h hie iaentheoo n
eeoosoaie 
ioi s r a noraotretteneaee
 ionssose osr  oasto oire  e oe nisoieese s shsriee eer   ereeie  se seneosheat tt  ristsooeine s  ee hothoo een t
 
 ti

In [35]:
train(lstm_model, text_as_int, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=50)

Epoch: 1/20... Step: 50... Loss: 3.2699... Val Loss: 3.2796
Dear ion ees  t s hheeaehs erst  ot inonaiteh s h siasan  neoon   a o nao n ntthtehsei hs ihrrn noino taosrh haee enhh too e  neh  eta oh  eoerretest eiei iso es  ot  sna o reaioart aoarsisohrat rihni  r irh  raoat eeeeoenteetao r niesn hh rn tetn eh i re ton e eeisotoa rhar o o e saiioe riait arnahr   thn no ss e   it roe i r ait erta at aro at areoa it oar  hiaiont e re nrsis si t hsnn ote hnn ehohri totaes rtir nhtsretatan ea s  t renn eostna   in sr etn seos  ote t e h aa eaeratseeste n rerreoo ssert  srha  rasseentt  n sea t tr it tannhni eete rtntsan oeen e  aa  i oi  e ne or ooet  otsaoehattt shoiietnrtneneeearreiarrrer aneisa  tnsnt r oeiast ia  ho ianteatho ee t  haeea  tsnrs eo  aii t eaee  n s nooee h t  o eestasose h rooa se  aist iran etio  tss ointsni aa aaiertsshhe  a thrsh rrntr ot iootn estnseeoteorh   saaeo ohh reae ese hehhrseeiteroerh so i  aaan ato ies r to n  ae r sh at hh et ta a eatishesa siie  s oree s

##### Generate text

In [36]:
print(sample(vanilla_model, 500, prime='The', top_k=10))

Thes high thy
Take in sime we with dack is doth thee! these and menting affore, wouth thought of your ground me with honoured my fortung.
That have not ho mis grasted with sue, my least,
Where,
Bus deather;
Bound thy manted men this.

MArGOL:
Nere, myself you have a kinfte
Sir, my fouse or evee
as tlought: they are ang to-the forrone whow the feart. Shat hessefe.

KING HENRY MANG CEMILIL:
By here on the puccan
Sto depths han bryages, hasts from the days, lire. That me to my strante wat deserf and be


In [37]:
print(sample(lstm_model, 500, prime='The', top_k=10))

The proces to your may',
As is deations me think of the share.

APHALLUS:
Hert it, as abare. 
PRAUC:
I makes me not,
And yours poot nave toodeng, and
beford me
The fouse the farth of aglod to be her sear.

BUUTHENTE:
Hest feither.

CeENRIO:
Mastare. 
MENINEL:
Sweld I whot was ampice.
Hows to was in exthith nots face,
And mentertiour onferfung to and to as take are my dost
But all sace: in my ways, thou greap it bithers, fasce ot thas
I whonglr so triegaor our deading,
And I lead as on, and my hast t


## Results:
Changing the number of layers: 
  - More than 5 layers just produced nonsense text regarding readability and structure. Ajusting remaining training hyperparameters like epochs, latent dimensionality or even batch size into different directions did not contradict our observation that RNN/LSTM networks with fewer layers performed better.
  - 2 layers showed to be fine with 20 epochs and produce almost readable text.


Changing the sequence length: 
  - Short sequence lengths (less than 5) produced well structured but nonsense text. For short sequence lengths the LSTM needed significantly more epochs to produce paragraph-structured text than the VanillaRNN
  - With longer sequence lengths both networks produced better text regarding structure and reability. From length 10 and up the LSTM seemed to produce dialgue text just as fast as VanillaRNN (i.e. paragraph-structured text with capital name headings)
  - For sequences larger than 100 training was very fast, well structured but contained a lot of typos reagrding punctuation marks.


Changing the batch size: no huge differences
  - 2 layers, batchsize 20 and sequence length 20 was, with the empirically chosen values, the best result


Overall we did not observe a huge difference between the VanillaRNN and the LSTM in our experiments. There were some runs where it seemed like the RNN actually was a littler faster in learning the training text structure (i.e. paragraphs of dialogue) while the LSTM on average needed a few more epochs to jump from newline-structured into paragraph-structured text. Also it seemed like in contrast to LSTM, the RNN still had issues with the rules of upper and lower case characters within words in the final training epochs. Readbility was sometimes hard to assess since Shakespearian english might from time to tiem appear even to native speakers as unreadable. Judging from the pattern Elizabethan vocabulary follows, bith networks perform well in reproduction with only a few layers and short training sequence lengths. Yet, both networks did not learn to produce good sentence structure and readability in terms of context. 
