<a href="https://colab.research.google.com/github/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/Char_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with an RNN


This notebook demonstrates how to generate text using a **charecter-level LSTM with PyTorch** using dataset from the book **Anna Karenina**. Given a sequence of charecter from this book, the model will generate longer sequences of data by calling the model repeatedly.

While some of the sentences are grammatical, most do not make sense. The model has not learned the meaning of words, but consider:

* The model is character-based. When training started, the model did not know how to spell an English word, or that words were even a unit of text.

* The structure of the output resembles a play—blocks of text generally begin with a speaker name, in all capital letters similar to the dataset.

* As demonstrated below, the model is trained on small batches of text (100 characters each), and is still able to generate a longer sequence of text with coherent structure. Below is the **general architecture of the character-wise RNN.**<br>
<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/lstm_rnn_architecture.png?raw=1"></img> 




# Set Up
### Import PyTorch and other libraries


In [0]:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np


# Download the Anna Karenina data
 

### Read the data

In [0]:
with open('sample_data/anna.txt', 'r') as f:
  text = f.read()

### First look at the text

In [17]:
print(text[:100])

Chapter 1


Happy families are all alike; every unhappy family is unhappy in its own
way.

Everythin


### GPU Usage
Enable GPU acceleration to execute this notebook faster. In Colab: *Runtime > Change runtime type > Hardware acclerator > GPU*. If running locally make sure TensorFlow version >= 1.11.

# Process the text
### Vectorize the text (Tokenization)
Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters.

In [18]:
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch:ii for ii, ch in int2char.items()}

# encode text
encoded = np.array([char2int[ch] for ch in text])
encoded

array([ 1, 46, 32, ..., 70, 10, 46])

# Pre-processing the data
Our LSTM expects an input that is **one-hot encoded** meaning that each character is converted into an integer (via our created dictionary) and then converted into a column vector where only it's corresponding integer index will have the value of 1 and the rest of the vector will be filled with 0's.

In [0]:
def one_hot_encode(arr, n_labels):
  # initialize encoded array
  # arr.shape = (3,8)
  # np.arange(3) = [0, 1, 2]
  # arr.flatten() = ([[1, 2, 3]]) => ([1, 2, 3])
  one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)
  one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1
  one_hot = one_hot.reshape((*arr.shape, n_labels))
  return one_hot

In [20]:
test_seq = np.array([[3, 4, 5]])
one_hot_encode(test_seq, 8)

array([[[0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.]]], dtype=float32)

# Making training mini-batches

To train on this data, we create mini batches for training. We want our batches to be multiple sequences of some desired number of sequence steps as below-<br>
<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/mini_batch_1.png?raw=1"></img><br><br>
In this example, we'll take the encoded characters (passed in as the arr parameter) and split them into multiple sequences, given by batch_size. Each of our sequences will be seq_length long.

# Creating Batches

### 1. Discard text to accomodate completely full mini-batches

* batch_size = `N (2)`
* seq_length = `M (3)`
* no. of charecters in one batch =` N * M (2 * 3 = 6 )`
* Total batches `(K)` that can be made out of the given array :

`len(arr)/ (no. of charecters per batch) = 12/6 = 2`

* Total charecters in array to be kept in-order to accomodate completely full mini-batch - 

`arr[:N * M * K] = uptil arr[10]` (discarding arr[11]=12)




### 2. Split the array into N batches
You can do this by using :<br>`arr.reshape((batch_size, -1))`.<br>
After this the size of array should be -<br>
`N * (M * K)`




### 3. Iterate through mini-batches
The idea is, each batch is of size `(N * M) window` on `N * (M * K) array`. This window slides over by `seq_length`. We also want both input and target arrays.
<br>
Target arrays are basically input arrays shifted over by one charecter.

In [0]:
def get_batches(arr, batch_size, seq_length):
  total_batch_size = batch_size*seq_length
  n_batches = len(arr)//total_batch_size
  arr = arr[:n_batches*total_batch_size]
  arr = arr.reshape((batch_size, -1))
  # iterate through array, on seq_length at a time
  for n in range(0, arr.shape[1], seq_length):
    x = arr[:, n:n+seq_length]
    y = np.zeros_like(x)
    try:
      y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
    except IndexError:
      y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
    yield x, y

In [53]:
batches = get_batches(encoded, 8, 50)
x, y = next(batches)
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[ 1 46 32 26 10 37 47 70 67 59]
 [ 5 33 58 19 70 14 32 41 37 19]
 [26  5 58 58 13 18 14 37 65 70]
 [70 58  5  0 37 70 58  5 47 10]
 [10 37 58 58 70 36 47  5  0 70]
 [10 70 18 37 19 70 26 47 32  8]
 [58 58 13 58 10 32 33 10 19 70]
 [37 44 19 70 26 47 13 33 41 37]]

y
 [[46 32 26 10 37 47 70 67 59 59]
 [33 58 19 70 14 32 41 37 19 70]
 [ 5 58 58 13 18 14 37 65 70 74]
 [58  5  0 37 70 58  5 47 10 70]
 [37 58 58 70 36 47  5  0 70 10]
 [70 18 37 19 70 26 47 32  8 17]
 [58 13 58 10 32 33 10 19 70 10]
 [44 19 70 26 47 13 33 41 37 17]]


# Defining the network with PyTorch
Below is sample structure of our LSTM model: <br>
<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/data/15.%20rnn_classifier.png?raw=1"></img><br>
We will use PyTorch to define the model's architecture and define the forward pass method as well.



### Model Structure
In `__init__` followinf structure can be defined -<br>
* Storing necessasry dictionaries (int2char, char2int)
* Defining LSTM layer that takes the following parameters - 
  * `input size`
  * hidden layer size (`n_hidden`)
  * number of layers (`n_layers`)
  * dropout probability (`drop_prob`)
  * Boolean batch first (`batch_first`)


### LSTM Inputs/Outputs
Basic LSTM can be created as follows - 
```python
self.lstm = nn.LSTM(input_size, n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
 ```
An initial hidden state of all zeros needs to be created as well -<br>
```python
self.init_hidden()
``` 


In [54]:
# check if GPU is available
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

Training on GPU!


In [0]:
class CharRNN(nn.Module):
  def __init__(self, tokens, n_hidden=256, n_layers=2, drop_prob=0.5, lr=0.001):
    super().__init__()
    self.drop_prob = drop_prob
    self.n_layers = n_layers
    self.n_hidden = n_hidden
    self.lr = lr

    # creating charecter dictionaries
    self.chars = tokens
    self.int2char = dict(enumerate(self.chars))
    self.char2int = {ch:ii for ii, ch in self.int2char.items()}

    # defining LSTM model
    self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, batch_first=True, dropout=drop_prob)

    # defining dropout layer
    self.dropout = nn.Dropout(drop_prob)

    # defining final fully-connected layer
    self.fc = nn.Linear(n_hidden, len(self.chars))

  def forward(self, x, hidden):
    # lstm will generate new output and new hidden state
    r_output, hidden = self.lstm(x, hidden)

    # passing x output through dropout layer
    out = self.dropout(r_output)

    # stacking LSTM using view
    # Using contigious to reshape output
    out =  out.contiguous().view(-1, self.n_hidden)#.contigious().view(-1, self.n_hidden)

    # put out through fully connected layer
    out = self.fc(out)

    # returning final output and hidden state
    return out, hidden
  
  def init_hidden(self, batch_size):
    # creating 2 new tensors
    # size = n_layers * batch_size * n_hidden
    # initialize to 0 for hidden and cell state of LSTM
    weight = next(self.parameters()).data
    if(train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
    else:
      hidden - (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
    return hidden



# Time to Train

Below we will use **Adam optimizer and Cross Entropy**. We calculate loss and perform backpropogation as usual. Few points to note -<br>
* Within the batch loop, we detach the hidden state from its history; this time setting it equal to a new tuple variable because an LSTM has a hidden state that is a tuple of the hidden and cell states.
* We use clip_grad_norm_ to help prevent exploding gradients.

In [0]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
  net.train()
  opt = torch.optim.Adam(net.parameters(), lr=lr)
  criterion = nn.CrossEntropyLoss()

  # creating training and validation data
  val_idx = int(len(data)*(1-val_frac))
  data, val_data = data[:val_idx], data[val_idx:]

  if(train_on_gpu):
    net.cuda()
  counter = 0
  n_chars = len(net.chars)
  for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    for x, y in get_batches(data, batch_size, seq_length):
      counter += 1

      # one hot encode our data and make them torch tensors
      x = one_hot_encode(x,n_chars)
      inputs, targets = torch.from_numpy(x), torch.from_numpy(y)

      if(train_on_gpu):
        inputs, targets = inputs.cuda(), targets.cuda()
        
      # create new hidden state variable to 
      # avoid traversing entire history 
      h = tuple([each.data for each in h])

      net.zero_grad()
      output, h = net(inputs, h)

      loss = criterion(output, targets.view(batch_size*seq_length).long())
      loss.backward()

      nn.utils.clip_grad_norm(net.parameters(), clip)
      opt.step()

      # loss statistics
      if counter%print_every == 0:
        val_h = net.init_hidden(batch_size)
        val_losses = []
        net.eval()
        for x, y in get_batches(val_data, batch_size, seq_length):
          x = one_hot_encode(x, n_chars)
          x, y = torch.from_numpy(x), torch.from_numpy(y)

          val_h = tuple([each.data for each in val_h])

          inputs, targets = x, y
          if(train_on_gpu):
            inputs, targets = inputs.cuda(), targets.cuda()
            
          output, val_h = net(inputs, val_h)
          val_loss = criterion(output, targets.view(batch_size*seq_length).long()) 

          val_losses.append(val_loss.item())  

        net.train()
        print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

### Instantiating the model
Before training the model, we will first create the network with some given hyper-parameters. Then define mini-batches and start training.


In [57]:
n_hidden = 512
n_layers = 2

net = CharRNN(chars, n_hidden, n_layers)
print(net)

CharRNN(
  (lstm): LSTM(75, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=75, bias=True)
)


In [58]:
batch_size = 128
seq_length = 100
n_epochs = 20

# training model
train(net, encoded, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=10)



Epoch: 1/10... Step: 10... Loss: 3.2211... Val Loss: 3.1686
Epoch: 1/10... Step: 20... Loss: 3.1917... Val Loss: 3.1080
Epoch: 1/10... Step: 30... Loss: 3.1456... Val Loss: 3.0928
Epoch: 1/10... Step: 40... Loss: 3.1375... Val Loss: 3.0888
Epoch: 2/10... Step: 50... Loss: 3.1353... Val Loss: 3.0865
Epoch: 2/10... Step: 60... Loss: 3.1278... Val Loss: 3.0846
Epoch: 2/10... Step: 70... Loss: 3.1352... Val Loss: 3.0832
Epoch: 2/10... Step: 80... Loss: 3.1139... Val Loss: 3.0743
Epoch: 2/10... Step: 90... Loss: 3.0964... Val Loss: 3.0623
Epoch: 3/10... Step: 100... Loss: 3.0625... Val Loss: 3.0478
Epoch: 3/10... Step: 110... Loss: 3.0535... Val Loss: 2.9801
Epoch: 3/10... Step: 120... Loss: 2.9026... Val Loss: 2.8468
Epoch: 3/10... Step: 130... Loss: 2.7777... Val Loss: 2.6890
Epoch: 4/10... Step: 140... Loss: 2.6579... Val Loss: 2.5888
Epoch: 4/10... Step: 150... Loss: 2.5511... Val Loss: 2.4988
Epoch: 4/10... Step: 160... Loss: 2.5353... Val Loss: 2.4445
Epoch: 4/10... Step: 170... Loss:

### Getting the best model

* If `training_loss << validation loss` - overfitting 
  * Solution - Regularization/Drop-out
* If `training_loss ~ validation_loss` - underfitting
  * Solution - Increase network size
* Most important parameters - 
  * n_layers - 2/3
  * n_hidden - adjustable according to your data size
* Best model is the one with **least validation loss.**


# Checkpoint
After training, we'll save the model so we can load it again later if we need too.

In [0]:
model_name = 'rnn_20_epoch.net'
checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars    
              }
with open(model_name, 'wb') as f:
  torch.save(checkpoint, f)



# Making Predictions
The output of our RNN is from a fully-connected layer and it outputs a distribution of next-character scores. <br>
To actually get the next character, we apply a softmax function, which gives us a probability distribution that we can then sample to predict the next character.

In [0]:
def predict(net, char, h=None, top_k=None):
  '''
    Given a charecter, predict next charecter.
    Returns the predicted charecter and the hidden state
  '''

  # tensor inputs
  x = np.array([[net.char2int[char]]])
  x = one_hot_encode(x, len(net.chars))
  inputs = torch.from_numpy(x)

  if(train_on_gpu):
    inputs = inputs.cuda()
  
  h = tuple([each.data for each in h])
  out, h = net(inputs, h)

  p = F.softmax(out, dim=1).data
  if(train_on_gpu):
    p = p.cpu()

  if top_k is None:
    top_ch = np.arange(len(net.chars))
  else:
    p, top_ch = p.topk(top_k)
    top_ch = top_ch.numpy().squeeze()

  p = p.numpy().squeeze()
  char = np.random.choice(top_ch, p=p/p.sum())

  return net.int2char[char], h


# Priming and Generaying Text

Typically you'll want to prime the network so you can build up a hidden state. Otherwise the network will start out generating characters at random. In general the first bunch of characters will be a little rough since it hasn't built up a long history of characters to predict from.


In [0]:
def sample(net, size, prime='Purva', top_k=None):
  if(train_on_gpu):
    net.cuda()
  else:
    net.cpu()
  
  net.eval()

  chars = [ch for ch in prime]
  h = net.init_hidden(1)
  for ch in prime:
    char, h = predict(net, ch, h, top_k=top_k)
  
  chars.append(char)

  for ii in range(size):
    char, h = predict(net, chars[-1], h, top_k=top_k)
    chars.append(char)
  
  return ''.join(chars)

In [71]:
print(sample(net, 1000, prime='Purva likes recurrent neural networks', top_k=5))

Purva likes recurrent neural networks world, and hid
and at her. Se can the
cansting, and and sereshing." 
"Why mean and whing the stond, seid to to here so ment a constare at it she was she chark and he sad hin., sho lading has saing that whet he came of the peised to shan sead.. 
"It tearing her asting to he how and a mentice with she seil ald a colled hor talk, and to
shing her tome of the some with her then wand to me what werl of hor, and she had, and to don it, she was aly
that all her.

"Wis to by the mare is it here ald your the don't, seid tele, and that his sathed as in he had
say her sowe thild and ale and
a mothed, and than she had strettion as it were, and he had beter the foll of him then
stomt of his beconder a same a certing the paring of she was ale hear and
the fack the was ofter, as hes how sad than sho her fert ond on that he saed.

There a proulless on the
mare and and whaten the sompted, and at they sond his ford wand at has shing that he she sadd on the sache wit