# Overview

In this notebook I will train character-level LSTM. The model will train character by character on some text in order to aftterewards produce completely new text also character by character. This example will be based on Anna Karenina text. The goal is to produce a network that will be able to generate a chunk of text based in the same style as Anna Karenina.<br>

The structure of the network.<br>
<img src="images/LSTM4.jpeg" width="500"><br>
Credits: Udacity Computer vision Nanodegree


# Loding and preparing the data for training

In [3]:
# load resources
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

In [7]:
# open text file and read it as a text
with open("data/anna.txt", "r") as stream:
    text = stream.read()

# show sample
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

Now we need to map that text into integers and to do that following steps were taken:
1. Create a tuple of all distinct characters preent in the text
2. From that tuple create a dictionary where integers are keys
3. Then inverse mapping from relation `int : char` to `char: int`
4. Using dict from point 3. map whole text into a list of corresponding integers

In [9]:
# step 1: use set constructor to get unique chars from whole text. Next make it immutable using tuple
chars = tuple(set(text))
# step 2: create dictionary `int : char`
int2char = dict(enumerate(chars))
# step 3: now inverse the mapping
char2int = {char : integer for integer, char in int2char.items()}
# step 4: map every char in text to the corresponding value in char2int dict, save it as numpy array
encoded = np.array([char2int[char]for char in text])

Let's see again the first line of the text and its encoded version

In [11]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

In [12]:
encoded[:100]

array([41, 47, 46,  3, 82, 49,  1,  8,  6, 79, 79, 79, 10, 46,  3,  3,  0,
        8, 51, 46, 65, 53, 26, 53, 49, 24,  8, 46,  1, 49,  8, 46, 26, 26,
        8, 46, 26, 53, 55, 49, 72,  8, 49, 78, 49,  1,  0,  8, 21, 12, 47,
       46,  3,  3,  0,  8, 51, 46, 65, 53, 26,  0,  8, 53, 24,  8, 21, 12,
       47, 46,  3,  3,  0,  8, 53, 12,  8, 53, 82, 24,  8, 40, 13, 12, 79,
       13, 46,  0, 67, 79, 79, 16, 78, 49,  1,  0, 82, 47, 53, 12])

In [16]:
# the number of distinct characters in our text == size of our vocabulary
max(encoded)

82

Everything is working perfectly :)

### One-hot encoding
As can be seen on the image above LSTM expects one-hot encoded characters. So for every letter we would have a vector of length `max(encoded) = 82`, where only one element will be 1 representing that particular character.

In [17]:
def one_hot_encode(arr, n_labels):
    
    # Initialize the the encoded array
    one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.float32)
    
    # Fill the appropriate elements with ones
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

### Making training mini-batches
When making mini-batches in sequence data it is important to understand wheter we are talking abiut **batch size** or **sequence lenght**. Image presented below perfectly describes that.<br>

<img src="images/LSTM5.png" width="500"><br>
Credits: Udacity Computer vision Nanodegree<br>

### Creating batches: step-by-step guide
Legend:<br>
`N - batch size`<br>
`M - Sequence length`<br>
`K - Total number of completely full batches of size N`<br>
`arr - sequence of encoded characters(ecoded by a dictionary, not one-hot encoded)`<br>
`n - number of all characters in arr, simply len(arr)`

1. Discard data that do not fit in complete batches<br>
To do taht we need to compute `K`. It is simply number of all chars in `arr`, `n` divided by number of chars in a single batch `N * M`. Once we get `K` we have to multiply it by `N * M` in order to obtain number of chars from `arr` we want to keep.

2. Having prepared `arr` we need to split it into `N` sequences
`arr` has to be reshaped into matrix of size `(N, M * K)`.

3. Lastly, we have to iterate through that matrix to get our batches
Iterate through that matrix can be see as moving a window of size `(N, M)` with a step `M`

In [72]:
def get_batches(arr, n_seqs, n_steps):
    """
    
    Generator that returs batches of size (n_seqs, n_steps) from arr
    
    Paramteters:
    -----------
    arr: numpy array from which the batches are created
    n_seqs: batch size (N)
    n_steps: num (M)
    
    """
    # compute number of characters in a batch (K)
    num_char_batch = n_seqs * n_steps
    
    # get the number of batches that fit arr completely, // - integer division
    n_batches = len(arr)//num_char_batch
    
    # keep only enough characters to make full batches
    arr = arr[:n_batches*num_char_batch]
    
    # reshape arr in order to get shape (N, M*K)
    arr = arr.reshape(n_seqs, -1)
    
    # get batches from prepared array
    for n in range(0, arr.shape[1], n_steps):
        
        # batch with features
        x = arr[ : , n : n + n_steps]
        
        # batch with targets
        y = np.zeros_like(x)
        
        # shift feature batch by one,
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n + n_steps]
        except IndexError:
            #when we get to the end of the batch take first column of the arr
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y        
        
# Let's test it
batches = get_batches(encoded, 10, 10)
x, y = next(batches)
print("Feature batch\n",x)
print("Target batch\n",y)

Feature batch
 [[41 47 46  3 82 49  1  8  6 79]
 [ 0 52 63  8 46 12 24 13 49  1]
 [46 61 12 53 51 53 37 49 12 82]
 [ 8 46  8  3  1 49 78 53 40 21]
 [49  8 82 47 46 12 55 49 19  8]
 [ 8 24 46 13  8 47 49  1  8 82]
 [37 82 49 19  8 13 47 46 82 79]
 [46 24  8 24 21 51 51 49  1 53]
 [49 19  8 53 82 24  8 82 53 65]
 [37 49  8 40 51  8 65 53 12 19]]
Target batch
 [[47 46  3 82 49  1  8  6 79 79]
 [52 63  8 46 12 24 13 49  1 49]
 [61 12 53 51 53 37 49 12 82 72]
 [46  8  3  1 49 78 53 40 21 24]
 [ 8 82 47 46 12 55 49 19  8 74]
 [24 46 13  8 47 49  1  8 82 49]
 [82 49 19  8 13 47 46 82 79 47]
 [24  8 24 21 51 51 49  1 53 12]
 [19  8 53 82 24  8 82 53 65 49]
 [49  8 40 51  8 65 53 12 19  8]]


Function seems to work as expected. We can see that the second column of the `Feature batch` is the first column of the `Target batch` and also last column in `Feature batch` is the last but one column in `Target batch`.