In [2]:
from data_prep import *
from model import *

### data preparation 

First things first. Let's start from data preparation. It's quite minimal here and the main question is how do we construct our batches. 

We read the text as it is without preprocessing or normalizing and extract all chars from it. So we have a lot of chars and we organize them using a dictionary where each char has an index. We also encode text as a list of numbers where each number is an index of a char in the mentioned dictionary.

In [3]:
with open('data/anna.txt', 'r') as f:
    text = f.read()

In [4]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

In [5]:
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}
print(list(char2int.items())[:5]), print(list(char2int.items())[-5:])

[('9', 0), ('*', 1), ('z', 2), ('r', 3), ("'", 4)]
[('c', 78), ('m', 79), ('!', 80), ('0', 81), ('&', 82)]


(None, None)

In [6]:
encoded = np.array([char2int[ch] for ch in text])

In [7]:
encoded[:10]

array([70, 10, 60, 21, 24, 57,  3, 52, 76, 64])

In [8]:
[int2char[i] for i in encoded[:10]]

['C', 'h', 'a', 'p', 't', 'e', 'r', ' ', '1', '\n']

Next step - we do one-hot encoding. Function mechanics is a bit involved but the result is quite clear.

In [9]:
arr = encoded[:10].reshape(2, 5)
arr, arr.shape

(array([[70, 10, 60, 21, 24],
        [57,  3, 52, 76, 64]]), (2, 5))

In [10]:
one_hot = one_hot_encode(arr=arr, n_labels=len(chars))

In [11]:
one_hot.shape

(2, 5, 83)

In [12]:
one_hot[0, 0, :]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)

In [13]:
one_hot[0, 0, 22]

0.0

Now it's the most interesting part - batch construction. We have 2 new parameters: `batch_size=128` and `seq_length=100`. So we're going to reshape `encoded` into `128` rows and then use a sliding window of size `100` to iterate over it. 

First question - why don't we construct a batch using sequntial values of `encoded`? Well probably because it doesn't really matter which way to construct. This is char RNN after all. We need to predict next char based on some previos ones and `seq_length` is quite high to make all rows of a batch completely independent. But this is just a hypothesis. In fact we have stateful and stateless `RNNs` - see *Hands-On Machine Leaning, ch. 16*. And in stateful `RNNs` the question how to build a batch does matter.

Let's use much less values of parameters for illustration purposes.

In [14]:
batch_size, seq_length = 3, 5

In [15]:
arr = encoded
n_batches = int(arr.size / (batch_size * seq_length))
arr = arr[:(n_batches * batch_size * seq_length)]
arr = arr.reshape((batch_size, -1))

In [16]:
arr.shape

(3, 661740)

In [17]:
arr[:, :10]

array([[70, 10, 60, 21, 24, 57,  3, 52, 76, 64],
       [12, 31, 73, 57, 78, 24, 35, 64, 64,  8],
       [72, 57, 52, 62, 34, 57,  3, 64, 60, 49]])

How should our batches look like? It's important that a sliding window goes in `seq_length` steps.

In [18]:
arr[:, :5]

array([[70, 10, 60, 21, 24],
       [12, 31, 73, 57, 78],
       [72, 57, 52, 62, 34]])

In [19]:
arr[:, 5:10]

array([[57,  3, 52, 76, 64],
       [24, 35, 64, 64,  8],
       [57,  3, 64, 60, 49]])

In [20]:
batch_generator = get_batches(arr=encoded, batch_size=3, seq_length=5)

In [21]:
x, y = next(batch_generator)
x

array([[70, 10, 60, 21, 24],
       [12, 31, 73, 57, 78],
       [72, 57, 52, 62, 34]])

In [22]:
x, _ = next(batch_generator)
x

array([[57,  3, 52, 76, 64],
       [24, 35, 64, 64,  8],
       [57,  3, 64, 60, 49]])

Last question - how do we compute `y`? It should be shifted by one position - we predict next character in a sequence after all. So if first row is `[22, 52, 26,  1, 80, 43, 16, 28,  2, 19, ...]` then the first row of `x` is `[22, 52, 26,  1, 80]`  and the first row of `y` is `[52, 26, 1, 80, 43]`.

In [23]:
arr[:, 1:6]

array([[10, 60, 21, 24, 57],
       [31, 73, 57, 78, 24],
       [57, 52, 62, 34, 57]])

In [24]:
y

array([[10, 60, 21, 24, 57],
       [31, 73, 57, 78, 24],
       [57, 52, 62, 34, 57]])

### lstm model

Next question - let's decompose our model bit-by-bit and simulate forward pass. There're a lot of questions out there - where do we use `seq_length` in `LSTM`, what is output of `LSTM`, do we use `softmax` or not, can we handle sequences of variable length and so on.

First question - what exactly do we need to construct `LSTM`? Let's start with `LSTMCell`. All the gates in `LSTMCell` are basically linear layers with some activation. So we have matrices of weights that depend on 2 parameters: `input_size` and `hidden_size`. And these are exactly parameters that we need to instantiate `LSTMCell`.

Now to instantiate `LSTM` we need only one additional parameter - `n_layers`. In our case `n_layers=2` and that means that we have 2 layers of LSTM stacked together. So here's the question - how do `LSTM` handle unrolling? How does it know `seq_length`?   

In [25]:
batch_size, seq_length, n_chars = 3, 5, 83
batch_gen = get_tensor_batches(encoded, batch_size, seq_length, n_chars)
x, y = next(batch_gen)
x.shape, y.shape

(torch.Size([3, 5, 83]), torch.Size([3, 5]))

In [26]:
y

tensor([[10, 60, 21, 24, 57],
        [31, 73, 57, 78, 24],
        [57, 52, 62, 34, 57]])

Let's create a simple `LSTM` with just `1` layer. Let's use `bias=False` to simplify computations.

What does `batch_first=True` mean? From documentation: *If `True`, then the input and output tensors are provided as `(batch, seq, feature)`.* And this is indeed the case: `x.shape[0]` is in fact the `batch_size`.

In [27]:
torch.manual_seed(0)
lstm_cell = nn.LSTMCell(input_size=len(chars),
                       hidden_size=2,
                       bias=False)
torch.manual_seed(0)
lstm = nn.LSTM(input_size=len(chars), 
               hidden_size=2, 
               num_layers=1,
               dropout=0, 
               batch_first=True,
               bias=False)

In [28]:
# shapes should be (4 * hidden_size, input_size)
# and (4 * hidden_size, hidden_size)
print(lstm_cell.weight_ih.shape, lstm_cell.weight_hh.shape)
print(lstm.weight_ih_l0.shape, lstm.weight_hh_l0.shape)

torch.Size([8, 83]) torch.Size([8, 2])
torch.Size([8, 83]) torch.Size([8, 2])


Next question - how should we initialize our hidden state? It turns out that most popular solution is just set it to `0` (again, see an alternative in *Hands-On Machine Leaning, ch. 16*). And it seems this option is now available by default, from documentation: *If `(h_0, c_0)` is not provided, both `h_0` and `c_0` default to zero.* This was not alwais the case: see [here](https://github.com/pytorch/pytorch/issues/434).

That's easy to check directly: below we do computations with default params and with `(h0, c0)` set to `0` - result is the same.

#### computations of `lstm_cell`

So let's try to reproduce computations of the `LSTMCell`. 

In [29]:
x.shape

torch.Size([3, 5, 83])

In [30]:
x0 = x[:, 0, :]
x0.shape

torch.Size([3, 83])

In [31]:
# let's check that we don't have bias
lstm_cell.bias_ih, lstm_cell.bias_hh

(None, None)

In [32]:
Wii, Wif, Wig, Wio = lstm_cell.weight_ih[0:2, :], lstm_cell.weight_ih[2:4, :], \
                     lstm_cell.weight_ih[4:6, :], lstm_cell.weight_ih[6:8, :]
Wii.shape, Wif.shape, Wig.shape, Wio.shape

(torch.Size([2, 83]),
 torch.Size([2, 83]),
 torch.Size([2, 83]),
 torch.Size([2, 83]))

In [33]:
with torch.no_grad():
    i = torch.sigmoid(torch.mm(x0, Wii.t()))
    g = torch.tanh(torch.mm(x0, Wig.t()))
    o = torch.sigmoid(torch.mm(x0, Wio.t()))
    print(i.shape, g.shape, o.shape)

    c_man = i * g
    h_man = o * torch.tanh(c_man)
    print(h_man.shape, c_man.shape)

torch.Size([3, 2]) torch.Size([3, 2]) torch.Size([3, 2])
torch.Size([3, 2]) torch.Size([3, 2])


In [34]:
h_man

tensor([[-0.0747,  0.1230],
        [-0.0605, -0.1216],
        [ 0.1025,  0.0474]])

In [35]:
c_man

tensor([[-0.1182,  0.2956],
        [-0.1354, -0.1866],
        [ 0.1601,  0.1251]])

In [36]:
h0, c0 = torch.zeros(3, 2), torch.zeros(3, 2)
h0, c0

(tensor([[0., 0.],
         [0., 0.],
         [0., 0.]]), tensor([[0., 0.],
         [0., 0.],
         [0., 0.]]))

In [37]:
with torch.no_grad():
    h_comp, c_comp = lstm_cell(x0, (h0, c0))

In [38]:
h_comp

tensor([[-0.0747,  0.1230],
        [-0.0605, -0.1216],
        [ 0.1025,  0.0474]])

In [39]:
c_comp

tensor([[-0.1182,  0.2956],
        [-0.1354, -0.1866],
        [ 0.1601,  0.1251]])

#### computations of `lstm`

And now is the **hard** part - computations of the `LSTM`. Probably we may use `LSTMCell` and a loop rather than do all computations manually. For convenience let's define all inputs and models one more time.

First we create `LSTM` and `LSTMCell`. To use `LSTMCell` to reproduce behavior of `LSTM` we need to replace weights in `LSTMCell` using weights of `LSTM`. And we're ready to unroll the loop:  we use length of a sequence to perform computations. The loop is the same as we can find in `pytorch` [docs](https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTMCell). Initially (h, c) are 0s and then we just write:
`h, c = lstm_cell(x, (h, c))`.

In [40]:
x, y = next(batch_gen)
x.shape, y.shape

(torch.Size([3, 5, 83]), torch.Size([3, 5]))

In [43]:
torch.manual_seed(0)
lstm = nn.LSTM(input_size=len(chars), 
               hidden_size=2, 
               num_layers=1,
               dropout=0, 
               batch_first=True,
               bias=False)
torch.manual_seed(0)
lstm_cell = nn.LSTMCell(input_size=len(chars),
                       hidden_size=2,
                       bias=False)

In [44]:
lstm_cell.weight_ih = lstm.weight_ih_l0
lstm_cell.weight_hh = lstm.weight_hh_l0

In [51]:
print(lstm_cell.weight_ih[0, :5])
print(lstm.weight_ih_l0[0, :5])

tensor([-0.0053,  0.3793, -0.5820, -0.5204, -0.2723], grad_fn=<SliceBackward>)
tensor([-0.0053,  0.3793, -0.5820, -0.5204, -0.2723], grad_fn=<SliceBackward>)


In [56]:
h_cell, c_cell = torch.zeros(3, 2), torch.zeros(3, 2)
for i in range(x.shape[1]):
    xi = x[:, i, :]
    h_cell, c_cell = lstm_cell(xi, (h_cell, c_cell))
print(h_cell)
print(c_cell)

tensor([[ 0.1121,  0.1053],
        [ 0.0386,  0.2719],
        [-0.0643, -0.0012]], grad_fn=<MulBackward0>)
tensor([[ 0.2603,  0.1606],
        [ 0.0702,  0.4512],
        [-0.1267, -0.0020]], grad_fn=<AddBackward0>)


In [57]:
output, (h, c) = lstm(x)
print(h)
print(c)

tensor([[[ 0.1121,  0.1053],
         [ 0.0386,  0.2719],
         [-0.0643, -0.0012]]], grad_fn=<StackBackward>)
tensor([[[ 0.2603,  0.1606],
         [ 0.0702,  0.4512],
         [-0.1267, -0.0020]]], grad_fn=<StackBackward>)


Let's now check `output`. From documentation: *tensor containing the output features (h_t) from the last layer of the LSTM, for each t*. First of all we see that `output[:, 4, :]` is in fact equal to `h` in above computations.

In [58]:
output.shape

torch.Size([3, 5, 2])

In [59]:
output[:, 4, :]

tensor([[ 0.1121,  0.1053],
        [ 0.0386,  0.2719],
        [-0.0643, -0.0012]], grad_fn=<SliceBackward>)

In [63]:
with torch.no_grad():
    h_cell, c_cell = torch.zeros(3, 2), torch.zeros(3, 2)
    for i in range(x.shape[1]):
        xi = x[:, i, :]
        h_cell, c_cell = lstm_cell(xi, (h_cell, c_cell))
        print(h_cell[0, :], output[:, i, :][0, :])

tensor([0.0578, 0.0923]) tensor([0.0578, 0.0923], requires_grad=True)
tensor([0.1041, 0.1083]) tensor([0.1041, 0.1083], requires_grad=True)
tensor([ 0.0826, -0.0348]) tensor([ 0.0826, -0.0348], requires_grad=True)
tensor([-0.0163, -0.0010]) tensor([-0.0163, -0.0010], requires_grad=True)
tensor([0.1121, 0.1053]) tensor([0.1121, 0.1053], requires_grad=True)


And that's it for this char RNN model.