**RECURRENT NEURAL NETWORK ARCHITECTURES FOR CHARACTER GENERATION**

*Patrick Donnelly, Groupware Technology*

Feedforward neural networks have come to dominate computer vision and other artifical intelligence applications. These networks use models consisting of linear operations such as convolutions, matrix multiplications, and additions to transform batches of inputs into outputs. These models are then trained through backpropagation, whereby parameters are updated via optimization: taking the partial derivative of the loss function with respect to each weight and bias.

When we train a feedforward neural network, we assume temporal independence between input data. Feedforward neural networks aren't necessarily deterministic. They frequently produce different outputs depending on the sequence in which input data is fed. However, the network architecture does not explicitly incorporate time. The network simply takes a batch of observations as input. It does not have the concept of *memory*, or a _state_ that updates as additional data is presented to the network. As a result, feedforward neural networks work fine for modeling "static" data such as images or user data (e.g. for recommendation systems), though we can certainly apply sequential models to such data (e.g. for image captioning or sequentially-aware recommendation systems).

Enter the **recurrent neural network (RNN)**. RNNs incorporate __feedback loops__ into the network architecture. These loops enable the network to take as input a **hidden state** in addition to the current input value. If this doesn't quite make sense yet, let's use an extremely simple example to compare an RNN with a feedforward network. Let's take a word and represent it as a string of letters $S$. Since we don't want to rip off Andrej Karpathy's excellent blog post (http://karpathy.github.io/2015/05/21/rnn-effectiveness/) *completely*, we'll _slightly_ change the text:

In [1]:
S = ['H','E','L','L','A']
S

Now let's use **one-hot encodings** to represent each letter as a vector. A one-hot encoding is a vector with a one corresponding to the index of the letter (or class, in the case of categorical data). The rest of the vector is all zeroes. We're only working with four possible letters, so we can encode our letters as one-hot vectors of length four. Since we're working with categorical (but not ordinal) data, it doesn't matter which index corresponds to which letter. If you've taken the Groupware image classification tutorial (shameless plug) or otherwise worked with labeled data, you may recognize this as the way we assigned class labels to images. Let's just start with our **H** encoded as a one followed by zeroes:

In [2]:
H = [1,0,0,0]
H

We can do the same for our E, L, and A. We'll be reusing our L encoding since we have two Ls (no need for two separate encodings):

In [3]:
E = [0,1,0,0]
L = [0,0,1,0]
A = [0,0,0,1]
print(E)
print(L)
print(A)

Now let's think about what we're trying to predict. For given input character(s) (letter), we'd like to generate the next (output) character. So if we give the network an **H** (as input), we'd want it to return an __E__ (as output). We can also use strings of multiple input characters to generate the next output. For instance, if we feed the network \['H','E','L','L'\], we'd like it to generate an 'A' (of course not an 'O'! Do it for the Bay. Also, I'm watching the A's play the O's in another window right now but don't tell my manager).

Let's just start with the simple example of using one letter to generate another. For our string of length $L=5$, we thus have $L-1=5-1=4$ input/output pairs:

{Input: H = \[1,0,0,0\], Output: E = \[0,1,0,0\]}
{Input: E = \[0,1,0,0\], Output: L = \[0,0,1,0\]}
{Input: L = \[0,0,1,0\], Output: L = \[0,0,1,0\]}
{Input: L = \[0,0,1,0\], Output: A = \[0,0,0,1\]}

Now let's say we're using a standard feedforward network to generate our output letters. Of course we'd never do this in practice. You can already see an inconsistency: there's an 'L' after the first 'L', but there's an 'A' after the second 'L.' However, our feedforward network can't tell the difference...

What sort of function should we use to map our inputs to outputs? How about a simple linear transformation! Let's start with a matrix multiplication (without adding a "bias" term). We can multiply an $nxm$ matrix by our $4x1$ $mxp$ input vector to get our $4x1$ $nxp$ output vector (see https://en.wikipedia.org/wiki/Matrix_multiplication - I use this almost as much as I google 'untar file linux'). You might be like, "well, wait a second, these are $1x4$ input and output vectors," and you would be correct, but then we get a very uninteresting 1x1 matrix multiplication. So we instead multiply a **weight matrix** by the transpose of our row vector (or just feed it a column vector) to get our output (also a column vector or transpose of our row vector).

NumPy makes matrix operations much easier in Python. Let's import it:

In [4]:
import numpy as np

Our data are currently lists:

In [5]:
print(type(H))
print(type(E))
print(type(L))
print(type(A))

<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>


Let's convert our data to numpy arrays:

In [6]:
H = np.array(H)
E = np.array(E)
L = np.array(L)
A = np.array(A)
print(type(H))
print(type(E))
print(type(L))
print(type(A))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


If we convert a list $L = [x_{1}, ..., x_{n}]$ to a *np.ndarray* using L = np.array(L), the default shape is (len(L),):

In [7]:
print(H.shape)
print(E.shape)
print(L.shape)
print(A.shape)

(4,)
(4,)
(4,)
(4,)


However, to do matrix math, we want to convert our (len(L),) array to a (len(L),1) "column vector" (although one with two "dimensions"). We can do this for each our letters (each of which is a one-hot "list") using **L.reshape(len(L),1)**:

In [8]:
H = H.reshape(len(H),1)
E = E.reshape(len(E),1)
L = L.reshape(len(L),1)
A = A.reshape(len(A),1)
print(H.shape)
print(E.shape)
print(L.shape)
print(A.shape)

(4, 1)
(4, 1)
(4, 1)
(4, 1)


We now have the column vectors we need to do matrix arithmetic:

In [9]:
print(H)
print(E)
print(L)
print(A)

[[1]
 [0]
 [0]
 [0]]
[[0]
 [1]
 [0]
 [0]]
[[0]
 [0]
 [1]
 [0]]
[[0]
 [0]
 [0]
 [1]]


Now we can randomly initialize a 4x4 weight matrix to map inputs to outputs:

In [10]:
W = np.random.randn(4,4)
print(W)

[[-0.26772254 -0.81721217 -0.43338741 -0.51728738]
 [-0.37003086  0.44429691 -2.13113222  0.6184106 ]
 [-0.4784692  -1.19890252 -0.41701251 -0.23615541]
 [ 0.99985447 -1.12891024 -1.4123793  -0.51689583]]


We can think of our "network" as a classifier. We multiply the weight matrix by each input, find the index of the maximum value of the output vector and compare it to the label. Let's call this output vector Y:

In [11]:
Y = np.matmul(W,H)
Y

array([[-0.26772254],
       [-0.37003086],
       [-0.4784692 ],
       [ 0.99985447]])

We can also write a simple function to see if we correctly classified our input/output pair. Recall that our "ground truth" output is **E**:

In [12]:
def did_we_get_this_right(Y,E):
    if np.argmax(Y) == np.argmax(E):
        return 'Yes we did get this right'
    return 'No we did not get this right. Please backprop and try again'

did_we_get_this_right(Y,E)

'No we did not get this right. Please backprop and try again'

We can also add bias to our linear transformation by adding a $4x1$ vector $b$ to $W$. Let's randomly initialize it:

In [13]:
b = np.random.randn(4,1)
print(b)

[[-0.80053434]
 [-1.64311801]
 [ 1.38740459]
 [ 2.10735781]]


Now let's try our linear transformation with bias:

In [14]:
Y = np.matmul(W,H)
Y = np.add(Y,b)
print(Y)
did_we_get_this_right(Y,E)

[[-1.06825688]
 [-2.01314888]
 [ 0.90893539]
 [ 3.10721228]]


'No we did not get this right. Please backprop and try again'

Of course we haven't actually done any weight or bias updates, so whether or not we have correctly classified our output is completely random. The purpose of this exercise is merely to show how we would map inputs to outputs using a simple linear transformation.

One more thing, we'd also want to transform our output Y into a probability vector by exponentiating each output and dividing by the sum of exponentiated outputs. We call this the **softmax function**:

In [15]:
def softmax(Y):
    return np.exp(Y)/np.sum(np.exp(Y))

softmax(Y)

array([[0.01357192],
       [0.0052757 ],
       [0.09802235],
       [0.88313002]])

Cool. But you see why this algorithm is kinda dumb? Consider our output for input **L**:

In [16]:
Y = np.matmul(W,L)
Y = np.add(Y,b)
print(did_we_get_this_right(Y,L))
print(did_we_get_this_right(Y,A))
softmax(Y)

Yes we did get this right
No we did not get this right. Please backprop and try again


array([[0.05873781],
       [0.00463092],
       [0.5324013 ],
       [0.40422997]])

There are actually two possible outputs for input **L**: 'L' (if it's the first 'L') and 'A' (if it's the second 'L'). We have no idea which output we're trying to predict, since the algorithm has no knowledge of sequence. How could we provide this knowledge as input?

The first thing we'll do is introduce a **hidden layer** to our network. This is simply an additional linear transformation. We'll compute two matrix multiplications and additions rather than one.

While we're not accounting for sequencing yet, maybe you can see how the introduction of a hidden layer will make modeling a sequence of characters easier? If not, no worries! I have no idea whether I would have grasped this intuition myself either, since I have not been presented the same sequence of data as the reader.

Let's randomly initialize a second weight matrix and bias vector. We're gonna change our nomenclature here. Following Karpathy's (amazing) blog post, we'll call the weights and biases for our first layer **W_xh** and __b_xh__. The *x* refers to our input and the *h* refers to our **hidden vector**, situated between our input and output vectors. If we choose an identical length for our input, hidden, and output vectors, we can reuse the __np.random.randn()__ function we used to initialize our prior weight matrix and bias vector. Let's do that:

In [17]:
W_xh = np.random.randn(4,4)
b_xh = np.random.randn(4,1)
print(W_xh)
print(b_xh)

[[-1.98142587  0.40522554 -0.17918443  0.0747749 ]
 [-0.43499536 -0.54000442 -0.72627699 -0.05630791]
 [-0.58736728  0.7137192  -0.90068108 -1.06274096]
 [ 0.8556658  -0.06196241  1.47379536 -0.10516402]]
[[-0.30636208]
 [-0.56215209]
 [ 0.10591097]
 [-1.10188667]]


We'll also need a second weight matrix and bias vector to connect our hidden vector with our output vector. Since we're keeping the length of our input, hidden, and output vectors identical, we'll pass the same arguments to **np.random.randn()** as we did with __W_xh__ and **b_xh**. We'll call our weight matrix __W_hy__ and our bias vector **b_hy** since they connect our hidden vector *h* with our output vector _y_:

In [18]:
W_hy = np.random.randn(4,4)
b_hy = np.random.randn(4,1)
print(W_hy)
print(b_hy)

[[ 1.19617907e-01 -1.87853287e-01  9.49974131e-01  1.47702080e+00]
 [-1.14526258e+00 -5.28390386e-01 -9.85818626e-04 -4.29507084e-01]
 [ 1.12734483e+00  9.08576176e-01 -1.66398994e+00 -1.10708956e+00]
 [ 3.82694730e-02 -1.35913733e-01  1.29185449e+00  3.45197729e-01]]
[[ 1.68147147]
 [ 0.33559643]
 [ 0.33195594]
 [-0.29763739]]


If you've worked with multilayer neural networks, you'll know that it's necessary to add an **activation function** before passing the data from one linear layer to another. If you haven't worked with multilayer neural networks (or have worked with multilayer neural networks and didn't know this), now you know. For more details, check out our "Introduction to Neural Networks tutorial!"

An activation function needs to do a few things for us (and for others who are using the network). It's gotta be nonlinear: we want our neural network to be a **universal approximator** (https://en.wikipedia.org/wiki/Universal_approximation_theorem), not just a linear function with many parameters. Even if we don't care about whether or not our function is nonlinear, there's no point in adding addition linear layers without an activation function; the function will just "collapse" to a single layer (see this elegant explanation on StackExchange: https://stats.stackexchange.com/questions/267024/what-if-do-not-use-any-activation-function-in-the-neural-network)

We also need an activation function that will allow our weights and biases to update effectively. Parameter updates are a form of optimization: we take the derivative of our loss function with respect to each weight and bias and set it to zero. We compute this derivative using the chain rule: sequentially multiplying each partial derivative as we "backpropagate" through the network. (I guess we don't need to do this sequentially since scalar multiplication is commutative.) Thus our network will only learn if our activations have nonzero derivatives. As the derivative of the activation function approaches zero, our **gradients** (a pretentious way of saying "derivatives") die / are killed / (euphemistically) vanish / (less violently) approach zero, and our weights and biases update less (in proportion to our *learning rate*, but that's another story).

We typically apply a __hyperbolic tangent (tanh)__ activation function to **W_xh** and __b_xh__. This will squash the positive values of our output between 0 and 1 and our negative values between -1 and 0. We do have the issue of **vanishing gradients** as $x$ increases or decreases, but in practice we should be fine with a single hidden layer. (It's when we stack lots of these activations that we really run into trouble.)

NumPy has a function for **tanh**. Let's apply it to the output of our first matrix multiplication and addition. We'll call this output __h__ (for "hidden vector," not to be confused with our input datum **H**). 

We'll use this to construct a hidden vector from our input __H__:

In [19]:
h = np.matmul(W_xh,H)
h = np.add(h, b_xh)
h = np.tanh(h)
h

array([[-0.9796093 ],
       [-0.76039356],
       [-0.44740916],
       [-0.24136295]])

In [20]:
h.shape

(4, 1)

Now we can pass **h** through our second layer, apply softmax, and compare to our label / ground truth (the 'E' that we're supposed to generate if our model is doing its job):

In [21]:
Y = np.matmul(W_hy,h)
Y = np.add(Y,b)
print(did_we_get_this_right(Y,E))
softmax(Y)

No we did not get this right. Please backprop and try again


array([[0.02789516],
       [0.13026018],
       [0.24194625],
       [0.59989841]])

Let's return to the issue of modeling sequencing. Suppose we're presenting our second character 'E' to the network. We want to make the network aware of the sequence, in this case the 'H' that has already passed through. We can do this by modeling our hidden layer as a **hidden state**! This means that we'll update this hidden state as we pass each character through the network. How do we do this? 

Instead of defining our hidden vector **h** as the linear transformation (matrix multiplication and vector addition) of our input vector (call this __x__: this can be any character in our string of letters), let's start by initializing our hidden state as a vector of zeroes. 

We can do this using **np.zeros** and passing the shape of the array of zeroes within square brackets inside parentheses. We'll keep the dimensions identical to our prior hidden vector:

In [22]:
h = np.zeros([4, 1])
print(h.shape)
h

(4, 1)


array([[0.],
       [0.],
       [0.],
       [0.]])

Now let's pass our first observation ('H') through the network. We first perform a linear transformation by multiplying **W_xh** by __H__ and adding **b_xh**. 

So far nothing's changed, except we won't directly output this to the hidden state. We'll call our transformed input __x__:

In [23]:
x = np.matmul(W_xh, H)
x = np.add(x, b)
x

array([[-2.78196021],
       [-2.07811338],
       [ 0.80003731],
       [ 2.96302361]])

Now we're gonna do something new. Let's update our hidden state. To do this, we need a weight matrix **W_hh**. Let's initialize this weight matrix (as with our other weight matrices and bias vectors) with samples drawn from a Gaussian (normal) distribution of mean 0 and variance 1. We're not going to add a bias vector when we update our hidden state, though we'll continue to do this for our output. As we did before, we'll pass the dimension of our weight matrix as arguments:

In [24]:
W_hh = np.random.randn(4,4)
print(W_hh)

[[ 0.64386655 -0.85625832  1.26784127 -0.66501504]
 [-0.05879225  0.27373941 -0.29388187 -1.31669811]
 [-0.44262796 -0.78381947  1.74877914 -1.25037294]
 [-1.03120265 -1.12159143  0.5567503   0.86568674]]


To update our hidden state, we'll multiply our weight matrix **W_hh** by the hidden state __h__. Since we're initializing our hidden state as a vector of zeroes, this won't actually change **h** the first time we perform this operation:

In [25]:
h = np.matmul(W_hh, h)
h

array([[0.],
       [0.],
       [0.],
       [0.]])

We now add our updated hidden state __h__ to the linear transformation of our input **x**. 

Recall that we updated our input by performing the matrix multiplication __x = np.matmul(W_xh, H)__ and vector addition **x = np.add(x, b)**. Thus we can just update our hidden state by adding our existing hidden state __h__ to our transformed input **x**:

In [26]:
h = h + x
h

array([[-2.78196021],
       [-2.07811338],
       [ 0.80003731],
       [ 2.96302361]])

Since **h** is a vector of zeroes, __h + x = x__ for the first step of our RNN.

Again, we'll apply a hyperbolic tangent activation function our output:

In [27]:
h = np.tanh(h)
h

array([[-0.99236185],
       [-0.96915019],
       [ 0.66405763],
       [ 0.99467619]])

Finally, we use this hidden state to compute our output. Again, we randomly initialize a weight matrix **W_hy** and bias vector __b_hy__:

In [28]:
W_hy = np.random.randn(4,4)
b_hy = np.random.randn(4,1)
print(W_hy)
print(b_hy)

[[ 2.12318582 -0.47892379 -0.98261904 -1.4032746 ]
 [ 1.28493762  0.80957265 -0.37800311 -0.13171113]
 [-0.13697523  0.40805538  1.74597203  0.5780113 ]
 [-1.05078095  0.42694039  1.38442699  0.28662435]]
[[0.05336671]
 [0.84768325]
 [0.92051778]
 [1.73485876]]


Then we multiply **W_hy** by our hidden state __h__ and add **b_hy**. 

Again, we'll call this output __Y__. **softmax(Y)** gives our output as a probability vector:

In [29]:
Y = np.matmul(W_hy,h)
Y = np.add(Y,b)
print(did_we_get_this_right(Y,E))
softmax(Y)

No we did not get this right. Please backprop and try again


array([[1.62376117e-04],
       [2.43891723e-04],
       [2.53676590e-01],
       [7.45917142e-01]])

We can continue by passing our next character through the network. We first linearly transform  (?) our **x**, now defined as the letter 'E':

In [30]:
x = np.matmul(W_xh, E)
x = np.add(x, b)
x

array([[-0.3953088 ],
       [-2.18312243],
       [ 2.1011238 ],
       [ 2.0453954 ]])

Now we update our hidden state by adding our existing **h** to __x__ and taking the hyperbolic tangent of the output. 

This becomes the new value of **h**:

In [31]:
print(h)
h = np.tanh(h + x)
h

[[-0.99236185]
 [-0.96915019]
 [ 0.66405763]
 [ 0.99467619]]


array([[-0.88265735],
       [-0.99635072],
       [ 0.99210221],
       [ 0.99543475]])

Now let's compute our output for 'E':

In [32]:
Y = np.matmul(W_hy,h)
Y = np.add(Y,b)
print(did_we_get_this_right(Y,E))
softmax(Y)

No we did not get this right. Please backprop and try again


array([[1.01968221e-04],
       [1.64641171e-04],
       [2.97475646e-01],
       [7.02257745e-01]])

You get the idea? As we pass each character, the hidden state incorporates the information of the character **x** along with its prior state (which is in turn a function of the value of prior characters).

What if we want to stack layers of linear transformations and activations with hidden states? Computing all this by hand (while certainly feasbile) might get verbose and unwieldy. Plus we haven't even gotten to the hard part: updating weights and biases using some form of gradient descent (e.g. with momentum)! While we're going to keep this tutorial focused on architectures rather than optimization functions, it's helpful to understand why we might benefit from a higher-level framework.

In this tutorial, we'll use **PyTorch** to define our networks. The cool thing about PyTorch is that it borrows heavily from NumPy. It uses an analogue of the NumPy array called the __Tensor__. By convention, we import PyTorch as follows:

In [33]:
import torch

Now let's define our RNN. We'll borrow heavily from the official PyTorch tutorial: https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

We start by importing the **torch.nn** module. This contains a lot of the stuff (including the __RNN__ class) that we need to define network architectures (including our simple RNN):

In [34]:
import torch.nn as nn

Now let's define our **RNN** class, which inherits from __nn.Module__, the "base class for all neural network modules" (see https://pytorch.org/docs/stable/\_modules/torch/nn/modules/module.html). This will include our constructor (**init**), which defines our architecture; a __forward__ method, which defines how our data sequentially *propagates* through the network; and an **initHidden** method, which (as you may have guessed) initializes our hidden layer as a vector of zeroes (just as we did!)

We will use an instance of this class to build a *character-level RNN*. Let's sketch the basic structure of the class:

In [35]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        pass
    def forward(self, input, hidden):
        pass
    def initHidden(self):
        pass

What's going on here? As we noted, our **RNN** class inherits from __nn.Module__. We also use **super()** to call the __init__ of **nn.Module** (see https://realpython.com/python-super/ for an explanation of __super()__ - it's kinda confusing)

Our **init** is going to take three arguments (in addition to __self__). For our simple RNN, we use identical input, hidden, and output sizes (each are vectors of length $4$)

For **forward**, we need to pass an __input__ Tensor (we're going to have to redefine our characters as Tensors!) and our hidden state prior to passing the data through the network. Our hidden state is zero-initialized and updated based on our inputs and prior hidden state as we propagate our data.

**initHidden** is simple. We don't need to pass it any arguments. It'll just take our __hidden_size__ from **init** and define our vector of zeroes for initializing our hidden state.

Let's start filling out our class methods! For **init**, we need to define the following:

1) The connection between our input and hidden state. We use **nn.Linear** to define a linear transformation. By default, __nn.Linear__ will apply a bias vector as well as a weight matrix to each input vector. In the tutorial, the input and hidden vectors are concatenated, and thus the input to **nn.Linear** is __input_size + hidden_size__. We won't do this; we'll just pass **input_size** as our input and __hidden_size__ as our output. 

2) A line of code (not in the PyTorch tutorial) to define the "loop" component of updating our hidden state. Recall that we need to apply **tanh** to the sum of the linear transformations of our prior hidden vector and input vector. Here we transform our hidden vector without bias.

3) The connection between our hidden and output state. Again, the tutorial concatenates input and hidden vectors. Instead, we'll pass our **hidden_size** as input and __output_size__ as output.

4) A softmax function to apply to our output. Following the PyTorch tutorial, we actually use logged softmax (**nn.LogSoftmax**) and pass the argument __dim=1__ to apply softmax to the column dimension (for rows, use **dim=0**)

Let's see how our network looks after building out our **init** method:

In [36]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.xh = nn.Linear(input_size, hidden_size)
        self.hh = nn.Linear(hidden_size, hidden_size, bias=False)
        self.hy = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    def forward(self, input, hidden):
        pass
    def initHidden(self):
        pass

Now let's define our **forward** method. Again, if you're following along with the tutorial, we'll keep our network simple and skip the concatenation of input and hidden layers. We do three things here:

1) Update our hidden state by applying **tanh** to the summed linear transformations of our input and prior hidden state

2) Apply **self.yh** to our updated hidden state

3) Apply softmax to the output of 3)

We return both our output and our new hidden state:

In [37]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.xh = nn.Linear(input_size, hidden_size)
        self.hh = nn.Linear(hidden_size, hidden_size, bias=False)
        self.hy = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    def forward(self, input, hidden):
        hidden = self.hh(hidden) + self.xh(input)
        output = self.hy(hidden)
        output = softmax(output)
        return output, hidden
    def initHidden(self):
        pass

Now all we need to do is define our **initHidden**. Again, we just need this to return a vector of zeroes with length equal to the size of our hidden state. We use __torch.zeros__ instead of **np.zeros**:

In [38]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.xh = nn.Linear(input_size, hidden_size)
        self.hh = nn.Linear(hidden_size, hidden_size, bias=False)
        self.hy = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    def forward(self, input, hidden):
        hidden = self.hh(hidden) + self.xh(input)
        output = self.hy(hidden)
        output = self.softmax(output)
        return output, hidden
    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

How about we call an instance of our **RNN**? We'll set the size of our input, hidden, and output vectors to 4:

In [39]:
rnn = RNN(4,4,4)
rnn

RNN(
  (xh): Linear(in_features=4, out_features=4, bias=True)
  (hh): Linear(in_features=4, out_features=4, bias=False)
  (hy): Linear(in_features=4, out_features=4, bias=True)
  (softmax): LogSoftmax()
)

Now let's pass our input character 'H' through the network. We need to convert it to a Tensor first. We use **torch.tensor** and specify __requires_grad=False__. This tells PyTorch not to compute *gradients* or derivatives for the Tensor values. We also convert this input to a **float**, the default data type for Tensor arithmetic in PyTorch. Finally, we want to resize this input as a row vector (dimensions \[1, 4\]). The **view** method does this for us; we simply pass our new dimensions as arguments:

In [40]:
input = torch.tensor(H, requires_grad=False).float()
print(input.shape)
input = input.view(1,4)
print(input.shape)
input

torch.Size([4, 1])
torch.Size([1, 4])


tensor([[1., 0., 0., 0.]])

We can also initialize our hidden length 4 vector of zeroes:

In [41]:
hidden = torch.zeros(1,4)
print(hidden.shape)
hidden

torch.Size([1, 4])


tensor([[0., 0., 0., 0.]])

Now we can use the **rnn** instance of our __RNN__ to generate the output of passing 'H' through our network, along with our next hidden state:

In [42]:
output, next_hidden = rnn(input, hidden)
print(output)
print(next_hidden)

tensor([[-1.2170, -1.5799, -1.5085, -1.2850]], grad_fn=<LogSoftmaxBackward>)
tensor([[-0.1291, -0.8088,  0.6089,  0.3288]], grad_fn=<AddBackward0>)


Sick! Now we need a loss function. Following the PyTorch tutorial, we'll use **nn.NLLLoss**. This is the *negative log likelihood loss*, which is the same as exponentiating each of our logged softmax outputs (so we've essentially canceled out the log operation) and applying _cross entropy loss_: taking the negative natural logarithm of the predicted probability for the *ground truth* class. See https://discuss.pytorch.org/t/understanding-nllloss-function/23702. We go into greater detail about cross-entropy loss in our Introduction to Neural Networks tutorial:

In [43]:
criterion = nn.NLLLoss()

We also need an *optimizer*. This will tell the backpropagation machine inside of PyTorch how to update our weights and biases. We'll use the _Adam_ optimizer (**torch.optim.Adam**), which uses momentum when performing parameter updates. Our *learning rate* (__lr__) controls the magnitude of our parameter updates. We multiply the learning rate by our gradients and update parameters accordingly. Let's set our learning rate to 0.01:

In [44]:
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.01)

Now let's compute our loss. We do this by applying our **criterion** to our model __output__ and our ground truth output 'E' (remember, 'E' comes next after 'H'). We represent our **target** 'E' as a *class index* ranging from 0 to the total number of classes minus one. In other words, we have four classes (zero-indexed), and thus we use the index 1 to represent 'E' as the second class in the class vector ('H','E','L','A'). We define this index as a **long** Tensor:

In [45]:
target = torch.tensor([1]).long()

Now we can apply our loss to our output and target and backpropagate:

In [46]:
loss = criterion(output, target)
loss.backward()

Very cool. What if we want to do more complicated text generation? What if we want to work with sequences of lengths longer than five characters? It turns out that "vanilla" RNNs aren't so great at learning from long sequences. Think about the process by which we update our hidden state. As we *backpropagate through time*, we compute the partial derivative of our loss function with respect to our weight at each point in the sequence and then multiply these derivatives together. This often leads to _exploding gradients_(!) in which we perform suboptimally large weight updates, or *vanishing gradients* in which we perform suboptimally small weight updates (and thus our network doesn't learn).

How do we solve this issue? We introduce *gates* into our hidden state. This particular type of "gated" network is called a _Long short-term memory network_, or *LSTM*. There are two great blogs on this, and I'll be drawing from both for reference: 

Chris Olah: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Adam Paszke: https://apaszke.github.io/lstm-explained.html

Our first gate is called a *forget gate*. This decides how much information we retain (or forget) from the input and prior hidden state. This is simply the application of a _sigmoid_ function (https://en.wikipedia.org/wiki/Sigmoid_function) to the summed linear transformations of our input and prior hidden state, with a bias term added at the end. In other words, we perform the linear transformations of input and prior hidden state without bias, add a bias term to these summed linear transformations, and pass this output through a sigmoid function. Let's call our forget gate **F**.

First, we initialize our hidden state as a vector of values sampled from a standard normal distribution (using __torch.randn__). From this point on, we'll be using random rather than zero initialization for our hidden states. Let's stick with **hidden_size** to define the length of our hidden state vector. We'll keep __hidden_size__, **input_size**, and __output_size__ equal to 4:

In [47]:
hidden_size = 4
input_size = 4
output_size = 4

Now we can initialize our hidden state:

In [48]:
h = torch.randn(1, hidden_size)
h

tensor([[ 0.2458, -1.2905, -0.0355, -1.5339]])

Now we can define our linear transformations: input-to-hidden and hidden-to-hidden with input and output of 4. For simplicity, we'll keep the length of all vectors in this exercise equal to 4. We'll set **bias=False**: we'll just multiply our input and hidden layers by weight matrices:

In [49]:
xh_forget = nn.Linear(4,4, bias=False)
hh_forget = nn.Linear(4,4, bias=False)
print(xh_forget)
print(hh_forget)

Linear(in_features=4, out_features=4, bias=False)
Linear(in_features=4, out_features=4, bias=False)


We also need a bias vector. Let's call it **b_h**. We can randomly initialize its values using __torch.randn__ and convert it to a **float**:

In [50]:
b_forget = torch.randn(1,4).float()
print(b_forget.type())
print(b_forget)

torch.FloatTensor
tensor([[ 0.2700,  1.3868, -2.2986, -1.3354]])


We'll add the output of our linear transformations and bias vector to update our hidden state. Let's use **input** again for our input:

In [51]:
input

tensor([[1., 0., 0., 0.]])

As we can see, this is the 'H' tensor. Let's use it, along with the existing/prior (depending on your temporal perspective) value of our hidden state **h** to define the raw output of our forget gate (prior to applying our activation):

In [52]:
forget_gate = xh_forget(input) + hh_forget(hidden) + b_forget
forget_gate

tensor([[-0.1588,  1.2220, -1.9492, -0.8783]], grad_fn=<AddBackward0>)

Sick! So far we're just doing what we did with the vanilla RNN. However, instead of passing the output **h** of our summed matrix-vector multiplications through a *tanh* activation, we'll use a _sigmoid_ function. We'll call the "activated" output of our forget gate __F__:

In [53]:
F = torch.sigmoid(forget_gate)
F

tensor([[0.4604, 0.7724, 0.1246, 0.2935]], grad_fn=<SigmoidBackward>)

So we've defined which information we're going to *remember* (or forget). Now we need to update our hidden state. We'll do this the same way we did with the vanilla RNN! The two linear transformations will also be identical to the ones we just applied to the input and hidden state in the forget gate. We'll also apply a bias afterward, identical to **b_h**, and pass the summed output through a sigmoid function. We call this our *input gate*.

First let's define a new pair of linear transformations (without bias) along with a new bias vector:

In [54]:
xh_input = nn.Linear(4,4, bias=False)
hh_input = nn.Linear(4,4, bias=False)
b_input = torch.randn(1,4).float()
print(xh_input)
print(hh_input)
print(b_input)

Linear(in_features=4, out_features=4, bias=False)
Linear(in_features=4, out_features=4, bias=False)
tensor([[-0.5973,  0.1780,  1.1972,  0.4832]])


Now let's sum our linear transformations and bias:

In [55]:
input_gate = xh_input(input) + hh_input(hidden) + b_input
input_gate

tensor([[-0.5818,  0.1083,  0.9470, -0.0015]], grad_fn=<AddBackward0>)

Finally, we pass our summed linear transformations and bias through the **sigmoid** function. We'll call this __I__:

In [56]:
I = torch.sigmoid(input_gate)
I

tensor([[0.3585, 0.5270, 0.7205, 0.4996]], grad_fn=<SigmoidBackward>)

Next we "create a vector of candidate values" (borrowing language from https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Again, we take our network input and prior/existing hidden state as inputs, apply linear transformations without bias, and add a bias. Again, we'll apply a sigmoid function to the output:

In [57]:
xh_candidate = nn.Linear(4,4, bias=False)
hh_candidate = nn.Linear(4,4, bias=False)
b_candidate = torch.randn(1,4).float()
print(xh_candidate)
print(hh_candidate)
print(b_candidate)

Linear(in_features=4, out_features=4, bias=False)
Linear(in_features=4, out_features=4, bias=False)
tensor([[-0.2224, -1.0292,  1.8480,  0.2707]])


Now we can define our **candidate_vector** (prior to activation) as the sum of our transformed input and hidden vectors, plus our bias term:

In [58]:
candidate_vector = xh_candidate(input) + hh_candidate(hidden) + b_candidate
candidate_vector

tensor([[-0.4017, -0.5760,  1.5157,  0.4986]], grad_fn=<AddBackward0>)

Let's also pass this output through a **sigmoid** activation:

In [59]:
C = torch.sigmoid(candidate_vector)
C

tensor([[0.4009, 0.3598, 0.8199, 0.6221]], grad_fn=<SigmoidBackward>)

We use this "activated candidate vector" **C** (I think I'm making up this terminology) and add it to *another* set of linearly transformed (without bias) input and hidden vectors:

In [60]:
xh_transform = nn.Linear(4,4, bias=False)
hh_transform = nn.Linear(4,4, bias=False)
b_transform = torch.randn(1,4).float()
print(xh_transform)
print(hh_candidate)
print(b_candidate)

Linear(in_features=4, out_features=4, bias=False)
Linear(in_features=4, out_features=4, bias=False)
tensor([[-0.2224, -1.0292,  1.8480,  0.2707]])


Now we can transform our input (input + hidden state) again...

In [61]:
transformed_input = xh_candidate(input) + hh_candidate(hidden) + b_candidate
transformed_input

tensor([[-0.4017, -0.5760,  1.5157,  0.4986]], grad_fn=<AddBackward0>)

... but this time we'll apply a **tanh** to our transformed input:

In [62]:
T = torch.tanh(transformed_input)
T

tensor([[-0.3814, -0.5198,  0.9079,  0.4610]], grad_fn=<TanhBackward>)

Next step! We'll define an *output gate*. This is the linear transformation of our input vector and prior hidden vectors (without bias), plus a bias. Then we apply a sigmoid activation to this output. Sound familiar?

In [63]:
xh_output = nn.Linear(4,4, bias=False)
hh_output = nn.Linear(4,4, bias=False)
b_output = torch.randn(1,4).float()
print(xh_output)
print(hh_output)
print(b_output)

Linear(in_features=4, out_features=4, bias=False)
Linear(in_features=4, out_features=4, bias=False)
tensor([[ 0.1209,  0.3103, -0.9206,  0.3118]])


Once again, we pass our input and hidden state through the gate...

In [64]:
output_gate = xh_candidate(input) + hh_candidate(hidden) + b_candidate
output_gate

tensor([[-0.4017, -0.5760,  1.5157,  0.4986]], grad_fn=<AddBackward0>)

... and apply a sigmoid activation to the output:

In [65]:
O = torch.sigmoid(output_gate)
O

tensor([[0.4009, 0.3598, 0.8199, 0.6221]], grad_fn=<SigmoidBackward>)

In addition to our input and hidden state, we also need to initialize an LSTM cell state. As with our hidden state, we can initialize by sampling from a standard normal distribution. We'll keep our length equal to 4:

In [66]:
cell_size = 4
c = torch.randn(1, cell_size)
c

tensor([[-0.6966,  0.5522, -0.5823, -1.0482]])

Now we can update our LSTM cell state! Here's how we do it:

1) Elementwise-multiply the output of forget gate __F__ by our prior cell state **c**

2) Elementwise-multiply the output of input gate **I** by our transformed input __T__

3) Add the two

We'll update the value of our LSTM cell state **c** as we go along:

In [67]:
c_forget = F * c
print(c_forget)
c_transformed = I * T
print(c_transformed)
c = c_forget + c_transformed
print(c)

tensor([[-0.3207,  0.4265, -0.0726, -0.3077]], grad_fn=<MulBackward0>)
tensor([[-0.1368, -0.2739,  0.6542,  0.2303]], grad_fn=<MulBackward0>)
tensor([[-0.4574,  0.1526,  0.5816, -0.0773]], grad_fn=<AddBackward0>)


Since we've initialized our LSTM cell state as a vector of zeroes, **c_forget** will also be a vector of zeroes, and __c__ = **c_transformed**.

Updating our hidden state is as simple as applying **tanh** to our updated LSTM state and elementwise-multiplying this output by the output of our output gate __O__ (lol):

In [68]:
h = O * torch.tanh(c)
h

tensor([[-0.1716,  0.0545,  0.4295, -0.0480]], grad_fn=<MulBackward0>)

Very cool. PyTorch can make this much simpler! Let's take a look at how to build an LSTM in PyTorch. Check out the documentation for reference: https://pytorch.org/docs/master/nn.html#torch.nn.LSTM

Let's define an **nn.LSTM** with input and hidden state vectors of length 4. Since our LSTM takes one layer by default, it's not necessary to define __num_layers = 1__, but we'll do it anyway for sake of example:

In [69]:
lstm = nn.LSTM(input_size=4, hidden_size=4, num_layers=1)
lstm

LSTM(4, 4)

There are three dimensions to our input. Let's say we defined a *random* input. We could sample from a normal distribution with these arguments:

1) **seq_len**: the number of time steps for our input. If we're just using our 'H' to predict 'E', we're only doing one time step. Let's stick with that.

2) **batch**: the size of our batch of inputs. We only have one example here, so __batch = 1__

3) **input_size = 4**, as with the argument we passed to __nn.LSTM__

This is a solid explanation of what's going on here: https://stackoverflow.com/questions/45022734/understanding-a-simple-lstm-pytorch

Here's what we'd get with random sampling:

In [70]:
input = torch.randn(1, 1, 4)
print(input.size())
input

torch.Size([1, 1, 4])


tensor([[[-0.0367,  0.0207,  0.9240,  1.0845]]])

However, we already have an input: our 'H' vector. Let's represent our 'H' as a tensor of size \[1, 1, 4\]. We will also have to cast this to a float (it is *long* by default):

In [71]:
input = torch.tensor([[[1,0,0,0]]]).float()
print(input.size())
input

torch.Size([1, 1, 4])


tensor([[[1., 0., 0., 0.]]])

Now let's initialize our hidden weights. Instead of using a zero vector, we'll again sample from a standard normal distribution. Our hidden vector is initialized randomly in three dimensions:

1) **num_layers**: we'll stick with our simple LSTM with one hidden layer

2) **batch**: again, we're only going to use one example

3) **hidden_size = 4**, as with the argument passed to __nn.LSTM__

In [72]:
h = torch.randn(1, 1, 4)
h

tensor([[[ 0.1794,  0.3520,  0.3231, -0.9658]]])

Up next is our LSTM state. This takes the same dimensions as our hidden layer. Again we'll stick with one layer, batch size of one, and **output_size** (instead of hidden_size) of 4:

In [73]:
c = torch.randn(1, 1, 4)
c

tensor([[[ 0.1698,  0.7908, -0.5226, -1.8723]]])

We can now pass the outputs of these functions as arguments to our **lstm** function:

In [74]:
output, (h, c) = lstm(input, (h, c))

Let's examine our output and updated hidden and LSTM cell states!

In [75]:
print(output)
print(h)
print(c)

tensor([[[ 0.0395, -0.0341,  0.0680, -0.3077]]], grad_fn=<StackBackward>)
tensor([[[ 0.0395, -0.0341,  0.0680, -0.3077]]], grad_fn=<StackBackward>)
tensor([[[ 0.1224, -0.0679,  0.1146, -0.8863]]], grad_fn=<StackBackward>)


Very cool. Note that our updated hidden state is equivalent to our output!

We're going to take a look now at one more common type of RNN: the **gated recurrent unit (GRU)**. The GRU computes four vectors (see https://pytorch.org/docs/master/nn.html#torch.nn.LSTM):

1) A *reset gate* vector **r**

2) An *update gate* vector **z**

3) A *"new gate"* vector **n**

4) A *hidden state* vector **h**

Let's go through each! PyTorch will automatically compute these gates, but it's good to know what's going on inside the GRU.

Let's start with our reset gate. This takes an input vector (we'll reuse the **input** tensor from our LSTM example) and our hidden state as inputs. This means we'll have to do some more random initialization. Let's stick with the dimensions \[num_layers=1, batch_size=1, hidden_size=4\]

In [76]:
h = torch.randn(1, 1, 4)
h

tensor([[[-1.7185,  0.3538,  0.4405,  1.6638]]])

Now we need to apply a linear transformation *with* bias. It's not necessary to specify __bias=True__ but we'll do it anyway to be explicit. We could do this by defining a weight matrix and multiplying the weight matrix by our initial hidden vector, or we can take a shortcut and use **nn.Linear** with four inputs and four outputs. We'll call this operation __xr__, since it takes our input (by convention, $x$) and generates an output necessary to compute our reset gate vector **r**:

In [77]:
xr = nn.Linear(4, 4, bias=True)
xr

Linear(in_features=4, out_features=4, bias=True)

We also need to transform linearly (not splitting the infinitive this time) our hidden state. Let's call this linear transformation **hr**:

In [78]:
hr = nn.Linear(4, 4, bias=True)
hr

Linear(in_features=4, out_features=4, bias=True)

To get our reset gate **r**, we just need to sum our transformed input and hidden states and pass this sum through a sigmoid function:

In [79]:
r = torch.sigmoid(xr(input) + hr(h))
r

tensor([[[0.4386, 0.4983, 0.1808, 0.7963]]], grad_fn=<SigmoidBackward>)

Good stuff. We use identical operations to compute our update gate. We'll call our linear transformations **xz** and __hz__ since we're taking our inputs (**x, h**) and generating output **z**:

In [80]:
xz = nn.Linear(4, 4, bias=True)
print(xz)
hz = nn.Linear(4, 4, bias=True)
print(hz)
z = torch.sigmoid(xz(input) + hz(h))
z

Linear(in_features=4, out_features=4, bias=True)
Linear(in_features=4, out_features=4, bias=True)


tensor([[[0.3397, 0.2081, 0.6019, 0.4117]]], grad_fn=<SigmoidBackward>)

Next we'll compute our "new gate." Again, we'll linearly transform (with bias) our input vector and hidden state. Let's call these linear transformations **xn** and __hn__:

In [81]:
xn = nn.Linear(4, 4, bias=True)
print(xn)
hn = nn.Linear(4, 4, bias=True)
print(hn)

Linear(in_features=4, out_features=4, bias=True)
Linear(in_features=4, out_features=4, bias=True)


Now we do something funky. We elementwise multiply our computed reset gate vector **r** by the output of our transformed hidden state:

In [82]:
n = r * hn(h)
n

tensor([[[-0.1192,  0.1106, -0.1692,  0.7538]]], grad_fn=<MulBackward0>)

We then add this to our transformed input:

In [83]:
n = n + xn(input)
n

tensor([[[-0.4798,  0.3850,  0.0886,  0.7907]]], grad_fn=<AddBackward0>)

Finally we use **n** and __z__, along with our current hidden state to generate a new hidden state. 

First we elementwise multiply **z** by our current hidden state and update this state:

In [84]:
print(z)
h = z * h
h

tensor([[[0.3397, 0.2081, 0.6019, 0.4117]]], grad_fn=<SigmoidBackward>)


tensor([[[-0.5837,  0.0736,  0.2652,  0.6849]]], grad_fn=<MulBackward0>)

Then we multiply (1 - z) by our transformed input **n** and add it to our updated h:

In [85]:
print(1 - z)
h = (1 - z) * n + h
h

tensor([[[0.6603, 0.7919, 0.3981, 0.5883]]], grad_fn=<RsubBackward1>)


tensor([[[-0.9005,  0.3785,  0.3005,  1.1502]]], grad_fn=<AddBackward0>)

Excellent. Now let's do this in a few lines of PyTorch code. First we'll define a GRU with arguments **seq_length**, __batch__, and **input_size**. We'll keep these identical to our LSTM (and what we've done in the above example):

In [86]:
gru = nn.GRU(input_size=4, hidden_size=4, num_layers=1)
gru

GRU(4, 4)

Now we can simply pass our **input** and hidden state __h__ through the GRU. Let's initialize **h** again since we just updated it:

In [87]:
h = torch.randn(1, 1, 4)
h

tensor([[[-0.3053,  0.2382,  1.5382, -1.3813]]])

NOW let's pass our input and hidden state through the GRU:

In [88]:
output, h = gru(input, h)
print(output)
print(h)

tensor([[[-0.2184,  0.2073,  0.5252, -0.0426]]], grad_fn=<StackBackward>)
tensor([[[-0.2184,  0.2073,  0.5252, -0.0426]]], grad_fn=<StackBackward>)


Again, our output and updated hidden state should be identical.

Hopefully you'll now have a good idea how to construct RNNs in PyTorch, including LSTMs and GRUs. There's still a lot to cover, including practical applications and theoretical justifications for the design of particular gates.

If you have any questions, comments, or suggestions, please contact Pat at pdonnelly@groupwaretech.com!