# Simple RNN TF1

** This is the implementation for TensorFlow 1.0 and python 3.4. It uses the TF-RNN library for training.** For another more manual implementation see simple_rnn.

In this notebook we consider a simple example of an RNN and used a quite artifical data generating process (if you have a better idea / story please contact me). 

The example has been motivated by:
http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html. 

Other Resources for RNNs:

* http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
* http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html
* http://karpathy.github.io/2015/05/21/rnn-effectiveness/
* http://colah.github.io/posts/2015-08-Understanding-LSTMs/
* http://www.deeplearningbook.org/contents/rnn.html

In [1]:
from six.moves.cPickle import loads
import numpy as np
import sys
np.random.seed(42)
import tensorflow as tf

%matplotlib inline
import matplotlib.pyplot as plt
tf.__version__, sys.version_info

('1.0.0',
 sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0))

In [2]:
# Global config variables (see below)
num_steps = 40     # number of truncated backprop steps
batch_size = 200  # number of minibatches b
num_classes_in = 3   # number of classes in the input
num_classes_out = 2   # number of classes in the output
state_size = 4    # number of classes in the state
learning_rate = 0.1

# Helper functions
def one_hot(Y, max):
    d = np.zeros((len(Y),max), dtype='int32')
    for row,col in enumerate(Y):
        d[row, col] = 1
    return d    

### Definition of the task

We consider a network which predicts at each point in time a variable $\hat{y}_t$ based on earlier values of $\hat{y}_{t'}$ covariates $x_t$. 

### Example data  (I screama, you screama, we all screama for I screama)

We need some data to play around with RNNs. They are capable of doing quite complicated things such as language models and so on. For this example, we want to generate the data ourself. We have to come up with a process which creates $x_t$ which itself can be influcenced by events $x_{t'}$ which happend before $t$. Further, we have to come up with $y_t$ which depends on $x_t'$ for timepoints $t' \le t$. 

To keep it simple, we analyse the following quite artifical process in which the weather $x_{t'}$ for $t' \le t$ influences our stock on icecream $y_t$. We then see if the RNN is capable of reconstructing that process.

#### Definition of the simple process
The weather $x_t$ at a certain point in time $t$ has three states (sunny, rainy, cloudy), which we model as $x_t = (1,0,0)$, $x_t = (0,1,0)$, and $x_t = (0,0,1)$ repectively. We assume that the weather is completly random (of course we could model more complex scenarios). 

We have an icecream store capable of holding 2 units of icecream and we start with a full store. When it is sunny we sell one unit of icecream. We have the strange policy that we order  on unit of icecream when it's cloudy. It takes 3 days to deliever the ice cream, we accept the ice cream if we do not have a full stock.

This enables us to model $y_t$ the state of the store $(1,0)$ for out of stock and $(0,1)$ for in stock. We create the one-hot-encoded data in the graph later. For now we use integers but keep in mind that the data is categorical. 

** The important part is that, we have values $y_t$ which can be prediced from earlier** $x_t$s

In [3]:
def gen_data(size=1000000):
    Xs = np.array(np.random.choice(3, size=(size,))) #Random Weather
    Y = []
    ice = 2 #Our stock of icecream at start
    for t,x in enumerate(Xs):
        # (t-3) >= 0 the first ice cream could be delivered on day 3
        # Xs[t - 3] claudy three days before today => we ordered ice cream
        # ice < 2 not full
        if (t - 3) >= 0 and Xs[t - 3] == 1 and ice < 2: 
            ice += 1
        if x == 0: # It is sunny we therefore sell ice, if we have
            if ice > 0: # We have ice cream
                ice -= 1
        if ice > 0: #We are not out of stock
            Y.append(1)
        else:
            Y.append(0)
    return Xs, np.array(Y)

In [4]:
X_train, Y_train = gen_data(50000) #Global variables holding the input and output
(X_train[0:50], Y_train[0:50])

(array([2, 0, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 0, 2, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        1, 0, 0, 0, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2,
        1, 0, 1, 1]),
 array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1]))

### Forward pass in numpy
To better illustrate the used method, we first do a forward-pass of the RNN using numpy. We load the weights which we calculated previously with the cells below.

In [5]:
W_, b_, V_, bv_ = np.load('rnn_weights_tf1.npy')

### Architecture of the network 
We now define the network, we do not consider the output nodes yet.
A single RNN cell is shown in the figure below in the middle:

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png)
Image taken from: [Colah's RNN Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

The joining of the two lines coming from the previous state $h_{t-1}$ and the current x-values $x_t$ is a concantination to a vector  $[h_{t-1}, x_{t}]$ of size `state_size + num_classes_in`. Alternatively, instead of concatinating, one could also use two matrices $W_x$ and $W_h$ and keep the states seperate. This is mathematically completely identical. The new state $h_t$ is then calculated as:

$$
    h_{t} = \tanh([h_{t-1}, x_{t}] \cdot W + b) = \tanh(h_{t-1} \cdot W_h + x_{t} \cdot U + b)
$$

The dynamic of the hidden state $h_{t}$ is determined by $W$ (and $b$):

In [6]:
W_, b_

(array([[-0.19342743,  1.20125043,  0.03041052, -0.7982803 ],
        [-1.17504787, -0.35316652,  0.68581891,  1.42157817],
        [-0.18561231, -0.16853848,  0.71172661, -0.50630069],
        [ 1.03420258,  0.59627861, -0.9009819 ,  0.35502589],
        [-0.15730815,  1.02999747, -0.86277425, -0.35499594],
        [ 0.16915447, -1.24952602,  0.11196493,  0.41059464],
        [-1.52237487,  0.30463067, -1.88313866, -1.14694726]], dtype=float32),
 array([ 0.12549509,  0.29770762,  0.21737964, -0.32890254], dtype=float32))

In [7]:
# The first state
h0 = np.zeros(state_size) #We start with 0 initial state
x1 = one_hot(X_train, num_classes_in)[0] #Make a vector

#<---- your code here (calculate the hidden state h1) ---->
h1 = np.tanh(np.matmul(np.concatenate([x1, h0]), W_) + b_)
#<---- end your code here ---->

print(h0, "--->", h1)

[ 0.  0.  0.  0.] ---> [-0.0600449   0.12845552  0.73017693 -0.68326001]


We could repeat those transitions of the hidden states to get a sequence of hidden states:

$h_0 \rightarrow h_1 \rightarrow h_2 \rightarrow h_3 \rightarrow h_4 \ldots $

In [8]:
def rnn_forward(state, X_train):
    hs = []
    for t in range(len(X_train)):
        # Note that TF concatenates [Input, State]
        state = np.tanh(np.matmul(np.concatenate([X_train[t,:],state]), W_) + b_)
        hs.append(state)
    return hs

In [9]:
rnn_forward(h0, one_hot(X_train[0:5],num_classes_in))

[array([-0.0600449 ,  0.12845552,  0.73017693, -0.68326001]),
 array([ 0.76718624,  0.44218723,  0.91533763, -0.11018243]),
 array([ 0.75578944, -0.13440726,  0.16483438, -0.21423011]),
 array([ 0.79930566,  0.16854152,  0.65613562, -0.20291764]),
 array([ 0.81844978,  0.85311612, -0.16078928, -0.38088856])]

We add some output. For each time step the output is produced by multiplying the hidden state with:

$o_t = h_t \cdot V + b_{\tt{v}}$

This is a logit, the final the probability of output class is the softmax of the logit.

In [10]:
#<---- your code here (calculate the output state o1 for timestep 1 from h1, V_ and the bias bv_) ---->
o1 = np.matmul(h1, V_) + bv_
#<---- your code here (calculate probability from the state o1) ---->
prob_1 = np.exp(o1)/np.sum(np.exp(o1))
#<---- end your code here  ---->
o1, prob_1

(array([-0.1591476,  0.545742 ]), array([ 0.33072903,  0.66927097]))

In [11]:
h = rnn_forward(h0, one_hot(X_train,3))
pt = []
for t in range(len(h)):
    ot = np.matmul(h[t], V_) + bv_
    pt.append(np.exp(ot)/np.sum(np.exp(ot)))

In [12]:
pt[0:10], np.argmax(pt[0:30],axis=1), Y_train[0:30]

([array([ 0.33072903,  0.66927097]),
  array([ 0.75566483,  0.24433517]),
  array([ 0.15560883,  0.84439117]),
  array([ 0.42810145,  0.57189855]),
  array([ 0.97741564,  0.02258436]),
  array([ 0.98956735,  0.01043265]),
  array([ 0.9835644,  0.0164356]),
  array([ 0.97277593,  0.02722407]),
  array([ 0.98552958,  0.01447042]),
  array([ 0.96432936,  0.03567064])],
 array([1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1]))

In [13]:
np.average(np.argmax(pt, axis=1) == Y_train)

0.98758000000000001

In [14]:
tot_loss = 0
for i in range(len(Y_train)):
    tot_loss += -np.log(pt[i][Y_train[i]])

In [15]:
tot_loss / len(Y_train)

0.060004378195400249

## Training in TensorFlow
### Preparation of the Minibatch

In this example, we have in principle a large stream of data $x$ and $y$. For efficiency reason we split the stream in minibatches of a certain length. For this task we could also imagin to have several realizations of that icecream process, so that it would also be natural to split the process into mini batches. 

For simplicyty we create the minibatch by we randomly cutting out `batch_size` entries of fixed length `num_steps`. Other, more advanced ways of doing so are possible. See e.g. https://danijar.com/variable-sequence-lengths-in-tensorflow/. For the time being, we thus consider the input tensor $X_{btc}$ for the minibatch to be of the following form:

* $b$ having `batch_size` entries
* $t$ loops over the unrolled timestamps (`num_steps`)
* $c$ has the dimension of the one-hot-coded classes (the one-hot-encoding will be done in the graph)

In [16]:
def get_batch(Xs, Ys, batch_size = 32, num_steps = 50):
    data_x = np.zeros([batch_size, num_steps], dtype=np.int32)
    data_y = np.zeros([batch_size, num_steps], dtype=np.int32)
    for i in range(1,batch_size):
        s = int(np.random.uniform(0, len(Xs) - num_steps))
        data_x[i] = Xs[s : s + num_steps]
        data_y[i] = Ys[s : s + num_steps]  
    return data_x, data_y

In [17]:
X, Y = get_batch(X_train, Y_train, batch_size=3, num_steps=10)
print (X)
print (Y)

[[0 0 0 0 0 0 0 0 0 0]
 [0 1 2 2 2 2 1 1 1 1]
 [1 0 0 1 2 1 0 1 0 1]]
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 1 1 1 1 1]
 [1 0 0 1 1 1 1 1 1 1]]


## Using the TensorFlow API

Alternatively one can use the TensorFlow-API for creating RNNs. In principle there are two TensorFlow methods. The first, kind of deprecated one, builds a graph from the unrolled network. This API has issues in performance, first of all the creation of the graph takes quite some time. Further, and this is a bit it is also slower during runtime. Therefore, the novel dynamic API should be prefered. If you want to use sequences of variable length see:  https://danijar.com/variable-sequence-lengths-in-tensorflow/

In [18]:
tf.reset_default_graph()
tf.set_random_seed(42)
# Placeholders
x = tf.placeholder(tf.int32, [batch_size, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [batch_size, num_steps], name='labels_placeholder')
init_state = tf.zeros([batch_size, state_size])

# RNN Inputs
# One hot encoding.
x_one_hot = tf.one_hot(x, num_classes_in)
# We want the following dimensions [batch_size, Max_Length, num_classes_in]
rnn_inputs = tf.transpose(x_one_hot, perm=(0,1,2))
rnn_inputs

<tf.Tensor 'transpose:0' shape=(200, 40, 3) dtype=float32>

### Definition of the basic cell

We have to define the elementary cell, which has a state of a given size. As above we use the basic RNN-Cell which is described in: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py

In [19]:
#cell = tf.nn.rnn_cell.BasicRNNCell(state_size) 
#cell = tf.nn.rnn_cell.BasicLSTMCell(state_size)
# For tf1.0 the cells have been temporarily moved to a contib see: https://github.com/tensorflow/models/issues/919
cell = tf.contrib.rnn.BasicRNNCell(state_size)
init_state = cell.zero_state(batch_size, tf.float32)
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state=init_state)

In [20]:
rnn_outputs

<tf.Tensor 'rnn/transpose:0' shape=(200, 40, 4) dtype=float32>

The output $o_{btk}$ tensor produces for each minibatch and timepoint a 4 (number of output states) dimensional vector indexed by $k$. For each timepoint and batch, we later want to compare this with the corresponding y-value with has the shape $y_{bt}$. In a first step we flatten the b and t dimension to a $200*40 = 8000$ dimensional vector.

In [21]:
#reshape rnn_outputs and y so we can get the logits in a single matmul
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
y_reshaped = tf.reshape(y, [-1])
y_reshaped, rnn_outputs

(<tf.Tensor 'Reshape_1:0' shape=(8000,) dtype=int32>,
 <tf.Tensor 'Reshape:0' shape=(8000, 4) dtype=float32>)

In [22]:
with tf.variable_scope('softmax'):
    V = tf.get_variable('V', [state_size, num_classes_out])
    bv = tf.get_variable('bv', [num_classes_out], initializer=tf.constant_initializer(0.0))

logits = tf.matmul(rnn_outputs, V) + bv
logits, y_reshaped

(<tf.Tensor 'add:0' shape=(8000, 2) dtype=float32>,
 <tf.Tensor 'Reshape_1:0' shape=(8000,) dtype=int32>)

In [23]:
#writer = tf.summary.FileWriter("tb_simple_rnn_tf1/dd", tf.get_default_graph()) 
#writer.close()
#!tensorboard --logdir=tb_simple_rnn_tf1/

In [24]:
total_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_reshaped, logits=logits))
train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)

In [25]:
Y = None
X = None
count = 0
sum_tr_losses = 0
sess = tf.Session()
#with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
for i in range(1000):
    X, Y = get_batch(X_train, Y_train, batch_size, num_steps)
    tr_losses, _ = sess.run([total_loss, train_step], feed_dict={x:X, y:Y})
    count += 1
    sum_tr_losses += tr_losses
    if (i < 10) or (i % 200 == 0):
        print ("{} {}".format(i, sum_tr_losses / count))
        count = 0
        sum_tr_losses = 0

Instructions for updating:
Use `tf.global_variables_initializer` instead.
0 0.6831499338150024
1 0.6532474160194397
2 0.640494704246521
3 0.6144058704376221
4 0.5690740942955017
5 0.6485346555709839
6 0.5548587441444397
7 0.5476928949356079
8 0.5383784174919128
9 0.5115993022918701
200 0.2237984576306418
400 0.18446169406175614
600 0.17015162907540798
800 0.16602394267916679


### Test on the training set

In [26]:
X, Y = get_batch(X_train, Y_train, batch_size, num_steps)
loss_train = sess.run(total_loss, feed_dict={x:X, y:Y})
loss_train

0.15341146

### Getting the relevant weights

In [27]:
# Finding the relavant weights
#ops = tf.get_default_graph().get_operations()
#for i in ops:
#   print(i.name)

In [28]:
graph = tf.get_default_graph()
W = graph.get_tensor_by_name('rnn/basic_rnn_cell/weights:0')
b = graph.get_tensor_by_name('rnn/basic_rnn_cell/biases:0')
W_,b_,V_,bv_ = sess.run([W,b,V,bv])
print (W_,b_,V_,bv_)
np.save('rnn_weights_tf1', [W_,b_,V_,bv_])

[[-0.19342743  1.20125043  0.03041052 -0.7982803 ]
 [-1.17504787 -0.35316652  0.68581891  1.42157817]
 [-0.18561231 -0.16853848  0.71172661 -0.50630069]
 [ 1.03420258  0.59627861 -0.9009819   0.35502589]
 [-0.15730815  1.02999747 -0.86277425 -0.35499594]
 [ 0.16915447 -1.24952602  0.11196493  0.41059464]
 [-1.52237487  0.30463067 -1.88313866 -1.14694726]] [ 0.12549509  0.29770762  0.21737964 -0.32890254] [[-0.08837788  0.04871782]
 [ 2.23107862 -3.23202324]
 [-0.11032624  0.4064675 ]
 [ 0.07012325 -0.50414103]] [-0.32257852  0.32258588]
