In [1]:
import numpy as np
import tensorflow as tf
%matplotlib inline
import matplotlib.pyplot as plt
tf.__version__

'0.12.0-rc1'

# Simple RNN

In this notebook we consider a simple example of an RNN and used a quite artifical data generating process. The example has been adopted from:
http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html. 

Other Resources for RNNs:

* http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
* http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html
* http://karpathy.github.io/2015/05/21/rnn-effectiveness/
* http://colah.github.io/posts/2015-08-Understanding-LSTMs/

### Definition of the task

We consider a network which predicts at each point in time a variable $\hat{y}_t$. (It thus corresponds to the "many to many" example at the very right side of the figure from the famous Karpathy blog post http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

### Example data  (I screama, you screama, we all screama for I screama)

We need some data to play around with RNNs. They are capable of doing quite complicated things such as language models and so on. For this example, we want to generate the data ourself. We have to come up with a process which creates $x_t$ which itself can be influcenced by events $x_{t'}$ which happend before $t$. Further, we have to come up with $y_t$ which depends on $x_t'$ for timepoints $t' \le t$. 

To keep it simple, we analyse the following quite artifical process in which the weather $x_{t'}$ for $t' \le t$ influences our stock on icecream $y_t$. We then see if the RNN is capable of reconstructing that process.

#### Definition of the simple process
The weather $x_t$ at a certain point in time $t$ has three states (sunny, rainy, cloudy), which we model as $x_t = (1,0,0)$, $x_t = (0,1,0)$, and $x_t = (0,0,1)$ repectively. We assume that the weather is completly random (of course we could model more complex scenarios). 

We have an icecream store capable of holding 2 units of icecream and we start with a full store. When it is sunny we sell one unit of icecream. We have the strange policy that we order  on unit of icecream when it's claudy. It takes 3 days to deliever the ice cream, we accept the ice cream if we do not have a full stock.

This enables us to model $y_t$ the state of the store $(1,0)$ for out of stock and $(0,1)$ for in stock. We create the one-hot-encoded data in the graph later. For now we use integers but keep in mind that the data is categorical.

In [25]:
def gen_data(size=1000000):
    Xs = np.array(np.random.choice(3, size=(size,))) #Random Weather
    Y = []
    ice = 2 #Our stock of icecream at start
    for t,x in enumerate(Xs):
        # (t-3) >= 0 the first ice cream could be delivered on day 3
        # Xs[t - 3] claudy three days before today => we ordered ice cream
        # ice < 2 not full
        if (t - 3) >= 0 and Xs[t - 3] == 1 and ice < 2: 
            ice += 1
        if x == 0: # It is sunny we therefore sell ice, if we have
            if ice > 0: # We have ice cream
                ice -= 1
        if ice > 0: #We are not out of stock
            Y.append(1)
        else:
            Y.append(0)
    return Xs, np.array(Y)

In [26]:
X_train, Y_train = gen_data(50000) #Global variables holding the input and output
(X_train[0:50], Y_train[0:50])

(array([2, 0, 2, 0, 2, 1, 1, 2, 2, 1, 0, 0, 2, 2, 2, 0, 1, 0, 0, 2, 0, 2, 2,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1, 1, 2, 2, 1, 2, 0, 2, 2, 2, 2, 0, 0,
        1, 2, 1, 2]),
 array([1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
        0, 0, 0, 1]))

### Preparation of the Minibatch

In this example, we have in principle a large stream of data $x$ and $y$. For efficiency reason we split the stream in minibatches of a certain length. For this task we could also imagin to have several realizations of that icecream process, so that it would also be natural to split the process into mini batches. 

For simplicyty we create the minibatch by we randomly cutting out `batch_size` entries of fixed length `num_steps`. Other, more advanced ways of doing so are possible. See e.g. https://danijar.com/variable-sequence-lengths-in-tensorflow/. For the time being, we thus consider the input tensor $X_{btc}$ for the minibatch to be of the following form:

* $b$ having `batch_size` entries
* $t$ loops over the unrolled timestamps (`num_steps`)
* $c$ has the dimension of the one-hot-coded classes (the one-hot-encoding will be done in the graph)

In [4]:
def get_batch(Xs, Ys, batch_size = 32, num_steps = 50):
    data_x = np.zeros([batch_size, num_steps], dtype=np.int32)
    data_y = np.zeros([batch_size, num_steps], dtype=np.int32)
    for i in range(1,batch_size):
        s = int(np.random.uniform(0, len(Xs) - num_steps))
        data_x[i] = Xs[s : s + num_steps]
        data_y[i] = Ys[s : s + num_steps]  
    return data_x, data_y

In [5]:
X, Y = get_batch(X_train, Y_train,3, 10)
print X
print Y

[[0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 1 1 1 1 0]
 [1 1 0 2 2 2 1 0 0 2]]
[[0 0 0 0 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 1 1]
 [1 1 1 1 1 1 1 1 0 1]]


In [6]:
# Global config variables
num_steps = 40     # number of truncated backprop steps
batch_size = 200  # number of minibatches b
num_classes_in = 3   # number of classes in the input
num_classes_out = 2   # number of classes in the output
state_size = 5    # number of classes in the state
learning_rate = 0.1

#### Definition of the in- and outputs

We define the input and output for the graph

In [7]:
tf.reset_default_graph()

# Placeholders
x = tf.placeholder(tf.int32, [batch_size, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [batch_size, num_steps], name='labels_placeholder')
init_state = tf.zeros([batch_size, state_size])

# RNN Inputs
# One hot encoding.
x_one_hot = tf.one_hot(x, num_classes_in)
# We want the following dimensions [batch_size, Max_Length, num_classes_in]
rnn_inputs = tf.transpose(x_one_hot, perm=(0,1,2))
rnn_inputs

<tf.Tensor 'transpose:0' shape=(200, 40, 3) dtype=float32>

The input tensor $r_{btc}$  is indexed by 

* $b$ having `batch_size` entries
* $t$ loops over the unrolled timestamps
* $c$ has the dimension of the one-hot-coded classes

#### Definition the cell

We now define the network, we do not consider the output nodes yet.
A single RNN cell is shown in the figure below in the middle:

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png)
Image taken from: [Colah's RNN Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)


The joining of the two lines coming from the previous state $h_{t-1}$ and the current x-values $x_t$ reflects a concantination to a vector  $[h_{t-1}, x_{t}]$ of size `state_size + num_classes_in`. Alternatively, instead of concatinating, one could also use two matrices $W_x$ and $W_h$ and keep the states seperate. This is mathematically completely identical. The new state $h_t$ is 
then calculated as:

$$
    h_{t} = \tanh([h_{t-1}, x_{t}] \cdot W + b) = \tanh(h_{t-1} \cdot W_h + x_{t} \cdot W_x + b)
$$

Note that we share the variables W and b during different time-points. To do this in TensorFlow we define them in a first step for later reuse. 

In [8]:
# Definition of the Variables needed in a single cell
with tf.variable_scope('rnn_cell', reuse = False):
    W = tf.get_variable('W', [num_classes_in + state_size, state_size])
    b = tf.get_variable('b', [state_size], initializer=tf.constant_initializer(0.0))
    
# Definition of a single cell
def rnn_cell(rnn_input, state):
    with tf.variable_scope('rnn_cell', reuse=True):
        W = tf.get_variable('W', [num_classes_in + state_size, state_size])
        b = tf.get_variable('b', [state_size], initializer=tf.constant_initializer(0.0))
    return tf.tanh(tf.matmul(tf.concat(1, [state, rnn_input]), W) + b)

#### Unrolling the timesteps

We build the network using the identical weights in the `num_steps` unrolled timesteps. Techniqualy the `rnn_cell` at time t, gets the input (weather) at time t and the pervious state.  

In [9]:
state = init_state
rnn_outputs_l = []

for t in range(num_steps):
    state = rnn_cell(rnn_inputs[:,t,:], state) #Pervious state
    rnn_outputs_l.append(state) #We put the states h_0, h_1, ... in a list. 

rnn_outputs = tf.pack(rnn_outputs_l, axis=1)
rnn_outputs

<tf.Tensor 'pack:0' shape=(200, 40, 5) dtype=float32>

### Adding output to the network

We now add at each timepoint an output layer to the network. This output use the same weiht for each timepoint.

The output of the rnn is a tensor index by $o_{btk}$ (batch, time, class). This tensor produces for each minibatch and timepoint a number of output states dimensional vector indexed by $k$. This output needs to be compared with the y-value with has the shape $y_{bt}$. It's easier later on to flatten this vector for each batch.    

In [10]:
#reshape rnn_outputs and y so we can get the logits in a single matmul
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
y_reshaped = tf.reshape(y, [-1])
y_reshaped, rnn_outputs

(<tf.Tensor 'Reshape_1:0' shape=(8000,) dtype=int32>,
 <tf.Tensor 'Reshape:0' shape=(8000, 5) dtype=float32>)

In [11]:
with tf.variable_scope('softmax'):
    V = tf.get_variable('V', [state_size, num_classes_out])
    b = tf.get_variable('b', [num_classes_out], initializer=tf.constant_initializer(0.0))

logits = tf.matmul(rnn_outputs, V) + b
total_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped))
train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)

In [12]:
# Looks quite ugly
# writer = tf.train.SummaryWriter("/tmp/dumm/RNN2", tf.get_default_graph(), 'graph.pbtxt') 

### Training

In [13]:
sess = tf.Session()
sess.run(tf.initialize_all_variables())
count = 0
sum_tr_losses = 0
for i in range(1000):
    X, Y = get_batch(X_train, Y_train, batch_size, num_steps)
    tr_losses, _ = \
    sess.run([total_loss, train_step], feed_dict={x:X, y:Y})
    sum_tr_losses += tr_losses
    count += 1
    if (i < 10) or (i % 200 == 0):
        print "{} {}".format(i, sum_tr_losses / count)
        count = 0
        sum_tr_losses = 0

Instructions for updating:
Use `tf.global_variables_initializer` instead.
0 0.691968500614
1 0.640196859837
2 0.624058187008
3 0.585162341595
4 0.568682193756
5 0.502234458923
6 0.460208594799
7 0.432763636112
8 0.34563729167
9 0.382720649242
200 0.170340090288
400 0.147368130423
600 0.147375527583
800 0.148459675126


### Testing 

In [14]:
preds = tf.nn.softmax(logits)
X, Y = get_batch(X_train, Y_train, batch_size, num_steps)
p_y_pred = sess.run(preds, feed_dict={x:X}) #A list for each time point
loss_train = sess.run(total_loss, feed_dict={x:X, y:Y})

In [15]:
loss_train

0.1576525

In [16]:
np.sum(np.argmax(p_y_pred, axis=1) == np.reshape(Y, -1)) / (1.0 * batch_size * num_steps)

0.95825000000000005

In [17]:
sess.close()

### Task
Change the state size to 2. What do you obsereve? Give an explanation for your observation. 

## Using the TensorFlow API

Alternatively one can use the TensorFlow-API for creating RNNs. In principle there are two TensorFlow methods. The first, kind of deprecated one, builds a graph from the unrolled network. This API has issues in performance, first of all the creation of the graph takes quite some time. Further, and this is a bit it is also slower during runtime. Therefore, the novel dynamic API should be prefered. If you want to use sequences of variable length see:  https://danijar.com/variable-sequence-lengths-in-tensorflow/

In [18]:
tf.reset_default_graph()

# Placeholders
x = tf.placeholder(tf.int32, [batch_size, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [batch_size, num_steps], name='labels_placeholder')
init_state = tf.zeros([batch_size, state_size])

# RNN Inputs
# One hot encoding.
x_one_hot = tf.one_hot(x, num_classes_in)
# We want the following dimensions [batch_size, Max_Length, num_classes_in]
rnn_inputs = tf.transpose(x_one_hot, perm=(0,1,2))
rnn_inputs

<tf.Tensor 'transpose:0' shape=(200, 40, 3) dtype=float32>

In [19]:
#cell = tf.nn.rnn_cell.BasicRNNCell(state_size)
cell = tf.nn.rnn_cell.BasicLSTMCell(state_size)
init_state = cell.zero_state(batch_size, tf.float32)
rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state=init_state)

In [20]:
rnn_outputs

<tf.Tensor 'RNN/transpose:0' shape=(200, 40, 5) dtype=float32>

The output $o_{btk}$ tensor produces for each minibatch and timepoint a 4 (number of output states) dimensional vector indexed by $k$. This can be compared with the y-value with has the shape $y_{bt}$.  

In [21]:
#reshape rnn_outputs and y so we can get the logits in a single matmul
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
y_reshaped = tf.reshape(y, [-1])
y_reshaped, rnn_outputs

(<tf.Tensor 'Reshape_1:0' shape=(8000,) dtype=int32>,
 <tf.Tensor 'Reshape:0' shape=(8000, 5) dtype=float32>)

In [22]:
with tf.variable_scope('softmax'):
    W = tf.get_variable('W', [state_size, num_classes_out])
    b = tf.get_variable('b', [num_classes_out], initializer=tf.constant_initializer(0.0))

logits = tf.matmul(rnn_outputs, W) + b

In [23]:
total_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped))
train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)

In [24]:
Y = None
X = None
count = 0
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    training_losses = []
    for i in range(1000):
        X, Y = get_batch(X_train, Y_train, batch_size, num_steps)
        tr_losses, _ = sess.run([total_loss, train_step], feed_dict={x:X, y:Y})
        count += 1
        sum_tr_losses += tr_losses
        if (i < 10) or (i % 200 == 0):
            print "{} {}".format(i, sum_tr_losses / count)
            count = 0
            sum_tr_losses = 0

Instructions for updating:
Use `tf.global_variables_initializer` instead.
0 35.462366432
1 0.691848635674
2 0.682622790337
3 0.670633971691
4 0.662271261215
5 0.658096909523
6 0.653116941452
7 0.635891377926
8 0.618283689022
9 0.592325210571
200 0.178685860601
400 0.0938078673556
600 0.0905056070909
800 0.0891324103251
