# Counting parameters for an RNN  
  
In order to get the RNN to consume text data, we need two things:  
  
1. A list of all the texts, eg `texts = ['Today is a nice day', 'Yesterday was gorgeous',....]`  
2. We need the labels in a list too, eg `labels = [1, 0, 1, 1,...]`  
  
We tokenize and do other preprocessing to get tensors for the data and the labels to feed into our NN.  
  
We use pretrained Glove embeddings to build the Embeddings layer.  
  
  

At this point the `embedding_matrix` has one row per word in the vocabulary.  Each row has the vector for that word, picked from glove.  Because it is an np.array, it has no row or column names. The order of the words in the rows is the same as the order of words in the dict word_index.  
  
We will feed this embedding matrix as weights to the embedding layer.  

In [1]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, LSTM, SimpleRNN


Using TensorFlow backend.


## How the Embedding Layer works  
The embedding layer has three parameters.  
   
**Arguments to specify**
1. input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10000, then the size of the vocabulary would be 10001 words (because the first element is not used and is all zeros).
2. output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. In the current example it is 100 long.
3. input_length - optional parameter, default is 'None': This is the length of input sequences, as you would define for any input layer of a Keras model. In the current case we are using maxlen=100.
  
**Param count**  
The total count of trainable parameters will be vocab_size * embedding_depth.  
  
The Embeddings Layer is just a weights matrix - with dimensions equal to the (vocab_size, embedding_depth).  In our case we have vocab_size=100, and embedding_dim=100 so we will have 100\*100 weights which means the count of parameters is 10,000.  
  
**Output shape**  
The output shape will be maxlen * embedding_depth.  
  
The Embedding Layer will take every single observation, which has a length of 100 (maxlen), and replace each numerical number with the corresponding weight from the embedding matrix.  Which means the output of the embedding layer will be (maxlen, embedding_depth), or in this case 100\*100, also 10,000.  That is its output shape, which feeds into the next layer.


In [2]:
# Setup parameters for illustration
vocab_size, embedding_dim, maxlen= 10000, 100, 100

In [3]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=maxlen)) # Note that vocab_size=10000, embedding_dim = 100 (100 dense vector for each word from Glove), maxlen=100 (using only first 100 words of each review)
model.add(Flatten()) # Get flat layer equal to the output size of the prior layer
model.add(Dense(32, activation = 'relu')) # Dense layer with 32 nodes. So param # = (Prior layer output size * 32) + 32 biases
model.add(Dense(1, activation='sigmoid')) # Dense layer with 1 node. So param # = (Prior layer output size * 1) + 1 bias
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_1 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________


## SimpleRNN
  
The parameter count for a SimpleRNN layer is given by:  

  **((InputSize + Number of nodes) * Number of nodes)  +  Number of nodes**  

In a simple RNN layer (that has no memory cell, or $\tilde{c}$, like GRU or LSTM, two things are input: the recurrent activation $a^{<t-1>}$ from the prior cell, and the $x^{<t>}$.  
  
Then there are two things output: the activation of the SimpleRNN cell, or $a^{<t-1>}$, and $\hat{y}^{<t>}$.  
  
$a^{<t>} = g(W_a[a^{<t-1>}, x^{<t>}] + b_a)$
and  
$\hat{y}^{<t>} = g'(W_{ya}a^{<t>} + b_y)$ 
  
(where $g$ and $g'$ are activation functions that can be different from one another.)  
  
Now $[a^{<t-1>}, x^{<t>}]$ is the side-by-side stacking of $a^{<t-1>}$ and $x^{<t>}$.  
  
All the $a^{<n>}$s are the same in size, and are really just the count of the output nodes of the RNN, in this case 32.  The $x^{<n>}$s are all words, equal in size to the embedding length, in this case 100.  So the side-by-side stacking of $a^{<t-1>}$ and $x^{<t>}$ is 132 in length.  Therefore $W_a$ is also 132 in length.  

So now you have an incoming vector of dimension (132,), being multiplied by 32 nodes inside the SimpleRNN cell, giving us 132 * 32 = 4224 parameters.  Add 32 biases to that, and 4224 + 32 = 4256.  Which gives us the 4256 parameter count as above.
  
In the second equation, $W_{ya}$ has the same dimensions as $a^{<t>}$, which is 32.  So this has 32 * 32 = 1024 parameters, plus 32 biases which should give us 1056 additional parameters which seem to be not counted in the Keras calculation above.  


In [4]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim)) # Note that vocab_size=10000 (vocab size), embedding_dim = 100 (100 dense vector for each word from Glove), maxlen=100 (using only first 100 words of each review)
model.add(SimpleRNN(32))
#model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 100)         1000000   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                4256      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 1,004,289
Trainable params: 1,004,289
Non-trainable params: 0
_________________________________________________________________


## LSTM parameters
In an LSTM layer, the parameter count is similarly done, except that there are 4 equations so everything we did for the SimpleRNN needs to be multiplied by 4.  

**[((InputSize + Number of nodes) * Number of nodes) + Number of nodes] * 4**
  
An LSTM has the following equations for its various things.

Inputs:  
1. Prior activation ($a^{<t-1>}$), which is identical to the node size of the LSTM  
2. Input $x^{<t>}$, ie the word, represented as a vector  
3. Prior $\tilde{c}^{<t-1>}$, which is the candidate memory cell

Outputs:
1. Memory cell $\tilde{c}^{<t>}$
2. Activation $a^{<t>}$
  
$\tilde{c}^{<t>} = tanh(W_c[a^{<t-1>}, x^{<t>}) + b_c)$ : **Candidate memory cell**    
$\Gamma_u = \sigma(W_u[a^{<t-1>}, x^{<t>}) + b_u)$ : **UPDATE GATE**  
$\Gamma_f = \sigma(W_f[a^{<t-1>}, x^{<t>}) + b_u)$ : **FORGET GATE**  
$\Gamma_o = \sigma(W_o[a^{<t-1>}, x^{<t>}) + b_u)$ : **OUTPUT GATE**  
$c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + \Gamma_f * \tilde{c}^{<t-1>}$  : **MEMORY CELL OUTPUT** 
$a^{<t>} = \Gamma_o * tanh(c^{<t>})$  : **ACTIVATION OUTPUT**  

$\sigma$ is the sigmoid function  

Now each one of W_c, W_u, W_f, and W_o will have the dimensions determined by the concatenation of a^<t-1> and x^t.  That will be Node Size + Input size, multiplied by the number of nodes (as it behaves like a fully connected layer).  Add to that the number of biases (equal to the number of nodes), and because there are 4 of these, multiply in the end by 4.  
    
For the example above, input size is 100 (embedding size), and number of nodes is 32, therefore the count of parameters is =(((100+32) * 32)+32) * 4 = 17024.  



In [6]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim)) # Note that vocab_size=10000 (vocab size), embedding_dim = 100 (100 dense vector for each word from Glove), maxlen=100 (using only first 100 words of each review)
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# model.layers[0].set_weights([embedding_matrix])
# model.layers[0].trainable = False

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 100)         1000000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                17024     
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 33        
Total params: 1,017,057
Trainable params: 1,017,057
Non-trainable params: 0
_________________________________________________________________


**Run the model**
```python
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
```