Forward Propagation in RNNs

- $U$ represents the input to the hidden layer weight matrix
- $W$ represents the hidden to hidden layer weight matrix
- $V$ represents the hidden to output layer weight matrix

The hidden state $h$ at a time step $t$ can be computed as:<br> $h_t = tanh(Ux_1+Wh_{t-1})$

The output at a time step $t$ can be computed as:<br> $\hat{y}_t = softmax(Vh_t)$

In [1]:
import numpy as np

Initialize all the weights, $U$, $W$, and $V$, by randomly drawing from the
uniform distribution:

In [2]:
input_dim = 10
hidden_dim = 10
output_dim = 10

In [3]:
U = np.random.uniform(-np.sqrt(1.0 / input_dim), np.sqrt(1.0 / input_dim), (hidden_dim, input_dim))
W = np.random.uniform(-np.sqrt(1.0 / hidden_dim), np.sqrt(1.0 / hidden_dim), (hidden_dim, hidden_dim))
V = np.random.uniform(-np.sqrt(1.0 / hidden_dim), np.sqrt(1.0 / hidden_dim), (input_dim, hidden_dim))

Define the number of time steps, which will be the length of our input
sequence, $x$:

In [4]:
x = [10, 12, 14, 16, 18]

In [5]:
num_time_steps = len(x)

In [6]:
hidden_state = np.zeros((num_time_steps + 1, hidden_dim))

In [7]:
hidden_state[-1] = np.zeros(hidden_dim)

In [8]:
YHat = np.zeros((num_time_steps, output_dim))

In [9]:
def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

In [10]:
for t in np.arange(num_time_steps):
    # h_t = tanh(UX + Wh_{t-1})
    # hidden_state[t] = np.tanh(U[:, x[t]] + W.dot(hidden_state[t - 1]))
    # yhat_t = softmax(vh)
    YHat[t] = softmax(V.dot(hidden_state[t]))
YHat

array([[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
       [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
       [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
       [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
       [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]])

Backpropagation through time

The Loss $L$ at a time step $t$ can be given as:<br> $L_t = -y_t log(\hat{y}_t)$

Here, $y_t$ is the actual output, and $\hat{y}_t$ is the predicted output at a time step $t$

For $T-1$ layers, the final loss is given as:<br> $L = \sum\limits_{j=0}^{T-1}L_j$

The gradient pf the loss function wrt $V$ is:<br> $\frac{\partial L}{\partial V} = \sum\limits_{j=0}^{T-1}(\hat{y}_j - y_j) \bigotimes h_j$

The gradient pf the loss function wrt $W$ is:<br> $\frac{\partial L}{\partial W} = \sum\limits_{j=0}^{T-1} \sum\limits_{k=0}^{j}(\hat{y}_j - y_j) \prod\limits_{m=k+1}^{j} W^T diag(1 - tanh^2(Wh_{m-1} + Ux_m)) \bigotimes h_{k-1} $

Gradient Clipping

Gradient clipping allows us to bypass the exploding gradient problem. First we normalize the graidents using the L2 norm, that is, $||\hat{g}||$. If the normalized gradient exceeds the defined threshold, we update the gradient by scaling it by a certain factor.

If the gradient exceeds the defined threshold, the we update it as:<br>$\hat{g} = \frac{threshold}{||\hat{g}||}.\hat{g}$

Generating song lyrics using RNNs

In [11]:
import warnings
warnings.filterwarnings('ignore')
import random
import numpy as np
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

Read the downloaded input dataset:

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
df = pd.read_csv('/content/drive/MyDrive/songdata.csv')

Dataset exploration

In [14]:
df.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [15]:
df.shape[0]

57650

In [16]:
len(df['artist'].unique())

643

In [17]:
df['artist'].value_counts()[:10]

Unnamed: 0_level_0,count
artist,Unnamed: 1_level_1
Donna Summer,191
Gordon Lightfoot,189
George Strait,188
Bob Dylan,188
Cher,187
Loretta Lynn,187
Alabama,187
Reba Mcentire,187
Chaka Khan,186
Dean Martin,186


In [18]:
df['artist'].value_counts().values.mean()

89.65785381026438

In [19]:
data = ', '.join(df['text'])

In [20]:
data[:369]

"Look at her face, it's a wonderful face  \nAnd it means something special to me  \nLook at the way that she smiles when she sees me  \nHow lucky can one fellow be?  \n  \nShe's just my kind of girl, she makes me feel fine  \nWho could ever believe that she could be mine?  \nShe's just my kind of girl, without her I'm blue  \nAnd if she ever leaves me what could I do, what co"

Since we are building a char-level RNN, we will store all the unique
characters in our dataset into a variable called chars; this is basically our
vocabulary:


In [21]:
chars = sorted(list(set(data)))

In [22]:
vocab_size = len(chars)

We map all the characters in the vocabulary to their corresponding index that
forms a unique number.

In [23]:
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

In [24]:
print(char_to_ix['s'])

68


In [25]:
print(ix_to_char[68])

s


In [26]:
vocabSize = 7
char_index = 4
print(np.eye(vocabSize)[char_index])

[0. 0. 0. 0. 1. 0. 0.]


As we can see above, we now have a <strong>one hot encoded</strong>  value of each character.<br>We will encapsulate this one hot encoder into a function

In [27]:
def one_hot_encoder(index):
    return np.eye(vocab_size)[index]

Defining the network parameters

1. Define the number of units in the hidden layer:

In [28]:
hidden_size = 100

2. Define the length of the input and output sequence:

In [29]:
seq_length = 25

3. Define the learning rate for gradient descent:

In [30]:
learning_rate = 1e-1

4. Set the seed value:

In [31]:
import tensorflow as tf
import random

In [32]:
seed_value = 42
tf.compat.v1.set_random_seed(seed_value)
random.seed(seed_value)

Defining placeholders

1. The placeholders for the input and output is defined as:

In [33]:
tf.compat.v1.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


In [34]:
inputs = tf.compat.v1.placeholder(shape=[None, vocab_size],dtype=tf.float32, name="inputs")
targets = tf.compat.v1.placeholder(shape=[None, vocab_size], dtype=tf.float32, name="targets")

2. Define the placeholder for the initial hidden state:

In [35]:
init_state = tf.compat.v1.placeholder(shape=[1, hidden_size], dtype=tf.float32, name="state")

3. Define an initializer for initializing the weights of the RNN:

In [36]:
initializer = tf.random_normal_initializer(stddev=0.1)

Defining forward propagation

$ h_t = tanh(Ux_t+W_{t-1}+bh) $


$ \hat{y} = softmax(Vh_t + bv) $

In [37]:
with tf.compat.v1.variable_scope("RNN") as scope:
    h_t = init_state
    y_hat = []

    for t, x_t in enumerate(tf.split(inputs, seq_length, axis=0)):
        if t > 0:
            scope.reuse_variables()

        #input to hidden layer weights
        U = tf.compat.v1.get_variable("U", [vocab_size, hidden_size], initializer=initializer)
        #hidden to hidden layer weights
        W = tf.compat.v1.get_variable("W", [hidden_size, hidden_size], initializer=initializer)
        #output to hidden layer weights
        V = tf.compat.v1.get_variable("V", [hidden_size, vocab_size], initializer=initializer)
        #bias for hidden layer
        bh = tf.compat.v1.get_variable("bh", [hidden_size], initializer=initializer)
        #bias for output layer
        by = tf.compat.v1.get_variable("by", [vocab_size], initializer=initializer)
        h_t = tf.tanh(tf.matmul(x_t, U) + tf.matmul(h_t, W) + bh)
        y_hat_t = tf.matmul(h_t, V) + by
        y_hat.append(y_hat_t)

Apply $softmax$ on the output and get the probabilities

In [38]:
output_softmax = tf.nn.softmax(y_hat[-1])
outputs = tf.concat(y_hat, axis=0)

Compute the cross-entropy loss

In [39]:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=targets, logits=outputs))

Store the final hidden state of the RNN in $hprev$. We use this final hidden state for making predictions

In [40]:
hprev = h_t

Defining BPTT

1. Initialize the Adam optimizer

In [41]:
minimizer = tf.compat.v1.train.AdamOptimizer()

2. Compute the gradients of the loss with the Adam optimizer

In [42]:
gradients = minimizer.compute_gradients(loss)

3. Set the threshold the gradient clipping

In [43]:
threshold = tf.constant(5.0, name="grad_clipping")

4. Clip the gradients that exceed the threshold and bring it to the range

In [44]:
clipped_gradients = []
for grad, var in gradients:
    clipped_grad = tf.clip_by_value(grad, -threshold, threshold)
    clipped_gradients.append((clipped_grad, var))

5. Update the gradients with the clipped gradients

In [45]:
updated_gradients = minimizer.apply_gradients(clipped_gradients)

Generating Songs

In [46]:
sess = tf.compat.v1.Session()
init = tf.compat.v1.global_variables_initializer()
sess.run(init)

In [47]:
pointer = 0
iteration = 0

In [48]:
while iteration < 50001:

    if pointer + seq_length+1 >= len(data) or iteration == 0:
        hprev_val = np.zeros([1, hidden_size])
        pointer = 0

    #select input sentence
    input_sentence = data[pointer:pointer + seq_length]

    #select output sentence
    output_sentence = data[pointer + 1:pointer + seq_length + 1]

    #get the indices of input and output sentence
    input_indices = [char_to_ix[ch] for ch in input_sentence]
    target_indices = [char_to_ix[ch] for ch in output_sentence]

    #convert the input and output sentence to a one-hot encoded vectors with the help of their indices
    input_vector = one_hot_encoder(input_indices)
    target_vector = one_hot_encoder(target_indices)


    #train the network and get the final hidden state
    hprev_val, loss_val, _ = sess.run([hprev, loss, updated_gradients],
                                      feed_dict={inputs: input_vector,targets: target_vector,init_state: hprev_val})


    #make predictions on every 500th iteration
    if iteration % 500 == 0:

        #length of characters we want to predict
        sample_length = 500

        #randomly select index
        random_index = random.randint(0, len(data) - seq_length)

        #sample the input sentence with the randomly selected index
        sample_input_sent = data[random_index:random_index + seq_length]

        #get the indices of the sampled input sentence
        sample_input_indices = [char_to_ix[ch] for ch in sample_input_sent]

        #store the final hidden state in sample_prev_state_val
        sample_prev_state_val = np.copy(hprev_val)

        #for storing the indices of predicted characters
        predicted_indices = []


        for t in range(sample_length):

            #convert the sampled input sentence into one-hot encoded vector using their indices
            sample_input_vector = one_hot_encoder(sample_input_indices)

            #compute the probability of all the words in the vocabulary to be the next character
            probs_dist, sample_prev_state_val = sess.run([output_softmax, hprev],
                                                      feed_dict={inputs: sample_input_vector,init_state: sample_prev_state_val})

            #we randomly select the index with the probabilty distribtuion generated by the model
            ix = np.random.choice(range(vocab_size), p=probs_dist.ravel())

            sample_input_indices = sample_input_indices[1:] + [ix]


            #store the predicted index in predicted_indices list
            predicted_indices.append(ix)

        #convert the predicted indices to their character
        predicted_chars = [ix_to_char[ix] for ix in predicted_indices]

        #combine the predcited characters
        text = ''.join(predicted_chars)

        #predict the predict text on every 50000th iteration
        if iteration % 5000 == 0:
            print ('\n')
            print (' After %d iterations' %(iteration))
            print('\n %s \n' % (text,))
            print('-'*115)


    #increment the pointer and iteration
    pointer += seq_length
    iteration += 1



 After 0 iterations

 y(Qgq[h?s:wKc54Z2uVi
"
0
lus)m(C8,'LCPsTSCJ7U(9o8lvj3jHnKSRb"jYl7e[Y[[.
0'3(LaqxlJrUn3UJohYFxS1vYv73o1E?Aq?NU'MY2dVdMWeuksCtpqo]J 8pGT"e:e',:[NiP]MY:Obk61s(z!sGP7Lt])u]PjFPo(CDCB1Z:KCeyG ,Jk-TiR(7TZQ,prI3W)pwL:yiJ1-2jLSiYp7'(O6qyMO,LLddLb6,gyYgzSkn23L)zRevK"qWTAbBB.UZPb!?787-8iUbD[io:v5yNYylUkpDE"-HDnqi2-ZKgiO-rZiug4
Kz-uBQ( 4CQ7,lsP,z8etiL:BN9QZ1.(y[xRGIAdEftjZT2Z?dEIPCIUC6 Oh5w5RNpQvg,E.O0Ff lX:(LXa0g:t9r:j5PEC
RUn"zlS]?:d3ShIf7TboAPFYTpo?)[Uo76J
C:8N(s8?R:f78f3H:W6)9JO
sdMPG!lMWOf(NvkhApM3v
o 

-------------------------------------------------------------------------------------------------------------------


 After 5000 iterations

 g thand inde sto lak  
Now, wave I sull this me bof lise caf ald me he bleg found-us bngely w
I wust ofon- a bthegy uro s the wie pparesheve cpsice llakner I d, songst wet mo f plowgit fours pyee  
I ksinet thon doplif spi tongand pedse aseat fio bake  
A dald brofereon sighime oo  
I sp-hfhere the purpeayt wrou cipneep of the u

Different types of RNN architectures

1. One-to-one architecture:<br>
In a one-to-one architecture, a single input is mapped to a single output,
and the output from the time step t is fed as an input to the next time step.

![Cute Cat](https://media.geeksforgeeks.org/wp-content/uploads/20211225155841/Screenshot32.png)


2. One-to-many architectures:<br>
In a one-to-many architecture, a single input is mapped to multiple hidden
states and multiple output values, which means RNN takes a single input and
maps it to an output sequence.

![image.png](https://media.geeksforgeeks.org/wp-content/uploads/20211225155842/Screenshot33.png)

3. Many-to-one architectire:<br>
A many-to-one architecture, as the name suggests, takes a sequence of input
and maps it to a single output value.

![image.png](https://media.geeksforgeeks.org/wp-content/uploads/20211225155845/Screenshot36.png)

4. Many-to-many architecture:<br>
In many-to-many architectures, we map a sequence of input of arbitrary
length to a sequence of output of arbitrary length.

![image.png](https://media.geeksforgeeks.org/wp-content/uploads/20211225155843/Screenshot35.png)