<a href="https://colab.research.google.com/github/montimaj/Deep-Learning-SE-6213/blob/master/HW8/hw8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SYS ENG 6213 - Deep Learning and Advanced Neural Networks 

### Homework#5: RNN and LSTM networks


In this homework, you will be implementing the RNN and LSTM networks on Penn Treebank dataset. These architectures are useful for data which have dependencies like sequences and lists. For example, Language modelling where the future words in text are predicted based on history of previous words.

Few more applications:
1. Speech recognition 
2. Language modeling 
3. Translation 
4. Image captioning

Detailed information on RNN and LSTM can be found @ http://colah.github.io/posts/2015-08-Understanding-LSTMs/

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd 'drive/My Drive/SysEng 6213 Fall 2020 Sayantan Majumdar /HW8'
!ls

/content/drive/.shortcut-targets-by-id/1xpWpUfn0NqGtbgAP_303jr6QW5RKzJPR/SysEng 6213 Fall 2020 Sayantan Majumdar /HW8
data	   layers.py	 reader.py	rnn_pic.png  utilities.py
hw8.ipynb  lstm_pic.png  rnn_layers.py	rnn.py


In [3]:
!pip install tensorflow
import os
import pickle as pickle
import reader
import tensorflow as tf
import numpy as np
from rnn_layers import *
from utilities import _get_batch,_divide_into_batches
from rnn import *



__Dataset:__

For this homework, we will use the [Penn Tree Bank](https://catalog.ldc.upenn.edu/ldc99t42) (PTB) dataset, which is a popular benchmark for measuring the quality of these models, whilst being small and relatively fast to train. Below is a small sample of data.

In [4]:
with tf.io.gfile.GFile(os.path.join(os.getcwd(),'data', "ptb.train.txt"), "r") as f:
    data = f.read()
data[0:2000]

" aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter \n pierre <unk> N years old will join the board as a nonexecutive director nov. N \n mr. <unk> is chairman of <unk> n.v. the dutch publishing group \n rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate \n a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported \n the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said \n <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N \n although preliminary find

__Pre Processing__

The given dataset is a collection of words as shown above. In order to train an neural network we need numerical representation of the data. For this purpose, we encode each word in the dataset with unique numerical values. The following code implements the conversion of dataset words into numerical values. 

In [5]:
data_path = os.path.join(os.getcwd(),'data')
raw_data = reader.ptb_raw_data(data_path)
train_data, valid_data, test_data, vocabulary, word_ids = raw_data
data = {'train_data':train_data[0:10000],'valid_data':valid_data,'test_data':test_data,'vocabulary':vocabulary,'word_ids':word_ids}
ids_words = {i: w for w, i in word_ids.items()}

In [6]:
word_ids

{'the': 0,
 '<unk>': 1,
 '<eos>': 2,
 'N': 3,
 'of': 4,
 'to': 5,
 'a': 6,
 'in': 7,
 'and': 8,
 "'s": 9,
 'that': 10,
 'for': 11,
 '$': 12,
 'is': 13,
 'it': 14,
 'said': 15,
 'on': 16,
 'by': 17,
 'at': 18,
 'as': 19,
 'from': 20,
 'million': 21,
 'with': 22,
 'mr.': 23,
 'was': 24,
 'be': 25,
 'are': 26,
 'its': 27,
 'he': 28,
 'but': 29,
 'has': 30,
 'an': 31,
 "n't": 32,
 'will': 33,
 'have': 34,
 'new': 35,
 'or': 36,
 'company': 37,
 'they': 38,
 'this': 39,
 'year': 40,
 'which': 41,
 'would': 42,
 'about': 43,
 'says': 44,
 'more': 45,
 'were': 46,
 'market': 47,
 'billion': 48,
 'his': 49,
 'had': 50,
 'their': 51,
 'up': 52,
 'u.s.': 53,
 'one': 54,
 'than': 55,
 'who': 56,
 'some': 57,
 'been': 58,
 'also': 59,
 'stock': 60,
 'other': 61,
 'share': 62,
 'not': 63,
 'we': 64,
 'corp.': 65,
 'if': 66,
 'when': 67,
 'i': 68,
 'last': 69,
 'president': 70,
 'shares': 71,
 'years': 72,
 'all': 73,
 'first': 74,
 'two': 75,
 'because': 76,
 'trading': 77,
 'after': 78,
 'could': 

### Demo for Recurrent Neural Network

<img src="rnn_pic.png" alt="Drawing" style="width: 600px;"/>

From the above figure you can observe that at RNN cell (left) is just a single neural network that is connected to itself. The unfolded representation is on right. It implies, at each time step the architecutre will not just recieve the current input but also the previous output. Therefore, the prediction at any time step depends on current input and all the previous inputs. This give the RNN the capability to learn a sequence. 

In [None]:
model = language_model(data,update_rule='SGD_with_momentum',cell_type = 'rnn',batch_size = 128,seq_len= 2,epochs=70)
params = model.train()

Epoch 1/70
    Iteration 1/78, Loss: 18.42156245411719
    Iteration 51/78, Loss: 18.083903846302604
    tr_acc: 0.11227964743589744, val_acc: 0.1099582248263889
Epoch 2/70
    Iteration 1/78, Loss: 17.63596647057056
    Iteration 51/78, Loss: 16.16621535051767
    tr_acc: 0.10426682692307693, val_acc: 0.103759765625
Epoch 3/70
    Iteration 1/78, Loss: 15.323147337155284
    Iteration 51/78, Loss: 15.070407677346093
    tr_acc: 0.12469951923076923, val_acc: 0.12406412760416667
Epoch 4/70
    Iteration 1/78, Loss: 14.574902949914055
    Iteration 51/78, Loss: 14.681055789789589
    tr_acc: 0.12690304487179488, val_acc: 0.1250949435763889
Epoch 5/70
    Iteration 1/78, Loss: 14.228969680960729
    Iteration 51/78, Loss: 14.385314279447261
    tr_acc: 0.15224358974358973, val_acc: 0.15059407552083334
Epoch 6/70
    Iteration 1/78, Loss: 13.971150764251302
    Iteration 51/78, Loss: 14.123401840180534
    tr_acc: 0.16005608974358973, val_acc: 0.1593967013888889
Epoch 7/70
    Iteration 1/

    Iteration 51/78, Loss: 11.077169569587177
    tr_acc: 0.3659855769230769, val_acc: 0.22557237413194445
Epoch 52/70
    Iteration 1/78, Loss: 11.064897197143768
    Iteration 51/78, Loss: 11.036158159237992
    tr_acc: 0.3698918269230769, val_acc: 0.226318359375
Epoch 53/70
    Iteration 1/78, Loss: 11.023844557345104
    Iteration 51/78, Loss: 10.995179142091828
    tr_acc: 0.3733974358974359, val_acc: 0.22682020399305555
Epoch 54/70
    Iteration 1/78, Loss: 10.983200412695187
    Iteration 51/78, Loss: 10.954211818056425
    tr_acc: 0.37740384615384615, val_acc: 0.2265625
Epoch 55/70
    Iteration 1/78, Loss: 10.942940637600321
    Iteration 51/78, Loss: 10.913234338827184
    tr_acc: 0.38231169871794873, val_acc: 0.2265082465277778
Epoch 56/70
    Iteration 1/78, Loss: 10.903042030898016
    Iteration 51/78, Loss: 10.87222532445513
    tr_acc: 0.3847155448717949, val_acc: 0.2260470920138889
Epoch 57/70
    Iteration 1/78, Loss: 10.863484734973103
    Iteration 51/78, Loss: 10.83

### Do some predictions

In [None]:
# predict n next words
n = 3
model = language_model(data,cell_type = 'rnn',batch_size = 1,seq_len= n,use_pre_trained=True,params=params)
k = np.random.choice(5000, 1)[0] #random point
some_data = _divide_into_batches(train_data[k:k+100],1)
pred_str = []
for i in range(10):
    x,_=_get_batch(some_data,i,n)
    scores,_ = model.forward_pass(x)
    predictions = np.argmax(scores, axis = 2)
    #print(predictions)
    print('-'*80)
    print('actual sequnce    : '+str([ids_words[i] for i in x[0]]))
    print('predicted sequence: '+str([ids_words[i] for i in predictions[0]]))
    
    

--------------------------------------------------------------------------------
actual sequnce    : ['the', '<unk>', 'was']
predicted sequence: ['<unk>', '<unk>', 'to']
--------------------------------------------------------------------------------
actual sequnce    : ['<unk>', 'was', 'used']
predicted sequence: ['<unk>', 'to', 'to']
--------------------------------------------------------------------------------
actual sequnce    : ['was', 'used', '<eos>']
predicted sequence: ['to', 'to', '<unk>']
--------------------------------------------------------------------------------
actual sequnce    : ['used', '<eos>', 'workers']
predicted sequence: ['to', 'the', 'and']
--------------------------------------------------------------------------------
actual sequnce    : ['<eos>', 'workers', 'dumped']
predicted sequence: ['the', "'s", 'said']
--------------------------------------------------------------------------------
actual sequnce    : ['workers', 'dumped', 'large']
predicted sequenc

The predictions do not make any sense as we trained the model for only one epoch on a fraction of data.

### Implementation of Long Short Term Memory

The problem with RNN is that it is not suitable for learning long sequences. This inability is due to either vanishing gradient or exploding gradient. Fig below shows the examples of both. 

In case of vanishing gradient, when the sequence is long, the gradinent reduces at each step eventually becoming 0 (no more learning!). 

In case of exploding gradient, when the sequence is long, the gradient increases at each step eventually making the architecture unstable. 

To avoid both the scenarios, LSTM models were designed. The LSTM models use gates to decide which of the previous steps should participate in the current step prediction.

The LSTM cell is shown below:

<img src="lstm_pic.png" alt="Drawing" style="width: 600px;"/>

Complete the following functions in rnn_layers.py to implement LSTM architecture

1. lstm_forward
2. lstm_step_forward
3. lstm_backward
4. lstm_step_backward



Run the below code after implementing the functions

In [None]:
model = language_model(data,update_rule='SGD_with_momentum',cell_type = 'lstm',batch_size = 128)
params = model.train()

Epoch 1/70
    Iteration 1/78, Loss: 36.84074221531089
    Iteration 51/78, Loss: 34.84889413533025
    tr_acc: 0.21834935897435898, val_acc: 0.2225884331597222
Epoch 2/70
    Iteration 1/78, Loss: 32.2199586028576
    Iteration 51/78, Loss: 30.409138597655648
    tr_acc: 0.2244591346153846, val_acc: 0.2256401909722222
Epoch 3/70
    Iteration 1/78, Loss: 29.496922483596283
    Iteration 51/78, Loss: 29.21081591057568
    tr_acc: 0.20412660256410256, val_acc: 0.20518663194444445
Epoch 4/70
    Iteration 1/78, Loss: 28.418808081939623
