### Recurrent Neural Network


ANN and CNN fails because they are stateless. They can't remember past information.
Ex: "This is good movie" and "This is not good movie" will produce same output because of statelessness. 

RNN is sequential model. It take one or more input vectors and generate one or more outputs vectors, and the outputs determined not only by weights applied to inputs like NN but also by "hidden" state vector, representing meaning based on inputs/outputs beforehand. Therefore, the same input may generate a different output depending on series' previous inputs.

This looping back to better understanf the sentence is called "feedback loop" or "temporal loop". Compared to NN, RNN uses feedback loops, such as Backpropogation Through Time (BPTT), to loop information back into the netowrk during computational cycle. This is what links inputs together and is also what enables sequential and temporal data processing by RNNs.

Applications: sentiment analysis, text summarization, machine translation, image captioning

The "Vanishing gradient" problem:
In NN information passes through the input layer to output layer, and the error is back-propagated to update weights. In RNN, similar thing happens, but information travels through time, and we calculate error at each time point. To minimize the error, every single neuron that participates in calculating the output should have its weight updated. Hence, many neurons' weights need to be updated.

We start by randomizing the weights, which are close to zero. The weights used to connect the hidden layers to themselves in the unrolled temporal loop, but when we start with very small values, gradient becomes very small and it makes harder for the network to update the weights. Thus, longer to get to the final result.

The training for the time point, t, is happening based on the inputs coming from untrained layers. So, because of the vanishing gradient, the whole network is not being trained properly. For the vanishing gradient problem, the further you go through the network, the lower your gradient is, and the harder it is to train the weights, which has a domino effect on all of the further weights throughout the network.

Solution:
Initialize weights properly.
Have echi state network.
LSTM

Exploding gradient solution:
stop back-propagation early, leades to less optimal result.
Penalize or artificially reduce the gradient.
Put a maximum limit on gradient.


### Types of RNN architectures

1. One to One (non-sequential): basically represents multi-layer perceptron because it takes single input and generates single output.

2. One to Many: Accepts single input and generates multiple outputs at each time step. Usecase: Image captioning: model take image as input and generate output at each time step defining that image.

3. Many to One: Many inputs to generate one output. Usecase: Sentiment analysis - we provide sequence of text (different inputs at each time step), and the model then generate a single output(review rating)

4. Many to Many: Accepts many inputs and generate multiple outputs. Output of t-1 is necessary to generate output at time t. Usecase: POS tagging (part of speech) - we give model input sequence of text (multiple inputs) and the model then predicts the POS tags for each word. Sppech recognition.

5. Encoder - Decoder: Special type of many to many RNN. In this RNN is seperated into two parts: encoder, decoder. Encoder -> contect vector -> decoder -> output. Usecase: machine translation, where we feed text in one language to the model, and the model encodes and decodes it to generate text in another language. We seperate encoder from decoder because it is not necessary that we have corresponding word in the target language for each word in the source language.

### Long Short Team Memory Networks (LSTM)

Recurring weight (W_rec) <1 then we have vanishing gradient and if W_rec >1 then we have exploding gradient problem. LSTM made W_rec=1 to solve the problem. 

Instead of having single NN in cell, LSTM cell has 4 layers interacting in special way.

1. Sigmoid layer: LSTM decide what information to throw away from cell state. "forget gate layer". It looks at H_t-1(previous cell output) and X_t(current input) and outputs a number between 0 and 1 for each number in the cell state C_t-1. 1 represents completely keep and 0 represents completely delete.

2. Sigmoid and Tanh layer (second and third layer): decides what new information to store in cell state. First sigmoid layer "input gate layer" decides which values to update. Next tanh layer creates vector of new candidate values that could be added to the state. In the next step, we combined these two to create an update to the state. This create output from the current cell to be fed as input to next cell.

3. sigmoid layer (forth layer): decides what to output for current layer H_t. first sigmoid will run to decide what to output and then we put the cell state through tanh(push values between -1 and 1) and multiply it by the output of the sigmoid gate, so we only output the parts that we decided on.


In [9]:
#Load the dataset

from tensorflow.keras.datasets import imdb

#load the dataset
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

#train_data and test_data are sequences of integers (word indices)



In [10]:
print(len(train_data))
print(len(test_data))


25000
25000


In [11]:
print(train_data[0])
print(test_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
[1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4,

In [12]:
#decode one of the reviews back to english words
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]])
print(decoded_review)


? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you thi

In [13]:
#perform padding to make input length same for all sentences vector.

from tensorflow.keras.preprocessing import sequence

X_train = sequence.pad_sequences(train_data,maxlen=500)
X_test = sequence.pad_sequences(test_data,maxlen=500)



In [14]:
print(X_train.shape)
print(X_test.shape)

(25000, 500)
(25000, 500)


In [17]:
#Create model architecture

from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.models import Sequential

model = Sequential()
#vocabulary, vector size of embedding matrix
model.add(Embedding(10000,64)) #output size = (batch_szie,maxlen=500,k=64)

#maxlen times(500) unrolling of RNN cell ine step at a time. RNN produce vector of length 32 vector
model.add(SimpleRNN(32)) 
#fully connected layer
model.add(Dense(1,activation='sigmoid'))
print(model.summary())

#Here, we have single mebedding matrix which will get trained and pass output to RNN cells.



None


In [18]:
#compile the model
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])


In [19]:
#Callbacks: ModelCheckpoint, EarlyStopping

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping

checkpoint = ModelCheckpoint("best_model.h5",monitor = "val_loss",verbose=0,
                                save_best_only=True,save_weights_only=False)


earlystop = EarlyStopping(monitor='val_acc',patience=1)

In [None]:
#Model Training

hist = model.fit(X_train,train_labels,validation_split=0.2,epochs=10,
                    batch_size=128,callbacks=[checkpoint,earlystop])

Epoch 1/10


2026-01-31 13:08:31.152806: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 117s/step - acc: 0.5394 - loss: 0.6835  



[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18742s[0m 119s/step - acc: 0.5823 - loss: 0.6615 - val_acc: 0.6078 - val_loss: 0.6506
Epoch 2/10
[1m 77/157[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m24:13:57[0m 1090s/step - acc: 0.7436 - loss: 0.5157

In [None]:
#plot the accuracy
import matplotlib.pyplot as plt

acc = hist.history['acc']
val_acc = hist.history['val_acc']
epochs = range(1,len(loss)+1)#########

plt.title("Accuracy vs Epochs")
plt.plot(epochs,acc,label="Training Acc")
plt.plot(epochs,val_acc,label="Val Acc")
plt.legend()
plt.show()

In [None]:
#plot the loss

loss = his.history['loss']
val_loss = hist.history['val_loss']
epochs = range(1,len(loss)+1)

plt.title("Loss vs Epochs")
plt.plot(epochs,loss,label="Training Loss")
plt.plot(epochs,val_loss,label="Val Loss")
plt.legend()
plt.show()

In [None]:
model.evaluate(X_test,test_labels)

In [None]:
#predict sentiment of new sentence

sent = "This movie is really bad . I do not like this movie because the direction was horrible ."
inp = []


# Convert each word to integer
for word in sent.split():
  if word in word_index.keys():
    inp.append(word_index[word])
  else:
    inp.append(1)

print(inp) 

# Perform the padding
final_input = sequence.pad_sequences([inp],maxlen=500)

# Finally predict the sentiment
model.predict(final_input)