<a href="https://colab.research.google.com/github/kumar4372/sentiment_analysis_hands_on/blob/master/(participant)_Using_RNN_for_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis Using Recurrent Neural Network**

---



## choose Hardware accelarator to "GPU" for faster computation. Go to "Runtime" -> "Change runtime type" to change it.


In this tutorial, we will use RNN/LSTM for sentiment analysis on movie review dataset.

**What is sentiment analysis?**

Sentiment Analysis is nothing but finding the sentiments of reviews whether it is positive or negative review.

**Example Code to refer**: https://keras.io/examples/nlp/bidirectional_lstm_imdb/

**Notes**
- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.

**Importing Libraries**

We start by importing the required dependencies to preprocess our data and build our model.

In [None]:
# Import the dependencies
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, SimpleRNN,LSTM, GRU
from tensorflow.keras.preprocessing import sequence
print("Imported dependencies.")

Imported dependencies.


**Loading Data**

We will use IMDB sentiment classification dataset which consists of 50,000 movie reviews from IMDB users that are labeled as either positive (1) or negative (0). 

Continue downloading the IMDB dataset, which is, fortunately, already built into keras.

In [None]:
# enter your code

**Exploring the data**

You can see in the output above that the dataset is labeled into two categories, — 0 or 1, which represents the sentiment of the review. The whole dataset contains 9,998 unique words and the average review length is 234 words, with a standard deviation of 173 words.

In [None]:
# enter your code

We should always check how balanced our training and test data is. This helps in deciding evaluation metrics and observing training progress as well. You will observe that both training and test data is perfectly balanced.

In [None]:
# enter your code

You can see the first review of the dataset, which is labeled as positive (1). 

In [None]:
# enter your code

Now we try to map from word index to word so that we can read the reviews.
We replace every unknown word with a “#”. It does this by using the get_word_index() function.

In [None]:
index = imdb.get_word_index() # from word to index mapping
reverse_index = dict([(value, key) for (key, value) in index.items()]) # from index to word mapping

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [None]:
# enter your code

In [None]:
# enter your code

In [None]:
# enter your code

**Data Preparation**

Now it's time to prepare our data. 

As we know, each review consists of different number of words. Some reviews could even be of length 1. e.g. "nice"

We need to fix maximum length of our input sequence. 

In [None]:
# enter your code

Check how many sentences are of length less than x where x is any integer

In [None]:
# enter your code

Here we consider maximum length of our input sequence to be 200.

**Please feel free to choose your own maximum length**

In [None]:
# enter your code

To visualize how padding is happening, let's print a few examples of seqeunces for before and after padding


In [None]:
# enter your code

In [None]:
# enter your code

**BUILDING AND TRAINING THE MODEL**

Now our data is ready for some modelling!

Deep learning models have layers.

The top layer takes in the data we've just prepared, the middle layers do some math on this data and the final layer produces an output we can hopefully make use of.

In our case, our model has three layers, 

1. Embedding layer
2. LSTM layer
3. Dense layer.

Our model begins with the line model = Sequential(). Think of this as simply stating "our model will flow from input to output layer in a sequential manner" or "our model goes one step at a time".

**Embedding layer**

The Embedding layer creates a database of the relationships between words.

model.add(Embedding(max_words, embedding_vector_length, input_length=max_review_length)) is saying: add an Embedding layer to our model and use it to turn each of our words into embedding_vector_length dimensional vector which have some mathematical relationship to each other.

So each of our words will become vectors of dimension embedding_vector_length.

For example, vector of "the" = [0.556433, 0.223122, 0.789654....].

Don't worry for now how this is computed, Keras does it for us.

**LSTM layer**

model.add(LSTM(128)) is saying: add a LSTM layer after our embedding layer in our model and give it 128 units.

**Dense layer**

model.add(Dense(1, activation='sigmoid')) is saying: add a Dense layer to the end of our model and use a sigmoid activation function to produce a meaningful output.

A dense layer is also known as a fully-connected layer. This layer connects the 128 LSTM units in the previous layer to 1 unit. This last unit them takes all this information and runs it through a sigmoid function.

**Please feel free to change the model architecture.**


In [None]:
# enter your code

**Compiling the model**

Now we compile our model, which is nothing but configuring the model for training. We use the “adam” optimizer, an algorithm that changes the weights and biases during training. We also choose binary-crossentropy as loss (because we deal with binary classification) and accuracy as our evaluation metric.

In [None]:
# enter your code

**Summarize the model**

Making a summary of the model will give us an idea of what's happening at each layer.

In the embedding layer, each of our words is being turned into a vector of dimension 128. Because there are 10000 words (max_words), there are 1,280,000 parameters (128 x 10000).

Parameters are individual pieces of information. The goal of the model is to take a large number of parameters and reduce them down to something we can understand and make use of (less parameters).

The LSTM layer reduces the number of parameters to 131584 = 4 × [128(128+128) + 128].

The final dense layer connects each of the outputs of the LSTM units into one cell (128 + 1).

In [None]:
# enter your code

**Fitting the model to the training data**

Now our model is compiled, it's ready to be set loose on our training data.

We'll be training for 3 epochs with a batch_size of 64.

Because of our loss and optimzation functions, the model accuracy should improve after each cycle.

model.fit(X_train, y_train, epochs=3, batch_size=64) is saying: fit the model we've built on the training dataset for 3 cycles and go over 64 reviews at a time.

I use test data as validation data. Use validation_split parameter in model.fit if you want to split training data into train and val.

**Please Feel free to change the number of epochs or batch_size**


In [None]:
# enter your code

It is time to evaluate our model:

In [None]:
# enter your code

Let's analyze the results by looking at some examples.

In [None]:
# enter your code

In [None]:
# enter your code

In [None]:
# enter your code

In [None]:
# enter your code

In [None]:
# enter your code

In [None]:
# enter your code

In [None]:
# enter your code

In [None]:
# enter your code

## some samples of wrong predictions

In [None]:
# enter your code

## some samples of correct predictions

In [None]:
# enter your code