# Sentiment analysis on IMDB reviews

## Loading data

We will first load the [publicly available dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) using the following code - 

In [None]:
from keras.datasets import imdb

top_words = 10000

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=top_words)

Let us now inspect the training data and we can go by the hypothesis test data will not be vastly different in nature than the train_data.

In [None]:
(x_train, y_train)

The first thing we notice here is, there is no text! It is all numbers. It is so, because neural networks work with numbers and not text. Thus, the dataset has already encoded the data the text in numbers. The way this encoding has happened is narrated in this picture - 

<img src="https://docs.microsoft.com/en-us/learn/student-evangelism/analyze-review-sentiment-with-keras/media/2-keras-docs.png" alt="Keras dataset documentation on IMDB reviews" />

As we realize by now these encoded text, let us take a look at the dictionary

In [None]:
imdb.get_word_index()

Let us now get the word corresponding to any code; i.e. reverse encode the value based on the number

In [None]:
def get_reverse_encode() :
    word_dict = imdb.get_word_index()
    word_dict = {key:(value+3) for key, value in word_dict.items()}
    word_dict[''] = 0 #Padding at the start
    word_dict['>'] = 1 #Starting of the review
    word_dict['?'] = 2 #Unknown word
    reverse_word_dict = {value:key for key, value in word_dict.items()}
    return reverse_word_dict

print(' '.join(get_reverse_encode()[id] for id in x_train[0]))

In [None]:
from keras.preprocessing import sequence
max_review_length = 500
x_train = sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)

In the line above we have ensure all reviews are of constant length 500 words with reviews which are shorter are padded with 0 which is a reserved word for the operation

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.embeddings import Embedding
from keras.layers import Flatten


embedding_vector_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

Code above constructs the neural network i.e. configures the neural network. It defines the layers and its annexed configuration. Here we have configured 5 layers. The layer that we start with is Embedding. The layer is critical for any neural network that intends to process word corpus. Functionally the layer maps a multidimensional large integer baesd array to a restricted floating point array which subsequent layers can process easily. Embedding is followed by Flatten and few Dense layer. Flatten layer is bridge that translates output of Embedding to layers that follows.
The Dense layers are fully connected neurons of size 16. These happen to be basic kind of neural network with 16 neurons. The number 16 is arbitrary however, we will have to tune the numbers by inspecting the results of training.

In [None]:
hist = model.fit(x_train, y_train, validation_data = (x_test, y_test), epochs=2, batch_size=128)

BY calling the fit function we are training the network on the training data. Training is repeated 5 times which is known as epoch. During the run of training the model passes data back and forth through the neurons to tune the parameters. For each epoch an accuracy score is generated. Though during 1 epoch the validation accuracy increases but a higher value for both training accuracy and validation accuracy increases the risk of overfitting a model. To visualize this risk let us plot these values through epochs

In [None]:
%matplotlib inline
import matplotlib.pyplot as mpl
import seaborn as sns

sns.set()

acc = hist.history['accuracy']
val_acc = hist.history['val_accuracy']
epochs = range(1, len(acc)+1)

mpl.plot(epochs, acc, '-', label='Training accuracy')
mpl.plot(epochs, val_acc, ':', label='Validation accuracy')
mpl.title('Training vs Validation accuracy')
mpl.xlabel('Epoch')
mpl.ylabel('Accuracy')
mpl.legend(loc='upper left')
mpl.plot()

Let us make observation from this graph but progress to make prediction and judge the accuracy of such predictions. Let us not forget to remind ourselves prediction accuracy is just another metric of model performing. It should not be used in isolation to select a model.

In [None]:
scores = model.evaluate(x_test, y_test)
print('Model accuracy - %.2f%%' % (scores[1]*100))