### Data
[movie reviews](https://storage.googleapis.com/kaggle-datasets/134715/320111/imdb-dataset-of-50k-movie-reviews.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1571493433&Signature=nFHTnAGZQMsR1jV5hxDRg7oe%2B4IgODD0zbdBtb0A0Vwx1YmFaqGQcxrVMUtjagdyyr4OaLIL41XLwo57dP6AiZArsNflWWZnDewCZ8eNzJ4Ha%2FdZqqo99TLI3TrywWZEOIP4GbIc%2Fl5rTNGPYTVqAc0B0CTEfAP4Nb%2BwfTcgsr1zBTj4ARU7PD5yKhBTlntGfAw0CgwJZ3d009MnFrqY6B3XEAhgO0Y9IBtlUSlkL0HaU%2FDzkfqHAef45h1kluAGtKfZ2xzRb%2FEv8TVKpbUsB%2BlkNJ3Q%2Bts7pxVXxyrF%2BuhKsky7Rh50sCzmoDiHZe%2B67BC3PJtye5lHbhuPvcCUEQ%3D%3D)

In [None]:
# common libraries

import pandas as pd
import numpy as np
import re
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten
from keras.layers import GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
import nltk
from nltk.corpus import stopwords

In [None]:
#load data
imdb=pd.read_csv('IMDB Dataset.csv')

In [None]:
#inspect data
print(imdb.isnull().sum())
print(imdb.shape)
print(imdb['review'][4])

In [None]:
# visualize

import seaborn as sns

sns.countplot(x='sentiment', data=imdb)

In [None]:
# data processing function to remove html tags and other unwatend characters (stopwords)

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

In the `preprocess_text()` method the first step is to remove the HTML tags. To remove the HTML tags, `remove_tags()` function has been defined. The remove_tags function simply replaces anything between opening and closing <> with an empty space.

Next, in the preprocess_text function, everything is removed except capital and small English letters, which results in single characters that make no sense. For instance, when you remove apostrophe from the word "Mark's", the apostrophe is replaced by an empty space. Hence, we are left with single character "s".

Next, we remove all the single characters and replace it by a space which creates multiple spaces in our text. Finally, we remove the multiple spaces from our text as well.

In [None]:
#preprocess our reviews and store them in a new list

reviews = []
sentences = list(imdb['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))
    
print(reviews[4])

In [None]:
#convert our labels into digits, 1 for positive and 0 for negative

sentiments = imdb['sentiment']

sentiments = np.array(list(map(lambda x: 1 if x=="positive" else 0, sentiments)))

print(sentiments[1:5])

In [None]:
#divide dataset into train and test sets. The train set will be used to train our deep learning models while the test set will be used to evaluate how well our model performs.

X_train, X_test, y_train, y_test = train_test_split(reviews, sentiments, test_size=0.20, random_state=42)

#This code divides our data into 80% for the training set and 20% for the testing set.

Let's now write the script for our embedding layer. The embedding layer converts our textual data into numeric data and is used as the first layer for the deep learning models in Keras.  

we will use the Tokenizer class from the keras.preprocessing.text module to create a word-to-index dictionary. In the word-to-index dictionary, each word in the corpus is used as a key, while a corresponding unique index is used as the value for the key

In [None]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

If you view the X_train variable in variable explorer, you will see that it contains 40,000 lists where each list contains integers. Each list actually corresponds to each sentence in the training set. You will also notice that the size of each list is different. This is because sentences have different lengths.

We set the maximum size of each list to 100. You can try a different size. The lists with size greater than 100 will be truncated to 100. For the lists that have length less than 100, we will add 0 at the end of the list until it reaches the max length. This process is called padding.

In [None]:
print(len(X_train[0]),len(X_train[1]))

In [None]:
#find the vocabulary size and then perform padding on both train and test set.

# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

Now if you view the X_train or X_test, you will see that all the lists have same length i.e. 100.  

Also, the vocabulary_size variable now contains a value 92547 which means that our corpus has 92547 unique words.

We will use GloVe embeddings to create our feature matrix. 

In the following script we load the GloVe word embeddings and create a dictionary that will contain words as keys and their corresponding embedding list as values.

In [None]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()
glove_file = open('glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions
glove_file.close()

Finally, we will create an embedding matrix where each row number will correspond to the index of the word in the corpus. The matrix will have 100 columns where each column will contain the GloVe word embeddings for the words in our corpus.



In [None]:
embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector
        
# embedding_matrix will contain 92547 rows (one for each word in the corpus). Now we are ready to create our deep learning models.

In [None]:
# Simple neural network

model = Sequential()
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen , trainable=False)
model.add(embedding_layer)

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.summary()

In the script above, we create a `Sequential()` model. Next, we create our embedding layer. The embedding layer will have an input length of 100, the output vector dimension will also be 100. The vocabulary size will be 92547 words. Since we are not training our own embeddings and using the GloVe embedding, we set trainable to False and in the weights attribute we pass our own embedding matrix.

The embedding layer is then added to our model. Next, since we are directly connecting our embedding layer to densely connected layer, we flatten the embedding layer. Finally, we add a dense layer with `sigmoid` activation function.

To compile our model, we will use the `adam` optimizer, `binary_crossentropy` as our loss function and `accuracy` as metrics and then we will print the summary of our model:

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

print(model.summary())

Since there are `92547` words in our corpus and each word is represented as a 100-dimensional vector, the number of trainable parameter will be `92547x100` in the embedding layer. In the flattening layer, we simply multiply rows and column. Finally in the dense layer the number of parameters are `10000` (from the flattening layer) and 1 for the bias parameter, for a total of `10001`.

we use the `fit` method to `train our neural network`. Notice we are training on our `train set only`. The validation_split of 0.2 means that 20% of the training data is used to find the training accuracy of the algorithm.

In [None]:
history = model.fit(X_train, y_train, batch_size=128, epochs=6, verbose=1, validation_split=0.2)

In [None]:
# To evaluate the performance of the model, we can simply pass the test set to the evaluate method of our model.

score, accuracy = model.evaluate(X_test, y_test, verbose=1)
print("Test Score:", loss)
print("Test Accuracy:", accuracy)

we get a test accuracy of `74.68%`. Our training accuracy was `85.52%`. This means that our model is `overfitting on the training set`. Overfitting occurs when your model performs better on the training set than the test set. Ideally, the performance difference between training and test sets should be minimum.

In [None]:
# define function that will be used to plot loss difference and accuracy of training session

import matplotlib.pyplot as plt

plt.style.use('ggplot')

def visual_diff(history):
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train accuracy','validation accuracy'], loc='upper left')
    plt.show()

    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])

    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train loss','validation loss'], loc='upper left')
    plt.show()

In [None]:
visual_diff(history)

## Improve results with CNN (Convolutional Neural Network)

convolutional neural network is a type of network that is primarily used for 2D data classification, such as images. A convolutional network tries to find specific features in an image in the first layer. In the next layers, the initially detected features are joined together to form bigger features. In this way, the whole image is detected.

Convolutional neural networks have been found to work well with text data as well. Though text data is one-dimensional, we can use 1D convolutional neural networks to extract features from our data. To learn more about convolutional neural networks, please refer to [this article](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/).

Let's create a simple convolutional neural network with 1 convolutional layer and 1 pooling layer. Remember, the code up to the creation of the embedding layer will remain same, execute the following piece of code after you create the embedding layer:

In [None]:
from keras.layers import Conv1D

model = Sequential()

embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen , trainable=False)
model.add(embedding_layer)

model.add(Conv1D(128, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In the above script we create a sequential model, followed by an embedding layer. This step is similar to what we had done earlier. Next, we create a one-dimensional convolutional layer with 128 features, or kernels. The kernel size is 5 and the activation function used is sigmoid. Next, we add a global max pooling layer to reduce feature size. Finally we add a dense layer with sigmoid activation. The compilation process is the same as it was in the previous section.

In [None]:
model.summary()

**from model summary, it's clear hat in the above case we don't need to flatten our embedding layer. You can also notice that feature size is now reduced using the pooling layer.**  



Let's now train our model and evaluate it on the training set. The process to train and test our model remains the same. To do so, we can use the fit and evaluate methods, respectively.

In [None]:
history = model.fit(X_train, y_train, batch_size=128, epochs=6, verbose=1, validation_split=0.2)

score,accuracy = model.evaluate(X_test, y_test, verbose=1)

print()
print("Test Score:", score)
print("Test Accuracy:", accuracy)

visual_diff(history)

**the training accuracy for CNN is around 92%, which is greater than the training accuracy of the simple neural network. The test accuracy is around 82% for the CNN, which is also greater than the test accuracy for the simple neural network, which was around 74%.**   

However our CNN model is still overfitting as there is a vast difference between the training and test accuracy.

## Text Classification with Recurrent Neural Network (LSTM)

Recurrent neural network is a type of neural networks that is proven to work well with sequence data. Since text is actually a sequence of words, a recurrent neural network is an automatic choice to solve text-related problems. In this section, we will use an LSTM (Long Short Term Memory network) which is a variant of RNN, to solve sentiment classification problem.

**Our goal here is getting rid of the overfitting.**

In [None]:
from keras.layers import LSTM

model = Sequential()
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen , trainable=False)
model.add(embedding_layer)
model.add(LSTM(128))

model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

print(model.summary())

In [None]:
history = model.fit(X_train, y_train, batch_size=128, epochs=6, verbose=1, validation_split=0.2)

score = model.evaluate(X_test, y_test, verbose=1)

print()
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

visual_diff(history)

>The difference between the accuracy values for training and test sets is much smaller compared to the simple neural network and CNN. Similarly, the different between the loss values is also negligible, which shows that our model is not overfitting. We can conclude, that for our problem, RNN is the best algorithm.

Note that we randomly chose the number of layers, neurons, hyper parameters, etc.Try to change the number of layers, number of neurons and activation functions for all three neural networks discussed, see which neural network works best for you.

### Making Predictions on Single Instance

This is the final section of the notebook and here we will see how to make predictions on a single instance or single sentiment. Let's retrieve any review from our corpus and then try to predict its sentiment.

In [None]:
# randomly select any review from our corpus

instance = reviews[57]
print(instance)

You can clearly see that this is `negative review`. To predict the sentiment of this review, we have to `convert` this review into numeric form. We can do so using the `tokenizer` that we created in `word embedding` section. The text_to_sequences method will convert the sentence into its numeric counter part.

Next, we need to pad our input sequence as we did for our corpus. Finally, we can use the predict method of our model and pass it our processed input sequence.

In [None]:
instance = tokenizer.texts_to_sequences(instance)

flat_list = []
for sublist in instance:
    for item in sublist:
        flat_list.append(item)

flat_list = [flat_list]

instance = pad_sequences(flat_list, padding='post', maxlen=maxlen)

model.predict(instance)

> Remember, we mapped the positive outputs to 1 and the negative outputs to 0. However, the sigmoid function predicts floating value between 0 and 1. If the value is less than 0.5, the sentiment is considered negative where as if the value is greater than 0.5, the sentiment is considered as positive. The sentiment value for our single instance is 0.33 which means that our sentiment is predicted as negative, which actually is the case.