# IMDB Movie reviews sentiment classifier

**CONTEXT**: The objective of this project is to build a text classification model that analyses the customer's sentiments based on their reviews in the IMDB database. The model uses a complex deep learning model to build an embedding layer followed by
a classification algorithm to analyse the sentiment of the customers.

**DATA DESCRIPTION**: The Dataset of 50,000 movie reviews from IMDB, labelled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocabulary size of 10,000. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

**PROJECT OBJECTIVE**: Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments.

## Import the Libraries

We import the classes and functions required for this model.

In [4]:
import numpy
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence
import warnings
warnings.filterwarnings('ignore')

## Load Dataset

We need to load the IMDB dataset. We are constraining the dataset to the top 10,000 words. We also split the dataset into train (50%) and test (50%) sets.

In [5]:
top_words = 10000 ## Taking 10000 most frequent words
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

### Shape of the data

In [6]:
print(X_train[11].shape)

(25000,)
(25000,)


In [46]:
print(y_train[11])

0


### Decode the feature value to get original sentence

In [37]:
def get_sentences(feature):
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2
    word_to_id["<UNUSED>"] = 3

    id_to_word = {value:key for key,value in word_to_id.items()}
    S = ' '.join(id_to_word[id] for id in feature)
    print(S)

get_sentences(X_train[11])

<START> when i rented this movie i had very low expectations but when i saw it i realized that the movie was less a lot less than what i expected the actors were bad the doctor's wife was one of the worst the story was so stupid it could work for a disney movie except for the murders but this one is not a comedy it is a laughable masterpiece of stupidity the title is well chosen except for one thing they could add stupid movie after dead husbands i give it 0 and a half out of 5


## Truncate and pad input sequences

We need to truncate and pad the input sequences so that they are all the same length for modeling. The model will learn the zero values carry no information so indeed the sequences are not the same length in terms of content, but same length vectors is required to perform the computation in Keras.

In [38]:
max_review_length = 500
X_train_em = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test_em = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [39]:
X_train_em[11].shape

(500,)

## Define and train the model

The first layer is the **Embedded layer** that uses 32 length vectors to represent each word. The next layer is the **LSTM layer** with 100 memory units. Finally, because this is a classification problem we use a **Dense output layer** with a single neuron and a **sigmoid activation function** to make 0 or 1 predictions for the two classes (good and bad) in the problem.

Because it is a binary classification problem, **binary_crossentropy** loss is used as the loss function. The efficient **ADAM optimization algorithm** is used. The model is fit for only 5 epochs because it quickly overfits the problem. A large batch size of 64 reviews is used to space out weight updates.

In [41]:
embedding_vecor_length = 32

model = Sequential()

model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))

model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))

model.add(Dense(1, activation='sigmoid'))

In [42]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 32)           320000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________


In [43]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [45]:
model.fit(X_train_em, y_train, epochs=5, batch_size=64) ## Increase the epochs as per your hardware

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1da5a17a148>

## Model Evaluation and scoring

In [48]:
scores = model.evaluate(X_test_em, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.88%


## Prediction on any one sample

In [54]:
y_pred = model.predict_classes(X_test_em, batch_size = 128)

In [75]:
def sentiment(val):
    if val==1:
        return 'Positive Review'
    return 'Negative Review'

def get_predictions(n):
    review = get_sentences(X_test[n])
    print('-----------------------------------')
    print('Predicted sentiment -', sentiment(y_pred[n]))
    print('-----------------------------------')
    print('Actual sentiment    -', sentiment(y_test[n]))

In [78]:
get_predictions(144)

<START> here's a couple of <UNK> out of an <UNK> i wrote for university about br br the book of revelation is an erotic thriller about sex power and a talented <UNK> struggle to regain his sense of self after being unfortunately raped by three <UNK> women the three women that <UNK> him all have distinctive marks on the bodies one has a giant birth mark on her <UNK> another has a <UNK> <UNK> on her lower stomach and the ring leader has a small circle on her breast so he lives his new life in search of these <UNK> and to find them on these intimate places he does what any sane man does when he needs to see as many naked women as possible to solve a mystery he has sex with them an hour and ten minutes into the film and you feel like he has almost had a piece of every woman in melbourne br br the film is a giant <UNK> of pretentious celluloid it is like <UNK> from every frame at only one point towards the films final climax does give a scene the same energy and strength as her debut featur

In [77]:
get_predictions(155)

<START> i actually found out about rising via the imdb website i have a particular interest in <UNK> brazilian culture and films rising is one of those gems that gives a new meaning to human transformation beautifully documented and filmed by jeff <UNK> and matt its the story anderson <UNK> a former <UNK> de <UNK> drug <UNK> who after the deaths of family members and friends becomes a christ like malcolm x and <UNK> all rolled into one <UNK> formed a <UNK> cultural movement that uses <UNK> brazilian <UNK> <UNK> brazilian martial arts <UNK> and other to <UNK> the hopeless and most times angry youth into vibrant <UNK> caring community loving individuals br br a few years ago i remember going to a screening of city of god de <UNK> and walked out of the theatre completely <UNK> the images were grim yet stunning and you couldn't take your eyes off the screen i remember how hopeless some situations were in the <UNK> and how <UNK> the society was due to the <UNK> neglect how drug <UNK> was a 

## Conclution:

Our model perform quite well with the predictions. The accuracy achieved with test data is 86.88% and the couple of predictions that were tried showed positive results.