#**DOMAIN:** Digital content and entertainment industry
#**CONTEXT:** 
The objective of this project is to build a text classification model that analyses the customer's sentiments based on their reviews in the IMDB database. The model uses a complex deep learning model to build an embedding layer followed by a classification algorithm to analyse the sentiment of the customers.
#**DATA DESCRIPTION:** 
The Dataset of 50,000 movie reviews from IMDB, labelled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocabulary size of 10,000. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.
#**PROJECT OBJECTIVE:** 
Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments.

#**STEPS AND TASKS:** [ Total Score: 30 points]
1. Import and analyse the data set.
  * Hint: - Use `imdb.load_data()` method
    * Get train and test set
    * Take 10000 most frequent words
2. Perform relevant sequence adding on the data
3. Perform following data analysis:
  * Print shape of features and labels
  * Print value of any one feature and it's label
4. Decode the feature value to get original sentence
5. Design, train, tune and test a sequential model.
  * Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN
classifiers. Be analytical and experimental here in trying new approaches to design the best model.
6. Use the designed model to print the prediction on any one sample.

# 1. Import and analyse the data set
  * ### Hint: - Use `imdb.load_data()` method
    * ### Get train and test set
    * ### Take 10000 most frequent words

In [1]:
from keras.datasets import imdb

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = 10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


# 2. Perform relevant sequence adding on the data

In [2]:
# taking first 20 words of each review as mentioned above
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen = 20)
X_test =  pad_sequences(X_test, maxlen = 20)

# 3. Perform following data analysis:
  * ### Print shape of features and labels
  * ### Print value of any one feature and it's label

In [3]:
print('Number of reviews :', X_train.shape, X_test.shape)
print('Number of labels :', y_train.shape, y_test.shape)

Number of reviews : (25000, 20) (25000, 20)
Number of labels : (25000,) (25000,)


In [4]:
# visualizing the data

print('Feature :', X_train[1])
print('Label :', y_train[1])

Feature : [  23    4 1690   15   16    4 1355    5   28    6   52  154  462   33
   89   78  285   16  145   95]
Label : 0


# 4. Decode the feature value to get original sentence

In [5]:
word_index = imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [6]:
# mapping integers to their respective words

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [7]:
# decoding the review

decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in X_train[0]])
decoded_review

"story was so lovely because it was true and was someone's life after all that was shared with us all"

# 5. Design, train, tune and test a sequential model

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Flatten, Dense, Dropout

model = Sequential()

# embedding layer
model.add(Embedding(input_dim = 10000, output_dim = 100, input_length = 20))

# recurrent layer
model.add(LSTM(64, activation = 'relu', return_sequences=True))

model.add(Flatten())
model.add(Dense(512, activation = 'relu'))

# dropout for regularization
model.add(Dropout(0.5))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(1, activation='sigmoid'))



In [9]:
# compiling
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [10]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 100)           1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 20, 64)            42240     
_________________________________________________________________
flatten (Flatten)            (None, 1280)              0         
_________________________________________________________________
dense (Dense)                (None, 512)               655872    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 2

In [11]:
# fitting the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30, batch_size=128)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7facb1a65a50>

In [13]:
score = model.evaluate(X_test, y_test)
print('Loss : ', score[0], 'Accuracy: ', score[1])

Loss :  3.442291498184204 Accuracy:  0.7172399759292603


# 6. Use the designed model to print the prediction on any one sample

In [18]:
result = model.predict(X_test[0].reshape((1, 20)))

In [25]:
result

array([[0.2746454]], dtype=float32)