![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
from tensorflow.keras.datasets import imdb
import numpy as np

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = 10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [2]:
import tensorflow as tf

X_train_p = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen = 300, dtype = 'int32', truncating = 'pre', padding = 'pre', value = 0.0)
X_test_p = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen = 300, dtype = 'int32', truncating = 'pre', padding = 'pre', value = 0.0)

### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [3]:
print('Train reviews shape', X_train_p.shape)
print('Unique Train words', len(np.unique(np.hstack(X_train_p))))
length = [len(i) for i in X_train]
print('Average length of original reviews', np.mean(length))
length = [len(i) for i in X_train_p]
print('Average length of padded reviews', np.mean(length))


Train reviews shape (25000, 300)
Unique Train words 9999
Average length of original reviews 238.71364
Average length of padded reviews 300.0


In [4]:
print('Test reviews shape', X_test_p.shape)
print('Unique Test words', len(np.unique(np.hstack(X_test_p))))
length = [len(i) for i in X_test]
print('Average length of original reviews', np.mean(length))
length = [len(i) for i in X_test_p]
print('Average length of padded reviews', np.mean(length))

Test reviews shape (25000, 300)
Unique Test words 9943
Average length of original reviews 230.8042
Average length of padded reviews 300.0


Number of labels

In [5]:
print('Train label counts', y_train.shape)
print('Unique Train Sentiments', np.unique(y_train))

Train label counts (25000,)
Unique Train Sentiments [0 1]


In [6]:
print('Test label counts', y_test.shape)
print('Unique Test Sentiments', np.unique(y_test))


Test label counts (25000,)
Unique Test Sentiments [0 1]


### Print value of any one feature and it's label (4 Marks)

Feature value

In [7]:
print(X_train[10])

[1, 785, 189, 438, 47, 110, 142, 7, 6, 7475, 120, 4, 236, 378, 7, 153, 19, 87, 108, 141, 17, 1004, 5, 2, 883, 2, 23, 8, 4, 136, 2, 2, 4, 7475, 43, 1076, 21, 1407, 419, 5, 5202, 120, 91, 682, 189, 2818, 5, 9, 1348, 31, 7, 4, 118, 785, 189, 108, 126, 93, 2, 16, 540, 324, 23, 6, 364, 352, 21, 14, 9, 93, 56, 18, 11, 230, 53, 771, 74, 31, 34, 4, 2834, 7, 4, 22, 5, 14, 11, 471, 9, 2, 34, 4, 321, 487, 5, 116, 15, 6584, 4, 22, 9, 6, 2286, 4, 114, 2679, 23, 107, 293, 1008, 1172, 5, 328, 1236, 4, 1375, 109, 9, 6, 132, 773, 2, 1412, 8, 1172, 18, 7865, 29, 9, 276, 11, 6, 2768, 19, 289, 409, 4, 5341, 2140, 2, 648, 1430, 2, 8914, 5, 27, 3000, 1432, 7130, 103, 6, 346, 137, 11, 4, 2768, 295, 36, 7740, 725, 6, 3208, 273, 11, 4, 1513, 15, 1367, 35, 154, 2, 103, 2, 173, 7, 12, 36, 515, 3547, 94, 2547, 1722, 5, 3547, 36, 203, 30, 502, 8, 361, 12, 8, 989, 143, 4, 1172, 3404, 10, 10, 328, 1236, 9, 6, 55, 221, 2989, 5, 146, 165, 179, 770, 15, 50, 713, 53, 108, 448, 23, 12, 17, 225, 38, 76, 4397, 18, 183, 8, 

Label value

In [8]:
print(y_train[10])

1


### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [9]:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Now use the dictionary to get the original words from the encodings, for a particular sentence

In [10]:
decoded = ' '.join([reverse_index.get(i - 3, '#') for i in X_train[10]])
print(decoded)

# french horror cinema has seen something of a revival over the last couple of years with great films such as inside and # romance # on to the scene # # the revival just slightly but stands head and shoulders over most modern horror titles and is surely one of the best french horror films ever made # was obviously shot on a low budget but this is made up for in far more ways than one by the originality of the film and this in turn is # by the excellent writing and acting that ensure the film is a winner the plot focuses on two main ideas prison and black magic the central character is a man named # sent to prison for fraud he is put in a cell with three others the quietly insane # body building # marcus and his retarded boyfriend daisy after a short while in the cell together they stumble upon a hiding place in the wall that contains an old # after # part of it they soon realise its magical powers and realise they may be able to use it to break through the prison walls br br black magi

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [11]:
print('Positive' if y_train[10] == 1 else 'Negative')

Positive


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [12]:
import keras as k

In [13]:
e_init = k.initializers.RandomUniform(-0.01, 0.01, seed = 1)
init = k.initializers.glorot_uniform(seed = 1)
adam = k.optimizers.Adam()
embed_len = 100

model = k.models.Sequential()

In [14]:
model.add(k.layers.embeddings.Embedding(input_dim = 10000, input_length=300, output_dim=100, embeddings_initializer=e_init))
model.add(k.layers.LSTM(units = 100, kernel_initializer=init, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(k.layers.TimeDistributed(k.layers.Dense(100)))
model.add(k.layers.Flatten())
model.add(k.layers.Dense(units=1, kernel_initializer=init, activation='sigmoid'))



### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [23]:
model.compile(optimizer=adam, loss = 'binary_crossentropy', metrics=['acc'])

### Print model summary (4 Marks)

In [24]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 300, 100)          80400     
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          10100     
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 30001     
Total params: 1,120,501
Trainable params: 1,120,501
Non-trainable params: 0
_________________________________________________________________
None


### Fit the model (4 Marks)

In [28]:
model.fit(X_train_p, y_train, epochs = 5, batch_size = 128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f97153e2400>

### Evaluate model (4 Marks)

In [29]:
model_acc = model.evaluate(X_test_p, y_test, verbose = 1)



### Predict on one sample (4 Marks)

In [60]:
decoded = ' '.join([reverse_index.get(i - 3, '#') for i in X_test[1786]])
print(decoded)

# hi i'm # an african yet white jungle # princess who possesses the incredible ability to # into the # monster in the world think 60s star trek aliens by rolling # in mud when i first found myself in this horrible position i took the only logical action i made myself a torn apart jungle bikini in which to perform my badly acted antics i enjoy romance novels and # apart the occasional unimpressive african # and i would be # if i did not mention my white of course sidekick mr cutter an american ex military man who seems to have # the u s after his divorce can you say # # anyway he provides the occasional distraction from my difficult life i mean how many idiot # do you know who are also an # species of flesh # monster despite my many # acting is so hard # i haven't given up and after much soul searching i have finally discovered my role in life to # late night television viewers who are so unfortunate as to not have cable or satellite


In [61]:
print('Predicted Positive' if model.predict(X_test_p[1786].reshape(1,300)) == 1 else 'Predicted Negative')
print('Actual Positive' if y_test[1786] == 1 else 'Actual Negative')

Predicted Negative
Actual Negative
