![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
import pandas as pd
import numpy as np
from keras.datasets import imdb
from matplotlib import pyplot

In [2]:
import tensorflow as tf

In [3]:
from tensorflow.keras.datasets import imdb
imdb.load_data

<function tensorflow.python.keras.datasets.imdb.load_data>

In [7]:
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

In [8]:
print("train_data ", X_train.shape)
print("train_labels ", X_test.shape)
print("test_labels ", y_train.shape)
print("test_labels ", y_test.shape)

train_data  (25000,)
train_labels  (25000,)
test_labels  (25000,)
test_labels  (25000,)


### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [11]:
top_words = 10000
X_train =tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=300)
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=300)

In [12]:
print("Training data: ")
print(X.shape)
print(y.shape)

Training data: 
(50000,)
(50000,)


### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [34]:
print("Categories:", np.unique(y))
print("Number of unique words:", len(np.unique(np.hstack(X_train))))
length = [len(i) for i in X]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))

Categories: [0 1]
Number of unique words: 9999
Average Review length: 234.75892
Standard Deviation: 173.0


In [35]:
print("Number of words: ")
print(len(np.unique(np.hstack(X))))

Number of words: 
9998


Number of labels

In [36]:
print("Label:", y[0])

Label: 1


### Print value of any one feature and it's label (2 Marks)

Feature value

In [37]:
print(X[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


Label value

In [38]:
print("Label:", y[0])

Label: 1


### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [14]:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()]) 
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in X_test[0]] )
print(decoded) 

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # please give this one a miss br br # # and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite # so all you madison fans give this a miss


Now use the dictionary to get the original words from the encodings, for a particular sentence

In [15]:
def vectorize(sequences, dimension = 10000):
  results= np.zeros((len(sequences), dimension))
  for i,sequence in enumerate(sequences):
    results[i, sequence] = 1
    return results
 
data = vectorize(X)
targets = np.array(y).astype("float32")

In [16]:
test_x = data[:10000]
test_y = targets[:10000]
train_x = data[10000:]
train_y = targets[10000:]

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [None]:
#### Add your code here ####

### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [17]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import LSTM
from keras.layers import TimeDistributed
model = Sequential()
model.add(Embedding(top_words, 32, input_length=300))
model.add(LSTM(1, return_sequences=True, input_shape=(3,1)))
model.add(TimeDistributed(Dense(1)))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [18]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


### Print model summary (2 Marks)

In [19]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 32)           320000    
_________________________________________________________________
lstm (LSTM)                  (None, 300, 1)            136       
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 1)            2         
_________________________________________________________________
flatten (Flatten)            (None, 300)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 350,339
Trainable params: 350,339
Non-trainable params: 0
__________________________________________________

### Fit the model (2 Marks)

In [20]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=500, verbose=2)

Epoch 1/2
50/50 - 2s - loss: 0.4993 - accuracy: 0.7416 - val_loss: 0.3557 - val_accuracy: 0.8407
Epoch 2/2
50/50 - 2s - loss: 0.2414 - accuracy: 0.9035 - val_loss: 0.3042 - val_accuracy: 0.8723


<tensorflow.python.keras.callbacks.History at 0x7f5795e254e0>

### Evaluate model (2 Marks)

In [21]:
scores = model.evaluate(X_test, y_test, verbose=0)

### Predict on one sample (2 Marks)

In [22]:
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.23%


In [24]:
y_pred = model.predict(X_test[0:1000])

In [25]:
print("y_pred:",y_pred)

y_pred: [[7.19748735e-02]
 [9.95612144e-01]
 [9.35924053e-01]
 [7.13260651e-01]
 [9.84752715e-01]
 [4.88594472e-01]
 [9.80475008e-01]
 [1.04460325e-02]
 [9.67068017e-01]
 [9.85937655e-01]
 [9.51796055e-01]
 [4.46973601e-03]
 [2.55335178e-02]
 [2.92021837e-02]
 [9.94786739e-01]
 [2.00439517e-05]
 [9.63796377e-01]
 [7.00112820e-01]
 [8.43494257e-04]
 [5.44285476e-02]
 [9.93236840e-01]
 [9.83757079e-01]
 [2.77747333e-01]
 [9.27279830e-01]
 [9.73556101e-01]
 [9.77747381e-01]
 [1.15065321e-01]
 [9.06312943e-01]
 [9.56836522e-01]
 [6.20346400e-05]
 [9.60055292e-01]
 [7.27528989e-01]
 [9.42295969e-01]
 [1.90717224e-02]
 [2.77844071e-02]
 [4.76310262e-03]
 [9.94342923e-01]
 [9.81414735e-01]
 [6.48407862e-02]
 [3.39158485e-03]
 [9.97551978e-01]
 [9.92290080e-01]
 [4.20457497e-02]
 [9.73885536e-01]
 [9.99269664e-01]
 [9.21485603e-01]
 [1.88878011e-02]
 [7.98192399e-04]
 [4.25334508e-03]
 [2.52845585e-01]
 [2.16692267e-03]
 [9.92900506e-02]
 [8.47985268e-01]
 [9.46551561e-01]
 [8.26567113e-01]
 [