# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

import keras as keras
import tensorflow as tf

import matplotlib.pyplot as plt
from tensorflow.keras.datasets import imdb

In [22]:
print(tf.__version__)

2.3.1


In [20]:
data = pd.DataFrame(imdb.load_data())

In [9]:
most_frequent_words = 10000
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = most_frequent_words)

### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [25]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
sq_length = 300
train_data = pad_sequences(train_data, maxlen=sq_length, padding='pre', truncating='pre')
test_data = pad_sequences(test_data, maxlen=sq_length, padding='pre', truncating='pre')

### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [28]:
print(train_data.shape)
print(test_data.shape)

(25000, 300)
(25000, 300)


Now there are 25000 sentences each of length 300.

In [29]:
print(train_labels.shape)
print(test_labels.shape)

(25000,)
(25000,)


For each sentence there is a one lable. 25000 sentences have 25000 lables. 

Number of labels

In [37]:
print(np.unique(train_labels))
print(np.unique(test_labels))

[0 1]
[0 1]


There are only 2 lables,  positive (1) or negative (0).

### Print value of any one feature and it's label (4 Marks)

Feature value

In [38]:
train_data[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    1,  194, 1153,  194, 8255,   78,  228,    5,    6, 1463,
       4369, 5012,  134,   26,    4,  715,    8,  118, 1634,   14,  394,
         20,   13,  119,  954,  189,  102,    5,  207,  110, 3103,   21,
         14,   69,  188,    8,   30,   23,    7,   

Label value

In [39]:
train_labels[1]

0

### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [49]:
word_index = keras.datasets.imdb.get_word_index()
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2
word_index["<UNUSED>"] = 3

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [55]:
word_index = {value:key for key,value in word_index.items()}
print(' '.join(word_index[id] for id in train_data[0] ))

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend 

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [53]:
train_labels[0]

1

### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [89]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, TimeDistributed, Dropout, Bidirectional, Input

In [120]:
input = Input(shape=(sq_length,))
model = Embedding(input_dim=10000, output_dim=50, input_length=sq_length)(input)
model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = Dense(1, activation="softmax")(model)

In [121]:
model = Model(input, out)

### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [122]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]) 

### Print model summary (4 Marks)

In [123]:
model.summary()

Model: "functional_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        [(None, 300)]             0         
_________________________________________________________________
embedding_10 (Embedding)     (None, 300, 50)           500000    
_________________________________________________________________
dropout_6 (Dropout)          (None, 300, 50)           0         
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 300, 200)          120800    
_________________________________________________________________
dense_7 (Dense)              (None, 300, 1)            201       
Total params: 621,001
Trainable params: 621,001
Non-trainable params: 0
_________________________________________________________________


### Fit the model (4 Marks)

In [124]:
from tensorflow.keras.callbacks import ModelCheckpoint

save_at = "lstm_bidirectional.hdf5"
save_best = ModelCheckpoint(save_at, monitor='val_loss', verbose=1, 
                              save_best_only=True, save_weights_only=False, mode='min')

In [125]:
model.fit(train_data, train_labels, batch_size=32, epochs=3,
            validation_split=0.3, verbose=1, callbacks=[save_best])

Epoch 1/3
Epoch 00001: val_loss improved from inf to 7.72628, saving model to lstm_bidirectional.hdf5
Epoch 2/3
Epoch 00002: val_loss did not improve from 7.72628
Epoch 3/3
Epoch 00003: val_loss did not improve from 7.72628


<tensorflow.python.keras.callbacks.History at 0x17b08b72f40>

### Evaluate model (4 Marks)

In [126]:
model.evaluate(test_data, test_labels, batch_size=128)



[7.6246185302734375, 0.5]

### Predict on one sample (4 Marks)

In [127]:
y_pred = model.predict(test_data)

In [130]:
y_pred[0][1]

array([1.], dtype=float32)

In [129]:
train_labels[0]

1