![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
from tensorflow.keras.datasets import imdb
from sklearn.model_selection import train_test_split
import numpy as np

WORDS = 10000
INDEX_FROM=3
(Xtrain, ytrain), (Xtest, ytest) = imdb.load_data(num_words = WORDS, index_from=INDEX_FROM)

data = np.concatenate((Xtrain, Xtest), axis=0)
targets = np.concatenate((ytrain, ytest), axis=0)

Xtrain, Xtest, ytrain, ytest = train_test_split(data, targets, test_size=0.3, stratify=targets, random_state=41)

### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [2]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
MAXLENGTH = 300

Xtrain = pad_sequences(Xtrain, maxlen=MAXLENGTH, value = 0.0)
Xtest = pad_sequences(Xtest, maxlen=MAXLENGTH, value = 0.0)

### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [3]:
Xtrain.shape

(35000, 300)

In [4]:
Xtest.shape

(15000, 300)

Total number of reviews are 50,000. 35,000 as Training Data & 15,000 as Testing Data. Since we have padded the data earlier with max length of 300, the number of words in each review is 300.

Number of labels

In [5]:
ytrain.shape 

(35000,)

In [6]:
ytest.shape

(15000,)

Similarly we have 50,000 labels corresponding to feature data (35,000 + 15,000)

In [7]:
print("TRAIN DATA: ")
print("Categories:", np.unique(ytrain))
print("Number of unique words:", len(np.unique(np.hstack(Xtrain))))

length = [len(i) for i in Xtrain]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))

TRAIN DATA: 
Categories: [0 1]
Number of unique words: 9999
Average Review length: 300.0
Standard Deviation: 0.0


In [8]:
print("TEST DATA: ")
print("Categories:", np.unique(ytest))
print("Number of unique words:", len(np.unique(np.hstack(Xtest))))

length = [len(i) for i in Xtest]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))

TEST DATA: 
Categories: [0 1]
Number of unique words: 9998
Average Review length: 300.0
Standard Deviation: 0.0


### Print value of any one feature and it's label (4 Marks)

Feature value

In [9]:
index = 2
print(Xtrain[index])

[4490  405   22   12    9 3747    8    6  606 3666 1896 2012   43    6
  282    8   67    6 1922  232    5    6 1523  232   30  163  295   10
   10    4 1327 1126 2398    9    4 5003  212  356 7061 6900    2   15
   20    9   87   88    4  114  218 4801    5 5715   45 1121 1243    6
  232 4286  344   18    2  214 1571   19    6    2   21   45  147    2
   12  166   32    4 1474    4  105   26  147    4 1186   26  230   53
  147 4044  430    9 1050 2764    5   94  647 1186    2    4  105   75
  235  164   18   98    5   75   92  459   44  803 1448   23  268    2
 2529    4 4690  347  200 3599    5 1254  867    5 2198  502 4044  430
    9  331 1755   19  640   40    6  606 6135   11    4    2 1587   83
    6 2487 1587   83    6 1651   19    6  351 6135   15   66  218  351
   95    2    5    2   68  519   10   10    4  226    2  519  155    9
 2586  340   39    2    5   45 3270   89   76  538   11   14   22  165
  127 5124    4 1230 1593  308 1911   20   10   10    2    9  331   96
   99 

Label value

In [10]:
print("Sentiment is Positive !!!" if ytrain[index] == 1 else "Sentiment is Negative !!!")

Sentiment is Negative !!!


### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [11]:
index = 1
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNKNOWN>"] = 2

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [12]:
id_to_word = {value:key for key,value in word_to_id.items()}
decoded = ' '.join(id_to_word[id] for id in Xtrain[index])
print(decoded)

i have never before watched a film three times in one night after the third time at 3 00 am i knew i had just experienced a great film <UNKNOWN> now ranks 10 on my top twenty films of all time and in the very small universe of great gay or gay subtext film there is <UNKNOWN> mountain <UNKNOWN> <UNKNOWN> drive and maurice br br thank you mr bell <UNKNOWN> is brilliant and fully realized with a magnificent cast a wonderfully moving understated score excellent cinematography an entertaining touching totally appropriate and <UNKNOWN> song i can go on but i won't <UNKNOWN> too much more br br this film should have received oscar nominations certainly one for best picture the performances without exception were all wonderful ms lovely <UNKNOWN> voice was a surprising <UNKNOWN> and sir ian <UNKNOWN> said awesome br br <UNKNOWN> is the reason i <UNKNOWN> through over 200 mediocre to utterly horrendous films some in the 150 million plus range a year to find that one treasure that one exquisite 

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [13]:
print("Sentiment is Positive !!!" if ytrain[index] == 1 else "Sentiment is Negative !!!")

Sentiment is Positive !!!


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [14]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, TimeDistributed
from keras.layers.embeddings import Embedding

model=Sequential()
model.add(Embedding(input_dim=10000, output_dim=100, input_length=300))
model.add(LSTM(256,return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
model.add(TimeDistributed(Dense(100, activation='relu')))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))



### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [15]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])

### Print model summary (4 Marks)

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 300, 256)          365568    
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          25700     
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 30001     
Total params: 1,421,269
Trainable params: 1,421,269
Non-trainable params: 0
_________________________________________________________________


### Fit the model (4 Marks)

In [17]:
batch_size = 64
model.fit(Xtrain, ytrain, epochs = 5, batch_size=batch_size, verbose = 1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f8d3e7997f0>

### Evaluate model (4 Marks)

In [18]:
score,acc = model.evaluate(Xtest, ytest, verbose = 2, batch_size = batch_size)
print("Validation Loss: %.2f" % (score))
print("Validation Accuracy: %.2f" % (acc))

235/235 - 17s - loss: 0.4468 - accuracy: 0.8901
Validation Loss: 0.45
Validation Accuracy: 0.89


### Predict on one sample (4 Marks)

In [19]:
def Predict(idx):
  data = Xtest[idx].reshape(1,Xtest.shape[1])
  print("TEST DATA:")
  print(data)
  print("ACTUAL SENTIMENT: ")
  print("Sentiment is Positive !!!" if ytest[idx] == 1 else "Sentiment is Negative !!!")
  sentiment = model.predict(data, batch_size = 1)
  print("PREDICTED SENTIMENT: ")
  print("Sentiment is Positive !!!" if sentiment > 0.5 else "Sentiment is Negative !!!")
  print("")

In [20]:
Predict(1)
Predict(12)

TEST DATA:
[[  11 5731 1922 2281 7550   62   28  276   15  496  208   46 1011    5
   343   14  509   17    2 6228 2214   18 7329 3050   17    9  263 2787
    16 9404    2    2    5   69    8 4184   18    6   55 7687  611  395
  9956 5961  372   10   10    4  650  724    2    2   12  215   30  301
     9   24   55 1444    4 1385    9  327    5 3551   11    6 1162    2
    96   15 3355 6799   25   83  536   50  238   30   49 2812 1518 1663
    33  157  496    4  370 5898   45   43    6 3475  836 5235 1700   93
    18  248  708   15 1668 7652  201  381    8   97   49 1727 2208    5
  6585   49 1877    2  151   36  540  161  124   15   33    4   58    5
   123  125   68 2202  133 1106 1899    8    6  248 2407 3478  223    2
   125  429  248 2407 3959   37  191 1197  726  507    6 7040 1366   42
     6 4934 1215 1346    4 1057 3259    4 1474   34 1489   98   11    2
     2    5   28  115 2051   31    7    4 4101 3080    7 3469  699   92
   947   19 2801 9172   67    2    5 6934    4   64  