![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

In [1]:
%tensorflow_version 2.x
import tensorflow
tensorflow.__version__

'2.3.0'

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [2]:
import numpy as np
import random

from tensorflow.keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import Dense, Embedding, LSTM, Dropout, TimeDistributed, Flatten

In [3]:
#### Add your code here ####
(X_train,Y_train),(X_test,Y_test) = imdb.load_data(num_words=10000)

### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [4]:
#### Add your code here ####
print("Maximum review length : {}".format(len(max((X_train+X_test),key=len))))
maxwords = 300
X_train = pad_sequences(X_train, maxlen=maxwords)
X_test = pad_sequences(X_test, maxlen=maxwords)

Maximum review length : 2697


### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [5]:
#### Add your code here ####
print(X_train.shape , X_test.shape)
print("Number of reviews in train: {}".format(len(X_train)))
print("Number of reviews in test: {}".format(len(X_test)))

(25000, 300) (25000, 300)
Number of reviews in train: 25000
Number of reviews in test: 25000


In [6]:
#### Add your code here ####
print('num of words in train review:', [len(sequence) for sequence in X_train])
print('num of words in test review:', [len(sequence) for sequence in X_test])

num of words in train review: [300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300,

Number of labels

In [7]:
#### Add your code here ####
print(Y_train.shape, Y_test.shape)
print("Number of labels: {}".format(len(np.unique(Y_train))))

(25000,) (25000,)
Number of labels: 2


### Print value of any one feature and it's label (4 Marks)

Feature value

In [8]:
#### Add your code here ####
print(X_train[13])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    1  259   37  100  169 1653 1107   11
   14  418    7  595 3882    8   28   68  419 8932   75   28    4    2
 8802 5227  173   58 7164  322   19    2   32  120   41  648    2 1990
   39 2448    2   34   35 4595  492  150   59    9    2 7143 5170   32
  120    4 3904 1873    4  766   38 6204  820 6133    8 3177    2 9106
   41  957   11  620 1093   75   28    4  658   37  517   46   34    2
    6 4057   37   43  571    8   30   27  577  442 3072   19   90   88
   29  385   99  946    5  630   34 5330   27  668 7698  260  383   19
   41 3586    5   95    2   41   56   75   28    4  554   37    9 6866
    2   34   27 8176    5   37  266  344    5 3936   27 1633   25   67
   45 

Label value

In [9]:
#### Add your code here ####
print(Y_train[13])

0


### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [10]:
#### Add your code here ####
word2id = imdb.get_word_index()

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [11]:
#### Add your code here ####
sentiments = {1: 'Positive', 0: 'Negative'}

def decode_feature(sequence):
    id2words = dict([(value, key) for (key, value) in word2id.items()])
    review = ' '.join([id2words.get(i-3, '?') for i in sequence])
    print('\n',review)
    
review_id1 = np.random.choice(X_train.shape[0])
decode_feature(X_train[review_id1])



 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? i saw saving grace right after it came out on video since then it's become one of my favorites the plot isn't particularly complex but it doesn't take away from the entertainment it's chuck full of comedic moments and has a very endearing quality to it the characters are what makes the movie so good they each have their own quirky qualities which adds to the humor the two old ladies played by linda kerr scott and ? law leaps to mind superb acting was done by all particularly brenda ? she and craig ? were great together in pulling off some of the funnier moments if you're looking for a good comedy i'd ? recommend this movie


Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [12]:
#### Add your code here ####
print('The sentiment for the above sentence is:', sentiments.get(Y_train[review_id1]))

The sentiment for the above sentence is: Positive


Get the sentiment for the above sentence
### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [13]:
#### Add your code here ####
vocab_size=10000
embedding_size=300

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=maxwords))
model.add(Dropout(0.3))
model.add(LSTM(128, dropout=0.3, return_sequences=True))
model.add(TimeDistributed(Dense(100, activation='relu')))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [14]:
#### Add your code here ####
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

### Print model summary (4 Marks)

In [15]:
#### Add your code here ####
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 300)          3000000   
_________________________________________________________________
dropout (Dropout)            (None, 300, 300)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 300, 128)          219648    
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          12900     
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               3840128   
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0

### Fit the model (4 Marks)

In [16]:
#### Add your code here ####
model.fit(X_train, Y_train, epochs=5, batch_size=256)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fb44cc30d30>

### Evaluate model (4 Marks)

In [17]:
#### Add your code here ####
tr_loss, tr_acc = model.evaluate(X_train, Y_train)
print('Training Loss: %.4f and Accuracy: %.2f%%' % (tr_loss, tr_acc * 100))

loss, acc = model.evaluate(X_test, Y_test)
print('Test Loss: %.4f and Accuracy: %.2f%%' % (loss, acc * 100))

Training Loss: 0.0293 and Accuracy: 99.31%
Test Loss: 0.4026 and Accuracy: 87.64%


In [18]:
from sklearn.metrics import classification_report
y_pred = model.predict_classes(X_test)
print('Classification Report:\n',classification_report(y_pred, Y_test))

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.88      0.88     12252
           1       0.89      0.87      0.88     12748

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



### Predict on one sample (4 Marks)

In [19]:
#### Add your code here ####
decode_feature(X_test[7])
print('The Predicted sentiment is:',sentiments.get(y_pred[7][0]))
print('The Actual sentiment is:', sentiments.get(Y_test[7]))


 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? the ? richard ? dog is ? to ? joan fontaine dog however when ? bing crosby arrives in town to sell a record player to the emperor his dog is attacked by ? dog after a revenge attack where ? is ? from town a ? insists that ? dog must confront dog so that she can overcome her ? fears this is arranged and the dogs fall in love so do ? and ? the rest of the film passes by with romance and at the end ? dog gives birth but who is the father br br the dog story is the very weak vehicle that is used to try and create a story between humans its a terrible storyline there are 3 main musical pieces all of which are rubbish bad songs and dreadful choreography its just an extremely boring film bing has too many words in each sentence and delivers them in an a