# Advanced NLP - Lab 3

In this lab, we are going to do some simple text classification using deep neural networks. We are going to use the `imdb` dataset that comes with Keras. The description of this dataset according to Keras's documentation is 

> This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

https://keras.io/api/datasets/imdb/

We will start by loading the dataset

In [1]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb

train, test = imdb.load_data()
train

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


(array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
        list([1, 194,

## Question 1

Convert each document into a 1-of-V vector containing the frequency of each of the top 1,000 words

In [2]:
from collections import Counter
import numpy as np

def to_freq(v):
  "Your code goes here"
  f=Counter(v)
  return np.array([f[x] for x in range(1,1001)])
  
X = np.array([to_freq(v) for v in train[0]])
y = np.array(train[1])
"Try to create same for test dataset"
X_test = np.array([to_freq(v) for v in test[0]])
y_test = np.array(test[1])

In [3]:
X.shape

(25000, 1000)

## Question 2

Now build a simple classifier on this to predict the sentiment of the reviews using Keras. You can run this experiment with 100 datasets from train set. Try with two different sets :- One with Softmax and one with Sigmoid actiavtion function and check the accuracy difference.

In [15]:
from keras.models import Sequential
from keras.layers import Dense
import tensorflow.compat.v1 as tf
import keras
import keras.utils

model = Sequential()
model.add(Dense(1, activation='softmax', input_dim=1000))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=6)
score = model.evaluate(X_test, y_test)
print("Test Accuracy:", score[1])

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Accuracy: 0.5


In [16]:
from keras.models import Sequential
from keras.layers import Dense
import tensorflow.compat.v1 as tf
import keras
import keras.utils

model = Sequential()
model.add(Dense(1, activation='sigmoid', input_dim=1000))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=6)
score = model.evaluate(X_test, y_test)
print("Test Accuracy:", score[1])

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Accuracy: 0.8652399778366089


Look at the Keras documentation and add a dense layer with a dimensionality of 20 with a ReLU activation to your model

In [6]:
from keras.models import Sequential
from keras.layers import Dense
import tensorflow.compat.v1 as tf
import keras
import keras.utils

model = Sequential()
"Your code goes here"
model.add(Dense(20, activation='relu', input_dim=1000))
model.add(Dense(1, activation='sigmoid', input_dim=20))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=6)
score = model.evaluate(X_test, y_test)
print("Test Accuracy:", score[1])

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Accuracy: 0.8646399974822998


## Question 3

Look at the documentation for building and training a model

https://keras.io/guides/training_with_built_in_methods/

How would you go about evaluating your model? Hint : Try built in function of Keras Model class.

In [7]:
"Your code goes here"
score = model.evaluate(X_test, y_test)
print("Test Accuracy:", score[1])

Test Accuracy: 0.8646399974822998


## Question 4

Use the [Keras documentation search](https://keras.io/search.html) make the following modifications to your code:

* Add a dropout layer
* Change the optimizer to AdaGrad
* Change the learning rate of the Adam optimizer
* Add L2 regularization to the output

In [8]:
from keras.layers import Dropout
from keras.optimizers import Adam,Adagrad


model = Sequential()
model.add(Dense(20, activation='relu', input_dim=1000,kernel_regularizer='l2'))
model.add(Dropout(0.2,input_shape=(1000,)))
model.add(Dense(1, activation='sigmoid', input_dim=20,kernel_regularizer='l2'))
opt = keras.optimizers.Adagrad(learning_rate=0.001)
model.compile(loss='binary_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=6)
score = model.evaluate(X_test, y_test)
print("Test Accuracy:", score[1])

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Accuracy: 0.6819599866867065


## Question 5

Experiment and find a setting that improves the test accuracy. What works best?

In [11]:
from keras.optimizers import RMSprop

model = Sequential()
model.add(Dense(20, activation='relu', input_dim=1000,kernel_regularizer='l2'))
model.add(Dropout(0.2,input_shape=(1000,)))
#model.add(Dense(20, activation='relu', input_dim=20,kernel_regularizer='l2'))
model.add(Dense(1, activation='sigmoid', input_dim=20,kernel_regularizer='l2'))

opt = RMSprop(lr=0.0001, decay=1e-6)
model.compile(loss='binary_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=6)
score = model.evaluate(X_test, y_test)
print("Test Accuracy:", score[1])

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Accuracy: 0.8406000137329102
