## NLP Task 2-b
In this step you will implement a “vanilla” model of the architecture. 
For this you will need to use Py- Torch or Tensorflow/Keras functional API and various layers
like Input, Embedding, Conv1D, Dropout, MaxPooling1D, Flatten, concatenate, Dense, etc. as well as
other utility functions such as Tokenizer. For some of the parameters, you should consider the values
suggested in Table 1. Please note that those values are “typical” but not necessarily optimal. First, you need to tokenize the texts and cut/pad them to a common max length size. Then you derive the train and test samples and labels. The “vanilla” model should contain the Embedding layer, a single convolution layer of only one block followed by a max-pooling layer and a single dense layer. You will report the classification accuracy of this model. (18 points)


In [36]:
# !pip install -r requirements.txt

In [24]:
import numpy as np
import random, sys
import tensorflow
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Dense, Conv1D, MaxPooling1D
from tensorflow.keras.utils import plot_model 
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import pydot
import graphviz
# Collect Text 
import pandas as pd
import keras_tuner as kt

#### Parameters (Recommendation)

W or number of convolution blocks in each layer: 2-5.  
L or number of consecuetive convolution-pooling layers: 2-4  
Maximal length of each review sequence: 300-500    
Dimension of word embbeddings: 150-300.  
Train : Test split of the data samples: 4:1 or 9:1.  
Number of filters in each convolution layer: 10-50.  
Kernel size in each convolution block: 1-5.  

Here we set up the parameters for our model.

In [25]:
# Number of convolution blocks in each layer
# W = 1
# Number of consecuetive convolution-pooling layers
# L = 1
#  Max lenght of each review sequence
UNIFORM_LENGTH = 600
# The dimension of the word embeddings
WORD_EMBEDDING_DIM = 150
# Number of filters in each convolution layer
CONV_FILTERS = 10
# Kernel size in each convolution block
KERNEL_SIZE = 1
# Vocabulary: number of most frequent words
VOCABULARY = 30000
# POOL_SIZE: Downsamples the input representation by taking the maximum value 
POOL_SIZE = 2
# Training and evaluation:
EPOCHS = 3
BATCH_SIZE = 64
VERBOSE = 1

### Loading and preparing data

Loading the pre-processed data and splitting them into train and test sets. Random state is fixed for reproducibility.

In [26]:
df = pd.read_csv('review_preprocessed.csv')
training_data = df.sample(frac=0.8, random_state=25)
testing_data = df.drop(training_data.index)

print(f"No. of training examples: {training_data.shape[0]}")
print(f"No. of testing examples: {testing_data.shape[0]}")

No. of training examples: 40000
No. of testing examples: 10000


In [27]:
### Converting the pandas dataframe to lists and numpy arrays
X_train = training_data.loc[:,'review'].to_list()
X_test = testing_data.loc[:,'review'].to_list()
y_train = training_data.loc[:,'polarity'].to_numpy()
y_test = testing_data.loc[:,'polarity'].to_numpy()

The Tokenize class helps us to vectorize a text corpus by tunring them into a sequence of integers. 

In [28]:
t  = Tokenizer(num_words = VOCABULARY)

t.fit_on_texts(X_train)
X_train_enc = t.texts_to_sequences(X_train)
X_test_enc = t.texts_to_sequences(X_test)

Here we force a uniform length for each review. Longer reviews are truncated and shorted reviews and padded with zeros.

In [29]:
X_train_pad = pad_sequences(X_train_enc, maxlen=UNIFORM_LENGTH)
X_test_pad = pad_sequences(X_test_enc, maxlen=UNIFORM_LENGTH)

X_train = X_train_pad
X_test = X_test_pad

## CNN model with keras tuner

In [38]:
def call_existing_code(VOCABULARY, WORD_EMBEDDING_DIM, UNIFORM_LENGTH, CONV_FILTERS, KERNEL_SIZE):
    inputs = Input(shape=(UNIFORM_LENGTH,))
    x = Embedding(VOCABULARY, WORD_EMBEDDING_DIM, input_length=UNIFORM_LENGTH)(inputs)
    x = Conv1D(filters=CONV_FILTERS, kernel_size=KERNEL_SIZE, activation='relu')(x)
    x = MaxPooling1D(pool_size=POOL_SIZE)(x)
    outputs = Dense(1, activation='sigmoid')(x)
    model = Model(inputs=inputs, outputs=outputs)
    plot_model(model, to_file='model.png', show_shapes=True)
    model.compile(
        #optimizer=keras.optimizers.Adam(learning_rate=lr), ### Uncomment this to tune learning rate
        optimizer='adam',
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model


def build_model(hp):
    VOCABULARY = 30000
    WORD_EMBEDDING_DIM = hp.Int("output_dim", min_value=150, max_value=300, step=50)
    UNIFORM_LENGTH = 600
    CONV_FILTERS = hp.Int("filters", min_value=10, max_value=50, step=10)
    KERNEL_SIZE = hp.Int("kernel_size", min_value=1, max_value=5, step=1)

    # call existing model-building code with the hyperparameter values.
    model = call_existing_code(
        VOCABULARY=VOCABULARY, 
        WORD_EMBEDDING_DIM=WORD_EMBEDDING_DIM, 
        UNIFORM_LENGTH=UNIFORM_LENGTH, 
        CONV_FILTERS=CONV_FILTERS, 
        KERNEL_SIZE=KERNEL_SIZE
        )

    return model

build_model(kt.HyperParameters())


('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')


<keras.engine.functional.Functional at 0x7f5034dac7c0>

In [39]:
tuner = kt.RandomSearch(
    build_model,
    objective='accuracy',
    max_trials=5,
    overwrite=True)
    
tuner.search_space_summary()

('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')
Search space summary
Default search space size: 3
output_dim (Int)
{'default': None, 'conditions': [], 'min_value': 150, 'max_value': 300, 'step': 50, 'sampling': None}
filters (Int)
{'default': None, 'conditions': [], 'min_value': 10, 'max_value': 50, 'step': 10, 'sampling': None}
kernel_size (Int)
{'default': None, 'conditions': [], 'min_value': 1, 'max_value': 5, 'step': 1, 'sampling': None}


In [33]:
tuner.search(X_train, y_train, epochs=2, batch_size=BATCH_SIZE, validation_data=(X_test, y_test))
best_model = tuner.get_best_models()[0]


Search: Running Trial #1

Hyperparameter    |Value             |Best Value So Far 
output_dim        |300               |?                 
filters           |20                |?                 
kernel_size       |3                 |?                 

Epoch 1/2


ValueError: in user code:

    File "/home/robert/.local/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/home/robert/.local/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/robert/.local/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "/home/robert/.local/lib/python3.8/site-packages/keras/engine/training.py", line 809, in train_step
        loss = self.compiled_loss(
    File "/home/robert/.local/lib/python3.8/site-packages/keras/engine/compile_utils.py", line 201, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "/home/robert/.local/lib/python3.8/site-packages/keras/losses.py", line 141, in __call__
        losses = call_fn(y_true, y_pred)
    File "/home/robert/.local/lib/python3.8/site-packages/keras/losses.py", line 245, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "/home/robert/.local/lib/python3.8/site-packages/keras/losses.py", line 1807, in binary_crossentropy
        backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
    File "/home/robert/.local/lib/python3.8/site-packages/keras/backend.py", line 5158, in binary_crossentropy
        return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)

    ValueError: `logits` and `labels` must have the same shape, received ((64, 299, 1) vs (64,)).


In [34]:
best_hps=tuner.get_best_hyperparameters()[0]
print("Optimal parameter for CONV_FILTERS: ", best_hps.get('filters'))
print("Optimal parameter for WORD_EMBEDDING_DIM: ", best_hps.get('output_dim'))
print("Optimal parameter for KERNEL_SIZE: ", best_hps.get('kernel_size'))

IndexError: list index out of range