# Sentiment analysis on the IMDB dataset

Sentiment analysis on the keras IMDB dataset. The dataset cointains 50k text reviews in English, labelled with a thumbs up or thumbs down label. We want to be able to predict if the review is positive or negative from the text. 

![Picture title](http://flovv.github.io/figures/post25/imdb_classification.png)


## Import libraries and define symbolic constants

### Install wandb to keep track of model performance

In [1]:
!pip install --upgrade wandb
!wandb login WANDB_KEY

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


### Import and initialize parameters

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras import datasets
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import preprocessing

import wandb
from wandb.keras import WandbCallback

wandb.init(project="sentiment-analysis")

EPOCHS = 20  # this is how many times re-train the model, each time optimizing its weight and biases
BATCH_SIZE = 500 # this is the number of instances we take from the training set before running the optimizer
VERBOSE = 1 # make it loud
N_HIDDEN = 128 # neurons in hidden layer
DROPOUT = 0.3 # portion of dropout values in the network  

ACTIVATION_FUNCTION_HIDDEN = 'relu' # activation function for the hidden layers
ACTIVATION_FUNCTION_FINAL = 'sigmoid' # activation function for the output layer 
OPTIMIZER = 'adam' # optimizer, this is how we search for the minimum in the loss function
LOSS_FUNCTION = 'binary_crossentropy' #loss function, this is what is otimized

METRICS = ['accuracy'] #Our metrics, used to make sure we don't overfit. Computed also on the test set 

max_len = 200
n_words = 10000
dim_embedding = 256

wandb.config = {
  "epochs": EPOCHS,
  "batch_size": BATCH_SIZE, 
  "n_hidden": N_HIDDEN,
  'activation_funciton_hidden': ACTIVATION_FUNCTION_HIDDEN,
  'activation_funciton_final': ACTIVATION_FUNCTION_FINAL,
  'optimizer': OPTIMIZER,
  'loss_function': LOSS_FUNCTION,
  'metric': METRICS,
}


2022-11-05 16:11:30.530909: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-05 16:11:30.652731: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-05 16:11:30.657834: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-11-05 16:11:30.657854: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if yo

## Load demo dataset from Keras


In [18]:
def load_data():
    #load dataset from keras
    (X_train, Y_train), (X_test, Y_test) = datasets.imdb.load_data(num_words=n_words)
    #pad data
    X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=max_len)
    X_test = preprocessing.sequence.pad_sequences(X_test, maxlen=max_len)
    return (X_train, Y_train), (X_test, Y_test)

(X_train, Y_train), (X_test, Y_test) = load_data()
print(X_test)

[[   0    0    0 ...   14    6  717]
 [1987    2   45 ...  125    4 3077]
 [4468  189    4 ...    9   57  975]
 ...
 [   0    0    0 ...   21  846 5518]
 [   0    0    0 ... 2302    7  470]
 [   0    0    0 ...   34 2005 2643]]


## Build the model

- We use an embedding layer as input, this maps words to a more dense feature space

- We then use a maxpooling layer, that takes the may value from of either feature vector across the n_words

- We then have two dense layers.

- The last layer is a single neuron with a sigmoid activation function, which we will interpret as a probability that the review is favorable

In [4]:
def build_model():
    model = models.Sequential()

    model.add(
        layers.Embedding(
        n_words,
        dim_embedding,
        input_length=max_len
        )
    )
    model.add(
        layers.Dropout(DROPOUT)
    )
    model.add(
        layers.GlobalMaxPooling1D()
    )
    model.add(
        layers.Dense(
            128,
            activation = ACTIVATION_FUNCTION_HIDDEN
        ) 
    )
    model.add(
        layers.Dropout(DROPOUT+0.2)
    )    
    model.add(
        layers.Dense(
            1,
            activation = ACTIVATION_FUNCTION_FINAL
        ) 
    )

    return model
    
model = build_model()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 256)          2560000   
                                                                 
 dropout (Dropout)           (None, 200, 256)          0         
                                                                 
 global_max_pooling1d (Globa  (None, 256)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 128)               32896     
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                        

## Compile the model

- We use adam as optimizer

- The loss function is categorical cross-entropy, this is particularly well-suited for multi-class problems with a one-hot encoding 

- We use accuracy to evaluate the performance of the model

In [5]:
model.compile(
    optimizer=OPTIMIZER,
    loss=LOSS_FUNCTION,
    metrics=METRICS
)

## Train the model

We are now ready to train the model. We need to define the number of epochs and the batch size. 

- Epochs are the number of times the model is exposed to the training dataset. Each time, it will run the optimizer (SGD) and try to minimize the loss function. 

- Batch_size is the number of instances that the optimizer observes before tuning the weights and biases. There are many batches per epoch.

- We split the training data in an 80% training and 20% validation per epoch. The validation set is used to compute the metric and tune hyperparameters, to avoid overfitting.

- We add early stopping, on the loss function on the validation set, with a patience of N epoch. This will stop the optimization if the loss function does not go down for N  consecutive epochs. 

In [6]:
score = model.fit(
    X_train,
    Y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    verbose=VERBOSE,
    validation_data=(X_test, Y_test),
    callbacks=[WandbCallback()]
    )

Epoch 1/20
[34m[1mwandb[0m: Adding directory to artifact (/work/wandb/run-20221105_161135-3r1q2yz1/files/model-best)... Done. 0.1s
Epoch 2/20
[34m[1mwandb[0m: Adding directory to artifact (/work/wandb/run-20221105_161135-3r1q2yz1/files/model-best)... Done. 0.1s
Epoch 3/20
[34m[1mwandb[0m: Adding directory to artifact (/work/wandb/run-20221105_161135-3r1q2yz1/files/model-best)... Done. 0.1s
Epoch 4/20
[34m[1mwandb[0m: Adding directory to artifact (/work/wandb/run-20221105_161135-3r1q2yz1/files/model-best)... Done. 0.2s
Epoch 5/20
Epoch 6/20
[34m[1mwandb[0m: Adding directory to artifact (/work/wandb/run-20221105_161135-3r1q2yz1/files/model-best)... Done. 0.1s
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Test the model on unseen data

In [7]:
test_loss, test_accuracy = model.evaluate(X_test, Y_test, batch_size=BATCH_SIZE)
#track test results on wandb
wandb.log({
    "test_loss": test_loss, 
    "test_accuracy": test_accuracy
})



<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=ba4822a4-198a-4cdb-8280-0ca8d044b999' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>