# Catching the sentiment

Let's see how well deep learning handles text stuff.


Load the IMDB sentiment dataset:

In [7]:
from keras.datasets import imdb
from keras.preprocessing import sequence


(X_train, y_train), (X_test, y_test) = imdb.load_data()


Let's examine a document:


In [8]:
print(X_train[10])

[1, 785, 189, 438, 47, 110, 142, 7, 6, 7475, 120, 4, 236, 378, 7, 153, 19, 87, 108, 141, 17, 1004, 5, 30432, 883, 10789, 23, 8, 4, 136, 13772, 11631, 4, 7475, 43, 1076, 21, 1407, 419, 5, 5202, 120, 91, 682, 189, 2818, 5, 9, 1348, 31, 7, 4, 118, 785, 189, 108, 126, 93, 13772, 16, 540, 324, 23, 6, 364, 352, 21, 14, 9, 93, 56, 18, 11, 230, 53, 771, 74, 31, 34, 4, 2834, 7, 4, 22, 5, 14, 11, 471, 9, 17547, 34, 4, 321, 487, 5, 116, 15, 6584, 4, 22, 9, 6, 2286, 4, 114, 2679, 23, 107, 293, 1008, 1172, 5, 328, 1236, 4, 1375, 109, 9, 6, 132, 773, 14799, 1412, 8, 1172, 18, 7865, 29, 9, 276, 11, 6, 2768, 19, 289, 409, 4, 5341, 2140, 20250, 648, 1430, 10136, 8914, 5, 27, 3000, 1432, 7130, 103, 6, 346, 137, 11, 4, 2768, 295, 36, 7740, 725, 6, 3208, 273, 11, 4, 1513, 15, 1367, 35, 154, 14040, 103, 19100, 173, 7, 12, 36, 515, 3547, 94, 2547, 1722, 5, 3547, 36, 203, 30, 502, 8, 361, 12, 8, 989, 143, 4, 1172, 3404, 10, 10, 328, 1236, 9, 6, 55, 221, 2989, 5, 146, 165, 179, 770, 15, 50, 713, 53, 108, 448,

Not quite what we expected... Keras has already replaced each word with its index.

Since tensorflow and keras do not support dynamic graphs (yet?), we have to pad the documents (and possibly truncate the longer documents):

In [9]:
# num_words -> consider only the top 10000 most frequent words
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)

In [10]:
print(X_train[10])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    1  785  189  438   47  110
  142    7    6 7475  120    4  236  378    7  153   19   87  108  141
   17 1004    5    2  883    2   23    8    4  136    2    2    4 7475
   43 1076   21 1407  419    5 5202  120   91  682  189 2818    5    9
 1348   31    7    4  118  785  189  108  126   93    2   16  540  324
   23    6  364  352   21   14    9   93   56   18   11  230   53  771
   74   31   34    4 2834    7    4   22    5   14   11  471    9    2
   34    4  321  487    5  116   15 6584    4   22    9    6 2286    4
  114 2679   23  107  293 1008 1172    5  328 1236    4 1375  109    9
    6  132  773    2 1412    8 1172   18 7865   29    9  276   11    6
 2768   19  289  409    4 5341 2140    2  648 1430    2 8914    5   27
 3000 

So, we are ready to extract the sentiment from the documents!!! We will use a simple word embedding-based MLP for the classification:


In [11]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, AveragePooling1D
from keras.layers import Embedding
from keras.optimizers import Adam


model = Sequential()
# Number of unique words, embedding dimension, number of words per document
model.add(Embedding(10000, 32, input_length=500))
# Just flatten the embedding vector (does not takes into account the padding!)
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
print(model.summary())

I0000 00:00:1727954712.734866 2360022 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727954712.735162 2360022 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727954712.735350 2360022 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727954712.794074 2360022 cuda_executor.cc:1015] successful NUMA node read from SysFS ha

None


In [12]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)

Epoch 1/5


I0000 00:00:1727954717.513218 2360239 service.cc:146] XLA service 0x72d54c0092a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727954717.513259 2360239 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6
2024-10-03 14:25:17.531932: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-10-03 14:25:17.635366: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907


[1m 25/196[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.5269 - loss: 0.7231 

2024-10-03 14:25:17.850363: W external/local_xla/xla/service/gpu/nvptx_compiler.cc:762] The NVIDIA driver's CUDA version is 12.2 which is older than the ptxas CUDA version (12.3.107). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
I0000 00:00:1727954717.973487 2360239 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.6745 - loss: 0.5647 - val_accuracy: 0.8618 - val_loss: 0.3149
Epoch 2/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9480 - loss: 0.1493 - val_accuracy: 0.8465 - val_loss: 0.3790
Epoch 3/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9945 - loss: 0.0288 - val_accuracy: 0.8502 - val_loss: 0.4739
Epoch 4/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9996 - loss: 0.0047 - val_accuracy: 0.8552 - val_loss: 0.5201
Epoch 5/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 1.0000 - loss: 0.0011 - val_accuracy: 0.8545 - val_loss: 0.5619


<keras.src.callbacks.history.History at 0x72d642f566b0>

Usually, using just the mean embedding vector works equally good!

In [14]:
from keras.layers import GlobalAveragePooling1D
model = Sequential()
model.add(Embedding(10000, 32, input_length=500))

# Calculate the mean embedding
model.add(GlobalAveragePooling1D())
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
print(model.summary())

None


The number of parameters are greatly reduced. Let's examine the performance of the model.

In [15]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)


Epoch 1/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 15ms/step - accuracy: 0.5425 - loss: 0.6840 - val_accuracy: 0.7586 - val_loss: 0.5060
Epoch 2/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7935 - loss: 0.4473 - val_accuracy: 0.7278 - val_loss: 0.5323
Epoch 3/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8542 - loss: 0.3380 - val_accuracy: 0.8260 - val_loss: 0.3747
Epoch 4/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8790 - loss: 0.2882 - val_accuracy: 0.7948 - val_loss: 0.4253
Epoch 5/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8718 - loss: 0.2862 - val_accuracy: 0.8837 - val_loss: 0.2897


<keras.src.callbacks.history.History at 0x72d64785f7c0>

It actually works better (this is expected since the flattening operator keeps too much temporal information that the used MLP cannot use). Also, let's try to ignore the padded words (masking):

In [16]:
from keras.layers import Masking

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(500,)))
model.add(Embedding(10000, 32, input_length=500))

# Calculate the mean embedding
model.add(AveragePooling1D(pool_size=500))

model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)


  super().__init__(**kwargs)


None
Epoch 1/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 13ms/step - accuracy: 0.5441 - loss: 0.6827 - val_accuracy: 0.8016 - val_loss: 0.4988
Epoch 2/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8020 - loss: 0.4419 - val_accuracy: 0.8504 - val_loss: 0.3567
Epoch 3/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8612 - loss: 0.3295 - val_accuracy: 0.8463 - val_loss: 0.3427
Epoch 4/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8756 - loss: 0.2938 - val_accuracy: 0.8405 - val_loss: 0.3460
Epoch 5/5
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8951 - loss: 0.2569 - val_accuracy: 0.8838 - val_loss: 0.2884


<keras.src.callbacks.history.History at 0x72d649491630>

Masking does not seem to significantly impact the performance of the model. We can also, use a CNN for text classification!

In [17]:
from keras.layers import Conv1D, GlobalAveragePooling1D, GlobalMaxPool1D, Dropout

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(500,)))
model.add(Embedding(10000, 32, input_length=500))
model.add(Dropout(0.3))
model.add(Conv1D(filters=32, kernel_size=3))
model.add(GlobalMaxPool1D())

model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
model.summary()

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128, verbose=1)




Epoch 1/10
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 12ms/step - accuracy: 0.5963 - loss: 0.6415 - val_accuracy: 0.8311 - val_loss: 0.3780
Epoch 2/10
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8590 - loss: 0.3293 - val_accuracy: 0.8788 - val_loss: 0.2881
Epoch 3/10
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9079 - loss: 0.2326 - val_accuracy: 0.8843 - val_loss: 0.2755
Epoch 4/10
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9341 - loss: 0.1760 - val_accuracy: 0.8822 - val_loss: 0.2905
Epoch 5/10
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9505 - loss: 0.1334 - val_accuracy: 0.8813 - val_loss: 0.3097
Epoch 6/10
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9604 - loss: 0.1053 - val_accuracy: 0.8809 - val_loss: 0.3328
Epoch 7/10
[1m196/196[0m 

<keras.src.callbacks.history.History at 0x72d647c68e80>