In [1]:
#Import packages
from keras.layers import LSTM, Dense, Embedding
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.datasets import imdb
import warnings
warnings.filterwarnings("ignore")

---
**Observations:** Keras comes with some built-in datasets. We’re going to use the IMDB movie review sentiment classification dataset. This dataset contains 25,000 movie reviews from the IMDB website which have been labeled with their sentiment (positive or negative). The model will examine the text in the training data and learn which characteristics define positive or negative sentiments.

The IMDB data provided in Keras have already been preprocessed, which comes in handy. All we'll have to do before we give text to our RNN is turn it into numeric data.

---

In [2]:
# Import data from imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = 20000)

---
**Observations:** The data set contains special vectors that can be used by the neural network. Therefore, rather than the usual Pandas dataframe, the load_data() function here returns a tuple of Numpy arrays. If we set the num_words argument, we’ll limit the number of words examined to save time.

---

In [3]:
X_train

[[1,
  14,
  22,
  16,
  43,
  530,
  973,
  1622,
  1385,
  65,
  458,
  4468,
  66,
  3941,
  4,
  173,
  36,
  256,
  5,
  25,
  100,
  43,
  838,
  112,
  50,
  670,
  2,
  9,
  35,
  480,
  284,
  5,
  150,
  4,
  172,
  112,
  167,
  2,
  336,
  385,
  39,
  4,
  172,
  4536,
  1111,
  17,
  546,
  38,
  13,
  447,
  4,
  192,
  50,
  16,
  6,
  147,
  2025,
  19,
  14,
  22,
  4,
  1920,
  4613,
  469,
  4,
  22,
  71,
  87,
  12,
  16,
  43,
  530,
  38,
  76,
  15,
  13,
  1247,
  4,
  22,
  17,
  515,
  17,
  12,
  16,
  626,
  18,
  19193,
  5,
  62,
  386,
  12,
  8,
  316,
  8,
  106,
  5,
  4,
  2223,
  5244,
  16,
  480,
  66,
  3785,
  33,
  4,
  130,
  12,
  16,
  38,
  619,
  5,
  25,
  124,
  51,
  36,
  135,
  48,
  25,
  1415,
  33,
  6,
  22,
  12,
  215,
  28,
  77,
  52,
  5,
  14,
  407,
  16,
  82,
  10311,
  8,
  4,
  107,
  117,
  5952,
  15,
  256,
  4,
  2,
  7,
  3766,
  5,
  723,
  36,
  71,
  43,
  530,
  476,
  26,
  400,
  317,
  46,
  7,
  4,
  12118

---
**Observations:** We can see that the X_train set contains numerous lists of numbers. Each number represents a word ranked by its frequency, so number 1 is the most common word in the dataset and number 1622 is the 1622nd. Any unknown words are assigned a zero. The labels stored in y_train are just 1s and 0s denoting whether the sentiment of the text was positive or negative.

---

In [4]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

---
**Observations:** As recurrent neural networks can take a long time to train, and this dataset is fairly large, we can use the Keras preprocessing sequence package’s pad_sequences() function to modify the data and speed things up. The pad_sequences() function essentially makes all of the sequences the same length by padding zeros at the beginning or end. The maxlen argument is used to truncate any sequences that are over a particular length. We’ll limit our sequences to 100 characters to see if this improves the speed.

---

In [5]:
X_train = sequence.pad_sequences(X_train, maxlen = 100)
X_test = sequence.pad_sequences(X_test, maxlen = 100)

---
**Observations:** We can move on to the creation of the neural network. The specific neural network we’re going to use to analyse review sentiment is a recurrent neural network called LSTM or Long Short-Term Memory. This model is really useful and can be used for a variety of things, including time series analysis, anomaly detection and speech and handwriting recognition.

First, we’ll load the Sequential model class and set up the embedding layer. This will convert the Numpy arrays into “dense vectors” of a fixed size using padding, so it’s more convenient for the neural network to handle. The embedding layer has a vocabulary size of 20000 words (because that’s the num_words argument we passed when we loaded up the data), and while the 128 value denotes a 128 unit output dimension.

We then add the LSTM model, set the dropout rates and finally use Dense and the sigmoid function to determine the sentiment as either 1 or 0.

---

In [6]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(LSTM(128, dropout = 0.2, recurrent_dropout = 0.2))
model.add(Dense(1, activation = "sigmoid"))

---
**Observations:** The next step is to use compile() to determine how we run the model.

---

In [7]:
model.compile(loss = "binary_crossentropy", optimizer = "nadam", metrics = ["accuracy"])

---
**Observations:** We can now fit our model to the training data. The batch_size argument tells the model how many samples to “propagate” through the neural network, the epochs argument tells Keras how many how many training batches to run. 

---

In [8]:
model.fit(X_train, y_train, batch_size = 32, epochs = 10, verbose = 2, validation_data = (X_test, y_test))

Epoch 1/10
782/782 - 66s - 84ms/step - accuracy: 0.7827 - loss: 0.4567 - val_accuracy: 0.8361 - val_loss: 0.3829
Epoch 2/10
782/782 - 69s - 88ms/step - accuracy: 0.8817 - loss: 0.2912 - val_accuracy: 0.8368 - val_loss: 0.3716
Epoch 3/10
782/782 - 67s - 86ms/step - accuracy: 0.9174 - loss: 0.2118 - val_accuracy: 0.8450 - val_loss: 0.3744
Epoch 4/10
782/782 - 66s - 85ms/step - accuracy: 0.9438 - loss: 0.1535 - val_accuracy: 0.8381 - val_loss: 0.4159
Epoch 5/10
782/782 - 64s - 82ms/step - accuracy: 0.9624 - loss: 0.1092 - val_accuracy: 0.8361 - val_loss: 0.5053
Epoch 6/10
782/782 - 65s - 83ms/step - accuracy: 0.9728 - loss: 0.0777 - val_accuracy: 0.8236 - val_loss: 0.5636
Epoch 7/10
782/782 - 65s - 83ms/step - accuracy: 0.9802 - loss: 0.0564 - val_accuracy: 0.8281 - val_loss: 0.7014
Epoch 8/10
782/782 - 67s - 86ms/step - accuracy: 0.9842 - loss: 0.0493 - val_accuracy: 0.8327 - val_loss: 0.7047
Epoch 9/10
782/782 - 67s - 85ms/step - accuracy: 0.9915 - loss: 0.0263 - val_accuracy: 0.8268 - 

<keras.src.callbacks.history.History at 0x167370f50>

In [9]:
score, accuracy = model.evaluate(X_test, y_test, batch_size = 32, verbose = 2)

782/782 - 13s - 17ms/step - accuracy: 0.8282 - loss: 0.8537


---
**Observations:** We get an overall accuracy of 0.8282 and a loss of 0.8537 which looks pretty good for a first attempt. We could try tweaking the compile() settings to see if we could generate any further improvements.

---

In [10]:
predictions = model.predict(X_test[:5])
predictions

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step


array([[0.99741995],
       [0.99999964],
       [0.9140004 ],
       [0.10346764],
       [0.9999967 ]], dtype=float32)

---
**Observations:** To examine the predictions we can use model.predict(). Here, we get predictions for the first five rows in the X_test data. The predictions are returned as probabilities, so anything under 0.5 is negative in sentiment and anything above 0.5 is positive. As this is a preprocessed dataset, unfortunately, we don’t have the original source data to join back to this to examine how good the predictions are.

---