# Natural Language Processing (NLP) - IMDB

Build natural language processing systems using TensorFlow.
Prepare text to use in TensorFlow models.
Use word embeddings in your TensorFlow model.
Use RNN and GRU layers.
Use LSTM layers.
Use RNN, LSTM, GRU, and CNN layers.

Build and train models for binary classification.

In [1]:
# Import Tensorflow
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(f'The current version of TensorFlow is {tf.__version__}.')
print(f'The current versio of TFDS is {tfds.__version__}.')

The current version of TensorFlow is 2.12.0.
The current versio of TFDS is 4.9.2.


# Import Data

Import the data from TF Datasets.

In [None]:
# Import the IMDB dataset from tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

In [3]:
# View the info
print(info)

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path='C:\\Users\\JNSea\\tensorflow_datasets\\imdb_reviews\\plain_text\\1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num

In [4]:
# View the dataset
print(imdb)

{Split('train'): <_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>, Split('test'): <_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>, Split('unsupervised'): <_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>}


In [5]:
# View some examples
for ex in imdb['train'].take(3):
    print(ex)

(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on

In [6]:
# View one review
obs = None
for ex in imdb['train'].take(1):
    obs = ex

print(obs[0])
print('\n', type(obs[0]))

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)

 <class 'tensorflow.python.framework.ops.EagerTensor'>


# Process Data

## Perform the Train-Test Split

Divide the data into training and testing reviews and labels.
0 is bad, and 1 is good.

In [7]:
# Separate the training and testing data
train_data, test_data = imdb['train'], imdb['test']

# Create python lists for reviews and labels
training_reviews = []
training_labels = []

testing_reviews = []
testing_labels = []

# Save the reviews and labels in the training dataset
for r, l in train_data:
    training_reviews.append(r.numpy().decode('utf8'))
    training_labels.append(l.numpy())

for r, l in test_data:
    testing_reviews.append(r.numpy().decode('utf8'))
    testing_labels.append((l.numpy()))

# Convert label lists to numpy arrays
training_labels = np.array(training_labels)
testing_labels = np.array(testing_labels)

In [8]:
# Let's look at this data
i = 50

print(training_reviews[i])
print(training_labels[i])
print(testing_reviews[i])
print(testing_labels[i])

Not the most successful television project John Cleese ever did, "Strange Case" has the feel of a first draft that was rushed into production before any revisions could be made. There are some silly ideas throughout and even a few clever ones, but the story as a whole unfortunately doesn't add up to much.<br /><br />Arthur Lowe is a hoot, though, as Dr. Watson, bionic bits and all. "Good Lord."
1
<br /><br />Very dull, laborious adaptation of Amis's amusing satire. The hero is portrayed not as a likeable loser but a merely oafish cretin. Most of the rest are pure caricatures with only Helen McCrory putting in real quality and providing something of the novel's wit. The period setting is camped up as if it were the 1920s, not the post-war period of horror comics and rock'n' roll. A real dud even by the standards of bad UK TV.<br /><br />
0


## Generate Padded Sequences

Make some super matrices.

In [9]:
# Data parameters
vocab_size = 88000
max_length = 2300
trunc_type = 'post'
oov_token = '<OOV>'

In [10]:
# Tokenize and word index

# Init the Tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)

# Generate a word index from the training reviews
tokenizer.fit_on_texts(training_reviews)
word_index = tokenizer.word_index

print(f'The index contains {len(word_index)} words.')

The index contains 88583 words.


In [11]:
# Make sequences and pad them

"""
I know that padding sequences without a maximum length makes for large matrices
of varying sizes (2493 for training and 2332 for testing). So max length matters.
"""

# Generate and pad the training sequences
train_seq = tokenizer.texts_to_sequences(training_reviews)
train_seq_pad = pad_sequences(train_seq,maxlen=max_length, truncating=trunc_type)

# Generate and pad the testing sequences
test_seq = tokenizer.texts_to_sequences(testing_reviews)
test_seq_pad = pad_sequences(test_seq, maxlen=max_length, truncating=trunc_type)

# Print some shapes
print(f'The shape of the padded training matrix is {train_seq_pad.shape}.')
print(f'The first sequence in the padded training matrix is {train_seq_pad[0]}.\n')

print(f'The shape of the padded testing matrix is {test_seq_pad.shape}.')
print(f'The first sequence in the padded testing matrix is {test_seq_pad[0]}.\n')

The shape of the padded training matrix is (25000, 2300).
The first sequence in the padded training matrix is [  0   0   0 ... 867 141  10].

The shape of the padded testing matrix is (25000, 2300).
The first sequence in the padded testing matrix is [  0   0   0 ...  56  46 214].



# Build, Compile, and Train the Models

Model 1: Embed and Flatten
Model 2: Bi-Directional LSTM
Model 3: Gated Recurrent Unit (GRU)
Model 4: Convolution

In [12]:
# Model parameters
num_epochs = 10
batch_size = 128
embedding_dim = 16
dense_dim = 64

## Model 1: Embed and Flatten

Just flatten the embedding and train some neurons

In [13]:
"""
A Note on Embedding: The embedding layer will receive a list that is input_length long.
Each item in the list will be an integer ranging from 0 to input_dim.
Each integer will be assigned a vector with output_dim dimensions and weights.
These vector weights will be adjusted to determine the classification.
"""

# Build the Flat Model
model_flat = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(dense_dim, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the Flat Model
model_flat.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Print the summary
model_flat.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 2300, 16)          1408000   
                                                                 
 flatten (Flatten)           (None, 36800)             0         
                                                                 
 dense (Dense)               (None, 64)                2355264   
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 3,763,329
Trainable params: 3,763,329
Non-trainable params: 0
_________________________________________________________________


In [14]:
# Train the Flat model
history_flat = model_flat.fit(
    train_seq_pad,
    training_labels,
    epochs=num_epochs,
    validation_data=(test_seq_pad, testing_labels)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Model 2: LSTM

Use an LSTM with full words and see how it goes

In [15]:
# LSTM Hyper-Parameters
lstm_dim = 64

# Build the LSTM Model
model_lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_dim)),
    tf.keras.layers.Dense(dense_dim, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the LSTM Model
model_lstm.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Print the summary
model_lstm.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 2300, 16)          1408000   
                                                                 
 bidirectional (Bidirectiona  (None, 128)              41472     
 l)                                                              
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,457,793
Trainable params: 1,457,793
Non-trainable params: 0
_________________________________________________________________


In [16]:
# Train the LSTM Model
history_lstm = model_lstm.fit(
    train_seq_pad,
    training_labels,
    epochs=num_epochs,
    validation_data=(test_seq_pad, testing_labels)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Model 3: GRU

A simpler, but faster, version of the LSTM

In [17]:
# GRU Hyper-Parameters
gru_dim = 64

# Build the GRU Model
model_gru = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(gru_dim)),
    tf.keras.layers.Dense(dense_dim, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the GRU Model
model_gru.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Print the summary
model_gru.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 2300, 16)          1408000   
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              31488     
 nal)                                                            
                                                                 
 dense_4 (Dense)             (None, 64)                8256      
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,447,809
Trainable params: 1,447,809
Non-trainable params: 0
_________________________________________________________________


In [18]:
# Train the GRU Model
history_gru = model_gru.fit(
    train_seq_pad,
    training_labels,
    epochs=num_epochs,
    validation_data=(test_seq_pad, testing_labels)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Model 4: Convolutions

Use convolutions on an entire word, not subwords.

In [19]:
# Convolution Hyper-Parameters
filters = 128
kernel_size = 5

# Build the Conv Model
model_conv = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(filters, kernel_size, activation='relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(dense_dim, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the Conv Model
model_conv.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Print the summary
model_conv.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 2300, 16)          1408000   
                                                                 
 conv1d (Conv1D)             (None, 2296, 128)         10368     
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense_6 (Dense)             (None, 64)                8256      
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,426,689
Trainable params: 1,426,689
Non-trainable params: 0
____________________________________________

In [20]:
# Train the Conv Model
history_conv = model_conv.fit(
    train_seq_pad,
    training_labels,
    epochs=num_epochs,
    validation_data=(test_seq_pad, testing_labels)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# With vocab=10,000 and max_length=120

Model 1: Flatten, val_accuracy = 80.4%

Model 2: Bi-LSTM, val_accuracy = 79.1%

Model 3: GRU, val_accuracy = 79.3%

Model 4: Convolution, val_accuracy = 79.4%




# With vocab=88,000 and max_length=2,300

Model 1: Flatten, val_accuracy = 87.4%

Model 2: Bi-LSTM, val_accuracy = 85.8% ish

Model 3: GRU, val_accuracy = 84.0% ish

Model 4: Convolution, val_accuracy = 85.7%