# Text Classification

## Setup

In [1]:
try:
    %tensorflow_version 2.x
except:
    pass

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


In [2]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

print("\u2022 Using TensorFlow Version:", tf.__version__)

• Using TensorFlow Version: 2.15.0


## Download the IMDB Dataset

In [3]:
splits = ['train[:60%]', 'train[-40%:]', 'test']

splits, info = tfds.load(name='imdb_reviews', with_info=True, split=splits, as_supervised=True)

train_data, validation_data, test_data = splits

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


## Explore the Data

In [4]:
num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples
num_classes = info.features['label'].num_classes

print('The Dataset has a total of:')
print('\u2022 {:,} classes'.format(num_classes))

print('\u2022 {:,} movie reviews for training'.format(num_train_examples))
print('\u2022 {:,} movie reviews for testing'.format(num_test_examples))

The Dataset has a total of:
• 2 classes
• 25,000 movie reviews for training
• 25,000 movie reviews for testing


In [5]:
class_names = ['negative', 'positive']

In [6]:
for review, label in train_data.take(1):
  review = review.numpy()
  label = label.numpy()

  print('\nMovie Review:\n\n', review)
  print('\nLabel:', class_names[label])


Movie Review:

 b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

Label: negative


## Load Word Embeddings

In [7]:
# If you are running the notebook on Colab
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"

# # if you are running the notebook on your local machine
# embedding = "./models/tf2-preview_gnews-swivel-20dim_1"

hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)

## Build Pipeline

In [8]:
batch_size = 512

train_batches = train_data.shuffle(num_train_examples // 4).batch(batch_size).prefetch(1)
validation_batches = validation_data.batch(batch_size).prefetch(1)
test_batches = test_data.batch(batch_size)

## Build the Model

In [9]:
model = tf.keras.Sequential([
    hub_layer,
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

## Train the Model

In [10]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics='accuracy')

history = model.fit(train_batches,
                    epochs=20,
                    validation_data=validation_batches)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Evaluate the Model

In [11]:
eval_results = model.evaluate(test_batches, verbose=0)

for metrics, value in zip(model.metrics_names, eval_results):
  print(metrics + ': {:.3}'.format(value))

loss: 0.329
accuracy: 0.859
