In this notebook I will discuss about text classification (binary) on "imdb revies" from tensorflow dataset. The workflow of this taks is as follows:

- Loading the dataset
- Text preprocessing
- Model acrchitecture
- Training
- Evaluation

**Loading the dataset** 

In this section I will load data from tensorflow datasets. This is a text data for binary classification.

In [33]:
import os
import matplotlib.pyplot
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds

In [34]:
print("tensorflow version",tf.__version__)
print("tf dataset version",tfds.__version__)

tensorflow version 2.13.0
tf dataset version 4.9.2


In [35]:
# physical_devices = tf.config.list_physical_devices("GPU")
# tf.config.experimental.set_memory_growth(physical_devices[0], True)

In [36]:

(ds_train, ds_test), ds_info = tfds.load(
    "imdb_reviews",
    split=["train", "test"],
    shuffle_files=True,
    as_supervised=True, # will return tuple (text, label) otherwise dict
    with_info=True,  #able to get info about dataset
)

In [37]:
# info of the dataset
print(ds_info)

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path='C:\\Users\\klikh\\tensorflow_datasets\\imdb_reviews\\plain_text\\1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num

In [38]:
# let's see only one exaplme
for text, label in ds_train:
    print(text)
    break

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)


**Preprocessing Data**

We cannot send a entire text to our model. We need to tokenize it. After tokenaization the sentence will return a list of words. For example: "I love the movie"--> TOKENIZATION-->["I","love","the","movie"]. Then we cannot send the text in the model, we need to numericalize. So, to make the compatible input for the ml model the text data need two transfomation.
- tokenize the text
- numericalize the tokenized text



There are many built in functions available in tensorflow. Here I will use method from tensorflow datasets. One thing that, here I have used tfds.deprecated.text.Tokenizer() instead of tfds.features.text.Tokenizer(). Use one that runs without errors.

In [39]:
tokenizer = tfds.deprecated.text.Tokenizer()

**build vocabulary**: taking only the unique words I will build vocabulary.

In [40]:
def build_vocabulary(min_appearance):
    word_counts = {}
    for text, _ in ds_train:
        tokens = tokenizer.tokenize(text.numpy().lower()) #list
        for token in tokens:
            word_counts[token] = word_counts.get(token, 0) + 1
    
    vocabulary = {word for word, count in word_counts.items() if count >= min_appearance}
    
    return vocabulary

min_appearance=2
vocabulary = build_vocabulary(min_appearance)


**Numericalize all of the tokenized words**: the encoder function below will convert the text in the vocabulary to numerical sequences. The vocabulary we created is a set. But we need to convert it into list in order to make it as an input to the tokenTextEncoder.

In [41]:
encoder = tfds.deprecated.text.TokenTextEncoder(
    list(vocabulary), oov_token="<UNK>", lowercase=True, tokenizer=tokenizer
)


In [42]:
def my_enc(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

The below function "encode_map" is used to use python function(here my_enc) to tensorflow tensors. Moreover, the tensorflow datasets work efficiently if the shape is determined explicitly.. Here the shape is set to "None" because of the variable length sequences.

In [43]:
def encode_map_fn(text, label):
    # py_func doesn't set the shape of the returned tensors.
    encoded_text, label = tf.py_function(
        my_enc, inp=[text, label], Tout=(tf.int64, tf.int64)
    )
    # setting the shape of the tensors to None for variable length sequence
    encoded_text.set_shape([None])
    label.set_shape([])

    return encoded_text, label

In [44]:
batch_size=32
AUTOTUNE = tf.data.experimental.AUTOTUNE
ds_train = ds_train.map(encode_map_fn, num_parallel_calls=AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(1000)
ds_train = ds_train.padded_batch(batch_size, padded_shapes=([None], ()))
ds_train = ds_train.prefetch(AUTOTUNE)

ds_test = ds_test.map(encode_map_fn)
ds_test = ds_test.padded_batch(batch_size, padded_shapes=([None], ()))

The preprocessing part is complete. We have the encoded text tensors with corresponding label. Now I will define a simple sequential model using keras API.

In [45]:
model = keras.Sequential(
    [
        layers.Masking(mask_value=0),
        layers.Embedding(input_dim=len(vocabulary) + 2, output_dim=32),
        layers.GlobalAveragePooling1D(),
        layers.Dense(16, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
        layers.Dropout(.5),
        layers.Dense(1),
    ]
)

In [46]:
model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(3e-4, clipnorm=1),
    metrics=["accuracy"],
)


In [47]:
model.fit(ds_train, epochs=20, verbose=2)

Epoch 1/20


782/782 - 31s - loss: 0.7007 - accuracy: 0.5000 - 31s/epoch - 40ms/step
Epoch 2/20
782/782 - 26s - loss: 0.6692 - accuracy: 0.5044 - 26s/epoch - 33ms/step
Epoch 3/20
782/782 - 26s - loss: 0.5917 - accuracy: 0.6219 - 26s/epoch - 33ms/step
Epoch 4/20
782/782 - 25s - loss: 0.5141 - accuracy: 0.7508 - 25s/epoch - 33ms/step
Epoch 5/20
782/782 - 25s - loss: 0.4662 - accuracy: 0.8017 - 25s/epoch - 32ms/step
Epoch 6/20
782/782 - 25s - loss: 0.4300 - accuracy: 0.8337 - 25s/epoch - 32ms/step
Epoch 7/20
782/782 - 25s - loss: 0.4054 - accuracy: 0.8485 - 25s/epoch - 32ms/step
Epoch 8/20
782/782 - 25s - loss: 0.3848 - accuracy: 0.8609 - 25s/epoch - 32ms/step
Epoch 9/20
782/782 - 26s - loss: 0.3682 - accuracy: 0.8682 - 26s/epoch - 33ms/step
Epoch 10/20
782/782 - 25s - loss: 0.3527 - accuracy: 0.8773 - 25s/epoch - 33ms/step
Epoch 11/20
782/782 - 26s - loss: 0.3372 - accuracy: 0.8843 - 26s/epoch - 33ms/step
Epoch 12/20
782/782 - 27s - loss: 0.3254 - accuracy: 0.8904 - 27s/epoch - 34ms/step
Epoch 13/20


<keras.src.callbacks.History at 0x109c1aaaa90>

**Evaluation**

In [48]:
#evaluate on the test set
model.evaluate(ds_test,verbose=2)

782/782 - 14s - loss: 0.3291 - accuracy: 0.8822 - 14s/epoch - 18ms/step


[0.3291319012641907, 0.8822399973869324]