# 📦 Chapter 13: Loading & Preprocessing Data with TensorFlow — Practical Guide

Efficient data handling is crucial for scalable deep learning. This notebook covers the TensorFlow Data API (`tf.data`), TFRecord format, preprocessing techniques, and helpful tools like TF Transform and TensorFlow Datasets (TFDS).

## I. The Data API (`tf.data.Dataset)`

Let's explore how to create and manipulate data pipelines with `tf.data` for efficient training.

In [1]:
import tensorflow as tf

# Generate dummy image data (1000 images of 28x28 with 3 channels)
X = tf.random.uniform((1000, 28, 28, 3))

# Generate dummy labels (integers from 0 to 9)
y = tf.random.uniform((1000,), maxval=10, dtype=tf.int32)

# Create a tf.data.Dataset from tensors
dataset = tf.data.Dataset.from_tensor_slices((X, y))

2025-06-18 13:06:39.986290: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-18 13:06:40.281777: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-18 13:06:40.504811: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750241200.718291    1315 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750241200.779367    1315 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750241201.258033    1315 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

### B. Chain transformations: shuffle, batch, data augmentation, prefetching

This pipeline shuffles the data, creates batches, applies a random flip (augmentation), and prefetches for performance.

In [2]:
dataset = (
    dataset
    .shuffle(buffer_size=1000)
    .batch(32)
    .map(lambda x, y: (tf.image.random_flip_left_right(x), y))
    .prefetch(tf.data.AUTOTUNE)
)

### C. Using the dataset with Keras model training

The dataset can be directly fed into `model.fit()` for training and validation.

In [5]:
# Build a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28, 3)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model using the dataset
model.fit(dataset, epochs=5, validation_data=dataset.take(10))

Epoch 1/5
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.0879 - loss: 2.6782 - val_accuracy: 0.1187 - val_loss: 2.3856
Epoch 2/5
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.1389 - loss: 2.3991 - val_accuracy: 0.1000 - val_loss: 2.3436
Epoch 3/5
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.1259 - loss: 2.3233 - val_accuracy: 0.1344 - val_loss: 2.2946
Epoch 4/5
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.0948 - loss: 2.2974 - val_accuracy: 0.0875 - val_loss: 2.3095
Epoch 5/5
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.0942 - loss: 2.3143 - val_accuracy: 0.0969 - val_loss: 2.2975


<keras.src.callbacks.history.History at 0x79dd3938d660>

## II. The TFRecord Format

TFRecords are a compact, efficient binary format using Protocol Buffers, ideal for large datasets.

### A. Writing TFRecord Files

In [8]:
def serialize_example(image, label):
    # Normalize and cast image to uint8
    image_uint8 = tf.image.convert_image_dtype(image, dtype=tf.uint8, saturate=True)
    image_encoded = tf.io.encode_png(image_uint8)
    
    feature = {
        'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_encoded.numpy()])),
        'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(label)]))
    }
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()


# Write TFRecord file
with tf.io.TFRecordWriter('data.tfrecord') as writer:
    for img, lbl in zip(X.numpy(), y.numpy()):
        writer.write(serialize_example(img, lbl))

### B. Reading TFRecord Files

In [9]:
# Create a dataset from TFRecord file
raw_ds = tf.data.TFRecordDataset(['data.tfrecord'])

# Define feature description for parsing
feature_description = {
    'image': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenFeature([], tf.int64),
}

def parse_fn(serialized):
    parsed = tf.io.parse_single_example(serialized, feature_description)
    image = tf.io.decode_png(parsed['image'])
    label = parsed['label']
    # Normalize image to [0,1]
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

parsed_ds = raw_ds.map(parse_fn).batch(32)

### C. SequenceExample for Variable-Length Sequences

Use `SequenceExample` when dealing with sequences of variable length, such as time-series or text data. (This is an advanced topic and can be explored further.)

## III. Preprocessing the Input Features

### A. One-Hot Encoding Categorical Features

In [10]:
from tensorflow.keras.layers import StringLookup

# Example categorical data
categories = tf.constant(['red', 'green', 'blue', 'green', 'red'])

# Create a StringLookup layer for one-hot encoding
lookup = StringLookup(output_mode='one_hot')
lookup.adapt(categories)

# Encode new data
index = lookup(tf.constant(['red', 'blue', 'green']))
print("One-hot encoded vectors:\n", index.numpy())

One-hot encoded vectors:
 [[0 1 0 0]
 [0 0 0 1]
 [0 0 1 0]]


### B. Embeddings for Categorical Data

In [11]:
from tensorflow.keras.layers import Embedding

# Suppose genre IDs range from 0 to 9
embedding = Embedding(input_dim=10, output_dim=4)

# Example input IDs
input_ids = tf.constant([[1, 3, 7]])
embedded = embedding(input_ids)
print("Embedded shape:", embedded.shape)

Embedded shape: (1, 3, 4)


### C. Keras Preprocessing Layers

In [13]:
from tensorflow.keras.layers import Normalization

# Create a normalization layer
normalizer = Normalization()

# Adapt to some data
normalizer.adapt(tf.random.uniform((100, 1)))

# Normalize new data
normalized_x = normalizer(tf.constant([0.3, 0.7]))
print("Normalized values:\n", normalized_x.numpy())

Normalized values:
 [[-0.63422656  0.7468681 ]]


## IV. TF Transform (TFT)

TF Transform allows for consistent preprocessing during training and serving, enabling complex feature engineering pipelines that are portable.

## V. TensorFlow Datasets (TFDS)

TFDS provides preloaded, ready-to-use datasets with clean splits and rich metadata, making experimentation easier.

In [None]:
import tensorflow_datasets as tfds

# Load the MNIST dataset
ds, info = tfds.load("mnist", split="train", with_info=True)

# Preprocess the dataset: normalize images
ds = ds.map(lambda x: (tf.cast(x['image'], tf.float32) / 255.0, x['label']))
ds = ds.batch(32)

# Show dataset info
print(info)

  from .autonotebook import tqdm as notebook_tqdm


[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/ubuntu22/tensorflow_datasets/mnist/3.0.1...[0m


Dl Completed...: 0 url [00:00, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Dl Completed...:   0%|                                                                            | 0/1 [00:00<?, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Dl Completed...:   0%|                                                                            | 0/2 [00:00<?, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Dl Completed...:   0%|                                                                            | 0/3 [00:00<?, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Dl Completed...:   0%|                                                                            | 0/4 [00:00<?, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Dl Completed...:   0%|                                                                            | 0/4 [00:00<?, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s][A

Dl Completed...:   0%|                                                                            | 0/4 [00:00<

## Summary of Tools & APIs

| Tool/API                 | Use Case                                |
| ------------------------ | --------------------------------------- |
| **`tf.data.Dataset`**    | Fast, scalable data pipelines           |
| **TFRecord**             | Efficient binary storage with ProtoBufs |
| **Preprocessing Layers** | In-graph feature transformations        |
| **TF Transform**         | Consistent train/serve preprocessing    |
| **TFDS**                 | Ready-to-use datasets with metadata     |

## 💡 Exercises to Practice

1. Write TFRecords from a dataset, then load and decode them.
2. Use `tf.data` operations for feature standardization and augmentation.
3. Build a pipeline: load TFDS data, preprocess with Keras layers, feed into `model.fit`.
4. Explore TF Transform: build a Python preprocessing function, apply in offline & online modes.
5. Practice using TFRecord + `SequenceExample` for variable-length inputs (e.g., text sentences or time series).