# <font color="#418FDE" size="6.5" uppercase>**tf.data Basics**</font>

>Last update: 20260125.
    
By the end of this Lecture, you will be able to:
- Create tf.data.Dataset pipelines from tensors, NumPy arrays, and TFRecord files. 
- Apply common tf.data transformations such as map, batch, shuffle, and prefetch. 
- Debug dataset shapes and types to ensure compatibility with Keras models. 


## **1. Building TensorFlow Datasets**

### **1.1. Creating Datasets from Tensors**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_01_01.jpg?v=1769401250" width="250">



>* Wrap in-memory tensors as simple datasets
>* Enable easy experimentation and later scalable pipelines

>* Aligned tensor dimensions control dataset element pairing
>* Consistent leading dimension keeps examples coherent for models

>* Datasets expose shape and dtype problems early
>* In-memory datasets validate data matches model expectations



In [None]:
#@title Python Code - Creating Datasets from Tensors

# This script shows tf.data basics with tensors.
# You will create simple datasets from memory.
# Shapes and dtypes are checked for safety.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy for this lesson.
import tensorflow as tf
import numpy as np
import os

# Set seeds for deterministic behavior here.
tf.random.set_seed(7)
np.random.seed(7)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create small NumPy arrays for features and labels.
features_np = np.array([[1.0], [2.0], [3.0], [4.0]], dtype=np.float32)
labels_np = np.array([[0.0], [0.0], [1.0], [1.0]], dtype=np.float32)

# Validate that leading dimensions of arrays match.
assert features_np.shape[0] == labels_np.shape[0]

# Convert NumPy arrays to TensorFlow tensors.
features_tf = tf.convert_to_tensor(features_np, dtype=tf.float32)
labels_tf = tf.convert_to_tensor(labels_np, dtype=tf.float32)

# Show basic information about created tensors.
print("Features tensor shape:", features_tf.shape)
print("Labels tensor shape:", labels_tf.shape)

# Build dataset directly from a tuple of tensors.
base_dataset = tf.data.Dataset.from_tensor_slices((features_tf, labels_tf))

# Inspect one raw element from the base dataset.
for element in base_dataset.take(1):
    print("One raw element:", element)

# Define a simple mapping function for scaling features.
def scale_features(feature, label):
    scaled_feature = feature / tf.constant(4.0, dtype=tf.float32)
    return scaled_feature, label

# Apply map, shuffle, batch, and prefetch transformations.
train_dataset = (base_dataset
                 .map(scale_features)
                 .shuffle(buffer_size=4, seed=7)
                 .batch(2)
                 .prefetch(tf.data.AUTOTUNE))

# Inspect shapes and dtypes from the transformed dataset.
for batch_features, batch_labels in train_dataset.take(1):
    print("Batch features shape:", batch_features.shape)
    print("Batch labels shape:", batch_labels.shape)
    print("Batch features dtype:", batch_features.dtype)

# Build a tiny Keras model compatible with dataset.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,)),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

# Compile model with simple optimizer and loss.
model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])

# Train model briefly using the dataset pipeline.
history = model.fit(train_dataset,
                    epochs=5,
                    verbose=0)

# Evaluate model on the same small dataset.
loss, acc = model.evaluate(train_dataset, verbose=0)

# Print final loss and accuracy in one line.
print("Final loss and accuracy:", float(loss), float(acc))



### **1.2. TFRecord Input Pipelines**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_01_02.jpg?v=1769401293" width="250">



>* TFRecord stores many serialized training examples efficiently
>* Stream large datasets from disk for scalable training

>* First read raw serialized examples from TFRecords
>* Then parse strings into typed, shaped tensors

>* Parsed TFRecords act like regular tensor datasets
>* Supports scaling, sharding, and distributed training efficiently



In [None]:
#@title Python Code - TFRecord Input Pipelines

# This script shows basic TFRecord input pipelines.
# It creates tiny TFRecord files and parses them.
# Use this as a gentle TensorFlow dataset introduction.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import pathlib

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a small temporary directory for TFRecords.
base_dir = pathlib.Path("tfrecord_demo")
base_dir.mkdir(exist_ok=True)

# Define a helper to create one Example proto.
def make_example(features_tensor, label_int):
    feature_dict = {
        "features": tf.train.Feature(
            float_list=tf.train.FloatList(value=features_tensor)
        ),
        "label": tf.train.Feature(
            int64_list=tf.train.Int64List(value=[label_int])
        ),
    }

    example_proto = tf.train.Example(
        features=tf.train.Features(feature=feature_dict)
    )
    return example_proto

# Write a tiny TFRecord file with few examples.
def write_tfrecord_file(path, num_examples, feature_dim):
    with tf.io.TFRecordWriter(str(path)) as writer:
        for i in range(num_examples):
            features = [float(i)] * feature_dim
            label = i % 2
            example = make_example(features, label)
            writer.write(example.SerializeToString())

# Define file paths for two small shards.
file_one = base_dir / "data_part_1.tfrecord"
file_two = base_dir / "data_part_2.tfrecord"

# Create two shards with few examples each.
write_tfrecord_file(file_one, num_examples=4, feature_dim=3)
write_tfrecord_file(file_two, num_examples=4, feature_dim=3)

# List TFRecord files as string paths.
file_pattern = str(base_dir / "data_part_*.tfrecord")
files_dataset = tf.data.Dataset.list_files(file_pattern, shuffle=False)

# Read records from multiple files using interleave.
raw_dataset = files_dataset.interleave(
    lambda f: tf.data.TFRecordDataset(f),
    cycle_length=2,
    num_parallel_calls=tf.data.AUTOTUNE,
)

# Define the feature description for parsing.
feature_description = {
    "features": tf.io.FixedLenFeature([3], tf.float32),
    "label": tf.io.FixedLenFeature([], tf.int64),
}

# Create a parsing function for serialized examples.
def parse_example(serialized_example):
    parsed = tf.io.parse_single_example(
        serialized_example,
        feature_description,
    )
    features = parsed["features"]
    label = parsed["label"]
    return features, label

# Map parsing over the raw dataset.
parsed_dataset = raw_dataset.map(
    parse_example,
    num_parallel_calls=tf.data.AUTOTUNE,
)

# Shuffle, batch, and prefetch for efficiency.
final_dataset = parsed_dataset.shuffle(8).batch(2).prefetch(1)

# Inspect one batch to understand shapes and types.
for batch_features, batch_labels in final_dataset.take(1):
    print("Batch features shape:", batch_features.shape)
    print("Batch labels shape:", batch_labels.shape)
    print("Batch features dtype:", batch_features.dtype)
    print("Batch labels dtype:", batch_labels.dtype)

# Show the actual numeric values for clarity.
print("Example batch features:", batch_features.numpy())



### **1.3. Listing Dataset Files**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_01_03.jpg?v=1769401331" width="250">



>* Create datasets from file path lists only
>* Stream, order, and label files efficiently during training

>* Use filename patterns to automatically list files
>* Treat paths as dynamic, filterable views of storage

>* File-path datasets feed parsers for each file
>* Separation simplifies debugging, storage changes, and scaling



In [None]:
#@title Python Code - Listing Dataset Files

# This script shows how to list dataset files.
# We focus on tf.data and file patterns.
# Run cells sequentially to follow the flow.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and supporting modules.
import os
import pathlib
import tensorflow as tf

# Print TensorFlow version for quick verification.
print("TensorFlow version:", tf.__version__)

# Create a small temporary root directory.
root_dir = pathlib.Path("demo_data_root")
root_dir.mkdir(exist_ok=True)

# Define two subdirectories to mimic class folders.
class_names = ["cats", "dogs"]
for name in class_names:
    (root_dir / name).mkdir(exist_ok=True)

# Create a helper function to write tiny text files.
def write_dummy_file(path: pathlib.Path, text: str) -> None:
    path.write_text(text, encoding="utf-8")

# Populate each folder with a few tiny files.
for label in class_names:
    for index in range(3):
        file_path = root_dir / label / f"sample_{index}.txt"
        write_dummy_file(file_path, f"Dummy {label} file {index}.")

# Show the directory tree using pathlib iteration.
all_paths = sorted(root_dir.rglob("*.txt"))
print("Number of created files:", len(all_paths))

# Print a few example paths for orientation.
for path in all_paths[:4]:
    print("Example file:", path)

# Build a file pattern that matches all text files.
pattern_all = str(root_dir / "*" / "*.txt")
print("Pattern for all files:", pattern_all)

# Create a dataset from the file pattern.
files_ds_all = tf.data.Dataset.list_files(pattern_all, shuffle=False)

# Inspect the first few elements from the dataset.
print("Listing all dataset file paths:")
for path_tensor in files_ds_all.take(4):
    print("Path element:", path_tensor.numpy().decode("utf-8"))

# Build a pattern that only matches cat files.
pattern_cats = str(root_dir / "cats" / "*.txt")
print("Pattern for cat files:", pattern_cats)

# Create a dataset that lists only cat file paths.
files_ds_cats = tf.data.Dataset.list_files(pattern_cats, shuffle=False)

# Confirm that only cat paths are included.
print("Listing only cat file paths:")
for path_tensor in files_ds_cats.take(3):
    print("Cat path:", path_tensor.numpy().decode("utf-8"))

# Show the dtype and shape of a dataset element.
first_path = next(iter(files_ds_all))
print("Element dtype:", first_path.dtype, "shape:", first_path.shape)



## **2. Core Dataset Transformations**

### **2.1. Mapping for Preprocessing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_02_01.jpg?v=1769401370" width="250">



>* Map applies preprocessing function to each example
>* Transforms raw stored data into model-ready batches

>* Mapping centralizes complex, domain-specific preprocessing steps
>* Ensures every dataset example is processed consistently

>* Parallel mapping boosts preprocessing speed and throughput
>* Fused with batching, prefetching for scalable pipelines



In [None]:
#@title Python Code - Mapping for Preprocessing

# This script shows tf.data mapping preprocessing.
# It focuses on simple numeric feature scaling.
# You can run it directly inside Google Colab.

# !pip install tensorflow==2.20.0.

# Import required TensorFlow module.
import tensorflow as tf

# Set global random seed for determinism.
tf.random.set_seed(7)

# Print TensorFlow version in one line.
print("TensorFlow version:", tf.__version__)

# Create small numeric feature tensor.
features = tf.constant([[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]])

# Create small numeric label tensor.
labels = tf.constant([[0.0], [1.0], [0.0]])

# Validate shapes before building dataset.
assert features.shape[0] == labels.shape[0]

# Build dataset from tensor slices.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Define preprocessing mapping function.
def preprocess_example(feature, label):
    # Compute feature mean and standard deviation.
    mean = tf.reduce_mean(feature)
    std = tf.math.reduce_std(feature)

    # Avoid division by zero using epsilon.
    eps = tf.constant(1e-6, dtype=feature.dtype)

    # Normalize features to zero mean unit variance.
    norm_feature = (feature - mean) / (std + eps)

    # Return normalized feature and original label.
    return norm_feature, label

# Apply mapping with num_parallel_calls autotune.
mapped_ds = base_ds.map(preprocess_example, num_parallel_calls=tf.data.AUTOTUNE)

# Shuffle dataset with small buffer size.
shuffled_ds = mapped_ds.shuffle(buffer_size=3, seed=7)

# Batch dataset into small batches.
batched_ds = shuffled_ds.batch(2, drop_remainder=False)

# Prefetch to overlap preprocessing and consumption.
final_ds = batched_ds.prefetch(tf.data.AUTOTUNE)

# Iterate over final dataset and inspect shapes.
for batch_features, batch_labels in final_ds:
    print("Batch features shape:", batch_features.shape)
    print("Batch labels shape:", batch_labels.shape)
    print("First batch features:", batch_features.numpy())
    break



### **2.2. Efficient Batching Techniques**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_02_02.jpg?v=1769401402" width="250">



>* Batching groups many examples for each step
>* Improves hardware efficiency and stabilizes training gradients

>* Batch size trades speed, memory, and updates
>* Treat batch size as tunable, hardware-aware hyperparameter

>* Decide how to handle leftover small batches
>* Order batching and preprocessing to balance efficiency



In [None]:
#@title Python Code - Efficient Batching Techniques

# This script demonstrates efficient batching techniques.
# It uses tf.data to build simple datasets.
# Focus on batch size and drop remainder behavior.

# !pip install tensorflow==2.20.0.

# Import required TensorFlow module.
import tensorflow as tf

# Set a deterministic global random seed.
tf.random.set_seed(7)

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Create a small tensor of feature values.
features = tf.range(start=0, limit=12, delta=1, dtype=tf.int32)

# Create matching labels as simple multiples.
labels = features * 10

# Stack features and labels into a dataset.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Show dataset element shape and type.
for elem in base_ds.take(1):
    print("Single element shapes:", elem[0].shape, elem[1].shape)

# Define a small helper to describe batches.
def describe_batches(ds, name):
    print("\n", name)
    for batch in ds:
        x, y = batch
        print("batch shape:", x.shape, "labels shape:", y.shape)

# Create batches without dropping the remainder.
no_drop_ds = base_ds.batch(batch_size=5, drop_remainder=False)

# Describe batches when keeping the final smaller batch.
describe_batches(no_drop_ds, "No drop_remainder (keep last small batch)")

# Create batches while dropping the final smaller batch.
drop_ds = base_ds.batch(batch_size=5, drop_remainder=True)

# Describe batches when forcing all batches equal sized.
describe_batches(drop_ds, "With drop_remainder=True (fixed batch shapes)")

# Prefetch to overlap data preparation and consumption.
prefetch_ds = drop_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

# Take one batch to confirm shapes stay consistent.
for batch in prefetch_ds.take(1):
    x_batch, y_batch = batch
    print("\nPrefetched batch shape:", x_batch.shape, y_batch.shape)




### **2.3. Efficient Shuffle and Repeat**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_02_03.jpg?v=1769401435" width="250">



>* Shuffle randomizes example order to avoid patterns
>* Repeat cycles data for many randomized epochs

>* Balance shuffle buffer size with memory limits
>* Place repeat after shuffle for varied epochs

>* Shuffle, then repeat, then batch, then prefetch
>* Respect sequence boundaries while optimizing randomness and efficiency



In [None]:
#@title Python Code - Efficient Shuffle and Repeat

# This script demonstrates efficient shuffle and repeat.
# It uses a tiny synthetic dataset for clarity.
# Run all cells sequentially inside Google Colab.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy for data handling.
import tensorflow as tf
import numpy as np

# Set deterministic seeds for reproducible shuffling.
np.random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a small NumPy array of feature values.
features = np.arange(12, dtype=np.float32).reshape(6, 2)

# Create simple labels equal to row indices.
labels = np.arange(6, dtype=np.int32)

# Validate shapes before building the dataset.
assert features.shape[0] == labels.shape[0]

# Build a base dataset from tensors.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Define a tiny preprocessing map function.
def scale_features(x, y):
    x = x / 10.0
    return x, y

# Apply the map transformation to scale features.
mapped_ds = base_ds.map(scale_features)

# Build a pipeline with shuffle then repeat then batch.
shuffle_then_repeat = (
    mapped_ds.shuffle(buffer_size=6, seed=7, reshuffle_each_iteration=True)
    .repeat(2)
    .batch(3)
)

# Build a pipeline with repeat then shuffle then batch.
repeat_then_shuffle = (
    mapped_ds.repeat(2)
    .shuffle(buffer_size=6, seed=7, reshuffle_each_iteration=True)
    .batch(3)
)

# Helper function to collect one epoch of batches.
def collect_batches(ds, num_batches):
    batches = []
    for batch_index, (x_batch, y_batch) in enumerate(ds):
        batches.append((batch_index, y_batch.numpy().tolist()))
        if batch_index + 1 >= num_batches:
            break
    return batches

# Collect a few batches from shuffle then repeat pipeline.
str_shuffle_repeat = collect_batches(shuffle_then_repeat, num_batches=4)

# Collect a few batches from repeat then shuffle pipeline.
str_repeat_shuffle = collect_batches(repeat_then_shuffle, num_batches=4)

# Print results to compare ordering patterns.
print("\nBatches with shuffle().repeat():")
for idx, labels_list in str_shuffle_repeat:
    print("batch", idx, "labels", labels_list)

# Print second configuration ordering for contrast.
print("\nBatches with repeat().shuffle():")
for idx, labels_list in str_repeat_shuffle:
    print("batch", idx, "labels", labels_list)



## **3. Efficient Dataset Inspection**

### **3.1. Prefetching for Overlap**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_03_01.jpg?v=1769401480" width="250">



>* Prefetching overlaps data loading with model computation
>* Ensure prefetched batches keep consistent shapes and types

>* Prefetch keeps the data conveyor belt full
>* Verify shapes and dtypes stay unchanged with prefetch

>* Place prefetch after map, shuffle, batch
>* Disable prefetch to trace shape, type bugs



In [None]:
#@title Python Code - Prefetching for Overlap

# This script shows tf.data prefetching basics.
# We focus on shapes types and Keras compatibility.
# Run cells to inspect batches before and after prefetch.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy for this example.
import tensorflow as tf
import numpy as np

# Set deterministic seeds for reproducible behavior.
tf.random.set_seed(7)
np.random.seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create small synthetic feature and label arrays.
features = np.random.rand(12, 4).astype("float32")
labels = np.random.randint(0, 2, size=(12, 1)).astype("int32")

# Validate shapes before building the dataset.
assert features.shape[0] == labels.shape[0]

# Build a dataset from the NumPy arrays.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Shuffle with fixed buffer and seed for stability.
shuffled_ds = base_ds.shuffle(buffer_size=12, seed=7)

# Batch the dataset to match model expectations.
batched_ds = shuffled_ds.batch(4, drop_remainder=True)

# Inspect one batch before adding prefetch.
for batch_features, batch_labels in batched_ds.take(1):
    print("Before prefetch shapes:", batch_features.shape, batch_labels.shape)
    print("Before prefetch dtypes:", batch_features.dtype, batch_labels.dtype)

# Add prefetch to overlap input work and compute.
prefetched_ds = batched_ds.prefetch(tf.data.AUTOTUNE)

# Inspect one batch after adding prefetch.
for p_features, p_labels in prefetched_ds.take(1):
    print("After prefetch shapes:", p_features.shape, p_labels.shape)
    print("After prefetch dtypes:", p_features.dtype, p_labels.dtype)

# Define a tiny Keras model matching feature shape.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile the model with simple settings.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Run a very short silent training step.
history = model.fit(prefetched_ds, epochs=1, verbose=0)

# Confirm that training completed without shape issues.
print("Training finished with prefetching and stable shapes.")



### **3.2. Parallel Mapping Control**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_03_02.jpg?v=1769401516" width="250">



>* Parallel map controls speed and debugging difficulty
>* Lower parallelism simplifies tracing shapes and dtypes

>* Complex, resource-heavy map functions need careful parallelism
>* Lower parallelism simplifies debugging, then safely increase

>* Less parallelism gives clearer, ordered debugging logs
>* Controlled mapping ensures consistent shapes across collaborators



In [None]:
#@title Python Code - Parallel Mapping Control

# This script demonstrates parallel mapping control.
# It focuses on dataset shapes and types debugging.
# Run cells sequentially in a fresh Colab runtime.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy for this example.
import tensorflow as tf
import numpy as np

# Print TensorFlow version for reproducibility.
print("TensorFlow version:", tf.__version__)

# Set global random seeds for deterministic behavior.
tf.random.set_seed(7)
np.random.seed(7)

# Create small NumPy arrays for features and labels.
features_np = np.random.rand(8, 4).astype("float32")
labels_np = np.random.randint(0, 2, size=(8, 1)).astype("int32")

# Build a dataset from the NumPy arrays.
base_ds = tf.data.Dataset.from_tensor_slices((features_np, labels_np))

# Define a mapping function with shape and type logging.
def debug_map(example_features, example_labels):
    # Print shapes and dtypes for one example only.
    tf.print("map input shapes:", tf.shape(example_features),
             tf.shape(example_labels))
    tf.print("map input dtypes:", example_features.dtype,
             example_labels.dtype)

    # Add a simple transformation to features.
    scaled_features = example_features * tf.constant(2.0, dtype=tf.float32)

    # Ensure labels are int32 for model compatibility.
    safe_labels = tf.cast(example_labels, tf.int32)

    # Print shapes after transformation for verification.
    tf.print("map output shapes:", tf.shape(scaled_features),
             tf.shape(safe_labels))

    return scaled_features, safe_labels

# Create a dataset with sequential mapping for debugging.
debug_ds = base_ds.map(debug_map, num_parallel_calls=1)

# Batch and prefetch to match simple Keras model input.
debug_ds = debug_ds.batch(4).prefetch(1)

# Inspect one batch to confirm shapes and types.
for batch_features, batch_labels in debug_ds.take(1):
    print("Batch features shape:", batch_features.shape)
    print("Batch labels shape:", batch_labels.shape)

# Build a tiny Keras model matching the dataset shapes.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(4,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile the model with simple binary classification settings.
model.compile(optimizer="adam", loss="binary_crossentropy",
              metrics=["accuracy"])

# Train briefly with verbose zero to avoid long logs.
history = model.fit(debug_ds, epochs=1, verbose=0)

# Now enable parallel mapping for performance after debugging.
fast_ds = base_ds.map(debug_map, num_parallel_calls=tf.data.AUTOTUNE)

# Batch and prefetch the faster dataset for training.
fast_ds = fast_ds.batch(4).prefetch(1)

# Train again briefly on the faster parallel dataset.
final_history = model.fit(fast_ds, epochs=1, verbose=0)

# Print final batch shapes to confirm compatibility remains.
for f_batch, l_batch in fast_ds.take(1):
    print("Fast batch features shape:", f_batch.shape)



### **3.3. Inspecting element_spec**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_05/Lecture_A/image_03_03.jpg?v=1769401561" width="250">



>* element_spec shows dataset structure, shapes, dtypes
>* Use it to ensure model-compatible inputs, outputs

>* Transformations change dataset shapes and structures
>* Inspect element_spec after steps to catch mismatches

>* Element spec defines a contract with models
>* Regular checks prevent late, hard-to-debug errors



In [None]:
#@title Python Code - Inspecting element_spec

# This script shows how to inspect element_spec.
# It focuses on dataset shapes and types debugging.
# Run cells in order to follow the explanation.

# Optional TensorFlow install for some environments.
# !pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy for this example.
import tensorflow as tf
import numpy as np

# Set a deterministic random seed for reproducibility.
tf.random.set_seed(7)

# Print TensorFlow version in a compact single line.
print("TensorFlow version:", tf.__version__)

# Create small NumPy arrays for features and labels.
features_np = np.arange(12, dtype=np.float32).reshape((6, 2))

# Create integer labels aligned with the features.
labels_np = np.array([0, 1, 0, 1, 0, 1], dtype=np.int32)

# Build a dataset from the NumPy feature and label arrays.
base_ds = tf.data.Dataset.from_tensor_slices((features_np, labels_np))

# Print the element_spec of the base dataset.
print("Base element_spec:", base_ds.element_spec)

# Shuffle the dataset with a small buffer size.
shuffled_ds = base_ds.shuffle(buffer_size=6, seed=7)

# Batch the shuffled dataset into small batches.
batched_ds = shuffled_ds.batch(batch_size=2, drop_remainder=False)

# Prefetch to overlap preprocessing and model execution.
final_ds = batched_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

# Print the element_spec after batching and prefetching.
print("Final element_spec:", final_ds.element_spec)

# Take one batch from the final dataset for inspection.
for batch_features, batch_labels in final_ds.take(1):
    # Print shapes and dtypes for the batch tensors.
    print("Batch features shape:", batch_features.shape)
    print("Batch features dtype:", batch_features.dtype)
    print("Batch labels shape:", batch_labels.shape)
    print("Batch labels dtype:", batch_labels.dtype)

# Define a simple Keras model compatible with the dataset.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(2,)),
    tf.keras.layers.Dense(4, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile the model with a binary classification configuration.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Run a tiny silent training step to confirm compatibility.
model.fit(final_ds, epochs=1, verbose=0, steps_per_epoch=3)



# <font color="#418FDE" size="6.5" uppercase>**tf.data Basics**</font>


In this lecture, you learned to:
- Create tf.data.Dataset pipelines from tensors, NumPy arrays, and TFRecord files. 
- Apply common tf.data transformations such as map, batch, shuffle, and prefetch. 
- Debug dataset shapes and types to ensure compatibility with Keras models. 

In the next Lecture (Lecture B), we will go over 'Advanced Pipelines'