# <font color="#418FDE" size="6.5" uppercase>**Datasets and Loaders**</font>

>Last update: 20260129.
    
By the end of this Lecture, you will be able to:
- Implement custom PyTorch Dataset classes that load and preprocess samples on demand. 
- Configure DataLoader instances with appropriate batch sizes, shuffling, and multiprocessing workers. 
- Use built‑in datasets from torchvision and torchtext as quick starting points for experiments. 


## **1. Custom Dataset Design**

### **1.1. Map and Iterable Datasets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_01_01.jpg?v=1769669722" width="250">



>* Map datasets give index-based random access
>* Best for fixed collections and supervised training

>* Treat data as a one-way streaming sequence
>* Supports huge, unbounded, or hard-to-index datasets

>* Map datasets enable shuffling, subsets, efficient sampling
>* Iterable datasets suit complex streams without random access



In [None]:
#@title Python Code - Map and Iterable Datasets

# This script compares map and iterable datasets.
# It uses tiny synthetic data for clarity.
# Run cells sequentially to follow explanations.

# Optional install for PyTorch if missing.
# !pip install torch torchvision --quiet.

# Import required standard libraries.
import random
import itertools
import math

# Import torch and dataset utilities.
import torch
from torch.utils.data import Dataset
from torch.utils.data import IterableDataset
from torch.utils.data import DataLoader

# Set deterministic random seeds.
random.seed(0)
torch.manual_seed(0)

# Define a simple map style dataset.
class SquareMapDataset(Dataset):
    # Initialize with a fixed size.
    def __init__(self, size: int = 10):
        self.size = int(size)

    # Return dataset length for indexing.
    def __len__(self) -> int:
        return self.size

    # Get one item by integer index.
    def __getitem__(self, index: int) -> torch.Tensor:
        if index < 0 or index >= self.size:
            raise IndexError("index out of range")
        value = float(index)
        return torch.tensor([value, value ** 2], dtype=torch.float32)

# Create a map dataset instance.
map_dataset = SquareMapDataset(size=8)

# Access a few items directly.
first_item = map_dataset[0]
third_item = map_dataset[2]

# Wrap map dataset in a DataLoader.
map_loader = DataLoader(
    map_dataset,
    batch_size=4,
    shuffle=True,
    num_workers=0,
)

# Define a simple iterable style dataset.
class SquareIterableDataset(IterableDataset):
    # Initialize with a maximum count.
    def __init__(self, max_value: int = 8):
        self.max_value = int(max_value)

    # Create an iterator over the stream.
    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:
            start = 0
            end = self.max_value
        else:
            per_worker = int(
                math.ceil(self.max_value / worker_info.num_workers)
            )
            start = worker_info.id * per_worker
            end = min(start + per_worker, self.max_value)
        for i in range(start, end):
            value = float(i)
            yield torch.tensor([value, value ** 2], dtype=torch.float32)

# Create an iterable dataset instance.
iter_dataset = SquareIterableDataset(max_value=8)

# Wrap iterable dataset in a DataLoader.
iter_loader = DataLoader(
    iter_dataset,
    batch_size=4,
    shuffle=False,
    num_workers=0,
)

# Print framework version in one line.
print("PyTorch version:", torch.__version__)

# Show direct indexing from map dataset.
print("Map first item:", first_item.tolist())
print("Map third item:", third_item.tolist())

# Show one shuffled batch from map loader.
for batch in map_loader:
    print("Map loader batch shape:", batch.shape)
    break

# Show sequential batches from iterable loader.
for batch in iter_loader:
    print("Iterable loader batch shape:", batch.shape)
    break

# Confirm that iterable dataset has no length.
has_len = hasattr(iter_dataset, "__len__")
print("Iterable dataset has __len__:", has_len)




### **1.2. Core Dataset Methods**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_01_02.jpg?v=1769669801" width="250">



>* Two methods define dataset size and samples
>* They index raw data into structured training examples

>* Accurate length makes datasets work with tools
>* Updated length keeps all pipeline components synchronized

>* Index method locates, loads, and decodes data
>* Transforms raw records into clean model-ready samples



In [None]:
#@title Python Code - Core Dataset Methods

# This script shows core dataset methods.
# It focuses on __len__ and __getitem__.
# It uses a tiny custom tensor dataset.

# Uncomment to install PyTorch in some environments.
# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import random
import os
import math

# Import torch for tensors and Dataset.
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# Set deterministic random seeds.
random.seed(42)
torch.manual_seed(42)

# Define a simple custom dataset class.
class TinySquaresDataset(Dataset):
    # Initialize with a maximum integer value.
    def __init__(self, max_value: int = 10):
        assert max_value > 0, "max_value must be positive"
        self.values = list(range(1, max_value + 1))

    # Return the number of available samples.
    def __len__(self) -> int:
        return len(self.values)

    # Return one sample given an integer index.
    def __getitem__(self, index: int):
        if index < 0 or index >= len(self.values):
            raise IndexError("Index out of range for dataset")
        x_value = float(self.values[index])
        x_tensor = torch.tensor([x_value], dtype=torch.float32)
        y_tensor = torch.tensor([x_value ** 2], dtype=torch.float32)
        return {"input": x_tensor, "target": y_tensor}

# Create a tiny dataset instance.
dataset = TinySquaresDataset(max_value=8)

# Show dataset length using __len__.
print("Dataset length:", len(dataset))

# Inspect a single sample using __getitem__.
sample_index = 3
sample = dataset[sample_index]
print("Sample index:", sample_index, "input:", sample["input"].item())
print("Sample index:", sample_index, "target:", sample["target"].item())

# Create a DataLoader to batch samples.
loader = DataLoader(dataset, batch_size=4, shuffle=False)

# Take one batch from the loader.
first_batch = next(iter(loader))

# Validate batch shapes before printing.
assert first_batch["input"].shape == (4, 1)
assert first_batch["target"].shape == (4, 1)

# Print a short summary of the first batch.
print("Batch inputs:", first_batch["input"].squeeze().tolist())
print("Batch targets:", first_batch["target"].squeeze().tolist())

# Confirm that dataset methods work as expected.
print("Verified __len__ and __getitem__ behavior.")



### **1.3. On the fly preprocessing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_01_03.jpg?v=1769669857" width="250">



>* Transform each sample right when it’s loaded
>* Save raw data; keep preprocessing flexible, storage‑efficient

>* Random transforms create varied samples each epoch
>* Embedded logic keeps data fresh and generalizable

>* Balance preprocessing cost with data loading speed
>* Precompute heavy steps, apply cheap ones dynamically



In [None]:
#@title Python Code - On the fly preprocessing

# This script shows on the fly preprocessing.
# We build a tiny custom dataset class.
# Each sample is transformed right before use.

# !pip install tensorflow.

# Import required standard libraries.
import random
import numpy as np
import tensorflow as tf

# Set deterministic random seeds.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Create a tiny synthetic numeric dataset.
base_features = np.arange(12, dtype=np.float32).reshape(6, 2)
base_labels = np.array([0, 1, 0, 1, 0, 1], dtype=np.int32)

# Validate shapes before building dataset.
assert base_features.shape[0] == base_labels.shape[0]

# Define a function for on the fly noise.
def add_random_noise(features):
    noise = tf.random.normal(tf.shape(features), stddev=0.1)
    return features + noise

# Define a function for on the fly scaling.
def scale_features(features):
    return (features - tf.reduce_mean(features)) / (tf.math.reduce_std(features) + 1e-6)

# Define a mapping function combining both transforms.
def preprocess_example(features, label):
    noisy = add_random_noise(features)
    scaled = scale_features(noisy)
    return scaled, label

# Build a tf.data Dataset from tensors.
raw_ds = tf.data.Dataset.from_tensor_slices((base_features, base_labels))

# Apply on the fly preprocessing with map.
preprocessed_ds = raw_ds.map(preprocess_example)

# Batch the dataset for efficient loading.
preprocessed_ds = preprocessed_ds.batch(2)

# Take one epoch and collect two batches.
example_batches = list(preprocessed_ds.take(2))

# Print original features for comparison.
print("Original features:\n", base_features)

# Print first preprocessed batch.
first_batch_x, first_batch_y = example_batches[0]
print("First batch preprocessed features:\n", first_batch_x.numpy())

# Print first preprocessed batch labels.
print("First batch labels:", first_batch_y.numpy())

# Iterate again to show new random noise.
second_run_batches = list(preprocessed_ds.take(1))

# Print another run to highlight randomness.
second_x, second_y = second_run_batches[0]
print("Second run first batch features:\n", second_x.numpy())



## **2. Efficient DataLoader Configuration**

### **2.1. Smart Batching Strategies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_02_01.jpg?v=1769669891" width="250">



>* Batch size affects speed, memory, and gradients
>* Choose a size that fits hardware and stabilizes training

>* Group samples with similar size or length
>* This reduces padding, wasted memory, and computation

>* Mix shuffling with structured, size-based batching
>* Hybrid batching boosts efficiency while preserving robustness



In [None]:
#@title Python Code - Smart Batching Strategies

# This script demonstrates smart batching strategies simply.
# We simulate variable length sequences and batch them efficiently.
# Focus is on DataLoader configuration and padding behavior.

# Uncomment the next line if torch is not installed.
# !pip install torch torchvision.

# Import required standard libraries.
import random
import math
import os

# Import torch for tensors and DataLoader utilities.
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# Set deterministic seeds for reproducibility.
random.seed(42)
torch.manual_seed(42)

# Define a tiny custom dataset with variable lengths.
class ToySequenceDataset(Dataset):
    # Initialize with a list of sequence lengths.
    def __init__(self, lengths):
        self.lengths = list(lengths)

    # Return dataset size for DataLoader usage.
    def __len__(self):
        return len(self.lengths)

    # Generate a random sequence tensor for each index.
    def __getitem__(self, idx):
        length = int(self.lengths[idx])
        values = torch.ones(length, dtype=torch.float32)
        return values, length

# Create a small list of varying sequence lengths.
sequence_lengths = [3, 7, 2, 5, 9, 4, 6, 8]

# Instantiate the dataset with these lengths.
dataset = ToySequenceDataset(sequence_lengths)

# Validate dataset length to avoid configuration mistakes.
assert len(dataset) == len(sequence_lengths)

# Define a simple padding collate function for naive batching.
def naive_collate(batch):
    sequences, lengths = zip(*batch)
    max_len = max(int(l) for l in lengths)
    padded = []
    for seq in sequences:
        pad_size = max_len - int(seq.shape[0])
        pad_tensor = torch.zeros(pad_size, dtype=torch.float32)
        padded.append(torch.cat((seq, pad_tensor)))
    batch_tensor = torch.stack(padded, dim=0)
    length_tensor = torch.tensor(lengths, dtype=torch.int64)
    return batch_tensor, length_tensor

# Define a bucketing function to group similar lengths.
def make_buckets(lengths, bucket_size):
    paired = list(enumerate(lengths))
    paired.sort(key=lambda x: x[1])
    buckets = []
    for i in range(0, len(paired), bucket_size):
        bucket = paired[i : i + bucket_size]
        buckets.append(bucket)
    return buckets

# Build buckets to support smart batching behavior.
buckets = make_buckets(sequence_lengths, bucket_size=4)

# Define a custom sampler that yields indices by buckets.
class BucketSampler(torch.utils.data.Sampler):
    # Store buckets and enable optional shuffling.
    def __init__(self, buckets, shuffle=True):
        self.buckets = list(buckets)
        self.shuffle = bool(shuffle)

    # Return total number of samples across buckets.
    def __len__(self):
        return sum(len(b) for b in self.buckets)

    # Yield indices, shuffling within each bucket.
    def __iter__(self):
        bucket_order = list(self.buckets)
        if self.shuffle:
            random.shuffle(bucket_order)
        for bucket in bucket_order:
            inner = list(bucket)
            if self.shuffle:
                random.shuffle(inner)
            for idx, _length in inner:
                yield int(idx)

# Create naive DataLoader with random shuffling only.
naive_loader = DataLoader(dataset=dataset,
                          batch_size=4,
                          shuffle=True,
                          collate_fn=naive_collate,
                          num_workers=0)

# Create smart DataLoader using bucket sampler strategy.
bucket_sampler = BucketSampler(buckets=buckets, shuffle=True)
smart_loader = DataLoader(dataset=dataset,
                          batch_size=4,
                          sampler=bucket_sampler,
                          collate_fn=naive_collate,
                          num_workers=0)

# Helper function to inspect one epoch of a loader.
def inspect_loader(loader, name):
    print(f"\n{name} batches:")
    for batch_idx, (batch, lengths) in enumerate(loader):
        shape = tuple(batch.shape)
        min_len = int(lengths.min().item())
        max_len = int(lengths.max().item())
        print(f"Batch {batch_idx}: shape={shape}, min_len={min_len}, max_len={max_len}")

# Print torch version and device information briefly.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Torch version {torch.__version__}, device {device}")

# Move one batch to device to show realistic usage.
first_batch, first_lengths = next(iter(naive_loader))
first_batch = first_batch.to(device)
first_lengths = first_lengths.to(device)

# Inspect naive and smart loaders to compare padding ranges.
inspect_loader(naive_loader, "Naive loader")
inspect_loader(smart_loader, "Smart bucketed loader")




### **2.2. num_workers tuning**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_02_02.jpg?v=1769669990" width="250">



>* Worker count controls parallel data loading processes
>* Tune workers to balance speed and system load

>* Optimal worker count depends on hardware and preprocessing
>* Increase workers until throughput stops improving or fluctuates

>* Optimal worker settings change with environment, time
>* Regularly retune using system metrics and profiling



In [None]:
#@title Python Code - num_workers tuning

# This script demonstrates DataLoader num_workers tuning.
# It uses TensorFlow to simulate data loading work.
# Focus is on timing different worker configurations.

# !pip install tensorflow-io-gcs-filesystem.

# Import standard libraries for timing and randomness.
import os
import time
import random
import numpy as np

# Import TensorFlow and set logging level.
import tensorflow as tf

# Set deterministic random seeds for reproducibility.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version in a single concise line.
print("TensorFlow version:", tf.__version__)

# Define a small synthetic dataset size constant.
NUM_SAMPLES = 2048
BATCH_SIZE = 64

# Create a simple synthetic feature tensor dataset.
features = tf.random.normal((NUM_SAMPLES, 32))
labels = tf.random.uniform((NUM_SAMPLES, 1), 0, 2, dtype=tf.int32)

# Validate shapes before building the dataset pipeline.
assert features.shape[0] == labels.shape[0]

# Build a base tf.data Dataset from tensors.
base_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Define a small preprocessing function to simulate work.
def preprocess(x, y):
    x = tf.math.square(x)
    x = tf.math.reduce_mean(x, axis=-1, keepdims=True)
    return x, y

# Wrap preprocessing with map for later reuse.
def make_dataset(num_parallel_calls):
    ds = base_ds.shuffle(256, seed=42, reshuffle_each_iteration=False)
    ds = ds.map(preprocess, num_parallel_calls=num_parallel_calls)
    ds = ds.batch(BATCH_SIZE)
    ds = ds.prefetch(tf.data.AUTOTUNE)
    return ds

# Define a helper to time one full pass over the dataset.
def time_dataset(num_parallel_calls):
    ds = make_dataset(num_parallel_calls)
    start = time.time()
    batch_count = 0
    for batch_x, batch_y in ds:
        batch_count += 1
    end = time.time()
    elapsed = end - start
    return elapsed, batch_count

# Choose several worker like values to compare timings.
worker_like_values = [1, 2, 4, 8]

# Collect timing results for each configuration.
results = []

for workers in worker_like_values:
    elapsed, batches = time_dataset(workers)
    results.append((workers, elapsed, batches))

# Print a short summary explaining the measurements.
print("\nSimulated num_workers timing with tf.data map.")

# Print each configuration result in a compact format.
for workers, elapsed, batches in results:
    print(
        f"parallel_calls={workers}: {elapsed:.3f}s for {batches} batches"
    )

# Indicate which configuration was fastest overall.
best_cfg = min(results, key=lambda x: x[1])
print(
    f"\nFastest setting here uses parallel_calls={best_cfg[0]} workers."
)



### **2.3. Memory Pinning and Dropping**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_02_03.jpg?v=1769670078" width="250">



>* Pinned memory speeds CPU‑to‑GPU data transfers
>* Enables overlapping data transfer with model computation

>* Pinned memory speeds GPU training on big batches
>* Overuse increases RAM pressure, so monitor carefully

>* Free batch memory promptly to prevent buildup
>* Avoid hidden references; combine pinning with cleanup



In [None]:
#@title Python Code - Memory Pinning and Dropping

# This script shows PyTorch DataLoader memory pinning.
# It compares pinned and unpinned batches on a device.
# It also demonstrates careful memory dropping practices.

# !pip install torch torchvision.

# Import required standard libraries.
import os
import time
import random

# Import torch and torchvision safely.
import torch
import torchvision
import torchvision.transforms as T

# Set deterministic random seeds everywhere.
random.seed(0)
torch.manual_seed(0)

# Detect device preferring GPU when available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and selected device.
print("Torch", torch.__version__, "Device", device)

# Define a tiny transform converting images to tensors.
transform = T.Compose([T.ToTensor()])

# Download a very small MNIST training subset.
train_dataset = torchvision.datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)

# Limit dataset size to keep runtime very small.
small_size = 256
assert small_size <= len(train_dataset)

# Create a subset view of the dataset.
indices = list(range(small_size))
small_dataset = torch.utils.data.Subset(train_dataset, indices)

# Helper function to build a configured DataLoader.
def make_loader(pinned):
    return torch.utils.data.DataLoader(
        small_dataset,
        batch_size=64,
        shuffle=True,
        num_workers=2,
        pin_memory=pinned,
    )

# Create loaders with and without pinned memory.
loader_unpinned = make_loader(pinned=False)
loader_pinned = make_loader(pinned=True)

# Define a tiny model to consume batches.
model = torch.nn.Sequential(
    torch.nn.Flatten(),
    torch.nn.Linear(28 * 28, 10),
).to(device)

# Ensure model parameters are on the correct device.
for p in model.parameters():
    assert p.device == device

# Function to time one pass over a loader.
def time_loader(loader, description):
    start = time.time()
    total = 0
    for batch_idx, (images, labels) in enumerate(loader):
        assert images.ndim == 4
        assert labels.ndim == 1

        # Move batch to device using non_blocking when pinned.
        images = images.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)

        # Run a tiny forward pass then drop references.
        outputs = model(images)
        total += outputs.shape[0]
        del images, labels, outputs

    end = time.time()
    print(description, "items", total, "time", round(end - start, 3))

# Time unpinned loader first for comparison.
time_loader(loader_unpinned, "Unpinned loader")

# Time pinned loader to observe potential speedup.
time_loader(loader_pinned, "Pinned loader")

# Explicitly clear caches when using CUDA devices.
if device.type == "cuda":
    torch.cuda.empty_cache()




## **3. Using Builtin Datasets**

### **3.1. Working With torchvision**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_03_01.jpg?v=1769670159" width="250">



>* Torchvision offers ready-to-use image datasets quickly
>* Lets you test models without manual data handling

>* Torchvision datasets support flexible image transformations
>* Easily test different preprocessing without changing training code

>* Torchvision includes complex, richly annotated vision datasets
>* Consistent interfaces ease scaling from toy to real projects



In [None]:
#@title Python Code - Working With torchvision

# This script shows basic torchvision dataset usage.
# It focuses on small image classification examples.
# You can run everything directly inside Colab.

# !pip install torch torchvision torchaudio.

# Import standard libraries for reproducibility.
import os
import random
import math

# Import torch and torchvision core modules.
import torch
import torchvision
from torchvision import datasets
from torchvision import transforms

# Set a deterministic random seed value.
seed_value = 42
random.seed(seed_value)

# Set seeds for torch random number generators.
torch.manual_seed(seed_value)

# Detect device type for potential later use.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print torch and torchvision versions briefly.
print("torch version:", torch.__version__)
print("torchvision version:", torchvision.__version__)

# Define a simple transform pipeline for images.
transform_pipeline = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

# Choose a small root directory for dataset files.
data_root = os.path.join(".", "data_torchvision_demo")

# Create the MNIST training dataset with transforms.
mnist_train = datasets.MNIST(
    root=data_root,
    train=True,
    download=True,
    transform=transform_pipeline,
)

# Verify dataset length is positive and reasonable.
assert len(mnist_train) > 0

# Inspect one sample to understand shapes.
sample_image, sample_label = mnist_train[0]

# Validate the image tensor shape for safety.
assert sample_image.ndim == 3
assert sample_image.shape[0] == 1

# Print a short summary about one sample.
print("Single sample shape:", tuple(sample_image.shape))
print("Single sample label:", int(sample_label))

# Create a DataLoader with small batch size.
train_loader = torch.utils.data.DataLoader(
    mnist_train,
    batch_size=16,
    shuffle=True,
    num_workers=0,
)

# Get one batch from the DataLoader iterator.
first_batch_images, first_batch_labels = next(iter(train_loader))

# Move batch to the selected device safely.
first_batch_images = first_batch_images.to(device)
first_batch_labels = first_batch_labels.to(device)

# Validate batch dimensions before further processing.
assert first_batch_images.shape[0] == 16
assert first_batch_images.shape[1] == 1

# Print concise information about the loaded batch.
print("Batch images shape:", tuple(first_batch_images.shape))
print("Batch labels shape:", tuple(first_batch_labels.shape))

# Show a few labels to confirm shuffling behavior.
print("First eight labels:", first_batch_labels[:8].tolist())




### **3.2. Working With Torchtext**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_03_02.jpg?v=1769670249" width="250">



>* Torchtext offers ready-made datasets for NLP tasks
>* Enables quick experiments and fair benchmark comparisons

>* Torchtext offers raw text plus basic preprocessing
>* You can customize preprocessing while interface stays consistent

>* Torchtext promotes reusable text processing components
>* Build intuition to adapt datasets and custom corpora



### **3.3. Dataset Download And Cache**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_03_03.jpg?v=1769670411" width="250">



>* Datasets download once from remote servers first
>* Later runs use cached data, avoiding network issues

>* Libraries cache datasets locally, downloading only when missing
>* Shared or persistent caches save time, bandwidth, sessions

>* Monitor cache size and remove unused datasets
>* Share cache locations and avoid deleting active data



In [None]:
#@title Python Code - Dataset Download And Cache

# This script shows dataset download caching behavior.
# It uses TensorFlow builtin MNIST dataset safely.
# Focus is on download location and reuse.

# !pip install tensorflow.

# Import required standard libraries.
import os
import pathlib
import random

# Import numpy for simple array inspection.
import numpy as np

# Import tensorflow and its datasets module.
import tensorflow as tf
from tensorflow import keras

# Set deterministic random seeds for reproducibility.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version in one concise line.
print("TensorFlow version:", tf.__version__)

# Define a small helper to describe arrays.
def describe_array(name, array):
    # Print array name, shape and data type.
    print(f"{name} shape {array.shape} dtype {array.dtype}")

# Choose a cache directory under the user home.
home_dir = pathlib.Path.home()
cache_dir = home_dir / "tf_mnist_cache_demo"

# Create the cache directory if it does not exist.
cache_dir.mkdir(parents=True, exist_ok=True)

# Show the chosen cache directory path.
print("Cache directory:", str(cache_dir))

# Build the full path for the MNIST npz file.
mnist_path = cache_dir / "mnist.npz"

# Check whether the dataset file already exists.
first_time_download = not mnist_path.exists()

# Print a short message about expected behavior.
if first_time_download:
    print("MNIST not cached yet, download will occur.")
else:
    print("MNIST already cached, load will be fast.")

# Load MNIST using keras with explicit path.
(train_images, train_labels), (test_images, test_labels) = (
    keras.datasets.mnist.load_data(path="mnist.npz")
)

# Confirm whether file now exists after loading.
file_now_exists = mnist_path.exists()
print("File exists after load:", file_now_exists)

# Describe the loaded training and test arrays.
describe_array("train_images", train_images)
describe_array("train_labels", train_labels)

describe_array("test_images", test_images)
describe_array("test_labels", test_labels)

# Show a tiny sample of pixel values for intuition.
print("First image first row:", train_images[0, 0, :10])



# <font color="#418FDE" size="6.5" uppercase>**Datasets and Loaders**</font>


In this lecture, you learned to:
- Implement custom PyTorch Dataset classes that load and preprocess samples on demand. 
- Configure DataLoader instances with appropriate batch sizes, shuffling, and multiprocessing workers. 
- Use built‑in datasets from torchvision and torchtext as quick starting points for experiments. 

In the next Lecture (Lecture B), we will go over 'Transforms and Augment'