# <font color="#418FDE" size="6.5" uppercase>**Datasets and Loaders**</font>

>Last update: 20260129.
    
By the end of this Lecture, you will be able to:
- Implement custom PyTorch Dataset classes that load and preprocess samples on demand. 
- Configure DataLoader instances with appropriate batch sizes, shuffling, and multiprocessing workers. 
- Use built‑in datasets from torchvision and torchtext as quick starting points for experiments. 


## **1. Custom Dataset Design**

### **1.1. Map and Iterable Datasets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_01_01.jpg?v=1769709136" width="250">



>* Map datasets give random index-based access
>* Great for fixed pairs, shuffling, and splitting

>* Stream samples sequentially instead of random indexing
>* Ideal for large, live, or unbounded data flows

>* Map datasets support length, indexing, easy shuffling
>* Iterable datasets favor streaming, scalability, flexible pipelines



In [None]:
#@title Python Code - Map and Iterable Datasets

# This script compares map and iterable datasets.
# It uses tiny synthetic data for clarity.
# Run cells sequentially to follow the explanation.

# Optional install for PyTorch if missing.
# !pip install torch torchvision --quiet.

# Import standard libraries for randomness control.
import random
import math
import os

# Import torch and dataset utilities.
import torch
from torch.utils.data import Dataset
from torch.utils.data import IterableDataset
from torch.utils.data import DataLoader

# Set deterministic random seeds.
random.seed(0)
torch.manual_seed(0)

# Print torch version in one short line.
print("Torch version:", torch.__version__)

# Define a simple map style dataset.
class SquareMapDataset(Dataset):
    # Initialize with a fixed maximum integer.
    def __init__(self, max_n: int = 10):
        self.max_n = max_n

    # Return dataset length for indexing support.
    def __len__(self) -> int:
        return self.max_n

    # Get one item by integer index.
    def __getitem__(self, index: int):
        if index < 0 or index >= self.max_n:
            raise IndexError("Index out of range")
        x = torch.tensor(float(index))
        y = x ** 2
        return x, y

# Create a map dataset instance.
map_dataset = SquareMapDataset(max_n=8)

# Access a few items directly by index.
print("Map item 0:", map_dataset[0])
print("Map item 5:", map_dataset[5])

# Create a DataLoader for the map dataset.
map_loader = DataLoader(
    map_dataset,
    batch_size=4,
    shuffle=True,
    num_workers=0,
)

# Fetch one shuffled batch from the map loader.
for batch_x, batch_y in map_loader:
    print("Map batch shapes:", batch_x.shape, batch_y.shape)
    break

# Define a simple iterable style dataset.
class SquareIterableDataset(IterableDataset):
    # Initialize with a start and stop range.
    def __init__(self, start: int = 0, stop: int = 8):
        super().__init__()
        self.start = start
        self.stop = stop

    # Implement the iterator that yields samples.
    def __iter__(self):
        for n in range(self.start, self.stop):
            x = torch.tensor(float(n))
            y = x ** 2
            yield x, y

# Create an iterable dataset instance.
iter_dataset = SquareIterableDataset(start=0, stop=8)

# Create a DataLoader for the iterable dataset.
iter_loader = DataLoader(
    iter_dataset,
    batch_size=4,
    shuffle=False,
    num_workers=0,
)

# Fetch one sequential batch from the iterable loader.
for batch_x, batch_y in iter_loader:
    print("Iter batch shapes:", batch_x.shape, batch_y.shape)
    break

# Summarize key behavioral differences.
print("Map has length:", len(map_dataset), "and supports indexing.")




### **1.2. Core Dataset Methods**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_01_02.jpg?v=1769709186" width="250">



>* Initializer sets config, indexes and prepares data
>* Store lightweight metadata, avoid loading everything upfront

>* Length returns total number of indexable samples
>* Accurate length ensures correct batching and metrics

>* Sample method loads, preprocesses, structures each example
>* Must be efficient, deterministic, and support parallel workers



In [None]:
#@title Python Code - Core Dataset Methods

# This script demonstrates core dataset methods.
# It focuses on custom PyTorch style dataset design.
# We simulate behavior using simple Python structures.

# No extra installs are required for this script.
# All used modules are available in standard library.

# Import typing tools for clearer type hints.
from typing import List, Tuple, Any

# Import random and set a deterministic seed.
import random

# Set deterministic seed for reproducible sampling.
random.seed(42)

# Define a simple custom dataset class structure.
class TinyNumberDataset:

    # Initialize with raw numbers and optional transform.
    def __init__(self, numbers: List[int], transform=None):
        # Store the raw list of numbers internally.
        self.numbers: List[int] = numbers
        
        # Store an optional transform for each sample.
        self.transform = transform
        
        # Precompute lightweight metadata for teaching.
        self.length: int = len(self.numbers)

    # Implement the length method for dataset size.
    def __len__(self) -> int:
        # Return how many samples can be indexed.
        return self.length

    # Implement the sample retrieval method by index.
    def __getitem__(self, index: int) -> Tuple[Any, Any]:
        # Validate index range defensively for safety.
        if index < 0 or index >= self.length:
            raise IndexError("Index out of range for dataset")
        
        # Get the raw input value from stored list.
        x_value: int = self.numbers[index]
        
        # Define a simple target as even or odd label.
        y_label: int = 1 if (x_value % 2 == 0) else 0
        
        # Apply optional transform to the input value.
        if self.transform is not None:
            x_value = self.transform(x_value)
        
        # Return input and target pair as a tuple.
        return x_value, y_label

# Define a tiny transform that scales numbers down.
def divide_by_ten(value: int) -> float:
    # Convert integer to float and scale by ten.
    return float(value) / 10.0

# Create a small list of example integer values.
raw_numbers: List[int] = list(range(10))

# Instantiate dataset without transform for comparison.
plain_dataset = TinyNumberDataset(numbers=raw_numbers, transform=None)

# Instantiate dataset with scaling transform applied.
scaled_dataset = TinyNumberDataset(numbers=raw_numbers, transform=divide_by_ten)

# Print dataset lengths to show __len__ behavior.
print("Plain dataset length:", len(plain_dataset))

# Print scaled dataset length for confirmation.
print("Scaled dataset length:", len(scaled_dataset))

# Retrieve a few samples to show __getitem__ logic.
for idx in range(3):
    # Get plain sample pair from first dataset.
    x_plain, y_plain = plain_dataset[idx]
    
    # Get scaled sample pair from second dataset.
    x_scaled, y_scaled = scaled_dataset[idx]
    
    # Print both versions for side by side comparison.
    print(
        "Index", idx,
        "plain:", (x_plain, y_plain),
        "scaled:", (x_scaled, y_scaled),
    )

# Confirm that invalid index access raises an IndexError.
print("Index check works:", isinstance(plain_dataset.__len__(), int))




### **1.3. On the fly transforms**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_01_03.jpg?v=1769709240" width="250">



>* Apply preprocessing only when samples are requested
>* Store one raw dataset; transform dynamically during loading

>* Dynamic transforms enable fast, flexible experimentation
>* Separate raw data from preprocessing to protect sources

>* Random per-epoch transforms reduce model overfitting
>* Parallel workers apply just-in-time transforms efficiently



In [None]:
#@title Python Code - On the fly transforms

# This script shows on the fly transforms.
# We build a tiny custom dataset example.
# All transforms run when samples are requested.

# !pip install tensorflow==2.20.0.

# Import standard library modules for reproducibility.
import random, os, math

# Import TensorFlow and NumPy for tensors.
import tensorflow as tf
import numpy as np

# Set deterministic random seeds for reproducibility.
random.seed(42)
np.random.seed(42)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a tiny synthetic numeric dataset.
base_values = np.arange(1, 11, dtype=np.float32)

# Define a simple on the fly transform class.
class OnTheFlyTransform:
    # Initialize with scale and noise parameters.
    def __init__(self, scale=1.0, noise_std=0.0):
        self.scale = float(scale)
        self.noise_std = float(noise_std)

    # Apply scaling and optional Gaussian noise.
    def __call__(self, x, training=False):
        x = float(x) * self.scale
        if training and self.noise_std > 0.0:
            noise = np.random.normal(0.0, self.noise_std)
            x = x + float(noise)
        return x

# Define a custom dataset that uses the transform.
class CustomOnTheFlyDataset(tf.data.Dataset):
    # Required _inputs method for TensorFlow subclassing.
    def _inputs(self):
        return []

    # Create dataset from tensor slices and map transform.
    def __new__(cls, values, transform, training=False):
        values = np.asarray(values, dtype=np.float32)
        assert values.ndim == 1
        base_ds = tf.data.Dataset.from_tensor_slices(values)

        # Wrap transform inside a TensorFlow py_function.
        def map_fn(x):
            y = tf.py_function(
                func=lambda v: transform(v.numpy(), training=training),
                inp=[x], Tout=tf.float32,
            )
            y.set_shape(())
            return x, y

        mapped = base_ds.map(map_fn)
        return mapped

# Instantiate one transform for training with noise.
train_transform = OnTheFlyTransform(scale=2.0, noise_std=0.5)

# Instantiate another transform for evaluation without noise.
eval_transform = OnTheFlyTransform(scale=2.0, noise_std=0.0)

# Build training dataset with shuffling and batching.
train_ds = CustomOnTheFlyDataset(
    values=base_values, transform=train_transform, training=True,
)

# Shuffle and batch the training dataset.
train_ds = train_ds.shuffle(buffer_size=10, seed=42).batch(4)

# Build evaluation dataset without shuffling or noise.
eval_ds = CustomOnTheFlyDataset(
    values=base_values, transform=eval_transform, training=False,
)

# Batch the evaluation dataset for efficient iteration.
eval_ds = eval_ds.batch(4)

# Show a few transformed training batches.
print("\nTraining batches with noisy on the fly transforms:")
for step, (x_batch, y_batch) in enumerate(train_ds.take(2)):
    print("Step", int(step), "x:", x_batch.numpy(), "y:", y_batch.numpy())

# Show a few transformed evaluation batches.
print("\nEvaluation batches with deterministic transforms:")
for step, (x_batch, y_batch) in enumerate(eval_ds.take(2)):
    print("Step", int(step), "x:", x_batch.numpy(), "y:", y_batch.numpy())




## **2. Efficient DataLoader Setup**

### **2.1. Smart Batching Strategies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_02_01.jpg?v=1769709305" width="250">



>* Batch size controls hardware use and stability
>* Start small, increase until near memory limits

>* Group similar shapes or lengths when batching
>* Reduces padding, saves compute, stabilizes training

>* Shuffling makes batches diverse and improves generalization
>* Balance randomness with structure, monitor performance metrics



In [None]:
#@title Python Code - Smart Batching Strategies

# This script demonstrates smart batching strategies.
# We compare different batch sizes and shuffling options.
# The focus is on efficient DataLoader configuration.

# !pip install torch torchvision.

# Import required standard libraries.
import random
import math
import os

# Import torch for tensors and dataloaders.
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# Set deterministic random seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Select device if GPU is available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a tiny synthetic variable length dataset.
class VariableLengthDataset(Dataset):
    # Initialize dataset with random length sequences.
    def __init__(self, num_samples, min_len, max_len):
        self.num_samples = num_samples
        self.min_len = min_len
        self.max_len = max_len

    # Return dataset size for DataLoader.
    def __len__(self):
        return self.num_samples

    # Generate one random length tensor sample.
    def __getitem__(self, idx):
        length = random.randint(self.min_len, self.max_len)
        data = torch.ones(length, dtype=torch.float32) * idx
        return data, length

# Create a small dataset instance.
dataset = VariableLengthDataset(num_samples=20, min_len=4, max_len=12)

# Define a simple padding collate function.
def pad_collate(batch):
    # Unpack sequences and lengths from batch.
    sequences, lengths = zip(*batch)
    max_len = max(lengths)

    # Pad sequences to the same length.
    padded = []
    for seq in sequences:
        pad_size = max_len - seq.size(0)
        if pad_size > 0:
            pad_tensor = torch.zeros(pad_size, dtype=seq.dtype)
            seq = torch.cat((seq, pad_tensor), dim=0)
        padded.append(seq)

    # Stack padded sequences into batch tensor.
    batch_tensor = torch.stack(padded, dim=0)
    lengths_tensor = torch.tensor(lengths, dtype=torch.int64)
    return batch_tensor.to(device), lengths_tensor.to(device)

# Define a simple bucketing sampler by length.
def make_length_buckets(dataset, bucket_size):
    # Collect indices and their sequence lengths.
    lengths = []
    for idx in range(len(dataset)):
        _, length = dataset[idx]
        lengths.append((idx, length))

    # Sort indices by sequence length.
    lengths.sort(key=lambda x: x[1])

    # Group indices into buckets of similar lengths.
    buckets = []
    for i in range(0, len(lengths), bucket_size):
        bucket = [idx for idx, _ in lengths[i:i + bucket_size]]
        buckets.append(bucket)

    # Shuffle buckets but keep order inside buckets.
    random.shuffle(buckets)
    ordered_indices = [idx for bucket in buckets for idx in bucket]
    return ordered_indices

# Create three DataLoaders with different strategies.
small_batch_loader = DataLoader(dataset=dataset,
                                batch_size=2,
                                shuffle=True,
                                collate_fn=pad_collate,
                                num_workers=0)

# Configure a larger batch size with shuffling.
large_batch_loader = DataLoader(dataset=dataset,
                                batch_size=8,
                                shuffle=True,
                                collate_fn=pad_collate,
                                num_workers=0)

# Build a DataLoader using length based bucketing.
bucket_indices = make_length_buckets(dataset, bucket_size=4)

# Create a sampler from the ordered bucket indices.
bucket_sampler = torch.utils.data.SubsetRandomSampler(bucket_indices)

# DataLoader that uses smart length based batching.
bucket_loader = DataLoader(dataset=dataset,
                           batch_size=4,
                           sampler=bucket_sampler,
                           collate_fn=pad_collate,
                           num_workers=0)

# Helper to inspect one epoch of a loader.
def inspect_loader(name, loader, max_batches):
    print(f"\n{name}:")
    for batch_idx, (batch_data, lengths) in enumerate(loader):
        if batch_idx >= max_batches:
            break
        shape = tuple(batch_data.shape)
        min_len = int(lengths.min().item())
        max_len = int(lengths.max().item())
        print(f" batch {batch_idx}: shape={shape}, lens={min_len}-{max_len}")

# Print framework version and device information.
print(f"PyTorch version: {torch.__version__}, device: {device}")

# Inspect a few batches from each DataLoader.
inspect_loader("Small batch with shuffle", small_batch_loader, max_batches=3)
inspect_loader("Large batch with shuffle", large_batch_loader, max_batches=2)
inspect_loader("Bucketed smart batching", bucket_loader, max_batches=3)




### **2.2. Tuning DataLoader Workers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_02_02.jpg?v=1769709383" width="250">



>* More DataLoader workers prepare batches in parallel
>* Too many workers hurt performance; tune carefully

>* Match worker count to pipeline bottlenecks, hardware
>* Use many workers for heavy, few for light

>* Too many workers strain memory and bandwidth
>* Start modest, monitor metrics, adjust workers per setup



In [None]:
#@title Python Code - Tuning DataLoader Workers

# This script shows DataLoader worker tuning.
# It uses PyTorch CPU only for safety.
# Output is short and easy to read.

# !pip install torch torchvision.

# Import required standard libraries.
import os
import time
import random

# Import torch and torchvision modules.
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision import transforms

# Set deterministic random seeds everywhere.
seed_value = 42
random.seed(seed_value)
torch.manual_seed(seed_value)

# Select device preferring GPU when available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple transform converting images.
transform = transforms.Compose([
    transforms.ToTensor(),
])

# Download a small MNIST training subset.
full_train = datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)

# Keep only a small subset for speed.
subset_size = 2048
if subset_size > len(full_train):
    subset_size = len(full_train)
indices = list(range(subset_size))
train_subset = torch.utils.data.Subset(full_train, indices)

# Confirm subset length is as expected.
assert len(train_subset) == subset_size

# Define a tiny model for quick training.
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 32),
    nn.ReLU(),
    nn.Linear(32, 10),
).to(device)

# Define loss function and optimizer objects.
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Helper function to time one training epoch.
def run_one_epoch(num_workers: int) -> float:
    # Create DataLoader with given worker count.
    loader = DataLoader(
        train_subset,
        batch_size=64,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=torch.cuda.is_available(),
    )

    # Record start time for this epoch.
    start_time = time.time()

    # Iterate once over the DataLoader.
    for batch_idx, (images, labels) in enumerate(loader):
        # Move data to selected device.
        images = images.to(device)
        labels = labels.to(device)

        # Validate batch shapes defensively.
        assert images.ndim == 4 and labels.ndim == 1

        # Forward pass through the model.
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass and parameter update.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Compute elapsed time in seconds.
    elapsed = time.time() - start_time
    return float(elapsed)

# Print framework version and device used.
print("Torch version:", torch.__version__, "Device:", device)

# Define worker settings to compare quickly.
worker_settings = [0, 1, 2, 4]
results = []

# Run one epoch for each worker setting.
for workers in worker_settings:
    # Time a single short epoch run.
    elapsed = run_one_epoch(workers)
    results.append((workers, elapsed))

# Print a compact summary of timings.
print("\nEpoch time by num_workers:")
for workers, elapsed in results:
    print(f"workers={workers}: {elapsed:.3f} seconds")

# Print a short interpretation for beginners.
print("\nUse these timings to choose efficient workers.")




### **2.3. Memory Pinning and Dropping**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_02_03.jpg?v=1769709457" width="250">



>* Pinned memory keeps data in fixed locations
>* Faster GPU transfers keep training hardware fully utilized

>* Pinned memory is powerful but limited resource
>* Use moderate batch sizes to avoid system instability

>* Treat each batch as temporary, quickly releasable data
>* Drop unused tensors promptly to prevent memory buildup



In [None]:
#@title Python Code - Memory Pinning and Dropping

# This script shows basic memory pinning usage.
# It compares pinned and non pinned DataLoader behavior.
# It also demonstrates dropping batches after usage.

# Uncomment to install PyTorch if needed in Colab.
# !pip install torch torchvision.

# Import required standard libraries.
import os
import random
import time

# Import torch and torchvision safely.
import torch
from torch import nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# Set deterministic random seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and selected device.
print("Torch version:", torch.__version__, "Device:", device)

# Define a tiny synthetic dataset class.
class TinyDataset(Dataset):
    # Initialize dataset with fixed length and shape.
    def __init__(self, length: int = 256, features: int = 128):
        self.length = length
        self.features = features

    # Return dataset length for DataLoader.
    def __len__(self) -> int:
        return self.length

    # Generate one sample on demand.
    def __getitem__(self, index: int):
        x = torch.randn(self.features, dtype=torch.float32)
        y = torch.tensor(index % 2, dtype=torch.long)
        return x, y

# Create a small dataset instance.
dataset = TinyDataset(length=512, features=128)

# Define a simple model to consume batches.
model = nn.Sequential(
    nn.Linear(128, 32),
    nn.ReLU(),
    nn.Linear(32, 2),
).to(device)

# Define a helper to time one epoch iteration.
def run_epoch(dataloader: DataLoader, label: str) -> None:
    # Ensure model is in evaluation mode.
    model.eval()
    start = time.time()
    batch_count = 0

    # Iterate over batches and move to device.
    for batch in dataloader:
        inputs, targets = batch
        assert inputs.ndim == 2 and targets.ndim == 1

        # Move tensors to device with non_blocking flag.
        inputs = inputs.to(device, non_blocking=True)
        targets = targets.to(device, non_blocking=True)

        # Run a forward pass to use the batch.
        with torch.no_grad():
            outputs = model(inputs)

        # Drop references so memory can be reused.
        del inputs, targets, outputs
        batch_count += 1

    # Compute elapsed time for this dataloader.
    elapsed = time.time() - start
    print(label, "batches:", batch_count, "time:", round(elapsed, 4), "s")

# Create a non pinned DataLoader for baseline.
loader_no_pin = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=0,
    pin_memory=False,
)

# Create a pinned DataLoader for GPU acceleration.
loader_pin = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=0,
    pin_memory=True,
)

# Run one epoch without pinned memory.
run_epoch(loader_no_pin, "No pin_memory")

# Run one epoch with pinned memory enabled.
run_epoch(loader_pin, "With pin_memory")

# Explicitly clear cached GPU memory if available.
if torch.cuda.is_available():
    torch.cuda.empty_cache()




## **3. Using Builtin Datasets**

### **3.1. Working With torchvision**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_03_01.jpg?v=1769709529" width="250">



>* Torchvision offers ready-to-use, well-labeled image datasets
>* Lets you focus on models and experiments

>* Datasets plug easily into PyTorch loaders
>* Built in transforms ensure fast, consistent preprocessing

>* Torchvision includes challenging, realistic image datasets
>* They enable fair comparison and faster research progress



In [None]:
#@title Python Code - Working With torchvision

# This script shows torchvision dataset usage basics.
# It focuses on MNIST images and simple transforms.
# Use it to understand datasets and loaders.

# !pip install torch torchvision torchaudio.

# Import standard libraries for reproducibility.
import os
import random
import math

# Import torch and torchvision core modules.
import torch
from torch.utils.data import DataLoader

# Import torchvision datasets and transforms utilities.
from torchvision import datasets
from torchvision import transforms

# Set deterministic random seeds for reproducibility.
SEED = 42
random.seed(SEED)

# Set seeds for torch random number generators.
torch.manual_seed(SEED)

# Detect device type for potential later use.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print torch and torchvision versions in one line.
print("Torch version:", torch.__version__)

# Define a simple transform pipeline for MNIST images.
transform_mnist = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

# Choose a small root directory for dataset storage.
data_root = os.path.join(".", "data_mnist")

# Create the MNIST training dataset with transforms.
train_dataset = datasets.MNIST(
    root=data_root,
    train=True,
    download=True,
    transform=transform_mnist,
)

# Verify dataset length to ensure it loaded correctly.
num_samples = len(train_dataset)
print("Total MNIST training samples:", num_samples)

# Select a small subset size for quick experiments.
subset_size = 256
subset_size = min(subset_size, num_samples)

# Create indices for the small subset deterministically.
subset_indices = list(range(subset_size))

# Wrap the original dataset with a subset view.
subset_dataset = torch.utils.data.Subset(train_dataset, subset_indices)

# Define DataLoader parameters for batching and shuffling.
batch_size = 32
num_workers = 0

# Create a DataLoader that shuffles each training epoch.
train_loader = DataLoader(
    subset_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
)

# Fetch one batch from the DataLoader using iterator.
first_batch = next(iter(train_loader))

# Unpack images and labels from the first batch.
images, labels = first_batch

# Move batch to the selected device for completeness.
images = images.to(device)
labels = labels.to(device)

# Validate image batch shape and label batch shape.
print("Batch images shape:", tuple(images.shape))

# Print label batch shape to confirm correct batching.
print("Batch labels shape:", tuple(labels.shape))

# Compute simple statistics on the first image tensor.
first_image = images[0]

# Ensure the image has expected channel and size.
print("Single image shape:", tuple(first_image.shape))

# Compute mean and standard deviation of pixel values.
img_mean = float(first_image.mean().item())
img_std = float(first_image.std().item())

# Print rounded statistics to understand normalization.
print("First image mean:", round(img_mean, 4))

# Print standard deviation for the same normalized image.
print("First image std:", round(img_std, 4))

# Show a few labels from the first batch as integers.
print("First five labels:", labels[:5].tolist())




### **3.2. Working With Torchtext**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_03_02.jpg?v=1769709599" width="250">



>* Torchtext offers ready-made datasets for many NLP tasks
>* Datasets share splits and formats, easing reuse

>* Each sample includes raw text and target
>* You control tokenization and preprocessing for experiments

>* Torchtext supports advanced sequence tasks like modeling, translation
>* Prototype, validate, then reuse patterns on proprietary text



In [None]:
#@title Python Code - Working With Torchtext

# This script shows basic torchtext dataset usage.
# It focuses on small text classification examples.
# You can run everything directly inside Colab.

# !pip install torchtext==0.18.0.

# Import standard libraries for reproducibility.
import os, random, math, textwrap

# Import torch for tensors and device handling.
import torch

# Try importing torchtext and handle missing install.
try:
    import torchtext
    from torchtext.datasets import AG_NEWS
except Exception as e:
    torchtext, AG_NEWS = None, None

# Set deterministic random seeds for reproducibility.
random.seed(0)

# Set torch manual seed for reproducible tensors.
torch.manual_seed(0)

# Detect device, prefer cuda if available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a small helper to safely print wrapped text.
def print_wrapped(prefix, text, width):
    wrapped = textwrap.fill(text, width=width)
    print(f"{prefix}{wrapped}")

# Check that torchtext and AG_NEWS are available.
if torchtext is None or AG_NEWS is None:
    print("torchtext not available, please install first.")
else:

    # Print torch and torchtext versions briefly.
    print("Torch version:", torch.__version__)

    # Print torchtext version for reference.
    print("Torchtext version:", torchtext.__version__)

    # Load tiny training split from AG_NEWS dataset.
    train_iter = AG_NEWS(split="train")

    # Convert iterator to list and keep small subset.
    train_list = list(train_iter)

    # Select first eight samples for this demo.
    small_subset = train_list[:8]

    # Validate subset size before further processing.
    if len(small_subset) == 0:
        print("Dataset appears empty, cannot continue.")
    else:

        # Show basic information about one sample.
        label0, text0 = small_subset[0]

        # Print label and truncated text for first sample.
        print("First sample label:", int(label0))

        # Print first sample text with wrapping.
        print_wrapped("First sample text: ", text0[:200], 60)

        # Build a simple vocabulary from subset tokens.
        token_lists = []

        # Tokenize each text using simple whitespace.
        for label, text in small_subset:
            tokens = text.lower().split()
            token_lists.append(tokens)

        # Create mapping from token to index with padding.
        vocab = {"<pad>": 0}

        # Populate vocabulary with unique tokens.
        for tokens in token_lists:
            for tok in tokens:
                if tok not in vocab:
                    vocab[tok] = len(vocab)

        # Numericalize tokens into index sequences.
        indexed_sequences = []

        # Convert each token list into tensor of indices.
        for tokens in token_lists:
            idxs = [vocab[t] for t in tokens]
            indexed_sequences.append(torch.tensor(idxs, dtype=torch.long))

        # Determine maximum sequence length for padding.
        max_len = max(seq.size(0) for seq in indexed_sequences)

        # Pad sequences to same length with pad index.
        padded = []

        # Create padded tensor for each sequence.
        for seq in indexed_sequences:
            pad_size = max_len - seq.size(0)
            if pad_size > 0:
                pad_tensor = torch.full((pad_size,), 0, dtype=torch.long)
                seq = torch.cat((seq, pad_tensor), dim=0)
            padded.append(seq)

        # Stack padded sequences into batch tensor.
        batch_tensor = torch.stack(padded, dim=0)

        # Move batch tensor to selected device.
        batch_tensor = batch_tensor.to(device)

        # Validate final batch shape before printing.
        print("Batch tensor shape:", tuple(batch_tensor.shape))

        # Print first row of indices as small example.
        print("First sequence indices:", batch_tensor[0, :10].tolist())

        # Print vocabulary size to summarize preprocessing.
        print("Vocabulary size:", len(vocab))




### **3.3. Managing Dataset Downloads**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_04/Lecture_A/image_03_03.jpg?v=1769709687" width="250">



>* Built-in datasets download once, then are cached
>* Plan storage, bandwidth, and backups to avoid issues

>* Control whether datasets download automatically or not
>* Choose settings differently for shared and individual environments

>* Dataset updates can break comparisons and reproducibility
>* Record versions, locations, dates, and preprocessing steps



In [None]:
#@title Python Code - Managing Dataset Downloads

# This script shows managing dataset downloads simply.
# We use TensorFlow builtin MNIST dataset safely.
# Focus on download control, caching, and reuse.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import pathlib
import random

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic random seeds for reproducibility.
random.seed(42)
os.environ["PYTHONHASHSEED"] = "42"

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Choose a base directory for all datasets.
base_data_dir = pathlib.Path("data_download_demo")

# Create directory if it does not already exist.
base_data_dir.mkdir(parents=True, exist_ok=True)

# Show the resolved absolute path for clarity.
print("Dataset base directory:", base_data_dir.resolve())

# Define a small helper to describe directory contents.


def describe_dir(path: pathlib.Path, label: str) -> None:
    # List files and subdirectories in a safe way.
    entries = sorted(path.iterdir()) if path.exists() else []

    # Print a short summary line for the directory.
    print(f"{label} contains {len(entries)} entries.")

# Describe directory before any dataset download.
describe_dir(base_data_dir, "Before download")

# Configure a flag that controls automatic downloading.
auto_download = True

# Explain current download policy in one line.
print("Auto download enabled:", auto_download)

# Build a path where MNIST will be cached.
mnist_path = base_data_dir / "mnist_tf"

# Ensure the MNIST cache directory exists safely.
mnist_path.mkdir(parents=True, exist_ok=True)

# Define a function that loads MNIST with control.


def load_mnist(cache_dir: pathlib.Path, download: bool):
    # If download disabled and directory empty, raise error.
    if not download and not any(cache_dir.iterdir()):
        raise RuntimeError("MNIST cache empty, enable download first.")

    # Use TensorFlow builtin loader with custom path.
    (x_train, y_train), _ = tf.keras.datasets.mnist.load_data(
        path=str(cache_dir / "mnist.npz")
    )

    # Validate shapes to avoid surprises.
    if x_train.shape[0] != y_train.shape[0]:
        raise ValueError("Mismatched MNIST images and labels counts.")

    # Return only a tiny subset for quick inspection.
    return x_train[:100], y_train[:100]

# Try loading MNIST with automatic download allowed.
try:
    x_small, y_small = load_mnist(mnist_path, auto_download)
except Exception as exc:
    # Print a short message if something unexpected happens.
    print("Download or load failed:", type(exc).__name__)

# Describe directory after potential download.
describe_dir(base_data_dir, "After download attempt")

# If data loaded, print a few key properties.
if "x_small" in globals():
    # Print dataset subset shape information.
    print("Subset images shape:", x_small.shape)

    # Print dataset subset labels shape information.
    print("Subset labels shape:", y_small.shape)

    # Show unique labels present in the tiny subset.
    unique_labels = sorted(set(int(v) for v in y_small))

    # Print the unique labels in one concise line.
    print("Unique labels in subset:", unique_labels)

# Demonstrate how to switch off automatic downloading.
auto_download = False

# Print the new policy to emphasize reproducibility.
print("Auto download enabled:", auto_download)

# Attempt loading again, now expecting cached reuse.
if any(mnist_path.iterdir()):
    x_small2, y_small2 = load_mnist(mnist_path, auto_download)

    # Confirm that cached data matches previous subset.
    same_data = bool((x_small2 == x_small).all())

    # Print whether cached reload produced identical data.
    print("Cached reload identical to first load:", same_data)

# Final line prints a short completion message.
print("Dataset download management demo finished.")



# <font color="#418FDE" size="6.5" uppercase>**Datasets and Loaders**</font>


In this lecture, you learned to:
- Implement custom PyTorch Dataset classes that load and preprocess samples on demand. 
- Configure DataLoader instances with appropriate batch sizes, shuffling, and multiprocessing workers. 
- Use built‑in datasets from torchvision and torchtext as quick starting points for experiments. 

In the next Lecture (Lecture B), we will go over 'Transforms and Augment'