# 📘 Simple Dataset (PyTorch)

## 🎯 Objective
- Create a **custom dataset** in PyTorch.  
- Perform and **compose transforms** on dataset samples.

---

## 📋 Table of Contents
- [Simple dataset](#simple-dataset)
- [Transforms](#transforms)
- [Compose](#compose)

**Estimated Time:** **30 min**

---

## 🧰 Preparation

We’ll use:

- `torch` (PyTorch tensors)
- `torch.utils.data.Dataset` (abstract base for custom datasets)

We also set a manual seed so any randomness is repeatable.

```python
import torch
from torch.utils.data import Dataset

# Forcing deterministic behavior for demo purposes
torch.manual_seed(1)
```

## <a id="simple-dataset"></a> 🔹 Simple dataset

We’ll implement a tiny dataset class `toy_set` that:

- stores **features** `x` and **targets** `y`,
- accepts an optional **transform** callable,
- supports Python’s dataset protocol:
  - `__len__(self)` → number of samples,
  - `__getitem__(self, index)` → returns the sample `(x_i, y_i)`.

### 📐 What’s inside the dataset?

For clarity, we’ll make:
- $ \color{#1f77b4}{x \in \mathbb{R}^{N \times 2}} $ with **all entries = 2**  
- $ \color{#d62728}{y \in \mathbb{R}^{N \times 1}} $ with **all entries = 1**

So the $i$-th sample is:
$$
\text{sample}[i] = \big(\;\color{#1f77b4}{x_i} \in \mathbb{R}^2,\; \color{#d62728}{y_i} \in \mathbb{R}^1\;\big)
$$

---


In [15]:
# ===============================
# 1) Imports & setup
# ===============================
import torch
from torch.utils.data import Dataset

# For reproducibility when randomness is used (e.g., noise)
torch.manual_seed(0)

<torch._C.Generator at 0x12e27c290>

In [16]:
# ===============================
# Define class for dataset
# ===============================

class toy_set(Dataset):
    """
    A minimal dataset:
      - N samples (default: 100)
      - x[i] is a 2D vector of twos: [2., 2.]
      - y[i] is a scalar one: [1.]
    Optionally applies a 'transform' to each (x, y) upon access.
    """
    
    def __init__(self, length=100, transform=None):
        """
        Args:
            length (int): number of samples N.
            transform (callable or None): if provided, will be applied to each (x, y).
        """
        # Save the declared length
        self.len = int(length)
        
        # Create feature matrix X: shape (N, 2), filled with 2's
        # Example row: tensor([2., 2.])
        self.x = 2 * torch.ones(self.len, 2)
        
        # Create target vector y: shape (N, 1), filled with 1's
        # Example row: tensor([1.])
        self.y = torch.ones(self.len, 1)
        
        # Store an optional transform; must be callable(sample)->sample
        self.transform = transform
     
    def __getitem__(self, index):
        """
        Return the (x[index], y[index]) pair.
        If a transform is provided, apply it before returning.
        """
        # 1) Grab tensors for this index
        sample = (self.x[index], self.y[index])
        
        # 2) If a transform pipeline exists, run it
        if self.transform:
            sample = self.transform(sample)
        
        # 3) Return the (possibly transformed) sample
        return sample
    
    def __len__(self):
        """
        Return dataset size. Enables Python's len(ds) and is used by DataLoader.
        """
        return self.len


### 🔎 Create & Inspect the Dataset

Below we instantiate the dataset, check one element, and print its length.

- `__repr__` is inherited, so you’ll see the object type.
- `__getitem__` powers `ds[index]` access.
- `__len__` powers `len(ds)`.


In [17]:
# Create Dataset Object. Inspect element at index 0 and dataset length.
our_dataset = toy_set()

print("Our toy_set object: ", our_dataset)
print("Value on index 0 of our toy_set object: ", our_dataset[0])  # calls __getitem__(0)
print("Our toy_set length: ", len(our_dataset))                    # calls __len__()

Our toy_set object:  <__main__.toy_set object at 0x132329e80>
Value on index 0 of our toy_set object:  (tensor([2., 2.]), tensor([1.]))
Our toy_set length:  100


### ➿ Indexing & Iteration

Because `toy_set` implements `__getitem__` and `__len__`, it behaves like a Python iterable.  
Let’s print the first 3 samples using indexing:


In [18]:
# Use loop to print out first 3 elements in dataset (via indexing)
print("\n first three samples")
for i in range(3):
    x, y = our_dataset[i]
    print(f"[{i}] --> x:{x.tolist()}, y:{y.item():.1f}")


 first three samples
[0] --> x:[2.0, 2.0], y:1.0
[1] --> x:[2.0, 2.0], y:1.0
[2] --> x:[2.0, 2.0], y:1.0


You can also iterate directly over the dataset:

In [19]:
count = 0
for x, y in our_dataset:
    print(f"x:{x.tolist()}, y:{y.item()}")
    count += 1

    if count == 10:
        break # (break early so we don't print all 100 lines)


x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0
x:[2.0, 2.0], y:1.0


### 🧪 Practice

Create a `toy_set` with length **50** and print its length $N$.

> ✅ **Hint:** `ds = toy_set(length=50)`

In [20]:
toy_dataset = toy_set(length=50)

print(f"Length of the dataset {len(toy_dataset)}")

Length of the dataset 50


## <a id="transforms"></a> 🔹 Transforms

Transforms let you **preprocess** samples consistently.  
We’ll define a callable class `add_mult` that:

- adds a constant to $ \color{#1f77b4}{x} $:  $ \color{#1f77b4}{x'} = x + \text{addx} $
- multiplies $ \color{#d62728}{y} $: $ \color{#d62728}{y'} = (\text{muly}) \, y $


In [21]:
# ===============================
# Transform class: add_mult
# ===============================

class add_mult(object):
    """
    Simple transform:
      - x' = x + addx
      - y' = y * muly
    """

    def __init__(self, addx = 1, muly =  2): # 1 and 2 is the default value
        self.addx = addx
        self.muly = muly
    
    def __call__(self, sample):
        """
        Args:
            sample: tuple (x, y)
        Returns:
            tuple (x', y') with transform applied
        """

        x, y = sample

        x = x + self.addx
        y = y * self.muly
        return (x, y)
    

### 🧪 Apply Transform (Manually vs. via Dataset)

1) **Manual**: call the transform object on a sample.  
2) **Automatic**: pass the transform into the dataset’s constructor to apply on every access.


In [22]:
a_m = add_mult(3, 7)
data_set = toy_set()

# Compare original vs transformed (first 10 indices)

for i in range(3):
    x, y = data_set[i]
    x_, y_ = a_m((x, y))

    # Original vs Transformed
    print(f"Index Postion: [{i}]\n Oringal x :{x.numpy()}, y: {y.numpy()}\n Transformed x: {x_.numpy()}, y: {y_.numpy()}")

Index Postion: [0]
 Oringal x :[2. 2.], y: [1.]
 Transformed x: [5. 5.], y: [7.]
Index Postion: [1]
 Oringal x :[2. 2.], y: [1.]
 Transformed x: [5. 5.], y: [7.]
Index Postion: [2]
 Oringal x :[2. 2.], y: [1.]
 Transformed x: [5. 5.], y: [7.]


You can also **register the transform with the dataset** so it’s applied automatically in `__getitem__`:


In [27]:
custom_data_set = toy_set(transform=add_mult(8, 3))

for i in range(3):
    x, y = custom_data_set[i]

    print(f"[{i}] -> Trasformed x: {x.numpy()}, y: {y.numpy()}")

[0] -> Trasformed x: [10. 10.], y: [3.]
[1] -> Trasformed x: [10. 10.], y: [3.]
[2] -> Trasformed x: [10. 10.], y: [3.]


### 🧪 Practice

Write your own transform `my_add_mult` and apply it via the dataset.

> Example idea:
> - add `2` to both `x` and `y`, then multiply both by `10`.

```python
# Practice: Construct your own my_add_mult transform. Apply it on a new toy_set and print first 3 elements.

In [None]:
class my_add_mult(object):
    def __init__(self, add_x = 2, mul_y = 10 ):
        self.add_x = add_x
        self.mul_y = mul_y
    
    def __call__(self, samples):
        x, y = samples
        x = (x + self.add_x) * self.mul_y
        y = (y + self.add_x) * self.mul_y
        return (x, y)

sample_data_set = toy_set(transform=my_add_mult())

count = 0
for x, y in sample_data_set:

    if count == 5:
        break
    count += 1
    print(f"Index [{count}], Transformed X: {x.numpy()}, Transformed Y: {y.numpy()}")
    

Count 0
Index [1], Transformed X: [40. 40.], Transformed Y: [30.]
Count 1
Index [2], Transformed X: [40. 40.], Transformed Y: [30.]
Count 2
Index [3], Transformed X: [40. 40.], Transformed Y: [30.]
Count 3
Index [4], Transformed X: [40. 40.], Transformed Y: [30.]
Count 4
Index [5], Transformed X: [40. 40.], Transformed Y: [30.]
Count 5


## <a id="compose"></a> 🔹 Compose

You can **chain multiple transforms** with `torchvision.transforms.Compose`.

We’ll also define another transform `mult`:
- $ x' = (\text{mult}) \, x $
- $ y' = (\text{mult}) \, y $

Then we’ll build:
$$
\text{Compose}\big([\; \texttt{add\_mult}(),\; \texttt{mult}() \;]\big)
$$
which applies `add_mult` **then** `mult` in sequence.

> 🔁 **Order matters**: the output of one transform is input to the next.


In [36]:
!pip install torchvision

Collecting torchvision
  Downloading torchvision-0.24.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (5.9 kB)
Downloading torchvision-0.24.0-cp312-cp312-macosx_11_0_arm64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchvision
Successfully installed torchvision-0.24.0


In [42]:
from torchvision import transforms


# ===============================
# Transform class: mult
# ===============================

class mult(object):
    """
        Multiply BOTH x and y by the same scalar value
    """
    def __init__(self, mult = 100):
        self.mult = mult
    
    def __call__(self, sample):
        x, y = sample
        x = x * self.mult
        y = y * self.mult
        return (x, y)

# Build a composed transform: add_mult -> mult
data_transformer = transforms.Compose([add_mult(7, 2), mult()])

print(f"The combination of transforms (Compose): {data_transformer}")

my_data_set = toy_set()

## Manual Transformation
for i in range(5):
    x, y = my_data_set[i]
    x_, y_ = data_transformer(my_data_set[i])
    print(f"Index [{i}]\n Orignal x: {x.numpy()}, y:{y.numpy()}\n Transformed x: {x_.numpy()}, y:{y_.numpy()}")

    

The combination of transforms (Compose): Compose(
    <__main__.add_mult object at 0x1369ea7e0>
    <__main__.mult object at 0x1369eaf60>
)
Index [0]
 Orignal x: [2. 2.], y:[1.]
 Transformed x: [900. 900.], y:[200.]
Index [1]
 Orignal x: [2. 2.], y:[1.]
 Transformed x: [900. 900.], y:[200.]
Index [2]
 Orignal x: [2. 2.], y:[1.]
 Transformed x: [900. 900.], y:[200.]
Index [3]
 Orignal x: [2. 2.], y:[1.]
 Transformed x: [900. 900.], y:[200.]
Index [4]
 Orignal x: [2. 2.], y:[1.]
 Transformed x: [900. 900.], y:[200.]


### 📌 What happened at index 0?

- Original: $ x = [2, 2],\; y = [1] $
- After `add_mult(addx=1, muly=2)`:  
  $ x = [3, 3],\; y = [2] $
- After `mult(mult=100)` is applied **after** the above:  
  $ x = [300, 300],\; y = [200] $

Mathematically:
$$
\color{#1f77b4}{x_\text{final}} = \big(\,[2,2] + 1\,\big)\times 100 = [300, 300],\quad
\color{#d62728}{y_\text{final}} = \big(\,[1]\times2\,\big)\times100 = [200].
$$

---

### 🧪 Practice

Build a compose **in the opposite order**: `mult()` first, then `add_mult()`.

```python
# Practice: Compose with mult() THEN add_mult()
# my_compose = transforms.Compose([mult(), add_mult()])
# my_transformed_dataset = toy_set(transform=my_compose)
# for i in range(3):
#     x_, y_ = my_transformed_dataset[i]
#     print('Index:', i, 'Transformed x_:', x_, 'Transformed y_:', y_)


## ✅ Summary

- You created a **custom dataset** that implements:
  - `__len__` → returns $N$
  - `__getitem__` → returns a tuple $(\color{#1f77b4}{x_i}, \color{#d62728}{y_i})$
- You wrote **callable transforms** that operate on each sample.
- You learned to **compose** transforms with `torchvision.transforms.Compose` and how **order** changes outcomes.

📦 This pattern plugs directly into `torch.utils.data.DataLoader` for batching/shuffling during training.
