# Lab1 ‚Äî PyTorch Foundations for Computer Vision

**Course**: Deep Learning for Image Analysis

**Class**: M2 IASD App  

**Professor**: Mehyar MLAWEH

---

## Objectives
By the end of this lab, you should be able to:

- Understand how **neurons and layers** are implemented in PyTorch
- Manipulate **tensors** and reason about shapes
- Use **autograd** to compute gradients
- Implement a **training loop** yourself
- Connect theory (neurons, loss, backprop) to actual code

‚ö†Ô∏è This notebook is **intentionally incomplete**.  
Whenever you see **`# TODO`**, you are expected to write code.


**Deadline:** üóìÔ∏è **Saturday, February 7th (23:59)**

## ü§ñ A small (honest) note before you start

Let‚Äôs be real for a second.

 I know you **can use LLMs (ChatGPT, Copilot, Claude, etc.)** to help you with this lab.  
And yes, **I use them too**, so don‚Äôt worry üòÑ

üëâ **You are allowed to use AI tools.**  
But here‚Äôs the deal:

- Don‚Äôt just **copy‚Äìpaste** code you don‚Äôt understand  
- Take time to **read, question, and modify** what the model gives you  
- If you can solve a block **by yourself, without AI**, that‚Äôs excellent

Remember:

> AI can write code for you, but **only you can understand it** ‚Äî and understanding is what matters for exams, projects, and real work.

Use these tools **as assistants, not as replacements for thinking**.

---

## üìö Useful documentation (highly recommended)

You will often find answers faster (and more reliably) by checking the official documentation:

- **PyTorch main documentation**  
  https://pytorch.org/docs/stable/index.html

- **PyTorch tensors**  
  https://pytorch.org/docs/stable/tensors.html

- **Neural network modules (`torch.nn`)**  
  https://pytorch.org/docs/stable/nn.html

- **Loss functions** (`BCEWithLogitsLoss`, CrossEntropy, etc.)  
  https://pytorch.org/docs/stable/nn.html#loss-functions

- **Optimizers** (`SGD`, `Adam`, ‚Ä¶)  
  https://pytorch.org/docs/stable/optim.html

If you learn how to **navigate the documentation**, you are already thinking like a real AI engineer üëå

---

## PART I

## 0) Colab setup ‚Äî GPU check

**Instructions**
1. In Colab: `Runtime ‚Üí Change runtime type to GPU T4`
2. Select **GPU**
3. Save and restart runtime

Then run the cell below.


In [1]:
import torch

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

# TODO: set the device correctly (cuda if available, else cpu)
device = (torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")) #CUDA = technologie NVIDIA qui permet de faire du calcul g√©n√©ral sur GPU.
print("Using device:", device)

PyTorch version: 2.9.0+cu126
CUDA available: True
Using device: cuda


## 1) Imports and reproducibility


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

# TODO: fix the random seed for reproducibility
torch.manual_seed(42)


<torch._C.Generator at 0x781134165190>

## 2) PyTorch tensors and shapes

Tensors are multi-dimensional arrays that support:
- GPU acceleration
- automatic differentiation

Understanding **shapes** is critical in deep learning.


In [3]:
# Examples
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.randn(4, 5) # Returns a tensor filled with random numbers from a standard normal distribution
print("a shape:", a.shape)
print("b shape:", b.shape)


a shape: torch.Size([3])
b shape: torch.Size([4, 5])




### üîç Question (answer inside the markdown)
- How many dimensions does tensor `b` have? --> `b` has two dimensions (2D Tensor, a Matrix)
- What does each dimension represent conceptually? Conceptually, the number of rows represent the number of samples we consider, so the batch size. The columns are the features.


### ‚úÖTensor operations

Complete the following:

1. Create a tensor `x` of shape `(8, 3)` with random values  
2. Compute:
   - the **mean of each column**
   - the **L2 norm of each row**
3. Normalize `x` **row-wise** using the L2 norm

In [4]:
# TODO: create x
x = torch.randn([8,3])

# TODO: column mean
col_mean = torch.mean(x, dim=0) ## semble contre-intuitif, mais dim est la dimension qu'on supprime

# TODO: row-wise L2 norm
row_norm = torch.norm(x, p=2, dim=1)

# TODO: normalized tensor
x_normalized = x / row_norm.view(-1, 1)

print(x.shape, col_mean.shape, row_norm.shape, x_normalized.shape)


torch.Size([8, 3]) torch.Size([3]) torch.Size([8]) torch.Size([8, 3])


## 3) Artificial neuron ‚Äî from math to code

A neuron computes:

$$
z = \sum_i w_i x_i + b
$$

Then applies an activation function:

$$
y = g(z)
$$

This section connects directly to the theory seen in class.


In [5]:
x = torch.tensor([1.0, -2.0, 3.0])
w = torch.tensor([0.2, 0.4, -0.1])
b = torch.tensor(0.1)

z = torch.sum(x * w) + b
z


tensor(-0.8000)

### Activation functions

1. Implement **ReLU**
2. Implement **Sigmoid**
3. Apply both to `z` and compare the outputs

Which activation preserves negative values? -> Neither ReLU nor Sigmoid opreserves negative values as relu returns zero in case z negative and Sigmoid returns a `[0,1]` value


In [6]:
# TODO
def relu(z):
    return torch.max(z, torch.zeros_like(z))

def sigmoid(z):
    return torch.sigmoid(z)

y_relu = relu(z)
y_sigmoid = sigmoid(z)
y_relu, y_sigmoid


(tensor(0.), tensor(0.3100))

## 4) Autograd and gradients

PyTorch uses **automatic differentiation** to compute gradients
using the **chain rule** (backpropagation).


In [7]:
x = torch.tensor([1.0, 2.0, -1.0], requires_grad=True) # NOTE: requires_grad=True means track everything involving this tensor so I can compute gradients later
w = torch.tensor([0.5, -0.3, 0.8], requires_grad=True)
b = torch.tensor(0.2, requires_grad=True)

z = torch.sum(x * w) + b
loss = (z - 1.0) ** 2

loss.backward() ## This asks torch to compute the gradients using backpropagation for everything having requires_grad=True

print("loss:", loss.item())
print("grad w:", w.grad)
print("grad b:", b.grad)


loss: 2.890000104904175
grad w: tensor([-3.4000, -6.8000,  3.4000])
grad b: tensor(-3.4000)


### üîç Conceptual question

- If `b.grad > 0`, should `b` increase or decrease after a gradient descent step?
Explain **why** in one sentence --> Gradient descent updates parameters in the opposite direction of the gradient, if the gradient of b is positive, then we shoud decrease b


## 5) Toy classification dataset

We create a **linearly separable** dataset.

Label rule:
- class = 1 if `x‚ÇÅ + x‚ÇÇ + x‚ÇÉ > 0`
- else class = 0

This mimics a very simple classification problem.


In [8]:
# TODO: generate a dataset of size N=500 with 3 features
N = 500
X = torch.randn(N, 3)
y = (X.sum(dim=1) > 0).float().unsqueeze(1) # shape (N, 1)

# TODO: split into train (80%) and validation (20%)
perm = torch.randperm(N) ## return a random permutation for integers from 0 to N-1

train_size = int(0.8 * N)

train_idx = perm[:train_size]
val_idx = perm[train_size:]

X_train = X[train_idx]
y_train = y[train_idx]

X_val = X[val_idx]
y_val = y[val_idx]


## 6) Model definition

We define a small **MLP** (fully-connected network):

`3 ‚Üí 16 ‚Üí 8 ‚Üí 1`

Activation: ReLU  
Output: raw logits (no sigmoid)


In [9]:
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(3, 16),   # 3 ‚Üí 16
            nn.ReLU(),
            nn.Linear(16, 8),  # 16 ‚Üí 8
            nn.ReLU(),
            nn.Linear(8, 1)    # 8 ‚Üí 1 (logit)
        )

    def forward(self, x):
        return self.net(x)

model = MLP().to(device)
print(model)


MLP(
  (net): Sequential(
    (0): Linear(in_features=3, out_features=16, bias=True)
    (1): ReLU()
    (2): Linear(in_features=16, out_features=8, bias=True)
    (3): ReLU()
    (4): Linear(in_features=8, out_features=1, bias=True)
  )
)


###  parameters

1. Compute **by hand** the total number of parameters
2. Verify your answer using PyTorch


In [10]:
# TODO: count parameters with PyTorch
total_params = sum(p.numel() for p in model.parameters())
total_params


209

## 7) Training loop

You must complete the full training loop:
- forward pass
- loss computation
- backward pass
- optimizer step

Loss: `BCEWithLogitsLoss`
Optimizer: `SGD`


In [11]:
# move data to device
X_train_d = X_train.to(device)
y_train_d = y_train.to(device)
X_val_d = X_val.to(device)
y_val_d = y_val.to(device)

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(20):
    model.train()

    optimizer.zero_grad()

    # forward
    logits = model(X_train_d)

    # loss
    loss = criterion(logits, y_train_d)

    # backward
    loss.backward()

    # update
    optimizer.step()

    if epoch % 5 == 0:
        print("Epoch", epoch, "| loss =", float(loss))


Epoch 0 | loss = 0.6929869055747986
Epoch 5 | loss = 0.685593843460083
Epoch 10 | loss = 0.6791964769363403
Epoch 15 | loss = 0.6731548309326172


Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  print("Epoch", epoch, "| loss =", float(loss))


## 8) Evaluation

1. Apply `sigmoid` to the logits
2. Convert probabilities to predictions
3. Compute **accuracy** on the validation set


In [13]:
# TODO: evaluation
with torch.no_grad():
    logits = model(X_val_d)
    probs = torch.sigmoid(logits)
    preds = (probs > 0.5).float()
    accuracy = (preds == y_val_d).float().mean()

accuracy


tensor(0.4900, device='cuda:0')

## 9) Reflection questions (answer inside the markdown)

1. Why do we **not** apply sigmoid inside the model?

--> Because BCEWithLogitsLoss already includes sigmoid internally and is more numerically stable.
Applying sigmoid twice would hurt training and gradient quality.
2. What would happen if we removed all ReLU activations?

--> The whole network would become a single linear model, no matter how many layers it has.
It could only learn linear decision boundaries.
3. How does this toy problem relate to image classification?

-->Each input here (3 numbers) is like a tiny ‚Äúimage‚Äù; real images just have many more features (pixels).
The pipeline is identical: inputs ‚Üí neural network ‚Üí logits ‚Üí loss ‚Üí backpropagation.

Write short answers (2‚Äì3 lines each).


## 10) Bridge to Computer Vision

So far:
- inputs = vectors of size 3
- layers = fully-connected

Next session:
- inputs = images `(B, C, H, W)`
- layers = convolutions
- same training logic

üëâ **Architecture changes, learning principles stay the same.**


## Part II ‚Äî Training on MNIST

Check the next notebook