# Practical session n°5

Notions:
- Semantic segmentation
- Image Denoising
- Fully convolutional networks, U-Net
- Weak supervision (in part II): The noise-to-noise and the Neural Eggs Separation scenarios.

Duration: 1 h 30 + 2 h

In P2, we illustrated how Convolutional Neural Networks (CNNs) are trained for image classification tasks. In this practical session, we demonstrate how to achieve pixel-level predictions for tasks like semantic segmentation and image denoising.

To start, we’ll simply apply an off-the-shelf model. Then, we’ll focus on training a model from scratch (part I, exercise 2 and part II).

In P3, we also introduced a crucial set of methods known as "transfer learning," which is particularly effective when there’s limited training data. In this session, we’ll explore another equally important set of methods called "weak supervision," which is well-suited for cases where ground truth is imperfectly known (Part II).


## Part I: Semantic Segmentation and Image Denoising with Fully Convolutional Networks

This part aims to familiarize you with a semantic segmentation task.

By definition, a Fully Convolutional Network (FCN) does not contain fully connected layers. As a result, the output retains spatial dimensions. This configuration is useful when the learning target itself is an image. This is the case for tasks such as:
- Semantic segmentation, where each pixel is assigned a semantic class (e.g., ground, sky, clouds, buildings, etc.).
- Pixel-wise regression
- Image denoising
- Super-resolution

The first exercise features an FCN built from a ResNet50 for a simple segmentation task defined from a set of real images segmented by hand.

The second exercise proposes a pixel-wise regression task, completely supervised, defined on a set of dynamically generated synthetic images.

### **Exercise 1: Semantic Segmentation with FCN-ResNet (no training)**

**A.** Presentation of the Dataset

In the following cells, we load the necessary libraries, download the set of segmented images prepared for the [Pascal VOC 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/segexamples/index.html) challenge, and visualize input-target pairs from the training set:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import time
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, models, transforms
import torch.optim as optim
from PIL import Image

In [2]:
# Check GPU availability

if torch.cuda.is_available():
  device = torch.device("cuda")
  print("You are on GPU !")
else:
  print('Change the runtime to GPU or continue with CPU, but this should slow down your trainings')
  device = torch.device("cpu")

You are on GPU !


In [3]:
import os
root = "/content/data"
os.makedirs(root, exist_ok=True)

# Download from BrainChip mirror
!wget http://data.brainchip.com/dataset-mirror/voc/VOCtrainval_06-Nov-2007.tar \
     -O /content/data/VOCtrainval_06-Nov-2007.tar
! tar -xf /content/data/VOCtrainval_06-Nov-2007.tar -C /content/data

--2025-11-28 11:06:02--  http://data.brainchip.com/dataset-mirror/voc/VOCtrainval_06-Nov-2007.tar
Resolving data.brainchip.com (data.brainchip.com)... 146.59.209.152, 2001:41d0:301::31
Connecting to data.brainchip.com (data.brainchip.com)|146.59.209.152|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://data.brainchip.com/dataset-mirror/voc/VOCtrainval_06-Nov-2007.tar [following]
--2025-11-28 11:06:02--  https://data.brainchip.com/dataset-mirror/voc/VOCtrainval_06-Nov-2007.tar
Connecting to data.brainchip.com (data.brainchip.com)|146.59.209.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 460032000 (439M) [application/x-tar]
Saving to: ‘/content/data/VOCtrainval_06-Nov-2007.tar’


2025-11-28 11:06:09 (67.2 MB/s) - ‘/content/data/VOCtrainval_06-Nov-2007.tar’ saved [460032000/460032000]



In [4]:
# Check-up

! ls /content/data
# should show: VOCdevkit  VOCtrainval_06-Nov-2007.tar

! ls /content/data/VOCdevkit
# should show: VOC2007

! ls /content/data/VOCdevkit/VOC2007
# should show: JPEGImages  SegmentationClass  SegmentationObject  ImageSets  ...


VOCdevkit  VOCtrainval_06-Nov-2007.tar
VOC2007
Annotations  ImageSets	JPEGImages  SegmentationClass  SegmentationObject


In [6]:
# Build the training dataset
root = "/content/data"
input_resize = transforms.Resize((128, 128))
target_resize = transforms.Resize((128, 128))

train_dataset_viz = datasets.VOCSegmentation(
    root,
    year='2007',
    image_set='train',
    download = False,
    transform=input_resize,
    target_transform=target_resize,
)

In [None]:
# Viz some images

import math

def plot_images(images, num_per_row=4, title=None):
    num_rows = int(math.ceil(len(images) / num_per_row))

    fig, axes = plt.subplots(num_rows, num_per_row,figsize=(4*num_per_row,4*num_rows))
    #fig.subplots_adjust(wspace=0, hspace=0)

    for image, ax in zip(images, axes.flat):
        ax.imshow(image)
        ax.axis('off')

    return fig

# Sampling for viz:

inputs, ground_truths = list(zip(*[train_dataset_viz[i] for i in range(8)]))

_ = plot_images(inputs)

In [None]:
# Viz some targets

_ = plot_images(ground_truths)

**Question 1:** How many classes are there? \
Search the web for the difference between semantic segmentation and instance segmentation. What type of segmentation is this dataset about?

In [None]:
# num_classes = ...


1. **Semantic Segmentation:**
   - **Objective:** Assign a class label to each pixel in an image, indicating the category or type of the object to which it belongs.
   - **Target:** The segmentation masks indicate the class or category of each pixel.


2. **Instance Segmentation:**
   - **Objective:** Identify and outline individual objects in an image.
   - **Target:** The mask contains the same value for each set of pixels associated with the same physical object.



Here, the segmentation task is a semantic segmentation, where each pixel is assigned a class label representing the category of the object to which it belongs. The classes may include categories like ground, sky, clouds, buildings, etc.


**B.** Presentation of an FCN-ResNet

In the next cell, we load an [FCN](https://pytorch.org/vision/stable/models/fcn.html) built from a [ResNet50](https://arxiv.org/pdf/1512.03385.pdf). This model achieved state-of-the-art performance in 2015 according to [paperswithcode](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k).

In [None]:
fcn = torchvision.models.segmentation.fcn_resnet50(weights_backbone = None)

**Q2:** What is different from a standard ResNet50? Does the FCN provide an output of the same size as the input? Test and explain.

In [None]:
# sample the dataset, convert to torch.tensor:
batch_size = 4
imagenet_mean, imagenet_std = [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
input_resize = transforms.Resize([64,64])
target_resize = transforms.Resize([64,64])

# Transforms used during the training :
input_transform = transforms.Compose(
    [
        input_resize,
        transforms.ToTensor(),
        transforms.Normalize(imagenet_mean, imagenet_std),
    ]
)

def replace_tensor_value_(tensor, a, b):
    tensor[tensor == a] = b
    return tensor

target_transform = transforms.Compose(
    [
        target_resize,
        transforms.PILToTensor(),
        transforms.Lambda(lambda x: replace_tensor_value_(x.squeeze(0).long(), 255, 21)),
    ]
)

# Def of the Dataset object :
train_dataset = datasets.VOCSegmentation(
    './data',
    year='2007',
    download=False,
    image_set='train',
    transform=input_transform,
    target_transform=target_transform,
)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
inputs, targets  = next(iter(train_loader))


# Test here :
...

**C. Testing a Pretrained Network**

In this exercise, we simply **test** a model trained on another segmentation dataset. An extension to this exercise provides an opportunity to train the model introduced in exercise sheet #2.

In [None]:
fcn = torchvision.models.segmentation.fcn_resnet50(weights='COCO_WITH_VOC_LABELS_V1')

In [None]:
batch_size = 16
imagenet_mean, imagenet_std = [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
input_resize = transforms.Resize((256,256))
target_resize = transforms.Resize((256,256))

input_transform = transforms.Compose(
    [
        input_resize,
        transforms.ToTensor(),
        transforms.Normalize(imagenet_mean, imagenet_std),
    ]
)

target_transform = transforms.Compose(
    [
        target_resize,
        transforms.PILToTensor(),
    ]
)

test_dataset = datasets.VOCSegmentation(
    './data',
    year='2007',
    download=False,
    image_set='val',
    transform=input_transform,
    target_transform=target_transform,
)

In [None]:
# Creating loaders
test_loader = DataLoader(test_dataset, batch_size=batch_size,
                         shuffle=False, num_workers=2)

**Q3:** According to the preceding code lines, which set (validation or test) of PascalVOC2007 are we testing the model on? Why?



Now, let's visualize some model outputs on this set:

In [None]:
# Color palette for segmentation masks
PALETTE = np.array(
    [
        [0, 0, 0],
        [128, 0, 0],
        [0, 128, 0],
        [128, 128, 0],
        [0, 0, 128],
        [128, 0, 128],
        [0, 128, 128],
        [128, 128, 128],
        [64, 0, 0],
        [192, 0, 0],
        [64, 128, 0],
        [192, 128, 0],
        [64, 0, 128],
        [192, 0, 128],
        [64, 128, 128],
        [192, 128, 128],
        [0, 64, 0],
        [128, 64, 0],
        [0, 192, 0],
        [128, 192, 0],
        [0, 64, 128],
    ]
    + [[0, 0, 0] for i in range(256 - 22)]
    + [[255, 255, 255]],
    dtype=np.uint8,
)


def array1d_to_pil_image(array):
    pil_out = Image.fromarray(array.astype(np.uint8), mode='P')
    pil_out.putpalette(PALETTE)
    return pil_out

In [None]:
inputs, targets = next(iter(test_loader))
outputs = fcn(inputs)['out']
outputs = outputs.argmax(1)

outputs = replace_tensor_value_(outputs, 21, 255)
targets = replace_tensor_value_(targets, 21, 255)
targets = targets.squeeze(dim=1)

In [None]:
plt_inputs = np.clip(inputs.numpy().transpose((0, 2, 3, 1)) * imagenet_std + imagenet_mean, 0, 1)
fig = plot_images(plt_inputs)
fig.suptitle("Images")

pil_outputs = [array1d_to_pil_image(out) for out in outputs.numpy()]
fig = plot_images(pil_outputs)
fig.suptitle("Predictions")

pil_targets = [array1d_to_pil_image(gt) for gt in targets.numpy()]
fig = plot_images(pil_targets)
_ = fig.suptitle("Ground truths")

Finally, let's evaluate the model on the entire set.

In [None]:
# For the test metric
!pip install torchmetrics

In [None]:
import torchmetrics
IoU = torchmetrics.JaccardIndex(num_classes=21, ignore_index=255,task="multiclass")

Q4: Jaccard Index is used instead of accuracy. How is it defined? What is its other name? What is its advantage?

Q5: Modify the following code to obtain an average IoU over the entire set

In [None]:
fcn = fcn.to(device)
fcn.eval()
nbatch = 0
sum_batch_IoU = 0
for i, (inputs, targets) in enumerate(test_loader):
  ...

