## I. Fundamentals of Image Segmentation

Image Segmentation is a crucial and popular technique within Computer Vision that differs fundamentally from traditional classification methods.

### A. Classification vs. Segmentation

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRNsuxp7p5vNkxun-RGPA2TwgZ2hJKR9NnHBQ&s">


| Feature | Traditional CNN Classification | Image Segmentation |
| :--- | :--- | :--- |
| **Focus Level** | Works on the **image level**. | Works on the **pixel level**. |
| **Objective** | Determines the probability that an image belongs to a particular class (e.g., 99% Cat). | Aims to **extract the specific regions** where an object is present. |
| **Utility** | Does not cater to finding multiple instances of an object or determining their precise boundaries. | Provides the necessary **pixel-level information** required for advanced tasks. |

### B. Segmentation Applications

Segmentation is vital because it provides high-fidelity, boundary-specific data that simple bounding boxes cannot offer.

1.  **Autonomous Vehicles:** It is used to extract elements like **road boundaries** and the precise location of objects (cars, buses) for obstacle avoidance. It is also employed for lane segmentation.
2.  **Medical Imaging:** Segmentation is essential for biomedical tasks, including segmenting cell membranes, detecting tumors, and **brain segmentation**.
3.  **Other Uses:** The technique is utilized in diffusion models, and can be applied for image upscaling.

### C. Segmentation Masks (Ground Truth)

Segmentation models are trained using labeled data where the input image ($X$) is paired with a desired output mask ($Y$), known as the ground truth. Mask creation involves mapping every pixel value.

1.  **Binary Mask:** Used when the focus is on a single class (e.g., giraffe versus background). The background is typically assigned a value of **zero (Black)**, and the region of focus is assigned a high value, such as **255 (White)**. The output will have one channel.
2.  **Multi-Class Mask (RGB):** Used when multiple elements need to be segmented (e.g., giraffe, sky, grass). A unique color value is assigned for every distinct class. The output will have $N$ channels, where $N$ equals the number of different classes. Each channel effectively acts as a binary mask corresponding to a specific class.

## II. U-Net Architecture Deep Dive

<img src="https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-architecture.png">

The U-Net architecture, released in 2015 for biomedical image segmentation, is defined by its characteristic ‘U’ shape and is considered a state-of-the-art model for segmentation. It fundamentally employs an **Encoder-Decoder technique**.

### A. The Encoder (Contracting Path)

The left side of the U-Net architecture is the Encoder or Contracting Path.

1.  **Function:** Its purpose is to compress or **encode the input information** into a smaller, embedded representation (downscaling).
2.  **Convolutional Technique:** It consists of the repeated application of **3x3 unpadded convolutions**. Unpadded convolutions inherently reduce the output shape compared to the input shape. ReLU is used as the primary activation function.
3.  **Downsampling:** Max Pooling (2x2 with a stride of 2) is used to downsample or halve the size of the image representation.
4.  **Channel Management:** At every downsampling step, the number of feature channels (kernels) is **doubled**.

### B. The Decoder (Expansive Path)

The right side of the U-Net architecture is the Decoder or Expansive Path.

1.  **Function:** It takes the smaller, encoded value and scales it back up (up-scaling) until the output shape matches the input shape.
2.  **Upsampling Technique:** It uses **Up-convolution**, which is Convolution Transpose. This technique uses learnable parameters to upsample the image.
3.  **Channel Management:** At every up-sampling step, the number of feature channels is **halved**.

### C. Copy and Crop (Skip Connections)

The key feature giving U-Net its powerful localization ability is the use of **skip connections**, referred to as "Copy and Crop" in the paper.

1.  **Mechanism:** In the decoder path, the architecture merges the up-sampled data with feature maps taken directly from the corresponding stage in the encoder path.
2.  **Process:** The feature map from the encoder is "Copied and Cropped" to match the spatial size of the up-sampled feature map from the decoder.
3.  **Concatenation:** Both feature maps are then **concatenated**, effectively doubling the number of channels at that specific stage. This process allows the re-use of fine-grained details lost during the encoder's downsampling steps.
4.  **Input Constraints:** To ensure the Concatenation operation works correctly—as it requires perfectly matched pixel shapes—it is crucial to select an input pixel size (Height and Width) that is **even** and divisible by two multiple times (e.g., 512, 1024). This prevents issues resulting from rounding off decimal pixel values during repeated downsampling and upsampling.
5.  Left size (encoder size) image is larger in size so it is cropped to match the size of the decoder (at same level) and then they are stacked by height for eg encoder image shape 64 x 64 x 512 ->(cropped) 56 x 56 x 512 + conact with same size from decoder (56 x 56 x 512) -> 56 x 56 x 1024  

## III. Model Evaluation

Once a U-Net model is trained (which involves passing the input image $X$ and the corresponding ground truth mask $Y$ in a supervised manner), its performance is evaluated using metrics specifically designed for segmentation.

### Intersection Over Union (IoU)

The primary metric for image segmentation is Intersection Over Union (IoU).

1.  **Objective:** IoU determines how accurately the **predicted mask overlaps** with the manually labeled ground truth mask.
2.  **Calculation:** It is calculated as the ratio of the area of intersection between the predicted mask and the ground truth mask, divided by the area of their union (Intersection / Union).
3.  **Score:** The IoU yields a score between 0 and 1, where 1 indicates perfect overlap. This score is typically calculated for multiple segments and then averaged (mean IoU) to gauge overall model accuracy.



### Pixel-wise Softmax and Cross-Entropy Loss in Segmentation
<img src="../images/loss_unet.png">

In semantic segmentation, each pixel **x** in the image domain **Ω ⊂
ℤ²** is classified into one of **K** classes.\
The network outputs an activation **aₖ(x)** for each class **k** at
every pixel.\
A **pixel-wise softmax** converts these activations into probabilities:

$$
pₖ(x) = \frac{e^{aₖ(x)}}{\sum_{k'=1}^{K} e^{a_{k'}(x)}}
$$

Here, **pₖ(x)** ≈ 1 for the class **k** with the highest activation
**aₖ(x)**, and ≈ 0 for others.\
This allows the model to represent class probabilities smoothly across
pixels.

The **cross-entropy loss** (also called the energy function **E**)
measures how close the predicted probability for the true class is to 1:

$$
E = -\sum_{x∈Ω} w(x) \log(p_{y(x)}(x))
$$

where **y(x)** is the ground truth label of pixel **x**, and **w(x)** is
an optional weighting factor to handle class imbalance or emphasize
important regions.\
Minimizing **E** trains the network to assign high probabilities to the
correct class at each pixel, improving segmentation accuracy.