# [IAPR][iapr]: Final project - Chocolate Recognition


**Moodle group ID:** *3*
**Kaggle challenge:** *Deep learning*
**Kaggle team name (exact):** "*Byte the Bar*"

**Author 1 (sciper):** Nathann Morand (296190)

**Author 2 (sciper):** David Croce (327277)

**Author 3 (sciper):** Felipe Ramirez (331471)

**Due date:** 21.05.2025 (11:59 pm)


## Key Submission Guidelines:
- **Before submitting your notebook, <span style="color:red;">rerun</span> it from scratch!** Go to: `Kernel` > `Restart & Run All`
- **Only groups of three will be accepted**, except in exceptional circumstances.


[iapr]: https://github.com/LTS5/iapr2025

---

# Introduction
We are tasked to make a program that is able to count how many instance among 13 praline class in a cluttered image.
We must retrain our model from scratch and are provided with only a very limited number of training image (90)
The score is computed using a modified F1 score (that take difference in number of predicted praline)

For our approach we chose to make convolutional model based of the yolo architecture but instead we rewrote the network head to directly predict the number of instance for each class. We named our architecture yoco : you only count once. To train it we chose to make a synthetic dataset generator based of cropped praline from the training dataset pasted on top of the empty background that where extracted.

# Dataset & Preprocessing
The original dataset offer 90 image that are 6000x4000 px, .JPG The image where taken in similar lightning condition and are relatively well lit.
The inference dataset has the same properties.

## EDA
Image from the dataset look like the following with different background object, different miscellaneous object scatter around and a few praline.
<img src="chocolate_data/dataset_project_iapr2025/train/L1000957.JPG" width="600" height="400"/>

Using the provided CSV we computed the histogram of number of chocolate per image and the histogram showing the number of instance per class to see how well the class are balanced. We also show how many individual instance of praline are available across the dataset and the maximum number of chocolate of each class present on an image.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the CSV file
df = pd.read_csv('chocolate_data/dataset_project_iapr2025/train.csv')

# Calculate the total number of chocolates per image
df['total_chocolates'] = df.iloc[:, 1:].sum(axis=1)

# Print the total number of chocolates in the dataset
total_chocolates_in_dataset = df['total_chocolates'].sum()
print(f"Total number of chocolates in the dataset: {total_chocolates_in_dataset}")

# Get the maximum number of instances per class
max_per_class = df.iloc[:, 1:].max()

# Print the results
print("Maximum number of instances for each chocolate class in a single image:")
print(max_per_class)

# Plot the histogram for total chocolates per image
plt.figure(figsize=(12, 6))
plt.hist(df['total_chocolates'], bins=range(df['total_chocolates'].min(), df['total_chocolates'].max() + 1), edgecolor='black')
plt.title('Histogram of Total Chocolates per Image')
plt.xlabel('Total Number of Chocolates')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Plot the histogram for class distribution (excluding total chocolates column)
class_counts = df.iloc[:, 1:13].sum(axis=0)
plt.figure(figsize=(12, 6))
class_counts.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Class Balance Histogram')
plt.xlabel('Chocolate Class')
plt.ylabel('Number of Chocolates')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()



## Instance extraction
To make the synthetic dataset generator, we cropped manually the 583 praline present in the 90 image using a helper script to draw the box and save it in a new file. We made a second helper file to show the image and moving it to the correct folder after the operator write the class id thus making the sorting faster.

Once the praline where cropped we spent many hours cleaning the background from the 584 pralines using paint or Gimp. That being done we made another helper script to re-orient, center and rescale the praline. The recalling factor allowed use to measure the size variation between the praline and thus know that the variation was +-20% and thus a single detection head would be sufficient. We also did the same with the misc object present and patched the hole in the background.

Here are an overview of the cleaned praline :

In [None]:
import os
import matplotlib.pyplot as plt
from PIL import Image

# Define path and ignored folders
base_path = 'chocolate_data/praline_clean'
ignored_folders = {"MiscObjects", "raw_praline", "references", "Background"}

# Get valid subfolders
valid_folders = [f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f)) and f not in ignored_folders]

# Function to display a 6x6 image mosaic
def display_mosaic(images, title):
    fig, axes = plt.subplots(6, 6, figsize=(12, 12))
    fig.suptitle(title, fontsize=16)
    for i in range(36):
        ax = axes[i // 6, i % 6]
        if i < len(images):
            ax.imshow(images[i])
        ax.axis('off')
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()

# Process each valid folder
for folder in valid_folders:
    folder_path = os.path.join(base_path, folder)
    image_files = [f for f in os.listdir(folder_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
    image_files = image_files[:36]  # Limit to first 36 images

    images = []
    for img_file in image_files:
        img_path = os.path.join(folder_path, img_file)
        try:
            img = Image.open(img_path).convert('RGB')
            img = img.resize((200, 200))
            images.append(img)
        except Exception as e:
            print(f"Error loading image {img_file}: {e}")

    display_mosaic(images, title=folder)


## Synthetic dataset generation

To train our chocolate detection and counting model, we developed a synthetic dataset generator that creates realistic scenes by compositing high-quality, transparent PNG cutouts of pralines and clutter onto large photographic backgrounds. The generator is designed to mimic natural variations in object placement, orientation, scale, and density while ensuring dataset consistency and coverage across all 13 chocolate classes.

#### Directory Structure

The image assets are organized as follows:

```
../chocolate_data/
├── praline_clean/
│   ├── <ChocolateClass>/        # 1000x1000 transparent PNGs per class
│   ├── MiscObjects/             # 1000x1000 PNGs of clutter (non-chocolates)
│   └── Background/              # 6000x4000 high-res background images
└── syntheticDataset/
    ├── images/train/            # Generated training images
    ├── images/val/              # Generated validation images
    ├── train.csv                # YOLO-style count labels
    └── val.csv
```

#### Scene Generation Logic

For each synthetic scene, the generator performs the following steps:

1. **Background Selection**: A random high-resolution background (6000×4000 px) is selected.

2. **Misc Object Placement**:
   - Randomly place 0–6 miscellaneous objects per image.
   - Each object receives a random rotation (0–360°) and is scaled with ±20% jitter applied to base scale factors.
   - Objects are not allowed to overlap but may touch. Up to 20 retry attempts are made to find valid positions.

3. **Chocolate Placement**:
   - Each of the 13 chocolate classes is assigned 0–5 instances per image based on a skewed probability distribution favoring 0 or 1.
   - Each chocolate instance is rescaled (with class-specific base factors and jitter), rotated randomly, and placed while checking that overlaps do not exceed 20% with any existing chocolates (touching is allowed).
   - At least one pair of chocolates (if more than two are present) is forced to touch to reflect realistic clutter.

4. **Label Generation**:
   - Labels are saved in CSV format compatible with YOLO count training, with each row representing a synthetic image and columns encoding the number of instances per class.
   - Example:
     ```
     id,Jelly White,Jelly Milk,...,Stracciatella
     1000001,2,1,...,0
     ```

5. **Scene Saving**:
   - The final composite image can optionally be resized using a configurable downscaling factor.
   - Image and corresponding label are saved in the appropriate `train` or `val` directory, based on a configurable split ratio (default: 80/20).

#### Performance & Scalability

- The generator uses multi-threading to parallelize image composition, utilizing `N-2` CPU cores to avoid overloading the system.
- Progress is tracked using `tqdm` to provide live feedback.
- The total number of generated scenes is configurable (default: 10,000), and all key parameters (e.g., scaling jitter, image size, split ratio) can be tuned easily.

#### Result
Using the technique descibed previously we could generate between 1000 and 20k picture similar to the following.
<img src="chocolate_data/syntheticDataset/images/train/1000000.JPG" width="600" height="400"/>



# Model Architecture

YOCO Architecture – You Only Count Once

### Objective
The **YOCO** model is a custom convolutional neural network designed to **predict per-class object counts** in high-resolution images containing chocolate pralines. It avoids doing object detection + instance counting by directly classifying the number of instances for each of the 13 classes.


### Model Architecture

The YOCO network processes an RGB image of shape `(3, 800, 1200)` and outputs a tensor of shape `(13, 6)`, where:
- 13 is the number of chocolate classes
- 6 is the number of count classes: `[0, 1, 2, 3, 4, 5 or more]`
Each entry is a **logit vector** representing the likelihood of that count for a given class.

#### Feature Extractor

| Layer | Details |
|-------|---------|
| Conv2d(3 → 16)  | Kernel=3×3, Stride=1, Padding=1 |
| LeakyReLU(0.1)  | Non-linearity |
| MaxPool2d       | 2×2 downsampling |
| Conv2d(16 → 32) | Same pattern repeated |
| Conv2d(32 → 64) | |
| Conv2d(64 → 128) | |
| Conv2d(128 → 256) | |
| Conv2d(256 → 256) | Final feature map shape: **[B, 256, 7, 7]** |

Unlike classical yolo which expect a square image and will pad it to fit here we can fit directly our image for the convolution after a simple rescaling while preserving aspect ratio. We choose to downscale the image by 4 as it would make the smallest praline ~50 px across which was deemed sufficient to keep the small feature still visible.
#### Head

| Layer | Details |
|-------|---------|
| Conv2d(256 → 128) | 3×3 conv with LeakyReLU(0.3) |
| Conv2d(128 → 78)  | 1×1 conv (78 = 13 classes × 6 count bins) |
| AdaptiveAvgPool2d | Global average pooling to [B, 78, 1, 1] |
| Reshape           | Final output shape: **[B, 13, 6]** |

These parameters just work, but it might have been possible to reduce the network size further with some tuning.
We only have one head as the praline have roughly the same size and don't vary much across image which make a multiscale system necessary.

### Output Format and Count Encoding

For each class, the model predicts a **probability distribution over 6 count classes**:

```
0, 1, 2, 3, 4, 5+
```

> This is a **one-hot classification** over possible counts. For example, if there are exactly 3 pralines of class 5, the target vector is `[0, 0, 0, 1, 0, 0]` for that class.

This encoding has multiple advantages:
- It reflects the **discrete nature** of count prediction.
- It's more robust than regression for small integer counts.
- The `5+` bin handles the practical upper bound seen in the training data (no class had more than 5 pralines in a single image).

---

### Activation Function

- **LeakyReLU** is used throughout the model:
  - `LeakyReLU(0.1)` in the feature extractor
  - `LeakyReLU(0.3)` in the head
- Unlike ReLU, LeakyReLU allows a small gradient when the unit is not active, which helps with overfitting, **avoid dead neurons** and improves training stability.

---

### Loss Function

The final loss is computed as the **average cross-entropy loss across all 13 classes**:

```python
loss = sum(criterion(logits[:, i], targets[:, i]) for i in range(NUM_CLASSES)) / NUM_CLASSES
```

Where:
- `logits[:, i]` is the 6-dimensional output for class `i`
- `targets[:, i]` is the ground-truth one-hot encoded target vector for class `i`
- `criterion` is `nn.CrossEntropyLoss()`

We used softmax activation because
- Count values are **mutually exclusive**.
- Softmax creates a valid **probability distribution** over the possible counts.
- It's robust and simple to implement


here is the full schematic of the architecture
<img src="yoco_arch.png" width="1000" height="1000"/>



In [16]:
# the following snippet generate te network architecture image.
"""
from torchviz import make_dot
import torch
from src.yoco import YOCO
model = YOCO()
x = torch.randn(1, 3, 800, 1200)
y = model(x)
make_dot(y, params=dict(model.named_parameters())).render("yoco_arch", format="png")
"""

'\nfrom torchviz import make_dot\nimport torch\nfrom src.yoco import YOCO\nmodel = YOCO()\nx = torch.randn(1, 3, 800, 1200)\ny = model(x)\nmake_dot(y, params=dict(model.named_parameters())).render("yoco_arch", format="png")\n'

# Training

### Dataset Split

The dataset was split into **80% training** and **20% validation**.
Since synthetic data generation is no longer a bottleneck, this ratio was chosen for convenience and has **little practical impact on final performance**.


### Optimization
we used the default choice : Adam
- **Learning Rate:** 1e-3
  - Higher values were unstable.
  - Lower values (10⁻4) led to extremely slow convergence.

| Parameter     | Value                  |
|---------------|------------------------|
| Batch Size    | 16 (GPU), 1 (CPU)      |
| Epochs        | 50–100                 |
| Learning Rate | 1e-3                   |
| Optimizer     | Adam                   |
| Scheduler     | None                   |


### Training Loss Plot
REDO WITH BOTH TRAINING AND VALIDATION + F1
<img src="src/loss.jpeg" width="600" height="400"/>


# Evaluation

### Custom F1-Score Metric

To evaluate the performance of our YOCO (You Only Count Once) model, we where provided with a custom **F1-score** metric tailored for multi-class object counting.

### Evaluation Process

During training, evaluation was conducted on the **synthetic validation set**, which is structurally similar to the training images but randomly generated. However, to verify the generalization of the model, we also computed the custom F1-score on the **real training images**. This helped test whether the model overfit to synthetic artifacts or learned transferable features.

### Quantitative Results

**F1 Scores per Class:**

| Class              | F1 Score |
|--------------------|----------|
| Jelly_White        | 0.8073   |
| Jelly_Milk         | 0.9487   |
| Jelly_Black        | 0.8602   |
| Amandina           | 0.9315   |
| Crème_brulée       | 0.9750   |
| Triangolo          | 1.0000   |
| Tentation_noir     | 0.9892   |
| Comtesse           | 0.9204   |
| Noblesse           | 1.0000   |
| Noir_authentique   | 1.0000   |
| Passion_au_lait    | 1.0000   |
| Arabia             | 0.9897   |
| Stracciatella      | 1.0000   |

**Global F1 Score:** **0.9555**

> The lower F1 scores observed for the *Jelly* series may be due to their **reflective surfaces**, which cause inconsistent appearances under lighting variation. We hypothesize that modifying the synthetic generator to include **color jittering or reflective noise simulation** could improve performance on these classes. Comptesse is also significantly worse than other chocolate (it's big white round chocolate.) an hypothesis is the difficulty to differentiate from the background and that it tend to see it where there are none (white background).

### Visual Evaluation

Visual comparison between predictions and ground truth on real images was performed manually. However, due to the counting-only nature of the model and absence of bounding boxes, typical detection visualizations (e.g., masks or heatmaps) are not applicable. The results confirmed that the model is **visually accurate** in object presence and counts, particularly on well-lit and non-reflective samples.


# Inference on Original Images & Result
We made a script that load a checkpoint and run inference on the testing dataset and format the result in a CSV for kaggle. It also recompute the F1 score per class using the real image from the training dataset + provided CSV.

- Show per-image table: `image ID | GT counts | Predicted counts | F1`

- Final F1 score on original dataset.
- Example success cases (model does great).
- Example failure cases (too cluttered, occlusions, etc.)
- Insights about how well model generalizes to real scenes.


# Discussion & Limitations
- What worked well (e.g., synthetic scene generation).
- What didn’t (e.g., failure on specific chocolate classes?).
- Limitations of training from scratch.
- Ideas for future work (e.g., more complex scene synthesis, weak supervision, semi-supervised learning).


## Appendix

### Development Log – What We Tried

This project went through several phases of experimentation. Here's a chronological overview of the different approaches we explored, with some thoughts on each one:

1. **Autoencoder**
   Our first (slightly naïve) attempt was to use an autoencoder, without fully understanding how it might work in the context of object counting. The idea was to reconstruct the image and spot anomalies that might correspond to pralines. Unsurprisingly, this didn’t yield meaningful results.

2. **Ultralytics YOLO**
   We then tried something more out-of-the-box: **Ultralytics YOLO**, using a pretrained model fine-tuned on our data. While promising at first, we realised later on that it was not allowed due to requiring the Ultralytics package.

3. **Classical Machine Learning**
   We also explored traditional ML methods (like regressions, k-NN, and random forests) using handcrafted features from the images. As expected, without  feature extraction or spatial context and due to skill issue, the models struggled to generalize and offered only limited performance. We did find a stupid heuristic that almost reach the baseline but the real world usage of such a solution remain unclear.

4. **YOLOv1 Reimplementation in PyTorch**
   To get a better grasp of the detection pipeline, we reimplemented **YOLOv1 from scratch in PyTorch**. This helped us understand how detection and classification interact, and gave us more control over the architecture and training process.

5. **YOCO – “You Only Count Once” (Custom Model)**
   Building on our previous attempts, we designed our own architecture specifically for **multi-class counting without localization**. Our model, YOCO, features a custom prediction head that outputs, for each class, a distribution over 6 count bins (0 to 5+).
   Although inspired by YOLO in principle (for the convolution part), this model was built entirely from scratch and tailored to our needs: high-resolution image input, accurate per-class counts, and no bounding boxes.


# Bonus

Although we competed for the ML challenge, we also came up with a simple solution for the classical challenge by doing simple statistics on the training label only.
We made a script to find the "universal" answer that would yield the highest F1 score in O(1) time and thus managed to reach F1 of ~0.4 by always predictive 1 for the number of instance for the 13 class. Although of little practical use we found it original, funny and stupid enough to deserve a mention here.