# [IAPR][iapr]: Final project - Chocolate Recognition


**Moodle group ID:** *3*
**Kaggle challenge:** *Deep learning*
**Kaggle team name (exact):** "*Byte the Bar*"

**Author 1 (sciper):** Nathann Morand (296190)

**Author 2 (sciper):** David Croce (327277)

**Author 3 (sciper):** Felipe Ramirez (331471)

**Due date:** 21.05.2025 (11:59 pm)


## Key Submission Guidelines:
- **Before submitting your notebook, <span style="color:red;">rerun</span> it from scratch!** Go to: `Kernel` > `Restart & Run All`
- **Only groups of three will be accepted**, except in exceptional circumstances.


[iapr]: https://github.com/LTS5/iapr2025

---

# Introduction
We are tasked to make a program that is able to count how many instance among 13 praline class in a cluttered image.
We must retrain our model from scratch and are provided with only a very limited number of training image (90)
The score is computed using a modified F1 score (that take difference in number of predicted praline)

For our approach we chose to make convolutional model based of the yolo architecture but instead we rewrote the network head to directly predict the number of instance for each class. We named our architecture yoco : you only count once. To train it we chose to make a synthetic dataset generator based of cropped praline from the training dataset pasted on top of the empty background that where extracted.

# Dataset & Preprocessing
The original dataset offer 90 image that are 6000x4000 px, .JPG The image where taken in similar lightning condition and are relatively well lit.
The inference dataset has the same properties.

## EDA
Image from the dataset look like the following with different background object, different miscellaneous object scatter around and a few praline.
<img src="chocolate_data/dataset_project_iapr2025/train/L1000957.JPG" width="600" height="400"/>

Using the provided CSV we computed the histogram of number of chocolate per image and the histogram showing the number of instance per class to see how well the class are balanced. We also show how many individual instance of praline are available across the dataset and the maximum number of chocolate of each class present on an image.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the CSV file
df = pd.read_csv('chocolate_data/dataset_project_iapr2025/train.csv')

# Calculate the total number of chocolates per image
df['total_chocolates'] = df.iloc[:, 1:].sum(axis=1)

# Print the total number of chocolates in the dataset
total_chocolates_in_dataset = df['total_chocolates'].sum()
print(f"Total number of chocolates in the dataset: {total_chocolates_in_dataset}")

# Get the maximum number of instances per class
max_per_class = df.iloc[:, 1:].max()

# Print the results
print("Maximum number of instances for each chocolate class in a single image:")
print(max_per_class)

# Plot the histogram for total chocolates per image
plt.figure(figsize=(12, 6))
plt.hist(df['total_chocolates'], bins=range(df['total_chocolates'].min(), df['total_chocolates'].max() + 1), edgecolor='black')
plt.title('Histogram of Total Chocolates per Image')
plt.xlabel('Total Number of Chocolates')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Plot the histogram for class distribution (excluding total chocolates column)
class_counts = df.iloc[:, 1:13].sum(axis=0)
plt.figure(figsize=(12, 6))
class_counts.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Class Balance Histogram')
plt.xlabel('Chocolate Class')
plt.ylabel('Number of Chocolates')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()



## Instance extraction
To make the synthetic dataset generator, we cropped manually the 583 praline present in the 90 image using a helper script to draw the box and save it in a new file. We made a second helper file to show the image and moving it to the correct folder after the operator write the class id thus making the sorting faster.

Once the praline where cropped we spent many hours cleaning the background from the 584 pralines using paint or Gimp. That being done we made another helper script to re-orient, center and rescale the praline. The recalling factor allowed use to measure the size variation between the praline and thus know that the variation was +-20% and thus a single detection head would be sufficient. We also did the same with the misc object present and patched the hole in the background.

Here are an overview of the cleaned praline :

In [None]:
import os
import matplotlib.pyplot as plt
from PIL import Image

# Define path and ignored folders
base_path = 'chocolate_data/praline_clean'
ignored_folders = {"MiscObjects", "raw_praline", "references", "Background"}

# Get valid subfolders
valid_folders = [f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f)) and f not in ignored_folders]

# Function to display a 6x6 image mosaic
def display_mosaic(images, title):
    fig, axes = plt.subplots(6, 6, figsize=(12, 12))
    fig.suptitle(title, fontsize=16)
    for i in range(36):
        ax = axes[i // 6, i % 6]
        if i < len(images):
            ax.imshow(images[i])
        ax.axis('off')
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()

# Process each valid folder
for folder in valid_folders:
    folder_path = os.path.join(base_path, folder)
    image_files = [f for f in os.listdir(folder_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
    image_files = image_files[:36]  # Limit to first 36 images

    images = []
    for img_file in image_files:
        img_path = os.path.join(folder_path, img_file)
        try:
            img = Image.open(img_path).convert('RGB')
            img = img.resize((200, 200))
            images.append(img)
        except Exception as e:
            print(f"Error loading image {img_file}: {e}")

    display_mosaic(images, title=folder)


## Synthetic dataset generation

To train our chocolate detection and counting model, we developed a synthetic dataset generator that creates realistic scenes by compositing high-quality, transparent PNG cutouts of pralines and clutter onto large photographic backgrounds. The generator is designed to mimic natural variations in object placement, orientation, scale, and density while ensuring dataset consistency and coverage across all 13 chocolate classes.

#### Directory Structure

The image assets are organized as follows:

```
../chocolate_data/
├── praline_clean/
│   ├── <ChocolateClass>/        # 1000x1000 transparent PNGs per class
│   ├── MiscObjects/             # 1000x1000 PNGs of clutter (non-chocolates)
│   └── Background/              # 6000x4000 high-res background images
└── syntheticDataset/
    ├── images/train/            # Generated training images
    ├── images/val/              # Generated validation images
    ├── train.csv                # YOLO-style count labels
    └── val.csv
```

#### Scene Generation Logic

For each synthetic scene, the generator performs the following steps:

1. **Background Selection**: A random high-resolution background (6000×4000 px) is selected.

2. **Misc Object Placement**:
   - Randomly place 0–6 miscellaneous objects per image.
   - Each object receives a random rotation (0–360°) and is scaled with ±20% jitter applied to base scale factors.
   - Objects are not allowed to overlap but may touch. Up to 20 retry attempts are made to find valid positions.

3. **Chocolate Placement**:
   - Each of the 13 chocolate classes is assigned 0–5 instances per image based on a skewed probability distribution favoring 0 or 1.
   - Each chocolate instance is rescaled (with class-specific base factors and jitter), rotated randomly, and placed while checking that overlaps do not exceed 20% with any existing chocolates (touching is allowed).
   - At least one pair of chocolates (if more than two are present) is forced to touch to reflect realistic clutter.

4. **Label Generation**:
   - Labels are saved in CSV format compatible with YOLO count training, with each row representing a synthetic image and columns encoding the number of instances per class.
   - Example:
     ```
     id,Jelly White,Jelly Milk,...,Stracciatella
     1000001,2,1,...,0
     ```

5. **Scene Saving**:
   - The final composite image can optionally be resized using a configurable downscaling factor.
   - Image and corresponding label are saved in the appropriate `train` or `val` directory, based on a configurable split ratio (default: 80/20).

#### Performance & Scalability

- The generator uses multi-threading to parallelize image composition, utilizing `N-2` CPU cores to avoid overloading the system.
- Progress is tracked using `tqdm` to provide live feedback.
- The total number of generated scenes is configurable (default: 10,000), and all key parameters (e.g., scaling jitter, image size, split ratio) can be tuned easily.

#### Result
Using the technique descibed previously we could generate between 1000 ans 20k picture similar to the following.
<img src="chocolate_data/syntheticDataset/images/train/1000000.JPG" width="600" height="400"/>



# Model Architecture

- Motivation for direct counting (not object detection).
- Description of YOLO-style architecture:
  - Include diagram or schematic of model.
- Explanation of output layer: `13 classes × 6 neurons`
- Input size, normalization, padding strategy if relevant. (a precésier plus tard), number of parameters



# Training
- Data split train/val of 80/20 (data is not an issue anymore so this has little impact)
- Hyperparameters (batch size, epochs, learning rate, etc.) 10⁻3 or it takes forever
- Loss function & why softmax per class is used.
- Optimizer & scheduler if used.
- Include training loss plot.


# Evaluation
- Description of the custom F1-score metric.
- Evaluation process (on synthetic validation or original images?)
- Show:
  - F1 score per image and average.
  - Class-wise accuracy/confusion matrix.
  - Visual comparison of ground truth vs prediction on sample images.


# Inference on Original Images & Result
- Description of final inference setup.
- Load final model.
- Parse original images & corresponding CSV ground truth.
- Predict and compare with real counts.
- Show per-image table: `image ID | GT counts | Predicted counts | F1`
- Final F1 score on original dataset.
- Example success cases (model does great).
- Example failure cases (too cluttered, occlusions, etc.)
- Insights about how well model generalizes to real scenes.


# Discussion & Limitations
- What worked well (e.g., synthetic scene generation).
- What didn’t (e.g., failure on specific chocolate classes?).
- Limitations of training from scratch.
- Ideas for future work (e.g., more complex scene synthesis, weak supervision, semi-supervised learning).


# Appendix
- Journal de bord (dire ce quon as essayé et dans quelle ordre :
	- autoencoder (sans comprendre comment sa marche)
	- ultralytics yolo
	- classical ML
	- yolo v1 torch
	- yoco (custom head)

# Bonus

Although we competed for the ML challenge, we also came up with a simple solution for the classical challenge by doing simple statistics on the training label only.
We made a script to find the "universal" answer that would yield the highest F1 score in O(1) time and thus managed to reach F1 of ~0.4 by always predictive 1 for the number of instance for the 13 class. Although of little practical use we found it original, funny and stupid enough to deserve a mention here.