# Image Modeling with Transfer Learning

## **1. Introduction**  

This project aims to develop a **multimodal classification model** that leverages both **text and image data** for product matching and classification in the **Rakuten dataset**. We have already completed the **text modeling phase**, where we explored both **Machine Learning** and **Deep Learning models**.  

Now, we shift our focus to the **image modeling phase**, utilizing **Convolutional Neural Networks (CNNs)** and **Transfer Learning** to enhance model performance. 

----
Given the complexity of the **image dataset** and the similarities between different classes, we opted for **Transfer Learning** using **CNN-based architectures**. These models, pre-trained on large-scale datasets (such as ImageNet), provide a strong **feature extraction** capability, reducing computation time and improving classification accuracy. 


## **2. Choice of Pre-trained Models** 

 Among the list of models presented [here](https://keras.io/api/applications/), we selected those with the **fewest parameters** to **reduce training time** and **limit computational constraints**.  

| **Models**                     | **Number of Parameters** |
|---------------------------------|-------------------------|
| MobileNetV2                     | 3,538,984              |
| NASNetMobile                     | 5,326,716              |
| DenseNet121                      | 8,062,504              |
| Xception                          | 22,910,480             |
| InceptionV3                       | 23,851,784             |
| ResNet50 (Reference Model)        | 25,636,712             |
| InceptionResNetV2                 | 55,873,736             |
| EfficientNetB7                    | 66,658,687             |
| VGG16                              | 138,357,544            |
| VGG19                              | 143,667,240            |


## **3. Training Parameters and Data Augmentation Strategies**  

### ‚úÖ Displaying the Used Parameters  

- **Optimizer**: Adam is an adaptive optimizer that is widely used and highly effective.  
- **Loss Function**: *Categorical Cross-Entropy*‚Äîwell-suited for our multi-class classification problem.  
- **Batch Size**: Both **32 and 64** were tested.  
- **Learning Rate (LR)**: Models were trained with an initial **LR of 0.001** (default value), then with a very small value **0.00001**, and finally with the optimal values obtained through **Learning Rate Optimization**.  

### ‚úÖ **Data Augmentation** Applied to the Training Dataset  

- **Pixel scaling** between **0 and 1**.  
- **Image shearing** at a certain angle, giving the image a more stretched or skewed appearance.  
- **Random rotation** range: **[-45¬∞, +45¬∞]**.  
- **Horizontal translation** (image shifting along the width).  
- **Vertical translation** (image shifting along the height).  
- **Zoom range**: **Zoom out (-20%) & Zoom in (+20%)**.  
- **Horizontal flipping**.  
- **Vertical flipping**.  

---

## **Classification Layers**  

```python
# Classification Part
model.add(GlobalAveragePooling2D())
model.add(Dense(units = 1024, activation='relu'))

model.add(Dropout(rate = 0.2))
model.add(Dense(units = 512, activation='relu'))

model.add(Dropout(rate = 0.2))
model.add(Dense(units = 27, activation='softmax'))


## **Results of Tested Models**  

The table below presents the results of all the tested models, including their input size, training parameters, and performance metrics (**Accuracy & F1-score weighted**).  

| **Model**            | **Original Input Shape** | **Training Input Shape** | **Source Image Size** | **LR (Default)** | **Epochs** | **Batch Size** | **Accuracy** | **F1 Score Weighted** |
|----------------------|------------------------|-------------------------|----------------------|---------------|--------|------------|-----------|------------------|
| **InceptionResNetV2** | 299√ó299 | 299√ó299 | 500√ó500 | 0.001 | 40 | 64 | 0.62 | 0.61 |
| **InceptionResNetV2** | 299√ó299 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.58 | 0.57 |
| *DenseNet121*       | 224√ó224 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.61 | 0.61 |
| **Xception**         | 299√ó299 | 299√ó299 | 500√ó500 | 0.001 | 40 | 64 | 0.61 | 0.60 |
| **Xception**         | 299√ó299 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.58 | 0.56 |
| **InceptionV3**      | 299√ó299 | 299√ó299 | 500√ó500 | 0.001 | 40 | 64 | 0.60 | 0.59 |
| **InceptionV3**      | 224√ó224 | 256√ó256 | 256√ó256 | 0.001 | 40 | 64 | 0.57 | 0.56 |
| *MobileNetV2*      | 224√ó224 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.59 | 0.59 |
| *MobileNetV2*      | 224√ó224 | 224√ó224 | 256√ó256 | 0.001 | 20 | 32 | 0.59 | 0.58 |
| **VGG16**            | 224√ó224 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.58 | 0.57 |
| **VGG16**            | 224√ó224 | 224√ó224 | 256√ó256 | 0.001 | 20 | 64 | 0.58 | 0.57 |
| **VGG16**            | 224√ó224 | 128√ó128 | 256√ó256 | 0.001 | 10 | 32 | 0.54 | 0.53 |
| **VGG19**            | 224√ó224 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.57 | 0.56 |
| **NASNetMobile**     | 224√ó224 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.58 | 0.57 |
| **ResNet50 A REVOIR Exec Benchmak**         | 224√ó224 | 128√ó128 | 500√ó500 | 0.001 | 50 | 32 | 0.33 | 0.29 |
| **ResNet50 A REVOIR Exec Benchmak**         | 224√ó224 | 256√ó256 | 256√ó256 | 0.001 | 40 | 64 | 0.33 | 0.29 |
| **EfficientNetB7**   | 600√ó600 | 224√ó224 | 256√ó256 | 0.001 | 40 | 64 | 0.12 | 0.03 |

### Explanation of Table Columns

The table below describes the key input size parameters used in training the models, including their original pre-trained input shape, the actual input shape used during training, and the raw image size.

| **Concept**              | **Definition**                                                               | **Example in your case**                           |
|--------------------------|-----------------------------------------------------------------------------|---------------------------------------------------|
| **Original Input Shape** | The input size required by the pre-trained model on ImageNet. This is the size it was originally trained on. | EfficientNetB7 ‚Üí 600√ó600, ResNet50 ‚Üí 224√ó224 |
| **Training Input Shape** | The actual input size used for training (defined by `target_size` in `flow_from_directory`). Sometimes different from the original input shape to test performance variations. | `target_size=(224, 224)` or `target_size=(299, 299)` depending on the experiment |
| **Source Image Size**    | The raw image size before any preprocessing. This is the size of the dataset images before they are resized. | 500√ó500 (tested with 256√ó256 for performance comparison) |


####  *Original Input Shape* 
This is the input size the **pre-trained model** was originally trained on using **ImageNet**. It is recommended to keep this size for optimal performance in **transfer learning**.

| **Model**          | **Original Input Shape** |
|-------------------|------------------------|
| EfficientNetB7    | 600√ó600  |
| InceptionV3       | 299√ó299  |
| Xception         | 299√ó299  |
| DenseNet121      | 224√ó224  |
| ResNet50         | 224√ó224  |
| MobileNetV2      | 224√ó224  |
| VGG16 / VGG19    | 224√ó224  |

**Why is this important?**  
If using **transfer learning**, it is best to match this input size to get the most accurate results.

---

####  *Training Input Shape*
This is the **actual size of images used during training** after preprocessing. It is defined in `target_size` when using **`flow_from_directory`** in TensorFlow/Keras.

**Why can it be different from the Original Input Shape?**  
- Sometimes, a **smaller input size is used** to **reduce memory usage** and speed up training.
- Example: InceptionV3 was originally trained on **299√ó299**, but in your table, **224√ó224** was also tested to see how it affects performance.

**Example usage in `flow_from_directory`:**
```python
train_generator = datagen.flow_from_directory(
    'data/train', 
    target_size=(224, 224),  # Training Input Shape (testing 224√ó224 instead of 299√ó299)
    batch_size=32,
    class_mode='categorical'
)
```
---

### **Key Observations**  

‚úÖ **The top 5 models based on F1-score Weighted** are:  

| **Model**            | **Optimal Input Shape** | **Image Size Input** | **LR (Default)** | **Epochs** | **Batch Size** | **Accuracy** | **F1 Score Weighted** |
|----------------------|----------------------|----------------------|---------------|--------|------------|-----------|------------------|
| **InceptionResNetV2** | 299√ó299  | 500√ó500  | 0.001 | 40 | 64 | **0.62** | **0.61** |
| *DenseNet121*       | 224√ó224  | 256√ó256  | 0.001 | 40 | 64 | **0.61** | **0.61** |
| **Xception**         | 299√ó299  | 500√ó500  | 0.001 | 40 | 64 | **0.61** | **0.60** |
| **InceptionV3**      | 299√ó299  | 500√ó500  | 0.001 | 40 | 64 | **0.60** | **0.59** |
| *MobileNetV2*      | 224√ó224  | 256√ó256  | 0.001 | 40 | 64 | **0.59** | **0.59** |


---

### **Impact of Optimal Input Shape & Image Size Input vs Source Image Size**
üìå **Observations**:

1Ô∏è‚É£ **Models trained with their Optimal Input Shape perform better**  
   - Models trained with their **original input shape** (e.g., **InceptionResNetV2 at 299√ó299**, **Xception at 299√ó299**) achieved **higher F1 scores** than when trained with smaller input sizes.  
   - **Example:** **InceptionResNetV2 (299√ó299) obtained an F1-score of 0.61**, while reducing the training size to **224√ó224 lowered it to 0.57**.  

2Ô∏è‚É£ **Using a larger Source Image Size (500√ó500) improves performance for models with 299√ó299 input shape**  
   - **InceptionResNetV2, Xception, and InceptionV3 trained with 500√ó500 input images performed better** than when trained with **256√ó256 images**.  
   - **Example:** **Xception (500√ó500 ‚Üí 299√ó299) achieved an F1-score of 0.60**, whereas **Xception (256√ó256 ‚Üí 299√ó299) only reached 0.56**.  
   - **Explanation:** Large input sizes **preserve more image details** before resizing, reducing the information loss.  

3Ô∏è‚É£ **For models with a 224√ó224 input shape, reducing Source Image Size to 256√ó256 had no negative impact**  
   - **DenseNet121 and MobileNetV2 performed equally well with 256√ó256 images as with larger sizes.**  
   - This suggests that **as long as the image size is reasonably larger than the final training size (224√ó224), no major performance drop occurs**.  

---

üìå **Recommendation:**  
- **For models requiring 299√ó299 training input, use a larger Source Image Size (e.g., 500√ó500) to maintain high performance.**  
- **For models with a 224√ó224 training input, 256√ó256 Source Image Size is sufficient and does not negatively impact performance.**  
- **Always align Training Input Shape with the Optimal Input Shape to maximize performance.** üöÄ  


‚û°Ô∏è **Next Steps**: We will focus on the top-performing models for **fine-tuning** and **hyperparameter optimization** to further improve results.  


 ## Tableau optimis√© avec des configurations pertinentes
 
 | **Model**            | **Original Input Shape** | **Training Input Shape** | **Source Image Size** | **LR (Default)** | **Epochs** | **Batch Size** | **Accuracy** | **F1 Score Weighted** |
|----------------------|------------------------|-------------------------|----------------------|---------------|--------|------------|-----------|------------------|
| **InceptionResNetV2** | 299√ó299 | 299√ó299 | **500√ó500** | 0.001 | 40 | 64 | 0.62 | 0.61 |
| **InceptionResNetV2** | 299√ó299 | 299√ó299 | **256√ó256** | 0.001 | 40 | 64 | ? | ? |
| **DenseNet121**       | 224√ó224 | 224√ó224 | **500√ó500** | 0.001 | 40 | 64 | ? | ? |
| **Xception**         | 299√ó299 | 299√ó299 | **500√ó500** | 0.001 | 40 | 64 | 0.61 | 0.60 |
| **Xception**         | 299√ó299 | 299√ó299 | **256√ó256** | 0.001 | 40 | 64 | ? | ? |
| **InceptionV3**      | 299√ó299 | 299√ó299 | **500√ó500** | 0.001 | 40 | 64 | 0.60 | 0.59 |
| **InceptionV3**      | 299√ó299 | 299√ó299 | **256√ó256** | 0.001 | 40 | 64 | ? | ? |
| **MobileNetV2**      | 224√ó224 | 224√ó224 | **500√ó500** | 0.001 | 40 | 64 | ? | ? |
| **MobileNetV2**      | 224√ó224 | 224√ó224 | **256√ó256** | 0.001 | 40 | 64 | 0.59 | 0.59 |
| **VGG16**            | 224√ó224 | 224√ó224 | **500√ó500** | 0.001 | 40 | 64 | ? | ? |
| **VGG16**            | 224√ó224 | 224√ó224 | **256√ó256** | 0.001 | 40 | 64 | 0.58 | 0.57 |
| **VGG19**            | 224√ó224 | 224√ó224 | **500√ó500** | 0.001 | 40 | 64 | ? | ? |
| **NASNetMobile**     | 224√ó224 | 224√ó224 | **500√ó500** | 0.001 | 40 | 64 | ? | ? |
| **ResNet50 A REVOIR Exec Benchmak**         | 224√ó224 | 224√ó224 | **500√ó500** | 0.001 | 50 | 32 | ? | ? |
| **EfficientNetB7**   | 600√ó600 | 600√ó600 | **500√ó500** | 0.001 | 40 | 64 | ? | ? |


# REVIEW


### ‚úÖ Cas vraiment pertinents √† tester :

| **Model**       | **Original Input Shape** | **Training Input Shape** | **Source Image Size** |
|---------------|------------------------|-------------------------|----------------------|
| **DenseNet121**  | 224√ó224 | 224√ó224 | **500√ó500** |
| **MobileNetV2**  | 224√ó224 | 224√ó224 | **500√ó500** |
| **EfficientNetB7 √† remplacer par EfficientNetB5 (456√ó456)**  | 600√ó600  |600√ó600 | **500√ó500** ==>  456√ó456| 

| LR (Default)	| Epochs | 	Batch Size 
|---------------|------------------------|-------------------------|
 0.001 | 40 | 64 |
 0.001 | 40 | 64 |
 0.001 | 40 | 64 |
 
For **B0 to B7** base models, the input shapes are different. Here is a list of input shape expected for each model:
[here](https://keras.io/examples/vision/image_classification_efficientnet_fine_tuning/),


#### Pourquoi?
- Justification : Ces configurations respectent les dimensions d'entr√©e originales des mod√®les pr√©-entra√Æn√©s, ce qui est essentiel pour tirer pleinement parti du transfert d'apprentissage. En utilisant une taille d'image source de 500√ó500, vous conservez une richesse d'informations visuelles. Et jutifierait de garder:

| **Model**       | **Original Input Shape** | **Training Input Shape** | **Source Image Size** |
|---------------|------------------------|-------------------------|----------------------|
| **DenseNet121**  | 224√ó224 | 224√ó224 | **256√ó256** |
| **MobileNetV2**  | 224√ó224 | 224√ó224 | **256√ó256** |
| **EfficientNetB7 √† remplacer par EfficientNetB5 (456√ó456)**  | 600√ó600  |600√ó600 | **500√ó500** | 

![image.png](attachment:image.png)

- Et de monter que resize de 500x 500 ==> 256 x 256 n'a pas eu d'impact




### üö® Tests √† retirer car non pertinents :

| ‚ùå **Model**           | **Original Input Shape** | **Training Input Shape** | **Source Image Size** |
|----------------------|------------------------|-------------------------|----------------------|
| **InceptionResNetV2** | 299√ó299 | 224√ó224 | **256√ó256** |
| **Xception**         | 299√ó299 | 224√ó224 | **256√ó256** |
| **EfficientNetB7**   | 600√ó600 | 224√ó224 | **256√ó256** |

#### Pourquoi ces tests ne sont pas intelligents ?
1Ô∏è‚É£ Le Source Image Size est trop petit par rapport au Training Input Shape
256√ó256 est plus petit que l‚Äôinput shape attendu par le mod√®le, ce qui oblige un upscaling artificiel (ex: 256√ó256 ‚Üí 299√ó299 pour Inception/Xception).Cet upscaling n‚Äôapporte aucune information nouvelle et risque m√™me de d√©former les features du mod√®le pr√©-entra√Æn√© sur ImageNet  
2Ô∏è‚É£ EfficientNetB7 a √©t√© con√ßu pour du 600√ó600  
3Ô∏è‚É£ InceptionResNetV2 et Xception sont optimis√©s pour 299√ó299  

---