# Introduction

------------------------------
## Mission - Réalisez une classification supervisée d'images
------------------------------

Vous continuez votre travail au sein de "Place du marché". Vous avez partagé le travail effectué lors de votre mission précédente avec Lead Data Scientist, Linda. Elle vous invite désormais à aller plus loin dans l’analyse d’images. 

Voici le mail qu’elle vous a envoyé.

*Bonjour,*

*Merci beaucoup pour ton travail ! Voici la suite de ta mission :*

*Pourrais-tu réaliser une **classification supervisée à partir des images** ? Je souhaiterais que tu mettes en place une **data augmentation** afin d’optimiser le modèle.*

*De plus, nous souhaitons élargir notre gamme de produits à l’épicerie fine.*

*Pour cela, pourrais-tu tester la collecte de produits à base de “champagne” via l’[API disponible ici](https://developer.edamam.com/food-database-api) ou via l'API Openfood Facts en pièce jointe (ne nécessitant aucune inscription)?*

*Pourrais-tu ensuite nous proposer un **script** ou **notebook Python** permettant une extraction des 10 premiers produits dans un fichier “.csv”, contenant pour chaque produit les données suivantes : foodId, label, category, foodContentsLabel, image.*

*Enfin, pourrais-tu formaliser dans un **support de présentation** de 30 slides maximum au format PDF **l’ensemble de ta démarche** ainsi que les **résultats** d’analyse les plus pertinents ?*

*Merci encore, bon courage !*

*Linda*

*PS : En pièce jointe, tu trouveras pour t’aider un **exemple** de mise en œuvre de classification supervisée sur un autre dataset.*

# Import des librairies

In [1]:
## Global 
import os
import time
import pandas as pd
import numpy as np

# Classification and scoring
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Tensorflow
import tensorflow as tf
import tensorflow.keras
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import metrics as kmetrics
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array

# Pillow
import PIL
from PIL import Image
from PIL.ImageOps import *
from PIL.ImageFilter import *

# Keras
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.applications.vgg16 import decode_predictions
from keras.applications.vgg16 import VGG16
from keras.models import Model

# Pickle
from pickle import dump

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.image import imread

2025-02-21 11:35:20.203398: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-21 11:35:20.211315: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740134120.220515 2020913 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740134120.223094 2020913 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-21 11:35:20.233042: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [2]:
# Initialize sns
sns.set()

# Lecture du dataset

In [3]:
# Read the dataset from the csv file 
df = pd.read_csv("./data/flipkart_com-ecommerce_sample_1050.csv")

In [4]:
# Define a list of columns to keep
selected_columns = ["product_category_tree", "image"]
# Filter the dataframe with the selected columns
df = df[selected_columns]
df.head()

Unnamed: 0,product_category_tree,image
0,"[""Home Furnishing >> Curtains & Accessories >>...",55b85ea15a1536d46b7190ad6fff8ce7.jpg
1,"[""Baby Care >> Baby Bath & Skin >> Baby Bath T...",7b72c92c2f6c40268628ec5f14c6d590.jpg
2,"[""Baby Care >> Baby Bath & Skin >> Baby Bath T...",64d5d4a258243731dc7bbb1eef49ad74.jpg
3,"[""Home Furnishing >> Bed Linen >> Bedsheets >>...",d4684dcdc759dd9cdf41504698d737d8.jpg
4,"[""Home Furnishing >> Bed Linen >> Bedsheets >>...",6325b6870c54cd47be6ebfbffa620ec7.jpg


## Remaniement du champ image

In [5]:
# Define path to images
path_to_images = "./data/images/"

In [6]:
# Concatenate the path to the name of the image
df["image"] = path_to_images + df["image"]
df.head()

Unnamed: 0,product_category_tree,image
0,"[""Home Furnishing >> Curtains & Accessories >>...",./data/images/55b85ea15a1536d46b7190ad6fff8ce7...
1,"[""Baby Care >> Baby Bath & Skin >> Baby Bath T...",./data/images/7b72c92c2f6c40268628ec5f14c6d590...
2,"[""Baby Care >> Baby Bath & Skin >> Baby Bath T...",./data/images/64d5d4a258243731dc7bbb1eef49ad74...
3,"[""Home Furnishing >> Bed Linen >> Bedsheets >>...",./data/images/d4684dcdc759dd9cdf41504698d737d8...
4,"[""Home Furnishing >> Bed Linen >> Bedsheets >>...",./data/images/6325b6870c54cd47be6ebfbffa620ec7...


## Remaniement des catégories

In [7]:
# Suppress the [" and "] characters
df["product_category_tree"] = df["product_category_tree"].apply(lambda i : i.replace('["', ''))
df["product_category_tree"] = df["product_category_tree"].apply(lambda i : i.replace('"]', ''))
# Get the values of the product_category_tree in a series
product_category_tree = df["product_category_tree"]
# Split the product_category_tree and store it in a list
categories_list = product_category_tree.apply(lambda i : i.split(">>"))
# Go through categories_list and get the biggest list
max_size = 0
for i in range(len(categories_list)):
    # Check if the size is greater than the max_size already found
    if len(categories_list[i]) > max_size:
        # Set the max_size to the new max size
        max_size = len(categories_list[i])
# Generate the columns names based on the max_size
columns = list(range(1,max_size+1))
# Create a dataframe from the list of categories
categories_df = pd.DataFrame(categories_list.to_list(), columns=columns)
# Get the first column of categories_df and set it in a new column in df
df["category"] = categories_df.iloc[:,0]
df = df.drop("product_category_tree", axis=1)
df.head()

Unnamed: 0,image,category
0,./data/images/55b85ea15a1536d46b7190ad6fff8ce7...,Home Furnishing
1,./data/images/7b72c92c2f6c40268628ec5f14c6d590...,Baby Care
2,./data/images/64d5d4a258243731dc7bbb1eef49ad74...,Baby Care
3,./data/images/d4684dcdc759dd9cdf41504698d737d8...,Home Furnishing
4,./data/images/6325b6870c54cd47be6ebfbffa620ec7...,Home Furnishing


## Encodage de la catégorie

In [8]:
# Define a LabelEncoder
le = LabelEncoder()
# Train the LabelEconder with our categories
le.fit(df["category"])
# Generate the labels
df["label"] = le.transform(df["category"])

# Classification supervisée

## Création d'un jeu de données de validation

In [14]:
# Split data and labels
data = df["image"]
labels = df["label"]
# Split into work data and validation data
X, X_val, y, y_val = train_test_split(data, labels, test_size=0.15, random_state=8)

## Création d'un jeu de données d'entrainement et de test

In [15]:
# Split into train data and test data
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.25, random_state=8)

## Création du modèle de classification

In [9]:
def create_model():
    # Implement the pretrained model
    model0 = VGG16(include_top=False, weights="imagenet", input_shape=(224, 224, 3))
    
    # Set layers to non trainable to keep the weights of the pretrained model
    for layer in model0.layers:
        layer.trainable = False

    # Get the output layer of the model
    output = model0.output
    # Upgrade the model
    output = GlobalAveragePooling2D()(output)
    output = Dense(256, activation="relu")(output)
    output = Dropout(0.5)(output)
    # Define the new output with 7 classes and a softmax function
    predictions = Dense(7, activation="softmax")(output)

    # Redefine the whole model
    model = Model(iputs=model0.input, outputs=predictions)
    # Compile the new model
    model.compile(loss="categorical_cross_enthropy", optimizer="rmsprop", metrics=["accuracy"])

    print(model.summary())

    return model

## Première approche

Pour la première approche, nous ferons une préparation initiale simple de l'ensemble des images avant une classification supervisée.