# Analyse exploratoire des données — Leaf Classification

Ce notebook présente une analyse exploratoire du jeu de données *Leaf Classification*.
L’objectif est de comprendre la structure des données, la distribution des classes
et de justifier les choix méthodologiques utilisés dans le pipeline final.


In [22]:
import sys
from pathlib import Path

PROJECT_ROOT = Path().resolve().parents[0]

SRC_PATH = PROJECT_ROOT / "src"
sys.path.append(str(SRC_PATH))

print("PROJECT_ROOT :", PROJECT_ROOT)



PROJECT_ROOT : /home/ubuntu/techniques_apprentissage


In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from leaf_classification.gestion_donnees.leaf_data_loader import LeafDataLoader

sns.set(style="whitegrid")


In [24]:
from leaf_classification.gestion_donnees.leaf_data_loader import LeafDataLoader

loader = LeafDataLoader(
    chemin_train_zip=str(PROJECT_ROOT / "data/raw/train.csv.zip"),
    chemin_test_zip=str(PROJECT_ROOT / "data/raw/test.csv.zip"),
    dossier_extraction=str(PROJECT_ROOT / "data/raw/extracted"),
)

train_df, _ = loader.charger_donnees()
X, y = loader.obtenir_X_y_train()

train_df.head()


Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,Acer_Opalus,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391
1,2,Pterocarya_Stenoptera,0.005859,0.0,0.03125,0.015625,0.025391,0.001953,0.019531,0.0,...,0.000977,0.0,0.0,0.000977,0.023438,0.0,0.0,0.000977,0.039062,0.022461
2,3,Quercus_Hartwissiana,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,...,0.1543,0.0,0.005859,0.000977,0.007812,0.0,0.0,0.0,0.020508,0.00293
3,5,Tilia_Tomentosa,0.0,0.003906,0.023438,0.005859,0.021484,0.019531,0.023438,0.0,...,0.0,0.000977,0.0,0.0,0.020508,0.0,0.0,0.017578,0.0,0.047852
4,6,Quercus_Variabilis,0.005859,0.003906,0.048828,0.009766,0.013672,0.015625,0.005859,0.0,...,0.09668,0.0,0.021484,0.0,0.0,0.0,0.0,0.0,0.0,0.03125


In [25]:
print("Dimensions du dataset :", train_df.shape)
print("Nombre de features :", X.shape[1])
print("Nombre de classes :", y.nunique())


Dimensions du dataset : (990, 194)
Nombre de features : 193
Nombre de classes : 99


Le jeu de données contient 990 observations et 193 caractéristiques numériques,
correspondant à des descripteurs morphologiques de feuilles.


In [26]:
X.describe().T[["mean", "std", "min", "max"]].head()


Unnamed: 0,mean,std,min,max
id,799.59596,452.477568,1.0,1584.0
margin1,0.017412,0.019739,0.0,0.087891
margin2,0.028539,0.038855,0.0,0.20508
margin3,0.031988,0.025847,0.0,0.15625
margin4,0.02328,0.028411,0.0,0.16992


In [27]:
y_counts = y.value_counts()
display(y_counts.head(10))
print("Min / Max par classe :", y_counts.min(), "/", y_counts.max())


species
Sorbus_Aria              10
Acer_Opalus              10
Pterocarya_Stenoptera    10
Viburnum_Tinus           10
Morus_Nigra              10
Quercus_Vulcanica        10
Alnus_Viridis            10
Betula_Pendula           10
Olea_Europaea            10
Quercus_Ellipsoidalis    10
Name: count, dtype: int64

Min / Max par classe : 10 / 10


Les caractéristiques sont numériques et présentent des échelles variées,
ce qui justifie l’utilisation d’une normalisation dans le pipeline expérimental.


## Conclusion de l’analyse exploratoire

- Le jeu de données est entièrement numérique et ne contient pas de valeurs manquantes.
- Les classes sont relativement équilibrées.
- Les descripteurs fournis sont déjà informatifs, ce qui ne rend pas nécessaire
  une ingénierie de caractéristiques avancée.
- Une normalisation des données est toutefois requise avant l’entraînement des modèles.

