# Notebook : √âvaluation du mod√®le sur diff√©rentes p√©riodes
charge un mod√®le de machine learning sauvegard√© dans MLflow, √©value les performances du mod√®le sur deux p√©riodes distinctes. Les r√©sultats sont ensuite enregistr√©s dans MLflow.



## Importation des biblioth√®ques n√©cessaires et configuration mlflow


In [1]:
import os
import joblib
import mlflow
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score


In [2]:
# Configure MLflow
mlflow.set_tracking_uri("https://mlflowp51-975919512217.us-central1.run.app")
mlflow.set_experiment("Text_Processing_Experiment")


<Experiment: artifact_location='gs://apip5bucket/artifacts/1', creation_time=1728821200655, experiment_id='1', last_update_time=1728821200655, lifecycle_stage='active', name='Text_Processing_Experiment', tags={}>

## Artefacts
Charge les artefacts stock√©s dans MLflow (Mod√®les et donn√©es).


In [3]:
# Charge les artefacts (mod√®les et donn√©es)
def load_mlflow_artifact(artifact_path):
    local_path = mlflow.artifacts.download_artifacts(artifact_path)
    with open(local_path, 'rb') as f:
        return joblib.load(f)


In [4]:
# Mod√®les et les artefacts
X_reduced = load_mlflow_artifact("mlflow_artifacts/X_reduced.pkl")
y = load_mlflow_artifact("mlflow_artifacts/y.pkl")
model_path = "mlflow_artifacts/bow_svd_model.h5"

# Mod√®le depuis MLflow
model = tf.keras.models.load_model(model_path)

# Compile le mod√®le
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])




## S√©paration des donn√©es en deux p√©riodes (premiers 6 mois et derniers 6 mois)
Divise les donn√©es en deux p√©riodes : janvier √† juin 2023 (premi√®re p√©riode) et juillet √† d√©cembre 2023 (deuxi√®me p√©riode).


In [5]:
#  Donn√©es pour la p√©riode de surveillance
df = pd.read_csv('db/cleaned_data_sample.csv')
df['CreationDate'] = pd.to_datetime(df['CreationDate'])

# Divise les donn√©es en p√©riodes
df_first_half = df[(df['CreationDate'] >= '2023-01-01') & (df['CreationDate'] < '2023-07-01')]
df_second_half = df[(df['CreationDate'] >= '2023-07-01') & (df['CreationDate'] <= '2023-12-31')]

# S√©pare X et y pour chaque p√©riode
X_first_half = X_reduced[:len(df_first_half)]
X_second_half = X_reduced[len(df_first_half):]
y_first_half = y[:len(df_first_half)]
y_second_half = y[len(df_first_half):]


In [6]:
def log_descriptive_statistics(df, period_name):
    """Calcule et enregistre les statistiques descriptives des colonnes num√©riques, avec des noms valides pour MLflow."""
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    descriptive_stats = df[numeric_columns].describe()

    # Normaliser les noms pour MLflow (enlever les caract√®res non autoris√©s)
    valid_stats = descriptive_stats.rename(index=lambda x: x.replace('%', 'percent').replace(' ', '_'))

    # Enregistrer les statistiques dans MLflow
    with mlflow.start_run(run_name=f"Descriptive Statistics {period_name}"):
        for col in valid_stats.columns:
            for stat in valid_stats.index:
                metric_name = f"{period_name}_{col}_{stat}"
                mlflow.log_metric(metric_name, valid_stats.loc[stat, col])

    print(f"Statistiques descriptives pour {period_name} enregistr√©es dans MLflow.")

# Calculer et enregistrer les statistiques descriptives pour les deux p√©riodes
log_descriptive_statistics(df_first_half, "first_half_2023")
log_descriptive_statistics(df_second_half, "second_half_2023")


2024/10/13 14:22:12 INFO mlflow.tracking._tracking_service.client: üèÉ View run Descriptive Statistics first_half_2023 at: https://mlflowp51-975919512217.us-central1.run.app/#/experiments/1/runs/3c9e656644754c288b365b64a556721e.
2024/10/13 14:22:12 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: https://mlflowp51-975919512217.us-central1.run.app/#/experiments/1.


RestException: INVALID_PARAMETER_VALUE: Invalid metric name: 'first_half_2023_Id_25%'. Names may only contain alphanumerics, underscores (_), dashes (-), periods (.), spaces ( ), and slashes (/).

## √âvaluation du mod√®le sans r√©entra√Ænement

Cette fonction √©value les performances du mod√®le sans le r√©entra√Æner sur les deux p√©riodes (data drift et model drift).


In [6]:
# √âvaluation du mod√®le sans r√©entra√Ænement pour chaque p√©riode
def evaluate_model(model, X_test, y_test, dataset_name):
    """√âvalue les performances du mod√®le, y compris la perte, sans le r√©entra√Æner."""
    # Pr√©dictions du mod√®le
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)  # Ajout de la perte lors de l'√©valuation
    y_pred = model.predict(X_test)
    y_pred = (y_pred > 0.5).astype("int32")
    
    # Calcul des m√©triques
    f1 = f1_score(y_test, y_pred, average='macro')
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')

    # Enregistrement des r√©sultats dans MLflow
    with mlflow.start_run(run_name=f"Model Evaluation {dataset_name}"):
        mlflow.log_metric(f"{dataset_name}_loss", loss)
        mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
        mlflow.log_metric(f"{dataset_name}_f1_score", f1)
        mlflow.log_metric(f"{dataset_name}_precision", precision)
        mlflow.log_metric(f"{dataset_name}_recall", recall)

    print(f"{dataset_name} - Loss: {loss}, Accuracy: {accuracy}, F1 Score: {f1}")


## √âvaluation du mod√®le sur les deux semestres d'un an (Data Drift et Model Drift)
Compare les performances du mod√®le sur les deux p√©riodes pour d√©tecter les d√©rives de donn√©es et de mod√®le.


In [7]:
# Performances pour chaque p√©riode sans r√©entra√Ænement
evaluate_model(model, X_first_half, y_first_half, "first_half_2023")
evaluate_model(model, X_second_half, y_second_half, "second_half_2023")

# Affichage des r√©sultats
print("√âvaluation du mod√®le enregistr√©e dans MLflow.")


[1m499/499[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m0s[0m 562us/step


2024/10/13 14:06:44 INFO mlflow.tracking._tracking_service.client: üèÉ View run Model Evaluation first_half_2023 at: https://mlflowp51-975919512217.us-central1.run.app/#/experiments/1/runs/7f9e997515a843378ff0cbe9f8e49ccc.
2024/10/13 14:06:44 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: https://mlflowp51-975919512217.us-central1.run.app/#/experiments/1.


first_half_2023 - Accuracy: 0.38625963768570176, F1 Score: 0.46978230124867953
[1m9/9[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m0s[0m 875us/step


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
2024/10/13 14:06:45 INFO mlflow.tracking._tracking_service.client: üèÉ View run Model Evaluation second_half_2023 at: https://mlflowp51-975919512217.us-central1.run.app/#/experiments/1/runs/8d48cc8ea246474db8030ea10020c9f1.
2024/10/13 14:06:45 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: https://mlflowp51-975919512217.us-central1.run.app/#/experiments/1.


second_half_2023 - Accuracy: 0.3811188811188811, F1 Score: 0.45644860539824234
√âvaluation du mod√®le enregistr√©e dans MLflow.
