# Models Training Model Registry (experiments)

En este código, se entrena diferentes modelos para explorar distintas opciones para resolver la predicción sportify
Apache Airflow
MLflow
MinIO (ventana de administración de Buckets)
FAST API
Documentación de la API
Gradio, Posgresql

Este proyecto utiliza MLflow para el seguimiento detallado de los procesos de ETL, el tuneo de hiperparámetros (Runs y Runs Anidados, Modelos y Artefactos), registro de procesos del DAG de ETL y el reentrenamiento de modelos (Información de los Runs de Reentrenamiento, comparación de Métricas). 

## Importar datos y librerias

In [17]:
import mlflow
from datetime import datetime
import awswrangler as wr
import random
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from mlflow import MlflowClient

from plots import plot_information_gain_with_target, plot_correlation_with_target
from utils import get_or_create_experiment


Exporta las variables de entorno necesarias para trabajar con Minio.

In [None]:
# Exportar variables de entorno
%env AWS_ACCESS_KEY_ID=minio   
%env AWS_SECRET_ACCESS_KEY=minio123 
%env MLFLOW_S3_ENDPOINT_URL=http://localhost:9000
%env AWS_ENDPOINT_URL_S3=http://localhost:9000

env: AWS_ACCESS_KEY_ID=minio
env: AWS_SECRET_ACCESS_KEY=minio123
env: MLFLOW_S3_ENDPOINT_URL=http://localhost:9000
env: AWS_ENDPOINT_URL_S3=http://localhost:9000


In [19]:
mlflow_server = "http://localhost:5000"

mlflow.set_tracking_uri(mlflow_server)

Carga los datos procesados desde Minio.


In [20]:
X_train_df = wr.s3.read_csv("s3://data/train/bike_sharing_demand_X_train_scaled.csv")
y_train_df = wr.s3.read_csv("s3://data/train/bike_sharing_demand_y_train.csv")
X_test_df = wr.s3.read_csv("s3://data/test/bike_sharing_demand_X_test_scaled.csv")
y_test_df = wr.s3.read_csv("s3://data/test/bike_sharing_demand_y_test.csv")

In [21]:
corr_plot = plot_correlation_with_target(X_train_df, y_train_df)
information_gain_plot = plot_information_gain_with_target(X_train_df, y_train_df)

In [22]:
X_train = X_train_df.to_numpy()
y_train = y_train_df.to_numpy().ravel()
X_test = X_test_df.to_numpy()
y_test = y_test_df.to_numpy().ravel()

# Model training

Inicializar el experimento mlflow

In [23]:
experiment_id = get_or_create_experiment("Bike Sharing Demand")

print(f"Experiment ID: {experiment_id}")

Experiment ID: 1


In [24]:
run_name_parent = "best_hyperparam_"  + datetime.today().strftime('%Y/%m/%d-%H:%M:%S"')

Muestra el tamaño de los conjuntos de entrenamiento y prueba

In [25]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((12165, 20), (12165,), (5214, 20), (5214,))

La normalización de los datos es un paso importante antes de entrenar el modelo. 

In [None]:
# Definir la red de parámetros para Random Forest
param_grid_rf = {
    'n_estimators': [100, 150], 
    'max_depth': [10, 15, 17],
}

# Inicializar el modelo regresor Random Forest
rf_model = RandomForestRegressor()

# Configurar la búsqueda en cuadrícula con validación cruzada quíntuple
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid_rf, cv=5, scoring='neg_mean_squared_error')

#  Iniciar la ejecución de MLflow
with mlflow.start_run(experiment_id=experiment_id, run_name=run_name_parent, nested=True):
    # Realizar la búsqueda en la cuadrícula y ajustar el modelo
    grid_search_rf.fit(X_train, y_train)
    
    # Obtener el mejor modelo a partir de la búsqueda en la parrilla
    best_rf_model = grid_search_rf.best_estimator_
    
    # Hacer predicciones utilizando el mejor modelo
    rf_predictions = best_rf_model.predict(X_test)
    
    # Calcular métricas
    mse_rf = mean_squared_error(y_test, rf_predictions)
    rmse_rf = mean_squared_error(y_test, rf_predictions, squared=True)
    mae_rf = mean_absolute_error(y_test, rf_predictions)
    r2_rf = r2_score(y_test, rf_predictions)
    
    # Registrar los mejores parámetros y métricas en MLflow
    mlflow.log_param("best_rf_n_estimators", best_rf_model.n_estimators)
    mlflow.log_param("best_rf_max_depth", best_rf_model.max_depth)
    mlflow.log_param("best_rf_min_samples_split", best_rf_model.min_samples_split)
    mlflow.log_param("best_rf_min_samples_leaf", best_rf_model.min_samples_leaf)
    mlflow.log_param("best_rf_bootstrap", best_rf_model.bootstrap)
    
    mlflow.log_metric("best_rf_mse", mse_rf)
    mlflow.log_metric("best_rf_rmse", rmse_rf)
    mlflow.log_metric("best_rf_mae", mae_rf)
    mlflow.log_metric("best_rf_r2", r2_rf)
    
    mlflow.log_figure(corr_plot, artifact_file="correlation_with_target.png")
    mlflow.log_figure(information_gain_plot, artifact_file="information_gain_with_target.png")
    
    # Obtener la primera fila del conjunto de prueba y registrarla en MLflow
    input_example = X_test[0:1]
    
    # Definir la ruta del artefacto
    artifact_path = "best_rf_model"
    
    # Inferir el esquema del ejemplo de entrada
    signature = mlflow.models.infer_signature(X_train, best_rf_model.predict(X_train))
    
    # Log the best Random Forest model to the MLflow server
    mlflow.sklearn.log_model(
        sk_model=best_rf_model,
        artifact_path=artifact_path,
        signature=signature,
        serialization_format='cloudpickle',
        registered_model_name='bike_sharing_model_dev',
        metadata={'model_data_version': 1}
    )
    
    # Obtener el URI del modelo registrado
    model_uri = mlflow.get_artifact_uri(artifact_path)
    
    # Imprimir los resultados
    print(f"Best Random Forest model logged with MSE: {mse_rf}, RMSE: {rmse_rf}, MAE: {mae_rf}, R²: {r2_rf}")
    print(f"Best Random Forest parameters: {grid_search_rf.best_params_}")

Successfully registered model 'bike_sharing_model_dev'.
2024/08/25 18:41:09 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: bike_sharing_model_dev, version 1
Created version '1' of model 'bike_sharing_model_dev'.
2024/08/25 18:41:09 INFO mlflow.tracking._tracking_service.client: 🏃 View run best_hyperparam_2024/08/25-18:39:07" at: http://localhost:5000/#/experiments/1/runs/16508c7082cf45d79ce7515e1a8524cc.
2024/08/25 18:41:09 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/1.


Best Random Forest model logged with MSE: 0.14709518748886086, RMSE: 0.14709518748886086, MAE: 0.25692102312919873, R²: 0.9336163145130275
Best Random Forest parameters: {'max_depth': 15, 'n_estimators': 100}


## Testing the model

In [27]:
loaded_model = mlflow.sklearn.load_model(model_uri)

In [28]:
X_test = np.array(X_test)

In [None]:
# Obtener un elemento aleatorio del conjunto de prueba
input_example = X_test[random.randint(0, X_test.shape[0])] 

print(f"Input example: {input_example}")

Input example: [-1.003541   -0.16956604 -1.46237645  1.56384119  0.16826865 -0.45203217
 -0.59618097 -0.29878575 -0.01282315 -0.58212799  1.68342244 -0.56997785
 -0.40854189 -0.39931795 -0.40771959 -0.40895277 -0.40840489 -0.4120971
  0.58631601 -1.3282028 ]


In [30]:
int(np.exp(loaded_model.predict(input_example.reshape(1, -1))))

  int(np.exp(loaded_model.predict(input_example.reshape(1, -1))))


276

## Register the model

In [31]:
client = MlflowClient()

name = "bike_sharing_model_prod"
desc = "Production model for bike sharing demand prediction"

client.create_registered_model(name=name, description=desc)

tags = best_rf_model.get_params()
tags["model"] = type(best_rf_model).__name__
tags["mse"] = mse_rf
tags["r2"] = r2_rf

result = client.create_model_version(
    name=name,
    source=model_uri,
    run_id=model_uri.split("/")[-3],
    tags=tags
)

client.set_registered_model_alias(name, "champion", result.version)

2024/08/25 18:42:32 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: bike_sharing_model_prod, version 1


In [None]:
Otra opción


In [None]:
def get_or_create_experiment(experiment_name):
    """
    Retrieve the ID of an existing MLflow experiment or create a new one if it doesn't exist.

    This function checks if an experiment with the given name exists within MLflow.
    If it does, the function returns its ID. If not, it creates a new experiment
    with the provided name and returns its ID.

    Parameters:
    - experiment_name (str): Name of the MLflow experiment.

    Returns:
    - str: ID of the existing or newly created MLflow experiment.
    """

    if experiment := mlflow.get_experiment_by_name(experiment_name):
        return experiment.experiment_id
    else:
        return mlflow.create_experiment(experiment_name)