# MLOPs on GCP course - MLFlow introduction

<img src=https://www.headmind.com/wp-content/uploads/2024/01/logo_dark.png width="200">

<img src=https://www.ensta-paris.fr/profiles/createur_profil/themes/createur/dist/images/logo_ensta_new.jpg.pagespeed.ce.ERsGv8BS3M.jpg width="200">

*Context*

Credit risk is the risk that a customer doesn't pay back the money they borrowed from a bank. Banks do credit risk modelling to minimize their expected credit loss. ML models can be trained to classify whether a customer is at risk or not.

*Dataset*

The German Credit Risk dataset is used.

The dataset is anonymized because it contains personal identifiable information (PII) on the bank customers. The features are described in the [dataset documentation](https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data).

*Objectives*

- Dataset exploration : Using EDA, explore the relevant data 
- ML implementation : train a Random Forest Classifier with Optuna 

*Notebook made by Headmind Partners AI & Blockchain*

## Libraries

### You should have python 3.11.0 installed to run this lab and others correctly !

In [None]:
%pip install -r requirements.txt

In [None]:
import pickle
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import tree
import optuna
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score, f1_score

from IPython.display import Image
pd.set_option("display.max_columns", 500)

## Running MLFlow server
MLFlow enables us to track several informations on the ML model runs through a UI. To start the server, use the command

```mlflow server --host 127.0.0.1 --port 8080```

 from the root of the project



In [None]:
import mlflow
# By default, the logs will be saved in the current folder. To link your notebook computations to the mlflow server, set the tracking uri to the same uri as the server
host = "0.0.0.0" #TODO
port = "6000" #TODO
mlflow.set_tracking_uri(uri = f"http://{host}:{port}")

## Data Exploration

In [None]:
filename = "data/dataset.parquet"

df = pd.read_parquet(filename)
df.head()

The goal is to predict if a bank can give a credit to a customer according to its profile

Question: Identify the target field

In [None]:
# Identify target field
#########################
target_field = "" # TODO
#########################

In [None]:
# Let's rename the target field
df = df.rename(columns={target_field:'risk'})
# And change the label values 
df['risk'] = df['risk'].map({1:0,2:1})

y = df['risk']
X = df.drop(columns=['risk'])

This is a binary classification problem where
-  y = 1 if the customer is at risk
-  y = 0 if the customer is "bankable"

In real life banks assess customer risk with more than two values (risky or not risky).

In our case, what trick would you suggest to get n risk values (with n>2) ? (with probabilities for instance)

--------------------------
ANSWER HERE

--------------------------

### Using seaborn to explore data 

Correlation matrixes and features distributions according to the credit risk are displayed using the *seaborn* library.

In [None]:
# Correlation matrix
corr = df.corr(numeric_only = True)
plt.figure(figsize=(12,12))
sns.heatmap(corr, cmap="Blues", annot=True, linewidths=.5, cbar_kws={"shrink": .5})


Question : Do you consider the dataset unbalanced ? Compute the label proportion. If a dataset is unbalanced what are the risks on the model? Which method would you use to manage an unbalanced dataset?

--------------------------
ANSWER HERE

--------------------------

## Encoding

Preliminary data exploration helped us discover all the features in the dataset, their distributions and correlations.

The categorical features now have to be encoded

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_feat = X.select_dtypes(include=numerics).columns.tolist()

##############################################
# Pick the right categorical features to encode
categorical_feat = ["checking_account_status", ...] # TODO
##############################################

onehot_encoder = OneHotEncoder()

# Fit_transform - create a X_enc dataframe from the X dataframe
X_enc_array = onehot_encoder.fit_transform(X[categorical_feat])
X_enc = pd.DataFrame(X_enc_array.toarray(), columns=onehot_encoder.get_feature_names_out(input_features=categorical_feat))
X_enc[numeric_feat] = X[numeric_feat]

display(X_enc.head())


What is a one-hot encoder? How would it transform the following pandas Series: ['Cat','Cat','Dog','Cat','Bird','Dog']?

--------------------------
ANSWER HERE

--------------------------

In [None]:
with open("data/one_hot_encoder.pkl", 'wb') as file:
    pickle.dump(onehot_encoder, file)

## ML Modeling

### Train/test split

Question : Split X and y to fit the model. Make sure the risk proportion in the train set are the same as in the test set using the argument *stratify*. Use random_state = 16

In [None]:
X_train,X_test,y_train,y_test =  ... # TODO

### Training an ML model
During the rest of this workshop, we'll train a random forest classifier. What other models would be appropriate for the current problem? Justify your answer.

--------------------------
ANSWER HERE

--------------------------

#### Training with default hyperparameters


In [None]:
# Basic configuration

rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

Before running the follwoing script, make sure you have started the server using the command : 
```mlflow server --host 127.0.0.1 --port 8080```

In [None]:

mlflow.set_experiment(experiment_name="finetune-creditrisk")
with mlflow.start_run(run_name="RandomForest_NoOptimization"):
    # log params
    params = rf_clf.get_params()
    mlflow.log_param("n_estimators", params["n_estimators"])
    mlflow.log_param("bootstrap", params["bootstrap"])
    mlflow.log_param("min_samples_leaf", params["min_samples_leaf"])
    mlflow.log_param("max_depth", params["max_depth"])

    # log metrics
    y_pred = rf_clf.predict_proba(X_test)[:,1]
    mlflow.log_metric("auc", roc_auc_score(y_test,y_pred))
    mlflow.log_metric("f1-score", f1_score(y_test, rf_clf.predict(X_test)))
    
    mlflow.sklearn.log_model(rf_clf, artifact_path="sklearn-model",
        registered_model_name="sk-learn-random-forest")
    
    mlflow.log_artifact(local_path='data/one_hot_encoder.pkl', artifact_path="")

Using only the MLFlow UI, what are the basic parameters of a random forest classifier? (Justify by writing the path you took in the UI to read them)

--------------------------
ANSWER HERE

--------------------------

#### Optimizing hyperparameters by hand


Based on the results of the optimization, fine-tune the model using the provided code and write each result you obtain in a table

--------------------------
ANSWER HERE

--------------------------

In [None]:
# Modify here to fine-tune the model
params = {
    "n_estimators":1, # TODO
    "bootstrap":False,
    "min_samples_leaf":1,
    "max_depth":1,
}

rf_clf = RandomForestClassifier(**params,random_state=42)
rf_clf.fit(X_train, y_train)

mlflow.set_experiment(experiment_name="finetune-creditrisk")
with mlflow.start_run(run_name="RandomForest_manualOptim"):
    # log params
    params = rf_clf.get_params()
    mlflow.log_param("n_estimators", params["n_estimators"])
    mlflow.log_param("bootstrap", params["bootstrap"])
    mlflow.log_param("min_samples_leaf", params["min_samples_leaf"])
    mlflow.log_param("max_depth", params["max_depth"])

    # log metrics
    y_pred = rf_clf.predict_proba(X_test)[:,1]
    mlflow.log_metric("auc", roc_auc_score(y_test,y_pred))
    mlflow.log_metric("f1-score", f1_score(y_test, rf_clf.predict(X_test)))
    
    mlflow.sklearn.log_model(rf_clf, artifact_path="sklearn-model",
        registered_model_name="sk-learn-random-forest")
    
    mlflow.log_artifact(local_path='data/one_hot_encoder.pkl', artifact_path="")

#### Optimizing hyperparameters with Optuna

<a href=https://optuna.readthedocs.io/en/stable/index.html> Optuna </a> is a hyperparameter fine-tuning framework.

To use it, you first define a trial, a scoring function, and a set of hyperparameters to fine-tune, using 'suggest' methods.

Then, you choose an heuristic and optuna will try different sets of hyperparameters and log the KPIs.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, stratify = y_train, random_state=16)

def objective_rf(trial):
    rf_params = {
            # Parameter space definition
            #################################################################
            # TODO: based on your previous results, set a 
            'n_estimators' : trial.suggest_int('n_estimators',low=...,high=...),
            'max_depth' : trial.suggest_int('max_depth',low=...,high=...),
            'bootstrap' : trial.suggest_categorical('bootstrap', []),
            'min_samples_leaf' : trial.suggest_float("min_samples_leaf", low = ..., high = ...)
            #################################################################
            }

    rf_classifier = RandomForestClassifier(random_state=42)
    rf_classifier.set_params(**rf_params)

    rf_classifier.fit(X_train, y_train)

    # Log metrics
    y_pred = rf_classifier.predict(X_val)
    score=f1_score(y_val, y_pred)
    mlflow.log_metric("auc", roc_auc_score(y_val,y_pred))
    mlflow.log_metric("f1-score", score)
    return score

In [None]:
study = optuna.create_study(direction="maximize")
full_objective = lambda trial: objective_rf(trial)
mlflow.set_experiment(experiment_name="finetune-creditrisk")
with mlflow.start_run(run_name="RandomForest_Finetuning_exp"):
    study.optimize(full_objective, n_trials=30, timeout=600)
rf_params = study.best_trial.params

What is the difference between a train, validation, and test set. What are the risks if there is overlapping between the validation and test set?

--------------------------
ANSWER HERE

--------------------------

In [None]:
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.set_params(**rf_params)

X_train_val, y_train_val = pd.concat((X_train, X_val)), pd.concat((y_train, y_val))

rf_classifier.fit(X_train_val, y_train_val)
with mlflow.start_run(run_name="RandomForest_Optimization"):
    # log params
    mlflow.log_param("n_estimators", rf_params["n_estimators"])
    mlflow.log_param("min_samples_leaf", rf_params["min_samples_leaf"])
    mlflow.log_param("max_depth", rf_params["max_depth"])
    mlflow.log_param('max_features', rf_params['max_features'])

    # log metrics
    y_pred = rf_classifier.predict_proba(X_test)[:,1]
    mlflow.log_metric("auc", roc_auc_score(y_test,y_pred))
    mlflow.log_metric("f1-score", f1_score(y_test, rf_classifier.predict(X_test)))
    
    mlflow.sklearn.log_model(rf_clf, artifact_path="sklearn-model",
        registered_model_name="sk-learn-random-forest-finetuned")
    mlflow.log_artifact(local_path='data/one_hot_encoder.pkl', artifact_path="")

# Retrieves a model logged on MLFlow - on run_id

In [None]:
import mlflow
from IPython.display import display

experiment_name = ["finetune-creditrisk"]
run_name = "RandomForest_Optimization"

# Search for the run using the experiment name and run name
runs = mlflow.search_runs(experiment_names=experiment_name)

display(runs)

In [None]:
last_run_id = runs.loc[runs["tags.mlflow.runName"] == run_name]
last_run_id.sort_values(by = ["end_time"], ascending=False, inplace=True)
run_id = last_run_id.iloc[0]["run_id"]

In [None]:
# Retrieves a model from MLFlow
model = mlflow.sklearn.load_model(f"runs:/{run_id}/sklearn-model")
model.predict(X_test)

# Upgrades the model status

In [None]:
from mlflow import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
    name="sk-learn-random-forest-finetuned", version=# TODO: choose the latest version based on the UI, stage="Production"
)

# Retrieves the model from the status

In [None]:
import mlflow

model_name = # TODO: Load the right model
model_version = # TODO: load the right version
model = mlflow.sklearn.load_model(model_uri=f"models:/{model_name}/{model_version}")