So I've already done the training and finding best params using hyperopt. All of the experiments and runs are saved in `mlflow.db`. Using the MLFlow UI I will create 'model registry' for top 5 models that have best AUC, move some of them to 'production' and leave the rest to 'staging'. Then I will try to use MLFlow Client to talk to the MLFlow, test those that are in the 'staging' area and using the test data 'promote' the best one to 'production'. 

---

Before that, I will do another run on the best model just to test the idea of saving preprocessor (DictVectorizer) in separate folder, since it isn't saved using `autolog()`. 

Also, all the preprocessing part will be saved to `preprocessing.py` such that I can declutter this notebook and focus on parts dealing with MLFlow only. I know it is never done this way, but this notebook servers as POC not as a production level code. 

In [2]:
#To access the scope of preprocessing.py
%run preprocessing.py

In [13]:
import mlflow
import pickle
import xgboost as xgb

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("credit-risk-scoring")

<Experiment: artifact_location='/Users/goceovoono/Desktop/programming-stuff/mlz-latest/week-6/mlops-part/mlruns/1', creation_time=1702676022413, experiment_id='1', last_update_time=1702676022413, lifecycle_stage='active', name='credit-risk-scoring', tags={}>

In [15]:
best_params = {'learning_rate': 0.057786841452234214,
               'max_depth': 5.0,
               'min_child_weight': 11.353071298640767,
               'reg_alpha': 0.008039345251325283,
               'reg_lambda': 0.003694981097974786}

In [20]:
with mlflow.start_run():
        
    
        booster = xgb.XGBClassifier(
            max_depth=int(best_params['max_depth']),
            learning_rate=best_params['learning_rate'],
            reg_alpha=best_params['reg_alpha'],
            reg_lambda=best_params['reg_lambda'],
            min_child_weight=best_params['min_child_weight'],
            objective='binary:logistic',
            eval_metric='auc',
            seed=RANDOM_STATE,
            n_estimators=1000,
            early_stopping_rounds=50
        )
        
        #Pickling the DictVectorizer
        with open("./models/preprocessor.bin", "wb") as f_out:
            pickle.dump(dv, f_out)
        
        mlflow.autolog()
        
        
        booster.fit(X_train, y_train,
                    eval_set=[(X_valid, y_val)]
                    )
        
        y_pred = booster.predict_proba(X_valid)[:, 1]
        roc_auc = roc_auc_score(y_val, y_pred)

        #Log the DictVectorizer as an artifact
        mlflow.log_artifact("./models/preprocessor.bin", artifact_path = "preprocessors")

2023/12/16 23:58:16 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


[0]	validation_0-auc:0.76757
[1]	validation_0-auc:0.77520
[2]	validation_0-auc:0.79604
[3]	validation_0-auc:0.79677
[4]	validation_0-auc:0.80054
[5]	validation_0-auc:0.80208
[6]	validation_0-auc:0.80581
[7]	validation_0-auc:0.80793
[8]	validation_0-auc:0.80848
[9]	validation_0-auc:0.80954
[10]	validation_0-auc:0.81088
[11]	validation_0-auc:0.81302
[12]	validation_0-auc:0.81364
[13]	validation_0-auc:0.81375
[14]	validation_0-auc:0.81405
[15]	validation_0-auc:0.81445
[16]	validation_0-auc:0.81390
[17]	validation_0-auc:0.81454
[18]	validation_0-auc:0.81422
[19]	validation_0-auc:0.81418
[20]	validation_0-auc:0.81548
[21]	validation_0-auc:0.81538
[22]	validation_0-auc:0.81625
[23]	validation_0-auc:0.81737
[24]	validation_0-auc:0.81788
[25]	validation_0-auc:0.81759
[26]	validation_0-auc:0.81806
[27]	validation_0-auc:0.81847
[28]	validation_0-auc:0.81884
[29]	validation_0-auc:0.81919
[30]	validation_0-auc:0.81903
[31]	validation_0-auc:0.81933
[32]	validation_0-auc:0.81904
[33]	validation_0-au



In [43]:
#Accesing MLFlow via code to get all the models and their staging status

from mlflow.tracking import MlflowClient

MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"

client = MlflowClient(tracking_uri=MLFLOW_TRACKING_URI)

model_name = 'credit-scoring'
all_versions = client.search_model_versions(f"name='{model_name}'")

In [51]:
staging_runs = []
for ver in all_versions:
    if ver.current_stage == 'Staging':
        staging_runs.append(ver.run_id)

In [52]:
staging_runs

['a298ae01dc784deca4d17b22ccf5faf4',
 'c0dad59a9148498c8c991be737b26dd0',
 'aeb61853b70f49f58deede317277a2bb']

In [90]:
for run_id in staging_runs:
    # Load the model from MLflow
    logged_model = f'runs:/{run_id}/model'
    loaded_model = mlflow.xgboost.load_model(logged_model)

    # Retrieve the parameters from the MLflow run
    run_info = mlflow.get_run(run_id)
    params = run_info.data.params
    filtered_dict = {key: value for key, value in params.items() if value != 'None'}

    loaded_model.set_params(**filtered_dict)

    y_pred = loaded_model.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test, y_pred)
    print(f"run_id: {run_id} --> roc_auc: {roc_auc}")

run_id: a298ae01dc784deca4d17b22ccf5faf4 --> roc_auc: 0.8287115588547189
run_id: c0dad59a9148498c8c991be737b26dd0 --> roc_auc: 0.8287115588547189
run_id: aeb61853b70f49f58deede317277a2bb --> roc_auc: 0.8287115588547189


Ye, I know they are all the same because last 5 runs are basically defined as 5 models trained on `best_params` from the above. This is just to make things understandable!

In [97]:
versions = []

for ver in all_versions:
    if ver.current_stage == 'Staging':
        versions.append(ver.version)

In [104]:
runs_versions = dict(zip(staging_runs, versions))
runs_versions

{'a298ae01dc784deca4d17b22ccf5faf4': 4,
 'c0dad59a9148498c8c991be737b26dd0': 1,
 'aeb61853b70f49f58deede317277a2bb': 3}

In [105]:
# Saying that all versions should be set to 'Production' since they all 'passed' test as being good models
for ver in versions:
    new_stage = "Production"

    client.transition_model_version_stage(
        name='credit-scoring',  # Replace with your actual model name
        version=ver,  # Replace with the desired version number
        stage=new_stage
        )

  client.transition_model_version_stage(
