## Data Modeling with Hyperopt and MLflow

This notebook applies broader functionalities from Databricks that can be implemented without distributed computing (even though this is also useful here). Consequently, pandas APIs are used for handling data, instead of PySpark or even Koalas, as done in other notebooks. The main objective of this notebook is to use Hyperopt and MLflow for a binary classification task that otherwise would make use of standard approaches, such as the solely use of libraries as scikit-learn and XGBoost, besides of a handmade API for model tracking.

As mentioned, the data in handled through pandas, given that its volume is relatively small and codes for data pre-processing were constructed upon that Python framework. The data pre-processing procedures follow from a single function whose components can be found in the notebook of module-kind "FunctionsClasses". The response variable is binary and moderately imbalanced to the positive class. Concerning the learning task, two different fitting methods will be implemented: a logistic regression model (from scikit-learn) and a gradient boosted model (from XGBoost).

[MLflow](https://www.mlflow.org/docs/latest/index.html) is an API that allows, together with Databricks UI, a comprehensive and user-friendly framework for model tracking and registry. Within an MLflow context, several elements of a model training (from parameters and metrics to the fitted models) can be promptly logged, and later recovered back using the MLflow API. This helps exploring different models and retrieving the most appropriate one in order to register it and then deploy it into production.

[Hyperopt](http://hyperopt.github.io/hyperopt/) is an efficient library for optimizing hyper-parameters of machine learning models, although also the optimized choice of models themselves is available. Its basic usage is pretty simple: a function that takes values of hyper-parameters as arguments and returns the value of an objective function should be created, together with the search space and other elements for the optimization (such as the algorithm, maximum number of evaluations, and so on). Even that no distributed computing is necessary, once the notebook is attached to a Spark cluster, the search for the best hyper-parameters values can be distributed for faster results. The sinergy between MLflow and Hyperopt is huge: the first can keep track of all tested models fittted during the search conducted by the second.

Below, we find codes that import and pre-process data and a large section with data modeling. First, default models are fitted using fixed values of hyper-parameters and applying MLflow to log different elements of the model training. Then, Hyperopt is used to search for the best values of hyper-parameters for each learning algorithm (logistic regression and XGBoost). K-folds CV is the validation strategy, instead of a train-validation-test split. Finally, using those best values, final models are fitted, and codes illustrate how to register them into Databricks DBFS.

## Libraries

In [0]:
%run "./FunctionsClasses"

In [0]:
import pandas as pd
import numpy as np
import os

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from sklearn import __version__ as sk_version

import xgboost as xgb
from xgboost import __version__ as xgb_version

import mlflow
from mlflow.models.signature import infer_signature
from mlflow.utils.environment import _mlflow_conda_env

import hyperopt as hp
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK, space_eval
from hyperopt.pyll import scope

import cloudpickle

## Functions and classes

## Settings

Python interpreter will be restarted.
Collecting unidecode
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
Installing collected packages: unidecode
Successfully installed unidecode-1.3.2
Python interpreter will be restarted.


## Importing data

In [0]:
df = spark.read.format('csv').\
           options(header='true', delimiter = ',', inferSchema='true').\
           load("/FileStore/shared_uploads/matheusf.rosso@gmail.com/fraud_data_sample.csv")
df = df.toPandas().sort_values('epoch', ascending=True)

# Convertendo epoch em datetime:
df['date'] = df['epoch'].apply(lambda x: epoch_to_date(float(x)))

print(f'Type of df: {type(df)}.')
print(f'Shape of df: {df.shape}.')
print(f'Number of unique orders: {df.order_id.nunique()}.')
print('Time interval: from {0} to {1}.'.format(str(df.date.apply(lambda x: x.date()).min()),
                                               str(df.date.apply(lambda x: x.date()).max())))

# Support variables:
drop_vars = ['y', 'order_amount', 'store_id', 'order_id', 'status', 'epoch', 'date', 'weight']

# df.head(3)

Type of df: <class 'pandas.core.frame.DataFrame'>.
Shape of df: (8361, 1429).
Number of unique orders: 8361.
Time interval: from 2021-05-17 to 2021-06-27.


### Train-test split

In [0]:
df_train, df_test = train_test_split(df, preserve_date=True, date_var='date', test_ratio=0.25, shuffle=False, seed=None)

# Intervalo de tempo dos dados de treinamento e de teste:
first_date_train = str(df_train['date'].min().date())
last_date_train = str(df_train['date'].max().date())
first_date_test = str(df_test['date'].min().date())
last_date_test = str(df_test['date'].max().date())

print(f'Shape of df_train: {df_train.shape}.')
print(f'Number of unique instances: {df_train.order_id.nunique()}.')
print(f'Time interval: ({first_date_train}, {last_date_train}).\n')

print(f'Shape of df_test: {df_test.shape}.')
print(f'Number of unique instances: {df_test.order_id.nunique()}.')
print(f'Time interval: ({first_date_test}, {last_date_test}).')

# df_train.head(3)

Shape of df_train: (6313, 1429).
Number of unique instances: 6313.
Time interval: (2021-05-17, 2021-06-09).

Shape of df_test: (2048, 1429).
Number of unique instances: 2048.
Time interval: (2021-06-10, 2021-06-27).


### Data pre-processing

In [0]:
df_train, df_test, df_train_scaled, df_test_scaled = pre_process(training_data=df_train, test_data=df_test,
                                                                 vars_to_drop=drop_vars,
                                                                 log_transform=True, standardize=True)

---------------------------------------------------------------------------------------------------------
[1mCLASSIFYING FEATURES AND EARLY SELECTION[0m


Initial number of features: 1421.
3 features were dropped for excessive number of missings!
360 features were dropped for having no variance!
1058 remaining features.


---------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------------------------------------------
[1mASSESSING MISSING VALUES[0m


[1mTraining data:[0m
[1mNumber of features with missings:[0m 271 out of 1066 features (25.42%).
[1mAverage number of missings:[0m 443 out of 6313 observations (7.02%).

[1mTest data:[0m
[1mNumber of features with missings:[0m 158 out of 1066 features (14.82%).
[1mAverage number of missings:[0m 128 out of 2048 observations (6.25%).


------------------------------------------------------------------------------

## Data modeling

In [0]:
# Training and test data:
X_train, y_train = (df_train_scaled.drop(drop_vars, axis=1), df_train_scaled['y'])
X_test, y_test = (df_test_scaled.drop(drop_vars, axis=1), df_test_scaled['y'])

### Default models

[MLflow API](https://www.mlflow.org/docs/latest/python_api/index.html) is the main documentation for its implementation. Below, only some of its main functions and classes are used, such as [autologging](https://www.mlflow.org/docs/latest/tracking.html#automatic-logging), that logs standard elements of model training, [log](https://mlflow.org/docs/0.4.2/python_api/mlflow.html) of user-defined elements, and the log of elements conditional on the model object, here [sklearn](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html) and [XGBoost](https://www.mlflow.org/docs/latest/python_api/mlflow.xgboost.html) models.

#### Logistic regression

In [0]:
# Initializing the autologging of model parameters and metrics:
mlflow.autolog()

2021/10/24 18:50:32 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.
2021/10/24 18:50:32 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [0]:
# Creating the MLflow context for logging additional information about the model estimation:
with mlflow.start_run(run_name='lr_default') as lr_run:
  # Creating model estimator object:
  lr_model = LogisticRegression(penalty='l1', C=1.0, solver='liblinear', warm_start=True)

  # Training the model and converting the estimator into a transformer:
  lr_model.fit(X_train, y_train)

  # Predictions and ROC-AUC evaluated on test data:
  pred_test = [p[1] for p in lr_model.predict_proba(X_test)]
  test_roc_auc = roc_auc_score(y_test, pred_test)
  
  # Logging the model artifact:
  mlflow.sklearn.log_model(artifact_path='lr_default_model', sk_model=lr_model)
  
  # Logging the test ROC-AUC:
  mlflow.log_metric("test_roc_auc", test_roc_auc)
  print(f'\nTest ROC-AUC : {test_roc_auc:.4f}.')


Test ROC-AUC : 0.8706.


Loading the logged model

In [0]:
# Loaded model:
lr_model_loaded = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=lr_run.info.run_id
  )
)

# Predictions on test data:
pred_test_loaded = lr_model_loaded.predict(X_test)
pred_test_loaded[0:10]

Out[30]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

#### XGBoost

In [0]:
# Creating the MLflow context for logging additional information about the model estimation:
with mlflow.start_run(run_name='xgboost_default') as xgb_run:
  # Creating the objects containing training and test data (inputs and labels):
  train = xgb.DMatrix(data=X_train, label=y_train)
  test = xgb.DMatrix(data=X_test, label=y_test)

  # Creating and training the model:
  xgb_model = xgb.train(params={'subsample': 0.75, 'eta': 0.1, 'max_depth': 3, 'objective': 'binary:logistic'},
                        dtrain=train, num_boost_round=500, evals=[(test, "test")], early_stopping_rounds=50)

  # Predictions and ROC-AUC evaluated on test data:
  pred_test = xgb_model.predict(test)
  test_roc_auc = roc_auc_score(y_test, pred_test)
  
  # Logging the model artifact:
  mlflow.xgboost.log_model(artifact_path='xgboost_default_model', xgb_model=xgb_model)
  
  # Logging the test ROC-AUC:
  mlflow.log_metric("test_roc_auc", test_roc_auc)
  print(f'\nTest ROC-AUC : {test_roc_auc:.4f}.')

[0]	test-logloss:0.60759
[1]	test-logloss:0.53743
[2]	test-logloss:0.47931
[3]	test-logloss:0.42967
[4]	test-logloss:0.38732
[5]	test-logloss:0.35132
[6]	test-logloss:0.31941
[7]	test-logloss:0.29195
[8]	test-logloss:0.26799
[9]	test-logloss:0.24653
[10]	test-logloss:0.22833
[11]	test-logloss:0.21176
[12]	test-logloss:0.19719
[13]	test-logloss:0.18390
[14]	test-logloss:0.17216
[15]	test-logloss:0.16178
[16]	test-logloss:0.15233
[17]	test-logloss:0.14418
[18]	test-logloss:0.13714
[19]	test-logloss:0.13026
[20]	test-logloss:0.12360
[21]	test-logloss:0.11815
[22]	test-logloss:0.11388
[23]	test-logloss:0.11004
[24]	test-logloss:0.10622
[25]	test-logloss:0.10287
[26]	test-logloss:0.10091
[27]	test-logloss:0.09778
[28]	test-logloss:0.09561
[29]	test-logloss:0.09301
[30]	test-logloss:0.09118
[31]	test-logloss:0.08929
[32]	test-logloss:0.08763
[33]	test-logloss:0.08626
[34]	test-logloss:0.08485
[35]	test-logloss:0.08314
[36]	test-logloss:0.08193
[37]	test-logloss:0.08038
[38]	test-logloss:0.07

Loading the logged model

In [0]:
# Loaded model:
xgb_model_loaded = mlflow.xgboost.load_model(
  'runs:/{run_id}/model'.format(
    run_id=xgb_run.info.run_id
  )
)

# Predictions on test data:
pred_test_loaded = xgb_model_loaded.predict(test)
pred_test_loaded[0:10]

Out[32]: array([0.00524007, 0.00260701, 0.00251006, 0.04042081, 0.00893994,
       0.00692396, 0.00344518, 0.00198182, 0.00233956, 0.01562605],
      dtype=float32)

### Optimizing hyper-parameters

Here, [Hyperopt API](http://hyperopt.github.io/hyperopt/#documentation) is used for optimizing hyper-parameters of machine learning models. First, the [search space](http://hyperopt.github.io/hyperopt/getting-started/search_spaces/) is declared, and then optimization takes place inside an MLflow context. Using an object from [SparkTrials class](http://hyperopt.github.io/hyperopt/scaleout/spark/), the search is conducted according to a distributed plan of computing.

#### Logistic regression

In [0]:
# Search space for the grid search:
lr_search_space = {
  'C': hp.choice('C', [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 0.5, 0.75, 1.0, 10.0])
}

In [0]:
# Function that takes hyper-parameter values as arguments and returns the objective value for minimization:
def train_model(params):
  # With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
  mlflow.autolog()
  
  # Creating the MLflow context for logging additional information about the model estimation:
  with mlflow.start_run(nested=True, run_name='opt_lr'):
    val_roc_auc = []

    # Loop over folds of data:
    for train, val in KFold(3).split(X_train):
        # Creating model estimator object:
        model = LogisticRegression(penalty='l1', C=float(params['C']), solver='liblinear', warm_start=True)

        # Training the model and converting the estimator into a transformer:
        model.fit(X_train.iloc[train, :], y_train.iloc[train])

        # Validation data:
        X_val, y_val = X_train.iloc[val, :], y_train.iloc[val]

        # Predictions and ROC-AUC evaluated on validation data:
        pred_val = [p[1] for p in model.predict_proba(X_val)]
        val_roc_auc.append(roc_auc_score(y_val, pred_val))

    # ROC-AUC calculated through K-folds CV:
    val_roc_auc = np.nanmean(val_roc_auc)

    # Logging K-folds CV and test ROC-AUC:
    mlflow.log_metric('val_roc_auc', val_roc_auc)
    
    # Returning the objective function for minimization:
    return {'status': STATUS_OK, 'loss': -1*val_roc_auc}
  
# Defining the strategy of distributed computing:
spark_trials = SparkTrials(parallelism=10)
 
# MLflow context for tracking hyper-parameters tuning:
with mlflow.start_run(run_name='opt_lr'):
  best_params = fmin(
    fn=train_model, 
    space=lr_search_space, 
    algo=tpe.suggest, 
    max_evals=96,
    trials=spark_trials
  )

Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs.
To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks. Click the 'stderr' link for a task to view trial logs.
  0%|          | 0/96 [00:00<?, ?trial/s, best loss=?]  1%|          | 1/96 [00:56<1:28:53, 56.14s/trial, best loss: -0.8554495819187787]  2%|▏         | 2/96 [01:01<1:04:12, 40.99s/trial, best loss: -0.9179919962611008]  4%|▍         | 4/96 [01:05<44:52, 29.26s/trial, best loss: -0.9275053222316195]    5%|▌         | 5/96 [01:51<52:07, 34.37s/trial, best loss: -0.9275053222316195]  6%|▋         | 6/96 [01:56<37:58, 25.31s/trial, best loss: -0.9275053222316195]  7%|▋         | 7/96 [01:58<

Assessing the optimization

In [0]:
# Best model according to the ROC-AUC of K-folds CV:
best_run_lr = mlflow.search_runs(order_by=['metrics.val_roc_auc DESC']).iloc[0]

print(f'ROC-AUC from K-folds CV of the best run: {best_run_lr["metrics.val_roc_auc"]:.4f}.')

# Best hyper-parameter values:
lr_opt_params = space_eval(lr_search_space, best_params)
print(f'Best hyper-parameters: {lr_opt_params}.')

# Best models:
mlflow.search_runs(order_by=['metrics.val_roc_auc DESC'])[['run_id', 'experiment_id', 'status', 'artifact_uri', 'start_time', 'end_time',
                                                           'metrics.loss', 'metrics.val_roc_auc', 'params.C',
                                                           'tags.mlflow.runName']].head(3)

ROC-AUC from K-folds CV of the best run: 0.9352.
Best hyper-parameters: {'C': 0.1}.


Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.loss,metrics.val_roc_auc,params.C,tags.mlflow.runName
0,c55358779d1544a8bf8b15dc32e8781a,721477946143311,FINISHED,dbfs:/databricks/mlflow-tracking/7214779461433...,2021-10-24 19:14:37.989000+00:00,2021-10-24 19:15:54.967000+00:00,-0.935206,0.935206,4,
1,6461e742461e4d0fa30c2e702cc208ec,721477946143311,FINISHED,dbfs:/databricks/mlflow-tracking/7214779461433...,2021-10-24 19:13:49.577000+00:00,2021-10-24 19:15:27.555000+00:00,-0.935206,0.935206,4,
2,5b03728c07f6454bb7463ea12efde2fd,721477946143311,FINISHED,dbfs:/databricks/mlflow-tracking/7214779461433...,2021-10-24 19:12:28.148000+00:00,2021-10-24 19:14:24.429000+00:00,-0.935206,0.935206,4,


Training the final model

In [0]:
# Class to reconcile the prediction method with the trained sklearn classification model:
class SklearnModelWrapper(mlflow.pyfunc.PythonModel):
  def __init__(self, model):
    self.model = model
    
  def predict(self, context, model_input):
    return self.model.predict_proba(model_input)[:,1]

# Creating the MLflow context for logging additional information about the model estimation:
with mlflow.start_run(run_name='lr_model_final') as lr_run_final:
  # Creating model estimator object:
  lr_model = LogisticRegression(penalty='l1', C=float(lr_opt_params['C']), solver='liblinear', warm_start=True)
  
  # Training the model and converting the estimator into a transformer:
  lr_model.fit(X_train, y_train)

  # Predictions and ROC-AUC evaluated on test data:
  pred_test = [p[1] for p in lr_model.predict_proba(X_test)]
  test_roc_auc = roc_auc_score(y_test, pred_test)
  
  # Python object of the trained model with predict method:
  wrappedModel = SklearnModelWrapper(lr_model)

  # Signature that defines the schema of the model's inputs and outputs in order to validate inputs after deployment:
  signature = infer_signature(X_train, wrappedModel.predict(None, X_train))

  # Defining the conda environment for model serving:
  conda_env = _mlflow_conda_env(
        additional_conda_deps=None,
        additional_pip_deps=["cloudpickle=={}".format(cloudpickle.__version__), "scikit-learn=={}".format(sk_version)],
        additional_conda_channels=None,
    )

  # Logging the model artifact:
  mlflow.pyfunc.log_model(
    "lr_model_final",
    python_model=wrappedModel,
    conda_env=conda_env,
    signature=signature
  )
  
  # Logging the test ROC-AUC:
  mlflow.log_metric("test_roc_auc", test_roc_auc)
  print(f"\nTest ROC-AUC: {test_roc_auc:.4f}.")

  inputs = _infer_schema(model_input)

Test ROC-AUC: 0.9120.


Loading the model

In [0]:
# Loaded model:
lr_model_loaded = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=lr_run_final.info.run_id
  )
)

# Predictions on test data:
pred_test_loaded = lr_model_loaded.predict(X_test)

Model registry

In [0]:
# # Path to the model inside the DBFS:
# lr_uri = lr_run_final.info.artifact_uri
# lr_model_name = 'lr_model'

# # Registering the model:
# lr_model_version = mlflow.register_model(f'{lr_uri}/lr_model_final', lr_model_name)

#### XGBoost

In [0]:
# Search space for the grid search:
xgb_search_space = {
  'subsample': hp.uniform('subsample', 0.5, 1.0),
  'max_depth': hp.choice('max_depth', [i+1 for i in range(10)]),
  'eta': hp.uniform('eta', 0.0001, 0.1)
}

In [0]:
# Function that takes hyper-parameter values as arguments and returns the objective value for minimization:
def train_model(params):
  # With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
  mlflow.xgboost.autolog()
  
  # Creating the MLflow context for logging additional information about the model estimation:
  with mlflow.start_run(nested=True, run_name='opt_xgb'):
    val_roc_auc = []

    # Loop over folds of data:
    for train, val in KFold(3).split(X_train):
      # Creating the objects containing training and validation data (inputs and labels):
      train_data = xgb.DMatrix(data=X_train.iloc[train, :], label=y_train.iloc[train])
      val_data = xgb.DMatrix(data=X_train.iloc[val, :], label=y_train.iloc[val])

      # Creating and training the model:
      model = xgb.train(params={'subsample': params['subsample'], 'eta': params['eta'], 'max_depth': params['max_depth'],
                                'objective': 'binary:logistic'},
                        dtrain=train_data, num_boost_round=500, evals=[(val_data, "val")], early_stopping_rounds=50)
      
      # Predictions and ROC-AUC evaluated on validation data:
      pred_val = model.predict(val_data)
      val_roc_auc.append(roc_auc_score(y_train.iloc[val], pred_val))

    # ROC-AUC calculated through K-folds CV:
    val_roc_auc = np.nanmean(val_roc_auc)

    # Logging K-folds CV and test ROC-AUC:
    mlflow.log_metric('val_roc_auc', val_roc_auc)
    
    # Returning the objective function for minimization:
    return {'status': STATUS_OK, 'loss': -1*val_roc_auc}
  
# Defining the strategy of distributed computing:
spark_trials = SparkTrials(parallelism=10)
 
# MLflow context for tracking hyper-parameters tuning:
with mlflow.start_run(run_name='opt_xgb'):
  best_params = fmin(
    fn=train_model, 
    space=xgb_search_space, 
    algo=tpe.suggest, 
    max_evals=36,
    trials=spark_trials
  )

Assessing the optimization

In [0]:
# Best model according to the ROC-AUC of K-folds CV:
best_run_xgb = mlflow.search_runs(order_by=['metrics.val_roc_auc DESC']).iloc[0]

print(f'ROC-AUC from K-folds CV of the best run: {best_run_xgb["metrics.val_roc_auc"]:.4f}.')

# Best hyper-parameter values:
opt_params_xgb = space_eval(xgb_search_space, best_params)
print(f'Best hyper-parameters: {opt_params_xgb}.')

# Best models:
mlflow.search_runs(order_by=['metrics.val_roc_auc DESC'])[['run_id', 'experiment_id', 'status', 'artifact_uri', 'start_time', 'end_time',
                                                           'metrics.loss', 'metrics.val_roc_auc', 'params.C',
                                                           'tags.mlflow.runName']].head(3)

Training the final model

In [0]:
# Creating the MLflow context for logging additional information about the model estimation:
with mlflow.start_run(run_name='xgboost_final') as xgb_run_final:
  # Creating the objects containing training and test data (inputs and labels):
  train = xgb.DMatrix(data=X_train, label=y_train)
  test = xgb.DMatrix(data=X_test, label=y_test)

  # Creating and training the model:
  xgb_model = xgb.train(params={'subsample': opt_params_xgb['subsample'], 'eta': opt_params_xgb['eta'],
                                'max_depth': opt_params_xgb['max_depth'],
                                'objective': 'binary:logistic'},
                        dtrain=train, num_boost_round=500, evals=[(test, "test")], early_stopping_rounds=50)
  
  # Predictions and ROC-AUC evaluated on test data:
  pred_test = xgb_model.predict(test)
  test_roc_auc = roc_auc_score(y_test, pred_test)

  # Signature that defines the schema of the model's inputs and outputs in order to validate inputs after deployment:
  signature = infer_signature(X_train, xgb_model.predict(train))

  # Defining the conda environment for model serving:
  conda_env = _mlflow_conda_env(
        additional_conda_deps=None,
        additional_pip_deps=["cloudpickle=={}".format(cloudpickle.__version__), "xgboost=={}".format(xgb_version)],
        additional_conda_channels=None,
    )

  # Logging the model artifact:
  mlflow.xgboost.log_model(artifact_path='xgboost_default_model', xgb_model=xgb_model, conda_env=conda_env, signature=signature)
  
  # Logging the test ROC-AUC:
  mlflow.log_metric("test_roc_auc", test_roc_auc)
  print(f"\nTest ROC-AUC: {test_roc_auc:.4f}.")

Loading the model

In [0]:
# Loaded model:
xgb_model_loaded = mlflow.xgboost.load_model(
  'runs:/{run_id}/model'.format(
    run_id=xgb_run_final.info.run_id
  )
)

# Predictions on test data:
pred_test_loaded = xgb_model_loaded.predict(test)

Model registry

In [0]:
# # Path to the model inside the DBFS:
# lr_uri = xgb_run_final.info.artifact_uri
# xgb_model_name = 'xgb_model'

# # Registering the model:
# xgb_model_version = mlflow.register_model(f'{lr_uri}/xgb_model_final', xgb_model_name)