![](/files/images/dbxatscale.png)

#Overview

This notebook is going to be used as our model training job. Structurally our model is taken exactly from our AutoML expirement; however, there are a couple changes we need to make for deployment. Including this directly via an spark df generated AtScale query. This means any updates to our semantic model or source dataset will be reflected in the most recent version of our model.

In [0]:
import mlflow
import databricks.automl_runtime

target_col = "item type"

We have created a notebook in the same directory as this demo called XX_Establish_AtScale_Connection, which can be run in other notebooks as detailed below. Running this command will generate our AtScale connection and query using AI Link. The query returned by running XX_Establish_AtScale_Connection is identical to the query we generated in 01.

In [0]:
%run ./XX_Establish_AtScale_Connection

We need to do some data engineering here becuase this model will be fed via a spark dataframe generated by our AtScale query. The advantage of this is that our training will reflect the current data in our underlying datatable. The downside is we have to do a train, test, and validation split again, as well as our item type split locally.

In [0]:
import random
df = spark.sql(query).toPandas()
from sklearn.model_selection import train_test_split

row_type = []
item_cat = []

for index, row in df.iterrows():
  item = row["item"][0:3]
  if item == "FOO":
    item_cat.append(1)
  elif item == "HOB":
    item_cat.append(2)
  elif item == "HOU":
    item_cat.append(3)
  

df["item type"] = item_cat
df = df.drop(["item", "date"], axis = 1)

df.head(5)

Unnamed: 0,average_sales,average_units_sold,max_sales,max_units_sold,population_variance_sales,population_variance_units_sold,sample_standard_deviation_sales,sample_standard_deviation_units_sold,sample_variance_units_sold,total_categories,total_departments,total_items,total_sales,total_states,total_stores,total_transactions,total_units_sold,day_over_day_units_sold,previous_days_units_sold,total_sales_30_prd_mv_avg,total_units_sold_28_day_max,total_units_sold_30_prd_mv_avg,item type
0,1.96,7.5,1.96,19.0,-2.131628e-15,36.45,0.0,6.363961,40.5,1.0,1.0,1.0,19.6,3.0,10.0,10.0,75.0,0.0,,19.6,75.0,75,1
1,0.98,8.3,0.98,23.0,-5.329071e-16,35.01,0.0,6.236986,38.9,1.0,1.0,1.0,9.8,3.0,10.0,10.0,83.0,0.0,,9.8,83.0,83,1
2,3.16,4.3,3.28,10.0,0.0336,11.01,0.193218,3.497618,12.233333,1.0,1.0,1.0,31.6,3.0,10.0,10.0,43.0,0.0,,31.6,43.0,43,1
3,3.16,4.4,3.28,10.0,0.0336,10.24,0.193218,3.373096,11.377778,1.0,1.0,1.0,31.6,3.0,10.0,10.0,44.0,0.0,,31.6,44.0,44,1
4,1.379,0.0,1.48,0.0,0.021129,0.0,0.153221,0.0,0.0,1.0,1.0,1.0,13.79,3.0,10.0,10.0,0.0,0.0,,13.79,0.0,0,1


## Feeding into our AutoML Model
Now that we have generated our pandas df with our training data, we can feed it right into our AutoML model. AutoML runs off of a dataframe called df_loaded so instead of pointing to our table created in 01, we are going to set df_loaded equal to our pandas df

In [0]:
import os
import uuid
import shutil
import pandas as pd

df_loaded = df



In [0]:
from databricks.automl_runtime.sklearn.column_selector import ColumnSelector
supported_cols = ["population_variance_units_sold", "total_units_sold_28_day_max", "previous_days_units_sold", "total_sales_30_prd_mv_avg", "max_sales", "average_units_sold", "population_variance_sales", "max_units_sold", "sample_standard_deviation_sales", "average_sales", "total_sales", "total_units_sold_30_prd_mv_avg", "total_units_sold", "sample_standard_deviation_units_sold", "sample_variance_units_sold"]
col_selector = ColumnSelector(supported_cols)

In [0]:

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler

num_imputers = []
num_imputers.append(("impute_mean", SimpleImputer(), ["average_sales", "average_units_sold", "max_sales", "max_units_sold", "population_variance_sales", "population_variance_units_sold", "previous_days_units_sold", "sample_standard_deviation_sales", "sample_standard_deviation_units_sold", "sample_variance_units_sold", "total_sales", "total_sales_30_prd_mv_avg", "total_units_sold", "total_units_sold_28_day_max", "total_units_sold_30_prd_mv_avg"]))

numerical_pipeline = Pipeline(steps=[
    ("converter", FunctionTransformer(lambda df: df.apply(pd.to_numeric, errors='coerce'))),
    ("imputers", ColumnTransformer(num_imputers)),
    ("standardizer", StandardScaler()),
])

numerical_transformers = [("numerical", numerical_pipeline, ["population_variance_units_sold", "total_units_sold_28_day_max", "previous_days_units_sold", "total_sales_30_prd_mv_avg", "max_sales", "average_units_sold", "population_variance_sales", "max_units_sold", "sample_standard_deviation_sales", "average_sales", "total_sales", "total_units_sold_30_prd_mv_avg", "total_units_sold", "sample_standard_deviation_units_sold", "sample_variance_units_sold"])]

In [0]:
from sklearn.compose import ColumnTransformer

transformers = numerical_transformers

preprocessor = ColumnTransformer(transformers, remainder="passthrough", sparse_threshold=0)

In [0]:
from sklearn.model_selection import train_test_split

split_train_df, split_test_df = train_test_split(df_loaded, test_size = 0.4)
split_test_df, split_val_df = train_test_split(split_test_df, test_size = 0.5)

X_train = split_train_df.drop([target_col], axis=1)
y_train = split_train_df[target_col]

X_val = split_val_df.drop([target_col], axis=1)
y_val = split_val_df[target_col]

X_test = split_test_df.drop([target_col], axis=1)
y_test = split_test_df[target_col]

In [0]:
from mlflow.models import Model, infer_signature, ModelSignature
from mlflow.pyfunc import PyFuncModel
from mlflow import pyfunc
import sklearn
from sklearn import set_config
from sklearn.pipeline import Pipeline
import lightgbm
from lightgbm import LGBMClassifier

from hyperopt import hp, tpe, fmin, STATUS_OK, Trials

# Create a separate pipeline to transform the validation dataset. This is used for early stopping.
mlflow.sklearn.autolog(disable=True)
pipeline_val = Pipeline([
    ("column_selector", col_selector),
    ("preprocessor", preprocessor),
])
pipeline_val.fit(X_train, y_train)
X_val_processed = pipeline_val.transform(X_val)

def objective(params):
  with mlflow.start_run(experiment_id="1272847532445219") as mlflow_run:
    lgbmc_classifier = LGBMClassifier(**params)

    model = Pipeline([
        ("column_selector", col_selector),
        ("preprocessor", preprocessor),
        ("classifier", lgbmc_classifier),
    ])

    # Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(
        log_input_examples=True,
        silent=True)

    model.fit(X_train, y_train, classifier__callbacks=[lightgbm.early_stopping(5), lightgbm.log_evaluation(0)], classifier__eval_set=[(X_val_processed,y_val)])

    
    # Log metrics for the training set
    mlflow_model = Model()
    pyfunc.add_to_model(mlflow_model, loader_module="mlflow.sklearn")
    pyfunc_model = PyFuncModel(model_meta=mlflow_model, model_impl=model)
    training_eval_result = mlflow.evaluate(
        model=pyfunc_model,
        data=X_train.assign(**{str(target_col):y_train}),
        targets=target_col,
        model_type="classifier",
        evaluator_config = {"log_model_explainability": False,
                            "metric_prefix": "training_"  }
    )
    lgbmc_training_metrics = training_eval_result.metrics
    # Log metrics for the validation set
    val_eval_result = mlflow.evaluate(
        model=pyfunc_model,
        data=X_val.assign(**{str(target_col):y_val}),
        targets=target_col,
        model_type="classifier",
        evaluator_config = {"log_model_explainability": False,
                            "metric_prefix": "val_"  }
    )
    lgbmc_val_metrics = val_eval_result.metrics
    # Log metrics for the test set
    test_eval_result = mlflow.evaluate(
        model=pyfunc_model,
        data=X_test.assign(**{str(target_col):y_test}),
        targets=target_col,
        model_type="classifier",
        evaluator_config = {"log_model_explainability": False,
                            "metric_prefix": "test_"  }
    )
    lgbmc_test_metrics = test_eval_result.metrics

    loss = -lgbmc_val_metrics["val_f1_score"]

    # Truncate metric key names so they can be displayed together
    lgbmc_val_metrics = {k.replace("val_", ""): v for k, v in lgbmc_val_metrics.items()}
    lgbmc_test_metrics = {k.replace("test_", ""): v for k, v in lgbmc_test_metrics.items()}

    return {
      "loss": loss,
      "status": STATUS_OK,
      "val_metrics": lgbmc_val_metrics,
      "test_metrics": lgbmc_test_metrics,
      "model": model,
      "run": mlflow_run,
    }

In [0]:
space = {
  "colsample_bytree": 0.7210340071691833,
  "lambda_l1": 0.2347409701792684,
  "lambda_l2": 0.176191006422464,
  "learning_rate": 0.050834776202825935,
  "max_bin": 459,
  "max_depth": 7,
  "min_child_samples": 133,
  "n_estimators": 1876,
  "num_leaves": 9,
  "path_smooth": 55.8697083019294,
  "subsample": 0.6014265349166578,
  "random_state": 702788258,
}

In [0]:
trials = Trials()
fmin(objective,
     space=space,
     algo=tpe.suggest,
     max_evals=1,  # Increase this when widening the hyperparameter search space.
     trials=trials)

best_result = trials.best_trial["result"]
model = best_result["model"]
mlflow_run = best_result["run"]

display(
  pd.DataFrame(
    [best_result["val_metrics"], best_result["test_metrics"]],
    index=["validation", "test"]))

set_config(display="diagram")
model

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
  0%|          | 0/1 [00:00<?, ?trial/s, best loss=?]                                                     [LightGBM] [Info] Total Bins 4184
  0%|          | 0/1 [00:00<?, ?trial/s, best loss=?]                                                     [LightGBM] [Info] Number of data points in the train set: 4275, number of used features: 15
  0%|          | 0/1 [00:01<?, ?trial/s, best loss=?]                                                     [LightGBM] [Info] Start training from score -0.840847
  0%|          | 0/1 [00:01<?, ?trial/s, best loss=?]                                                     [LightGBM] [Info] Start training from score -1.247397
  0%|          | 0/1 [00:01<?, ?trial/s, best loss=?]                                                     [LightGBM] [Info] Start training from score -1.267966
  0%|          | 0/1 [00:01<?, ?trial/s, bes

2023/08/04 17:51:55 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.



  0%|          | 0/1 [00:08<?, ?trial/s, best loss=?]

2023/08/04 17:51:55 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as multiclass dataset, number of classes is inferred as 3



  0%|          | 0/1 [00:08<?, ?trial/s, best loss=?]

2023/08/04 17:51:57 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.



  0%|          | 0/1 [00:10<?, ?trial/s, best loss=?]

2023/08/04 17:51:57 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as multiclass dataset, number of classes is inferred as 3



  0%|          | 0/1 [00:10<?, ?trial/s, best loss=?]

2023/08/04 17:51:59 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.



  0%|          | 0/1 [00:12<?, ?trial/s, best loss=?]

2023/08/04 17:51:59 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as multiclass dataset, number of classes is inferred as 3



  0%|          | 0/1 [00:12<?, ?trial/s, best loss=?]100%|██████████| 1/1 [00:14<00:00, 14.47s/trial, best loss: -0.9992985811918921]100%|██████████| 1/1 [00:14<00:00, 14.48s/trial, best loss: -0.9992985811918921]


score,example_count,accuracy_score,recall_score,precision_score,f1_score,log_loss,roc_auc
0.9992987377279102,1426,0.9992987377279102,0.9992987377279102,0.9992999202949796,0.999298581191892,0.0073121432363053,0.9999930788492296
0.9978947368421052,1425,0.9978947368421052,0.9978947368421052,0.9979009311647826,0.9978921919040918,0.0081158346560743,0.9999736335057298


<Figure size 1050x700 with 0 Axes>

In [0]:
# model_uri for the generated model
print(f"runs:/{ mlflow_run.info.run_id }/model")

runs:/bdef39ee21b04ddbab4cf667b270e4f3/model


In [0]:
model_name = "Databricks_AI_Link_M5_Demo"
model_uri = f"runs:/{ mlflow_run.info.run_id }/model"
registered_model_version = mlflow.register_model(model_uri, model_name)

Registered model 'Databricks_AI_Link_M5_Demo' already exists. Creating a new version of this model...
2023/08/04 17:52:01 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: Databricks_AI_Link_M5_Demo, version 13
Created version '13' of model 'Databricks_AI_Link_M5_Demo'.


![](/files/images/model.png)

### Create Webhook 
This will let our developers know when their model has been retrained so they can monitor their compute cost and such.

In [0]:
%run ./06_Webhook_Set_Up

In [0]:
now = datetime.now()
user = "RUN BY: " + str(registered_model_version.user_id)
version = "VERSION: " + str(registered_model_version.version)
uri = "MODEL URI: " + model_uri
model = "MODEL_NAME: " + model_name
date = "DATE: " + str(now)

In [0]:
#format success message
messages = ["V--------------TRAINING COMPLETE----------------V"]
messages += [model]
messages += [date]
messages +=[user]
messages += [version]
messages += ["*******************************************"]
messages += [" "]
messages += [" "]
messages += [" "]

Webhook Set Up Complete.


In [0]:
send_slack_messages(slack_url, channel, messages)

![](/files/images/atscale_logo.png)