# Example notebook for data and model versioning using MLFlow and DeltaLake

This note book is created based on [Databricks example notebook](https://docs.databricks.com/_static/notebooks/mlflow/mlflow-delta-training.html)

The data used is loan status from lending club, 2007--2017, which can be found [here](https://www.kaggle.com/husainsb/lendingclub-issued-loans?select=lc_loan.csv)

**This notebook is used to demo**
- Data versioning using Delta Lake and MLflow
- MLflow version a model and its dependent model (i.e. Logistic Regression and Lime in this case)
- Copy both artifacts from MLflow to a specified location


**Cluster Configuration**
- Databricks Runtime Version `6.4 ML (includes Apache Spark 2.4.5, Scala 2.11)`
- Additional Libraries needed, both are using `default` repository
  - From Maven Repo, coordinate `Azure:mmlspark:0.17`
  - From Pip install, `mlflow==1.14.1`
  
**Storage Requirement**
- If no credential passthrough is used, please mount the containers in the storage account and change the container name accordingly
- If credential passthrough is used, please make sure you have configured
  - When creating clusters, tick the box of "using credential passthrough"
  - Make sure the correct permission has been set to the storage account

In [None]:
from distutils.version import LooseVersion
import pyspark
from mmlspark import TabularLIME, TabularLIMEModel
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder, StandardScaler, Imputer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel
from pyspark.sql.types import FloatType
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from typing import Tuple, List
from pathlib import Path

import mlflow
import mlflow.spark

In [None]:
print(f"MLFlow version = {mlflow.__version__}")

# Load data from Delta Lake

To load data,
- the mount Azure Delta Lake storage to Databricks (skip if we are using credential pass through)
  - If you are using credential passthrough please set  `IS_CRED_PASS` to `True`
- specify path and data version needed

In [None]:
IS_CRED_PASS = True
data_path = "/mnt"
cred_passthrough_configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
container_name = "datalake"
if IS_CRED_PASS:
  data_path = f"abfss://{container_name}@dlsloandev.dfs.core.windows.net/lc_loan"  

In [None]:
dbutils.fs.ls(data_path)

## Specify path and version

- Azure Delta Lake storage is mounted under path `/mnt/delta-ds-test/lc_loan` （or a direct access through credential pass through using container `delta-ds-test`)
- Data is ingested by year, from 2007 to 2017, using "issue_d" column as watermark for versioning.

In [None]:
DEFAULT_DATA_CONTAINER = "delta-ds-test"
DEFAULT_ARTIFACT_CONTAINER = "model-artifacts"
dbutils.widgets.text(name="deltaVersion", defaultValue="6", label="Table version, default=6")
dbutils.widgets.text(name="tbl_name", defaultValue="delta-ds-test/lc_loan", label="tbl_name,default=delta-ds-test/lc_loan")
data_version = None if dbutils.widgets.get("deltaVersion") == "" else int(dbutils.widgets.get("deltaVersion"))
DELTA_TABLE_DEFAULT_PATH =  f"/mnt/{DEFAULT_DATA_CONTAINER}/lc_loan"
input_str = dbutils.widgets.get("tbl_name")
data_path = ""
if input_str == "":
  data_path = f"abfss://{container_name}@dlsloandev.dfs.core.windows.net/lc_loan" if IS_CRED_PASS else DELTA_TABLE_DEFAULT_PATH
else:
  container_name, tbl_name = input_str.strip().split("/")
  data_path = f"abfss://{container_name}@dlsloandev.dfs.core.windows.net/{tbl_name}" if IS_CRED_PASS else f"/mnt/{input_str}"
displayHTML(f"Current data path {data_path}, version {data_version}")

In [None]:
displayHTML(data_path)

## Load from Delta Lake

In [None]:
dataset = spark.read.format("delta").option("versionAsOf", data_version).load(data_path)
display(dataset.select('issue_d').withColumn('year', F.year('issue_d')).select('year').distinct())

year
2007
2009
2010
2011
2008


In [None]:
spark.catalog.clearCache()

# Data Transformation

## Create bad loan label

Create bad loan label, this will include charged off, defaulted, and late repayments on loans.

In [None]:
dataset = dataset.filter(dataset.loan_status.isin(["Default", "Charged Off", "Fully Paid"]))\
                       .withColumn("bad_loan", (~(dataset.loan_status == "Fully Paid")).cast("string"))

## Feature Engineering

- use only year information
- compute credit_length_in_years
- add a "net" column as a new feature

In [None]:
dataset = (
          dataset.withColumn('issue_year',  F.year(F.col('issue_d')).cast('double')) 
                 .withColumn('earliest_year', F.year(F.col('earliest_cr_line')).cast('double'))
                  .withColumn('credit_length_in_years', F.col('issue_year')-F.col('earliest_year'))
           .withColumn('net', F.round(F.col('total_pymnt') -F.col('loan_amnt'), 2))

          )

In [None]:
display(dataset)

id,loan_amnt,annual_inc,dti,delinq_2yrs,total_acc,total_pymnt,issue_d,earliest_cr_line,loan_status,bad_loan,issue_year,earliest_year,credit_length_in_years,net
643218,20000.0,45000.0,13.36,0.0,33.0,24421.72,2010-12-01,1989-05-01,Fully Paid,False,2010.0,1989.0,21.0,4421.72
642872,7475.0,145000.0,17.39,0.0,30.0,10370.7427269,2010-12-01,1998-10-01,Fully Paid,False,2010.0,1998.0,12.0,2895.74
642861,5575.0,120000.0,16.4,1.0,48.0,6819.09,2010-12-01,1996-08-01,Fully Paid,False,2010.0,1996.0,14.0,1244.09
642859,2150.0,120000.0,6.16,0.0,6.0,2816.83,2010-12-01,2002-04-01,Fully Paid,False,2010.0,2002.0,8.0,666.83
642857,7050.0,140000.0,4.26,0.0,17.0,9619.30000103,2010-12-01,1992-10-01,Fully Paid,False,2010.0,1992.0,18.0,2569.3
642855,7375.0,140000.0,5.25,1.0,5.0,7482.1,2010-12-01,2002-06-01,Fully Paid,False,2010.0,2002.0,8.0,107.1
642844,1375.0,37500.0,12.03,0.0,6.0,1432.72,2010-12-01,2004-09-01,Fully Paid,False,2010.0,2004.0,6.0,57.72
642841,9050.0,71000.0,24.05,0.0,21.0,12570.2299982,2010-12-01,1997-01-01,Fully Paid,False,2010.0,1997.0,13.0,3520.23
642825,1900.0,42000.0,23.06,0.0,22.0,2048.69,2010-12-01,1987-11-01,Fully Paid,False,2010.0,1987.0,23.0,148.69
642823,6525.0,325000.0,6.4,0.0,52.0,7599.06,2010-12-01,1994-03-01,Fully Paid,False,2010.0,1994.0,16.0,1074.06


# Helper functions for training

## Data Transformation

- impute columns
- create feature vector
- convert target column to label

In [None]:
def data_transform(features:list, 
                    target:str, 
                   train_df: DataFrame)->PipelineModel:
  """
  - transform feature columns into a single vector type column named `features`
  - convert target column to label using "string indexer"
  - fit the transformation pipeline using training data
  
  :param features. list of feature column names
  :type features: list
  :param target: name of the target column
  :type str
  :param train_df. Train data frame for fit the transformation pipeline
  :type: Dataframe
  """
  model_matrix_stages = [
    Imputer(inputCols = features, outputCols = features),
    VectorAssembler(inputCols=features, outputCol='features'),
    StringIndexer(inputCol=target, outputCol="label")
  ]
  transform_pipeline = Pipeline(stages=model_matrix_stages)
  transform_pipeline_model = transform_pipeline.fit(train_df)
  return transform_pipeline_model
  
  

## Train Function

In [None]:
def train(train_df:DataFrame, 
          test_df: DataFrame,
          lr_params:dict, 
          lime_params:dict,
          lime_output_col:str="weights",
          lime_prediction_col:str="prediction")->Tuple[LogisticRegressionModel, TabularLIMEModel]:
  """
  Helper function that fits a CrossValidator model to predict a binary label
  `target` on the passed-in training DataFrame using the columns in `features`
  :param: train: Spark DataFrame containing training data
  :param: features: List of strings containing column names to use as features from `train`
  :param: target: String name of binary target column of `train` to predict
  :param: lime_output_col:str, output column of LIME model
  :param: lime_prediction_col, prediction column to be input to LIME
  """
  {mlflow.log_param("lr_"+param, val) for param,val  in lr_params.items()}
  {mlflow.log_param("lime_"+param, val) for param,val  in lime_params.items()}
  #   
  lime_input_col="features"
  #   
  lr = LogisticRegression(**lr_params, featuresCol = "features")
  #   
  training_pipeline =  Pipeline(stages=[lr])
  paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0, 0.001, 1, 10]).build()
  crossval = CrossValidator(estimator=training_pipeline,
                            estimatorParamMaps=paramGrid,
                            evaluator=MulticlassClassificationEvaluator(metricName="accuracy"),
                            numFolds=5)
 
  cvModel = crossval.fit(train_df)
  # evaluate on the test data
  evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
  validation_res = evaluator.evaluate(cvModel.transform(test_df))
  # log using mlflow
  mlflow.log_metric('test_' + evaluator.getMetricName(), validation_res)

  # Train LIME model
  lr_model = cvModel.bestModel
  # get the final model parameter for regParam
  mlflow.log_param("lr_regParam", lr_model.stages[-1]._java_obj.getRegParam())
  lime = (TabularLIME().
                setModel(lr_model.stages[-1]).
                setPredictionCol(lime_prediction_col).
                setOutputCol(lime_output_col).
                setInputCol(lime_input_col).
                setParams(**lime_params)
               )

  lime_model = lime.fit(train_df)
  # log models
  mlflow.spark.log_model(lr_model, "lr_model")
  mlflow.spark.log_model(lime_model, "lime_model")
  return lr_model, lime_model

## LIME Helper Function

Helper function for split lime output weight vector into corresponding feature columns

In [None]:
#Helper function for LIME
def splitVector(df_split: pyspark.sql.DataFrame, new_features: list) -> pyspark.sql.DataFrame: 
  """flatten LIME output "splitcol" wherein the importance of each feature 
  used in model is represented as a sigle column in the returend dataframe, 
  while remaining rest of the output columns, e.g. prediction

  :param df_split: Dataframe to be split column wise
  :type df_split: pyspark.sql.DataFrame
  :param new_features: column name list in the split columns
  :type new_features: list
  :return: Dataframe converted
  :rtype: pyspark.sql.DataFrame 
  """
  
  schema = df_split.schema
  cols = df_split.columns

  for col in new_features: # new_features should be the same length as vector column length
    schema = schema.add(col,FloatType(),True)

  return spark.createDataFrame(df_split.rdd.map(lambda row: [row[i] for i in cols]+row.splitcol.tolist()), schema)

# Training

- Split train/test based on the year, we always reserve latest year for validation.
- Start training

In [None]:
tot_year = sorted([x.issue_year for x in dataset.select('issue_year').distinct().collect()])
split_year = tot_year[-2]

In [None]:
# split train and validation based on the year
feature_cols = ["loan_amnt",  "annual_inc", "dti", "delinq_2yrs","total_acc", "credit_length_in_years", 'net']
target_col = 'bad_loan'
train_df = dataset.select(feature_cols + [target_col]).where(F.col('issue_year')<= split_year)
test_df = dataset.select(feature_cols + [target_col]).where(F.col('issue_year')>split_year)
transform_model = data_transform(train_df=train_df, features=feature_cols, target=target_col)
train_t_df = transform_model.transform(train_df)
test_t_df = transform_model.transform(test_df)

In [None]:
# cache dataframe to avoid lazy eval running multiple times
train_t_df.cache().count()
test_t_df.cache().count()

In [None]:
display(test_t_df)

loan_amnt,annual_inc,dti,delinq_2yrs,total_acc,credit_length_in_years,net,bad_loan,features,label
4000.0,34800.0,1.97,0.0,30.0,25.0,486.83,False,"List(1, 7, List(), List(4000.0, 34800.0, 1.97, 0.0, 30.0, 25.0, 486.83))",0.0
3600.0,65000.0,23.69,0.0,24.0,35.0,397.0,False,"List(1, 7, List(), List(3600.0, 65000.0, 23.69, 0.0, 24.0, 35.0, 397.0))",0.0
8000.0,35000.0,5.93,1.0,24.0,15.0,1474.88,False,"List(1, 7, List(), List(8000.0, 35000.0, 5.93, 1.0, 24.0, 15.0, 1474.88))",0.0
4400.0,95459.0,9.57,0.0,51.0,21.0,694.73,False,"List(1, 7, List(), List(4400.0, 95459.0, 9.57, 0.0, 51.0, 21.0, 694.73))",0.0
10000.0,68000.0,6.49,0.0,24.0,13.0,897.35,False,"List(1, 7, List(), List(10000.0, 68000.0, 6.49, 0.0, 24.0, 13.0, 897.35))",0.0
4000.0,36000.0,5.0,0.0,9.0,11.0,906.38,False,"List(1, 7, List(), List(4000.0, 36000.0, 5.0, 0.0, 9.0, 11.0, 906.38))",0.0
5500.0,48000.0,20.57,0.0,39.0,15.0,-3143.4,True,"List(1, 7, List(), List(5500.0, 48000.0, 20.57, 0.0, 39.0, 15.0, -3143.4))",1.0
2500.0,70800.0,7.95,0.0,39.0,18.0,-1021.0,True,"List(1, 7, List(), List(2500.0, 70800.0, 7.95, 0.0, 39.0, 18.0, -1021.0))",1.0
1000.0,43200.0,19.5,0.0,9.0,5.0,173.67,False,"List(1, 7, List(), List(1000.0, 43200.0, 19.5, 0.0, 9.0, 5.0, 173.67))",0.0
6000.0,84000.0,17.51,0.0,16.0,28.0,514.25,False,"List(1, 7, List(), List(6000.0, 84000.0, 17.51, 0.0, 16.0, 28.0, 514.25))",0.0


In [None]:
user_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
experiment_name = f"/Users/{user_name}/loan_classification"
mlflow.set_experiment(experiment_name)
with mlflow.start_run():
  lr_params = {"labelCol": "label", "maxIter":50}
  lime_params = {"nSamples": 1000, "samplingFraction": 0.3, "regularization":0.0}
  tags = {
          "data_path" : data_path,
          "data_version": data_version,
          "train_test_split": split_year
      }
  mlflow.set_tags(tags)
  # log note
  mlflow.set_tag("mlflow.note.content", 
                 "one sub-run for income prediction with lr and lime models with expr for param tuning")
  mlflow.log_param("train_test_split", split_year)
  mlflow.log_param("data_version", data_version)
  lr_model, lime_model = train(train_t_df, test_t_df,lr_params, lime_params)


# Check prediction output for LR and LIME

## Check output from LR

In [None]:
pred_df = lr_model.transform(test_t_df)
pred_df.cache().count()

In [None]:
display(pred_df)

loan_amnt,annual_inc,dti,delinq_2yrs,total_acc,credit_length_in_years,net,bad_loan,features,label,rawPrediction,probability,prediction
4000.0,34800.0,1.97,0.0,30.0,25.0,486.83,False,"List(1, 7, List(), List(4000.0, 34800.0, 1.97, 0.0, 30.0, 25.0, 486.83))",0.0,"List(1, 2, List(), List(2.6086087426405293, -2.6086087426405293))","List(1, 2, List(), List(0.9314135728418803, 0.06858642715811962))",0.0
3600.0,65000.0,23.69,0.0,24.0,35.0,397.0,False,"List(1, 7, List(), List(3600.0, 65000.0, 23.69, 0.0, 24.0, 35.0, 397.0))",0.0,"List(1, 2, List(), List(1.886456062452655, -1.886456062452655))","List(1, 2, List(), List(0.8683509247408665, 0.13164907525913358))",0.0
8000.0,35000.0,5.93,1.0,24.0,15.0,1474.88,False,"List(1, 7, List(), List(8000.0, 35000.0, 5.93, 1.0, 24.0, 15.0, 1474.88))",0.0,"List(1, 2, List(), List(3.5230176938029008, -3.5230176938029008))","List(1, 2, List(), List(0.9713356444606166, 0.02866435553938348))",0.0
4400.0,95459.0,9.57,0.0,51.0,21.0,694.73,False,"List(1, 7, List(), List(4400.0, 95459.0, 9.57, 0.0, 51.0, 21.0, 694.73))",0.0,"List(1, 2, List(), List(3.0121470469430767, -3.0121470469430767))","List(1, 2, List(), List(0.9531198828577707, 0.04688011714222933))",0.0
10000.0,68000.0,6.49,0.0,24.0,13.0,897.35,False,"List(1, 7, List(), List(10000.0, 68000.0, 6.49, 0.0, 24.0, 13.0, 897.35))",0.0,"List(1, 2, List(), List(3.5201096493428246, -3.5201096493428246))","List(1, 2, List(), List(0.9712545655463964, 0.02874543445360345))",0.0
4000.0,36000.0,5.0,0.0,9.0,11.0,906.38,False,"List(1, 7, List(), List(4000.0, 36000.0, 5.0, 0.0, 9.0, 11.0, 906.38))",0.0,"List(1, 2, List(), List(2.503854140225429, -2.503854140225429))","List(1, 2, List(), List(0.9244115682406961, 0.0755884317593039))",0.0
5500.0,48000.0,20.57,0.0,39.0,15.0,-3143.4,True,"List(1, 7, List(), List(5500.0, 48000.0, 20.57, 0.0, 39.0, 15.0, -3143.4))",1.0,"List(1, 2, List(), List(-0.2391862133174849, 0.2391862133174849))","List(1, 2, List(), List(0.4404869054027806, 0.5595130945972193))",1.0
2500.0,70800.0,7.95,0.0,39.0,18.0,-1021.0,True,"List(1, 7, List(), List(2500.0, 70800.0, 7.95, 0.0, 39.0, 18.0, -1021.0))",1.0,"List(1, 2, List(), List(1.3029828033990283, -1.3029828033990283))","List(1, 2, List(), List(0.7863365559624684, 0.2136634440375317))",0.0
1000.0,43200.0,19.5,0.0,9.0,5.0,173.67,False,"List(1, 7, List(), List(1000.0, 43200.0, 19.5, 0.0, 9.0, 5.0, 173.67))",0.0,"List(1, 2, List(), List(1.1925589220269597, -1.1925589220269597))","List(1, 2, List(), List(0.7671984129759291, 0.23280158702407097))",0.0
6000.0,84000.0,17.51,0.0,16.0,28.0,514.25,False,"List(1, 7, List(), List(6000.0, 84000.0, 17.51, 0.0, 16.0, 28.0, 514.25))",0.0,"List(1, 2, List(), List(2.3361868469435674, -2.3361868469435674))","List(1, 2, List(), List(0.9118300037667347, 0.08816999623326524))",0.0


## Check output from LIME

In [None]:
lime_df = lime_model.transform(pred_df.select('features'))

dfLimeSel = lime_df.select('weights').withColumnRenamed('weights', 'splitcol')
dfSplit = splitVector(dfLimeSel, feature_cols )

dfResult = dfSplit.drop("splitcol")

In [None]:
display(dfResult)

loan_amnt,annual_inc,dti,delinq_2yrs,total_acc,credit_length_in_years,net
-6.101539e-06,-2.0702684e-08,0.0078022867,0.007482727,0.0014448563,0.006528615,-5.8396578e-05
-6.4293235e-06,7.938342e-08,0.008590306,0.020816052,0.0006815597,0.0057702996,-5.058443e-05
-4.523055e-06,1.9230049e-07,0.0053175283,-0.0019843301,0.0018729902,0.0066902516,-5.906675e-05
-4.389909e-06,5.1562093e-08,0.006676693,0.025176546,0.0015294288,0.0046820506,-5.734387e-05
-6.4601713e-06,1.7303162e-07,0.0091542145,-0.030455729,0.0008506463,0.0063956208,-5.6991703e-05
-6.4365063e-06,1.5282039e-07,0.00875194,0.036463205,0.0015828876,0.0042121736,-5.8502803e-05
-4.8162424e-06,1.3161107e-07,0.0071770106,0.013265747,0.00033537875,0.0066317725,-5.2705844e-05
-6.921041e-06,-1.2050283e-07,0.008857339,0.02705266,0.0015371128,0.006917811,-6.301544e-05
-5.746143e-06,1.6620866e-07,0.008103372,-0.000973935,0.0026340438,0.0037639732,-5.5965676e-05
-5.008559e-06,-9.5246754e-08,0.007631727,0.033976477,0.0023911465,0.0040916028,-5.840895e-05


# Retrieve and Copy Artifact
- Retrieve based specific runs and experiments
- Copy both LR and LIME model to a specified location

## Helper function for cp artifacts

In [None]:
def copy_model(
    exp_id: str,
    run_id: str,
    model_name: str,
    filepath: str
):
    """
    copy a model from mlflow artificts to a filelocaion
    :param exp_id: expr id to be retrieved
    :type exp_id: str
    :param run_id: run id to be retrieved
    :type run_id: str
    :param model_name: model name to be retrieved
    :type model_name: str
    :param filepath: filepath the model to be stored
    :type filepath: str
    """
    
    model_path = f"dbfs:/databricks/mlflow-tracking/{exp_id}/{run_id}/artifacts/{model_name}"
    model = mlflow.spark.load_model(model_path)
    model.write().overwrite().save(filepath)

## Query all runs

In [None]:
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
runs = mlflow.search_runs(experiment_ids=experiment_id, order_by=['metrics.avg_accuracy'])

In [None]:
runs

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.avg_accuracy,metrics.std_accuracy,metrics.test_accuracy,params.regParam,params.mlEstimatorUid,params.mlModelClass,params.train_test_split,params.lr_regParam,params.lr_maxIter,params.evaluator,params.data_version,params.estimator,params.numFolds,params.lime_regularization,params.lr_labelCol,params.lime_nSamples,params.estimatorParamMapsLength,params.lime_samplingFraction,tags.fit_uuid,tags.mlflow.parentRunId,tags.mlflow.user,tags.mlflow.rootRunId,tags.runSource,tags.train_test_split,tags.mlflow.databricks.cluster.id,tags.mlflow.databricks.notebookRevisionID,tags.mlflow.source.name,tags.mlflow.databricks.notebookPath,tags.mlflow.databricks.cluster.libraries,tags.data_version,tags.mlflow.log-model.history,tags.mlflow.databricks.cluster.info,tags.data_path,tags.mlflow.databricks.notebookID,tags.mlflow.source.type,tags.mlflow.databricks.webappURL,tags.mlflow.note.content
0,26189f75314e4d9faf3db1f3a1fa5c71,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-14 00:13:19.146000+00:00,NaT,0.817198,0.002042,,10.0,Pipeline_18ffaf555e2c,Pipeline,,,,,,,,,,,,,fae056,9e384d987cfe419596331727bb21e5f6,xiaolulu@microsoft.com,9e384d987cfe419596331727bb21e5f6,mllibAutoTracking,,,,,,,,,,,,,,
1,795da589bfef45e3b935460ed4298fb6,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-14 00:13:18.374000+00:00,NaT,0.817792,0.002057,,1.0,Pipeline_18ffaf555e2c,Pipeline,,,,,,,,,,,,,fae056,9e384d987cfe419596331727bb21e5f6,xiaolulu@microsoft.com,9e384d987cfe419596331727bb21e5f6,mllibAutoTracking,,,,,,,,,,,,,,
2,25ad52721945422787321a6942e5cee5,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-13 23:44:23.003000+00:00,NaT,0.819119,0.000863,,10.0,Pipeline_ebd594477ed6,Pipeline,,,,,,,,,,,,,56b122,88e43402bf904f4a9de3266ef47236eb,xiaolulu@microsoft.com,88e43402bf904f4a9de3266ef47236eb,mllibAutoTracking,,,,,,,,,,,,,,
3,c794ce40b06948bfbffa6b87b8bfd3a9,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-13 23:44:22.328000+00:00,NaT,0.819736,0.000797,,1.0,Pipeline_ebd594477ed6,Pipeline,,,,,,,,,,,,,56b122,88e43402bf904f4a9de3266ef47236eb,xiaolulu@microsoft.com,88e43402bf904f4a9de3266ef47236eb,mllibAutoTracking,,,,,,,,,,,,,,
4,1068d4aa450e4c6882ecaef1a077af53,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-15 04:54:53.005000+00:00,NaT,0.852034,0.002804,,10.0,Pipeline_82d7f773f2cf,Pipeline,,,,,,,,,,,,,b70459,250e0c32dd4e4dfaa0ce02d415b7b1c6,xiaolulu@microsoft.com,250e0c32dd4e4dfaa0ce02d415b7b1c6,mllibAutoTracking,,,,,,,,,,,,,,
5,eb8aecf1f85f4d4f9a8e6798a5848ddd,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-15 04:54:52.296000+00:00,NaT,0.852034,0.002804,,1.0,Pipeline_82d7f773f2cf,Pipeline,,,,,,,,,,,,,b70459,250e0c32dd4e4dfaa0ce02d415b7b1c6,xiaolulu@microsoft.com,250e0c32dd4e4dfaa0ce02d415b7b1c6,mllibAutoTracking,,,,,,,,,,,,,,
6,9de7bc5149c0429985180c53791512f0,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-13 23:48:33.465000+00:00,NaT,0.852034,0.002804,,10.0,Pipeline_dda14af8bcf6,Pipeline,,,,,,,,,,,,,9f2193,40589b732b5f4853bf33b04511c88e2f,xiaolulu@microsoft.com,40589b732b5f4853bf33b04511c88e2f,mllibAutoTracking,,,,,,,,,,,,,,
7,1f20516f1c504f5d9273f3fb79921df0,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-13 23:48:32.801000+00:00,NaT,0.852034,0.002804,,1.0,Pipeline_dda14af8bcf6,Pipeline,,,,,,,,,,,,,9f2193,40589b732b5f4853bf33b04511c88e2f,xiaolulu@microsoft.com,40589b732b5f4853bf33b04511c88e2f,mllibAutoTracking,,,,,,,,,,,,,,
8,03badabf9f6a437f878fcf0779b4de83,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-19 05:18:39.261000+00:00,NaT,0.862609,0.007591,,10.0,Pipeline_da4079cc59eb,Pipeline,,,,,,,,,,,,,1c1c4f,c9d496f49b064d3fb6bf0643edabfb3c,xiaolulu@microsoft.com,c9d496f49b064d3fb6bf0643edabfb3c,mllibAutoTracking,,,,,,,,,,,,,,
9,16ddef1fab8a4fcba418060e6a11c9c4,1293095669402272,RUNNING,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-19 05:18:38.619000+00:00,NaT,0.862609,0.007591,,1.0,Pipeline_da4079cc59eb,Pipeline,,,,,,,,,,,,,1c1c4f,c9d496f49b064d3fb6bf0643edabfb3c,xiaolulu@microsoft.com,c9d496f49b064d3fb6bf0643edabfb3c,mllibAutoTracking,,,,,,,,,,,,,,


## Only get "best" runs for each CV run

In [None]:
runs[runs['tags.mlflow.parentRunId'].isnull()]

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.avg_accuracy,metrics.std_accuracy,metrics.test_accuracy,params.regParam,params.mlEstimatorUid,params.mlModelClass,params.train_test_split,params.lr_regParam,params.lr_maxIter,params.evaluator,params.data_version,params.estimator,params.numFolds,params.lime_regularization,params.lr_labelCol,params.lime_nSamples,params.estimatorParamMapsLength,params.lime_samplingFraction,tags.fit_uuid,tags.mlflow.parentRunId,tags.mlflow.user,tags.mlflow.rootRunId,tags.runSource,tags.train_test_split,tags.mlflow.databricks.cluster.id,tags.mlflow.databricks.notebookRevisionID,tags.mlflow.source.name,tags.mlflow.databricks.notebookPath,tags.mlflow.databricks.cluster.libraries,tags.data_version,tags.mlflow.log-model.history,tags.mlflow.databricks.cluster.info,tags.data_path,tags.mlflow.databricks.notebookID,tags.mlflow.source.type,tags.mlflow.databricks.webappURL,tags.mlflow.note.content
28,0b0be4e6020a4475a7c2678b7ec33de9,1293095669402272,FINISHED,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-19 05:28:10.050000+00:00,2021-04-19 05:29:02.123000+00:00,,,0.932759,,CrossValidator_d31357e16015,CrossValidator,2010.0,0.0,50,MulticlassClassificationEvaluator,5,Pipeline,5,0.0,label,1000,4,0.3,7e0f96,,xiaolulu@microsoft.com,0b0be4e6020a4475a7c2678b7ec33de9,mllibAutoTracking,2010.0,0415-033547-pear331,1618810142227,/Users/xiaolulu@microsoft.com/loan_classification,/Users/xiaolulu@microsoft.com/loan_classification,"{""installable"":[{""maven"":{""coordinates"":""Azure...",5,"[{""run_id"":""0b0be4e6020a4475a7c2678b7ec33de9"",...","{""cluster_name"":""ds-dev-credpassXL"",""spark_ver...",abfss://datalake@dlsloandev.dfs.core.windows.n...,1293095669402272,NOTEBOOK,https://eastus-c3.azuredatabricks.net,one sub-run for income prediction with lr and ...
29,c9d496f49b064d3fb6bf0643edabfb3c,1293095669402272,FINISHED,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-19 05:18:10.364000+00:00,2021-04-19 05:18:46.593000+00:00,,,0.922324,,CrossValidator_ba15b3288728,CrossValidator,2009.0,0.0,50,MulticlassClassificationEvaluator,4,Pipeline,5,0.0,label,1000,4,0.3,1c1c4f,,xiaolulu@microsoft.com,c9d496f49b064d3fb6bf0643edabfb3c,mllibAutoTracking,2009.0,0415-033547-pear331,1618809526690,/Users/xiaolulu@microsoft.com/loan_classification,/Users/xiaolulu@microsoft.com/loan_classification,"{""installable"":[{""maven"":{""coordinates"":""Azure...",4,"[{""run_id"":""c9d496f49b064d3fb6bf0643edabfb3c"",...","{""cluster_name"":""ds-dev-credpassXL"",""spark_ver...",abfss://datalake@dlsloandev.dfs.core.windows.n...,1293095669402272,NOTEBOOK,https://eastus-c3.azuredatabricks.net,one sub-run for income prediction with lr and ...
30,780740d9ee4e4b14a982e2163cd7c064,1293095669402272,FINISHED,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-19 04:55:09.092000+00:00,2021-04-19 04:55:53.935000+00:00,,,0.922324,,CrossValidator_89215cb7bfc4,CrossValidator,2009.0,0.0,50,MulticlassClassificationEvaluator,4,Pipeline,5,0.0,label,1000,4,0.3,170877,,xiaolulu@microsoft.com,780740d9ee4e4b14a982e2163cd7c064,mllibAutoTracking,2009.0,0415-033547-pear331,1618808154035,/Users/xiaolulu@microsoft.com/loan_classification,/Users/xiaolulu@microsoft.com/loan_classification,"{""installable"":[{""maven"":{""coordinates"":""Azure...",4,"[{""run_id"":""780740d9ee4e4b14a982e2163cd7c064"",...","{""cluster_name"":""ds-dev-credpassXL"",""spark_ver...",abfss://datalake@dlsloandev.dfs.core.windows.n...,1293095669402272,NOTEBOOK,https://eastus-c3.azuredatabricks.net,one sub-run for income prediction with lr and ...
31,250e0c32dd4e4dfaa0ce02d415b7b1c6,1293095669402272,FINISHED,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-15 04:54:03.085000+00:00,2021-04-15 04:55:06.915000+00:00,,,0.955192,,CrossValidator_f360c98fd488,CrossValidator,2011.0,0.0,50,MulticlassClassificationEvaluator,6,Pipeline,5,0.0,label,1000,4,0.3,b70459,,xiaolulu@microsoft.com,250e0c32dd4e4dfaa0ce02d415b7b1c6,mllibAutoTracking,2011.0,0415-033547-pear331,1618462507018,/Users/xiaolulu@microsoft.com/loan_classification,/Users/xiaolulu@microsoft.com/loan_classification,"{""installable"":[{""maven"":{""coordinates"":""Azure...",6,"[{""run_id"":""250e0c32dd4e4dfaa0ce02d415b7b1c6"",...","{""cluster_name"":""ds-dev-credpassXL"",""spark_ver...",abfss://delta-ds-test@dlsloandev.dfs.core.wind...,1293095669402272,NOTEBOOK,https://eastus-c3.azuredatabricks.net,one sub-run for income prediction with lr and ...
32,9e384d987cfe419596331727bb21e5f6,1293095669402272,FINISHED,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-14 00:12:20.016000+00:00,2021-04-14 00:13:29.902000+00:00,,,0.995445,,CrossValidator_99667d4cb631,CrossValidator,2015.0,0.0,50,MulticlassClassificationEvaluator,10,Pipeline,5,0.0,label,1000,4,0.3,fae056,,xiaolulu@microsoft.com,9e384d987cfe419596331727bb21e5f6,mllibAutoTracking,2015.0,0413-014144-peaks516,1618359210077,/Users/xiaolulu@microsoft.com/loan_classification,/Users/xiaolulu@microsoft.com/loan_classification,"{""installable"":[{""maven"":{""coordinates"":""Azure...",10,"[{""run_id"":""9e384d987cfe419596331727bb21e5f6"",...","{""cluster_name"":""ds-dev"",""spark_version"":""6.4....",/mnt/delta-ds-test/lc_loan,1293095669402272,NOTEBOOK,https://eastus-c3.azuredatabricks.net,one sub-run for income prediction with lr and ...
33,40589b732b5f4853bf33b04511c88e2f,1293095669402272,FINISHED,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-13 23:47:47.784000+00:00,2021-04-13 23:48:42.447000+00:00,,,0.955192,,CrossValidator_a71208b88434,CrossValidator,2011.0,0.0,50,MulticlassClassificationEvaluator,6,Pipeline,5,0.0,label,1000,4,0.3,9f2193,,xiaolulu@microsoft.com,40589b732b5f4853bf33b04511c88e2f,mllibAutoTracking,2011.0,0413-014144-peaks516,1618357722543,/Users/xiaolulu@microsoft.com/loan_classification,/Users/xiaolulu@microsoft.com/loan_classification,"{""installable"":[{""maven"":{""coordinates"":""Azure...",6,"[{""run_id"":""40589b732b5f4853bf33b04511c88e2f"",...","{""cluster_name"":""ds-dev"",""spark_version"":""6.4....",/mnt/delta-ds-test/lc_loan,1293095669402272,NOTEBOOK,https://eastus-c3.azuredatabricks.net,one sub-run for income prediction with lr and ...
34,88e43402bf904f4a9de3266ef47236eb,1293095669402272,FINISHED,dbfs:/databricks/mlflow-tracking/1293095669402...,2021-04-13 23:43:18.672000+00:00,2021-04-13 23:44:35.337000+00:00,,,0.99407,,CrossValidator_7032bca41d64,CrossValidator,2013.0,0.0,50,MulticlassClassificationEvaluator,8,Pipeline,5,0.0,label,1000,4,0.3,56b122,,xiaolulu@microsoft.com,88e43402bf904f4a9de3266ef47236eb,mllibAutoTracking,2013.0,0413-014144-peaks516,1618357475478,/Users/xiaolulu@microsoft.com/loan_classification,/Users/xiaolulu@microsoft.com/loan_classification,"{""installable"":[{""maven"":{""coordinates"":""Azure...",8,"[{""run_id"":""88e43402bf904f4a9de3266ef47236eb"",...","{""cluster_name"":""ds-dev"",""spark_version"":""6.4....",/mnt/delta-ds-test/lc_loan,1293095669402272,NOTEBOOK,https://eastus-c3.azuredatabricks.net,one sub-run for income prediction with lr and ...


## Copy LR model and its dependent LIME model

In [None]:
run_id = "88e43402bf904f4a9de3266ef47236eb"
model_names=['lime_model', 'lr_model']

artifact_path = ( f"abfss://{DEFAULT_ARTIFACT_CONTAINER}@dlsloandev.dfs.core.windows.net/{experiment_id}/{run_id}" if 
                 IS_CRED_PASS else 
                 f'/mnt/{DEFAULT_ARTIFACT_CONTAINER}/{experiment_id}/{run_id}')
for model in model_names:
  try: 
    dbutils.fs.ls(f"{artifact_path}/{model}")
    displayHTML("Already copied!")
  except:
    displayHTML(f"Copying {model} to "f"{artifact_path}/{model}")
    copy_model(experiment_id,run_id,model_name=model, filepath=f"{artifact_path}/{model}")

In [None]:
display(dbutils.fs.ls(artifact_path))

path,name,size
abfss://model-artifacts@dlsloandev.dfs.core.windows.net/1293095669402272/88e43402bf904f4a9de3266ef47236eb/lime_model/,lime_model/,0
abfss://model-artifacts@dlsloandev.dfs.core.windows.net/1293095669402272/88e43402bf904f4a9de3266ef47236eb/lr_model/,lr_model/,0


## Re-load a model and test

- We use LIME model as an example, re-load it from the artifact path and test it

In [None]:
test_model = 'lime_model'
model_path = f'{artifact_path}/{test_model}'
lime_reloaded = PipelineModel.load(model_path)
lime_df2 = lime_reloaded.transform(pred_df.select('features'))
dfLimeSel = lime_df.select('weights').withColumnRenamed('weights', 'splitcol')
dfSplit = splitVector(dfLimeSel, feature_cols )
dfResult = dfSplit.drop("splitcol")
dfResult.show()