## Module 3: Train and register Diabetes Prediction Machine Learning Model
In this module you will learn to train a machine learning model to predict the likelihood of an individual getting diabetes based on some historical diagnostic measurements available in the training dataset

Once a model is trained, you will learn to register the trained model, and log hyperaparameters used and evaluation metrics using Trident's native integration with the MLflow framework.

[MLflow](https://mlflow.org/docs/latest/index.html) is an open source platform for managing the machine learning lifecycle with features like Tracking, Models, and Model Registry. MLflow is natively integrated with Trident Data Science Experience.

### Import mlflow and create an experiment to log the run


In [1]:
# Create Experiment to Track and register model with mlflow
import mlflow
print(f"mlflow lbrary version: {mlflow.__version__}")
EXPERIMENT_NAME = "diabetes_prediction"
mlflow.set_experiment(EXPERIMENT_NAME)


StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 5, Finished, Available)

mlflow lbrary version: 2.1.1


<Experiment: artifact_location='', creation_time=1686771370492, experiment_id='8a4f09f3-597d-41aa-b891-c20e9bf7fda8', last_update_time=None, lifecycle_stage='active', name='diabetes_prediction', tags={}>

### Read data from lakehouse delta table (saved in previous module)

In [2]:
data_df = spark.read.format("delta").load("Tables/diabetes_processed")

data_df = data_df.drop("insulin", "bmi") 
display(data_df)

StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, d24cc593-f25b-4e8a-9714-526371bb4b59)

### Perform random split to get train and test datasets and identify feature columns to be used or Model Training

In [3]:
train_test_split = [0.75 , 0.25]
seed = 1234
train_df, test_df = data_df.randomSplit(train_test_split, seed=seed)

print(f"Train set record count: {train_df.count()}")
print(f"Test set record count: {test_df.count()}")

StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 7, Finished, Available)

Train set record count: 570
Test set record count: 198


### Define steps to perform feature engineering and train the model using Spark ML Pipelines and Microsoft SynapseML Library

You can learn more about Spark ML pipelines [here](https://spark.apache.org/docs/latest/ml-pipeline.html), and SynapseML is documented [here](https://microsoft.github.io/SynapseML/docs/about/)

The algorithm used for this tutorial, [LightGBM](https://lightgbm.readthedocs.io/en/v3.3.2/) is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms. It is an open source project developed by Microsoft and supports regression, classification and many other machine learning scenarios. Its main advantages are faster training speed, lower memory usage, better accuracy, and support for distributed learning.

In [4]:
numeric_cols = train_df.drop("diabetes", "bmi", "insulin", "insulin_level", "obesity_level").columns
print(numeric_cols)

StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 8, Finished, Available)

['pregnancies', 'plasma_glucose', 'blood_pressure', 'triceps_skin_thickness', 'diabetes_pedigree', 'age']


In [5]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categorical_cols = ["insulin_level","obesity_level"]

# onehot encode above categorical columns
stages = []
for col in categorical_cols:
    stringIndexer = StringIndexer(inputCol=col, outputCol = col + "index")
    encoder = OneHotEncoder(inputCol = stringIndexer.getOutputCol(), outputCol = col  + "_vec")
    
    stages += [stringIndexer, encoder]

#Use VectorAssembler to generate feature column to be passed into ML Model for training
assemblerInputs = numeric_cols + [col + "_vec" for col in categorical_cols]
assembler = VectorAssembler(inputCols = assemblerInputs, outputCol = "features", handleInvalid="skip")

stages += [assembler]

StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 9, Finished, Available)

In [6]:
from synapse.ml.lightgbm import LightGBMClassifier
#this is the model training stage

learningRate = 0.3
numIterations = 100
numLeaves = 31

lgr = LightGBMClassifier(learningRate = learningRate, numIterations = numIterations, numLeaves = numLeaves, labelCol = "diabetes")
stages += [lgr]

StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 10, Finished, Available)

In [7]:
from pyspark.ml import Pipeline
from synapse.ml.train import ComputeModelStatistics

from pyspark.ml.feature import VectorAssembler

#start mlflow run to capture parameters, metrics and log model
with mlflow.start_run():
       
    #define the pipeline
    ml_pipeline = Pipeline(stages = stages)

    #log the parameters used in training for tracking purpose
    mlflow.log_param("train_test_split",train_test_split)
    mlflow.log_param("learningRate", learningRate)
    mlflow.log_param("numIterations", numIterations)
    mlflow.log_param("numLeaves", numLeaves)

    #Call fit method on the pipeline with trianing subset data to create ML Model
    lg_model = ml_pipeline.fit(train_df)

    #perform the predictions on the test subset of the data
    lg_predictions = lg_model.transform(test_df)

    #measure and log metrics to track performance of model
    metrics = ComputeModelStatistics(
        evaluationMetric="classification",
        labelCol="diabetes",
        scoredLabelsCol="prediction",
        ).transform(lg_predictions)

    mlflow.log_metric("precision", round(metrics.first()["precision"],4))
    mlflow.log_metric("recall", round(metrics.first()["recall"],4))
    mlflow.log_metric("accuracy", round(metrics.first()["accuracy"],4))   
    mlflow.log_metric("AUC", round(metrics.first()["AUC"],4))

    #log the model for subsequent use
    model_name = "diabetes-lgbm"
    mlflow.spark.log_model(lg_model, artifact_path = model_name, registered_model_name = model_name, dfs_tmpdir="Files/tmp/mlflow") 




StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 11, Finished, Available)

Successfully registered model 'diabetes-lgbm'.
2023/06/23 19:59:48 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: diabetes-lgbm, version 5
Created version '5' of model 'diabetes-lgbm'.


In [8]:
#display raw predictions generated by model
display(lg_predictions)

StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 12, Finished, Available)

SynapseWidget(Synapse.DataFrame, 2e2b29c6-f9d9-4dc1-9132-4dc5e484f30f)

In [9]:
#display performance metrics for the trained ML Model
display(metrics)

StatementMeta(, c041aa64-aab3-4357-beff-f74fcd1e4a71, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, 43d82a8c-3260-4da5-bf61-878318dcde73)