## Module 3: Train and register Heart Failure Prediction Machine Learning Model
In this module you will learn to train a machine learning model to predict the likelihood of an individual getting heart failure based on some historical diagnostic measurements available in the training dataset

Once a model is trained, you will learn to register the trained model, and log hyperaparameters used and evaluation metrics using Fabric's native integration with the MLflow framework.

[MLflow](https://mlflow.org/docs/latest/index.html) is an open source platform for managing the machine learning lifecycle with features like Tracking, Models, and Model Registry. MLflow is natively integrated with Fabric Data Science Experience.

### Import mlflow and create an experiment to log the run


In [1]:
# Create Experiment to Track and register model with mlflow
import mlflow
print(f"mlflow lbrary version: {mlflow.__version__}")
EXPERIMENT_NAME = "heartfailure_prediction"
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(exclusive=False)


StatementMeta(, , , Waiting, , Waiting)

mlflow lbrary version: 2.6.0


In [2]:
# Import the required libraries for model training
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix, recall_score, roc_auc_score, classification_report

StatementMeta(, , , Waiting, , Waiting)

2024/07/15 14:56:11 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


2024/07/15 14:56:11 INFO mlflow.tracking.fluent: Autologging successfully enabled for lightgbm.


### Read data from lakehouse delta table (saved in previous module)

In [3]:
data_df = spark.read.format("delta").load("Tables/heartfailure_processed")
data_df.printSchema()
display(data_df)

StatementMeta(, , , Waiting, , Waiting)

root
 |-- Sex: long (nullable = true)
 |-- ChestPainType: long (nullable = true)
 |-- RestingECG: long (nullable = true)
 |-- ExerciseAngina: long (nullable = true)
 |-- ST_Slope: long (nullable = true)
 |-- Age: integer (nullable = true)
 |-- RestingBP: double (nullable = true)
 |-- Cholesterol: double (nullable = true)
 |-- FastingBS: double (nullable = true)
 |-- MaxHR: integer (nullable = true)
 |-- Oldpeak: double (nullable = true)
 |-- HeartDisease: integer (nullable = true)



SynapseWidget(Synapse.DataFrame, 03b861ca-7b5e-466b-9b8f-9d260f378969)

### Perform random split to get train and test datasets and identify feature columns to be used or Model Training

In [5]:
data_df = data_df.toPandas()

StatementMeta(, , , Waiting, , Waiting)

In [6]:
from sklearn.model_selection import train_test_split
SEED = 12345
y = data_df["HeartDisease"]
X = data_df.drop("HeartDisease",axis=1)
# Split the dataset to 60%, 20%, 20% for training, validation, and test datasets
# Train-Test Separation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=SEED)
# Train-Validation Separation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=SEED)

StatementMeta(, , , Waiting, , Waiting)

In [7]:
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix, recall_score, roc_auc_score, classification_report
mlflow.sklearn.autolog(registered_model_name='rfc1_sm') # Register the trained model with autologging
rfc1_sm = RandomForestClassifier(max_depth=4, max_features=4, min_samples_split=3, random_state=1) # Pass hyperparameters
with mlflow.start_run(run_name="rfc1_sm") as run:
    rfc1_sm_run_id = run.info.run_id # Capture run_id for model prediction later
    print("run_id: {}; status: {}".format(rfc1_sm_run_id, run.info.status))
    rfc1_sm.fit(X_train,y_train) # Imbalanaced training data
    rfc1_sm.score(X_test,y_test)
    y_pred = rfc1_sm.predict(X_test)
    cr_rfc1_sm = classification_report(y_test, y_pred)
    cm_rfc1_sm = confusion_matrix(y_test, y_pred)
    roc_auc_rfc1_sm = roc_auc_score(y_train, rfc1_sm.predict_proba(X_train)[:, 1])

StatementMeta(, , , Waiting, , Waiting)



run_id: 9111e9e2-ba82-4ba0-8a8f-f2c5d4ab8fb0; status: RUNNING








Registered model 'rfc1_sm' already exists. Creating a new version of this model...


2024/07/15 14:56:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: rfc1_sm, version 2
Created version '2' of model 'rfc1_sm'.






In [8]:
display(cm_rfc1_sm)

StatementMeta(, , , Waiting, , Waiting)

array([[53, 17],
       [ 8, 60]])

In [9]:
 display(roc_auc_rfc1_sm)

StatementMeta(, , , Waiting, , Waiting)

0.9750763941940412

### Load the model to generate predictions and re-calculate accuracy with the validation dataset



In [9]:
load_model = mlflow.sklearn.load_model(f"runs:/{rfc1_sm_run_id}/model")

StatementMeta(, 7f1fc87f-1288-499f-b3ff-7b2dd6398150, 12, Finished, Available, Finished)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

StatementMeta(, 7f1fc87f-1288-499f-b3ff-7b2dd6398150, 13, Finished, Available, Finished)

In [11]:
#Generate predictions with validation dataset
y_pred_val = load_model.predict(X_val)

StatementMeta(, 7f1fc87f-1288-499f-b3ff-7b2dd6398150, 15, Finished, Available, Finished)

Recalculate the classification report, confusion matrix and AUC score with the newly generated predictions.

In [12]:
#Calculate the classification report
cr_val= classification_report(y_val, y_pred_val)
#Calculate the confusion matrix
cm_val= confusion_matrix(y_val, y_pred_val)
#Calculate the auc score
roc_auc_val= roc_auc_score(y_val, y_pred_val)


StatementMeta(, 7f1fc87f-1288-499f-b3ff-7b2dd6398150, 16, Finished, Available, Finished)

Display the re-calculated values

In [14]:
display(cr_val)

StatementMeta(, 7f1fc87f-1288-499f-b3ff-7b2dd6398150, 18, Finished, Available, Finished)

'              precision    recall  f1-score   support\n\n           0       0.84      0.71      0.77        52\n           1       0.84      0.92      0.88        86\n\n    accuracy                           0.84       138\n   macro avg       0.84      0.82      0.82       138\nweighted avg       0.84      0.84      0.84       138\n'

In [15]:
display(cm_val)

StatementMeta(, 7f1fc87f-1288-499f-b3ff-7b2dd6398150, 19, Finished, Available, Finished)

array([[37, 15],
       [ 7, 79]])

In [16]:
display(roc_auc_val)

StatementMeta(, 7f1fc87f-1288-499f-b3ff-7b2dd6398150, 20, Finished, Available, Finished)

0.815071556350626