## Module 3: Train and register Heart Failure Prediction Machine Learning Model
In this module you will learn to train a machine learning model to predict the likelihood of an individual getting heart failure based on some historical diagnostic measurements available in the training dataset

Once a model is trained, you will learn to register the trained model, and log hyperaparameters used and evaluation metrics using Fabric's native integration with the MLflow framework.

[MLflow](https://mlflow.org/docs/latest/index.html) is an open source platform for managing the machine learning lifecycle with features like Tracking, Models, and Model Registry. MLflow is natively integrated with Fabric Data Science Experience.

#### Import the required libraries


In [2]:
# Run this code cell to import the required libraries for model training
import mlflow
print(f"mlflow lbrary version: {mlflow.__version__}")
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix, recall_score, roc_auc_score, classification_report


StatementMeta(, 3f766ed0-1822-4de7-a442-f95c127f87cf, 4, Finished, Available)

mlflow lbrary version: 2.6.0


#### Part 1 Instructions
In the cell below, create an mlflow experiment to track and register the model. The experiment name has been provided to you in the first variable. [Documentation link](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_experiment)

In [4]:
# Create Experiment to Track and register model with mlflow
EXPERIMENT_NAME = "heartfailure_prediction"

#Enter code here to create an experiment
mlflow.

#This line of code enables user-started runs to be tracked
mlflow.autolog(exclusive=False)

StatementMeta(, 3f766ed0-1822-4de7-a442-f95c127f87cf, 6, Finished, Available)

2024/06/19 13:29:06 INFO mlflow.tracking.fluent: Autologging successfully enabled for lightgbm.
2024/06/19 13:29:06 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


#### Part 2 Instructions
Now, read the delta table that you saved into the lakehouse on notebook 2. Refer to previous notebooks if needed.

In [3]:
#Enter code to read the delta table from the lakehouse
data_df = 

#Display the schema and a preview of the loaded dataframe
data_df.printSchema()
display(data_df)

StatementMeta(, be009a0d-f936-4196-a3aa-31e411e3c3c2, 6, Finished, Available)

root
 |-- Sex: long (nullable = true)
 |-- ChestPainType: long (nullable = true)
 |-- RestingECG: long (nullable = true)
 |-- ExerciseAngina: long (nullable = true)
 |-- ST_Slope: long (nullable = true)
 |-- Age: integer (nullable = true)
 |-- RestingBP: integer (nullable = true)
 |-- Cholesterol: integer (nullable = true)
 |-- FastingBS: integer (nullable = true)
 |-- MaxHR: integer (nullable = true)
 |-- Oldpeak: double (nullable = true)
 |-- HeartDisease: integer (nullable = true)



SynapseWidget(Synapse.DataFrame, fcf3065d-1f78-4673-9772-9febb4f656cc)

#### Part 3 instructions
You now have to perform a random split of your data to obtain a training and a testing sub-datasets. Run the following 2 code cells as they are, the comments explain what is happening at each stage.

In [5]:
#First, convert your dataframe to a Pandas dataframe
data_df = data_df.toPandas()

StatementMeta(, be009a0d-f936-4196-a3aa-31e411e3c3c2, 8, Finished, Available)

In [6]:
#Import the train_test_split function from sklearn
from sklearn.model_selection import train_test_split

#Set the seed for the random splitter
SEED = 12345

#Separate your dataframe into y, the feature data and x, the label data
y = data_df["HeartDisease"]
X = data_df.drop("HeartDisease",axis=1)

StatementMeta(, be009a0d-f936-4196-a3aa-31e411e3c3c2, 9, Finished, Available)

Now, perform the split of data according to the following groups: 60% training (x_train and y_train), 20% testing (x_test and y_test) and 20% validation (x_val and y_val). Use the train_test_split function you previously imported. [Documentation Link](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
# First split the data into training and testing. Make the training portion 60% and the rest testing. 

X_train, ... =
# Now, with your existing testing dataset, split it evenly into testing and validation.

X_train, ... =

#### Part 4 Instructions
You will now create a run with the experiment you previously started. You will be using a random forest classifier from the SK Learn library. 

[Quick RandomForestClassifier function documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

[In-depth explanation of random forest classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
#Import the required libraries by running this code cell
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix, recall_score, roc_auc_score, classification_report

Before starting, register the model in mlflow with autologging. Use the mlflow.sklearn.autolog function, which allows you to track sklearn models in mlflow. Make sure to use 'rfc1_sm' as your registered model name.   [Documentation Link](https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.autolog)

In [None]:
# Register the trained model with autologging
mlflow.

Now, set your registered model name to a Random Forest Classifier. Use the link under the Part 4 instructions for more information about the RandomForestClassfier function. For the hyperparameters, use a max depth of 4, a max features of 4, a minimum sample split of 3 and a random state of 1.

In [None]:
rfc1_sm = 

Once you have set the model, you will start a training run in mlflow to fit your model to the data. Name your run 'rfc1_sm' [Documentation Link](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.start_run).

You will need to perform a variety of different actions in the run, but they must all be in the same code cell. Here you will find a labeled list of helpful links for each step. You must add one line of code in between each comment line.

1. [model.fit](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit)
1. [model.score](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.score)
1. [model.predict](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict)
1. [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#classification-report)
1. [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#confusion-matrix)
1. [roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#roc-auc-score) & [predict_proba](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html#sklearn.calibration.CalibratedClassifierCV.predict_proba)


In [None]:
with "use the start_run function here" as run:
    rfc1_sm_run_id = #Enter code here to capture run id
    print("run_id: {}; status: {}".format(rfc1_sm_run_id, run.info.status)) #This line prints the run ID and status
    #1 Fit the model on the train data you previously split
    
    #2 Score the model with the training data (x_train,...) to obtain the precision of the model
    
    #3 Generate predictions using the training feature data
    y_pred = 
    #4 Create a classification report comparing your step 3 predictions (y_pred) with the actual values (y_train)
    cr_rfc1_sm = 
    #5 Create a confusion matrix comparing your step 3 predictions with the actual values (y_train)
    cm_rfc1_sm = 
    #6 Calculate the auc score with the training data. Use the predict_proba function for your predictions.
    roc_auc_rfc1_sm = 

Display the classification report

In [None]:
display(cr_rfc1_sm)

Display the confusion matrix

In [8]:
display(cm_rfc1_sm)

StatementMeta(, be009a0d-f936-4196-a3aa-31e411e3c3c2, 11, Finished, Available)

array([[69, 18],
       [ 8, 89]])

Display the auc score

In [9]:
 display(roc_auc_rfc1_sm)

StatementMeta(, be009a0d-f936-4196-a3aa-31e411e3c3c2, 12, Finished, Available)

0.9669833380099144

#### Part 5 instructions
You have now a trained and saved model. While you have explored its performance with the training data, you will now complete a final check with the validation dataset. In this section you will also explore loading a previously trained model.

Load the model using MLFlow. [Documentation Link](https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.load_model)

In [None]:
#Load the model
load_model=

Now, generate predictions with the validation data (X_val). Refer to Part 4 for documentation.

In [None]:
#Generate predictions
y_pred_val=

To check the model accuracy, recalculate the clasification report, confusion matrix and auc score. Refer to Part 4 for documentation. Use y_val (actual values) and y_pred_val (predicted values).

In [None]:
#Calculate the classification report
cr_val=
#Calculate the confusion matrix
cm_val=
#Calculate the auc score
roc_auc_val=

Display the classification report

In [None]:
display(cr_val)

Display the confusion matrix

In [None]:
display(cm_val)

Display the auc score

In [None]:
display(roc_auc_val)