<a href="https://colab.research.google.com/github/mimrancomsats/ProgrammingforAI_SPRING25/blob/main/Lab_9_MLflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MLFlow**

**MLflow is an open-source platform, purpose-built to assist machine learning practitioners and teams in handling the complexities of the machine learning process. MLflow focuses on the full lifecycle for machine learning projects, ensuring that each phase is manageable, traceable, and reproducible.**

**In this notebook, we are going to use MLFlow for Experiment Tracking**

**The installation process of MLFlow is described in the following link:**




https://mlflow.org/docs/latest/getting-started/intro-quickstart/index.html

# **MLFlow Library Installation**

In [3]:
!pip install --quiet mlflow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.0/29.0 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m100.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.9/231.9 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m700.2/700.2 kB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# **Sklearn Pipeline Implementation (KNN)**

In [2]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
titanic_data = pd.read_csv('titanic.csv')

# Custom function to impute missing values in 'Embarked' column
def impute_embarked(X):
    X['Embarked'] = X['Embarked'].fillna(X['Embarked'].mode()[0])  # Fill missing values
    return X

# Custom function to create the 'FamilySize' feature
def create_family_size(X):
    X['FamilySize'] = X['SibSp'] + X['Parch'] + 1  # Add 1 for the individual themselves
    return X

# Custom function to drop columns that are not needed for model training
def drop_columns(X):
    return X.drop(['SibSp', 'Parch'], axis=1)

# Function to create 'FamilySize' and drop 'SibSp' and 'Parch' columns
def family_size(X):
    #print(X)
    X = create_family_size(X)
    #print(X)
    X = drop_columns(X)
    #print(X)
    return X

# Pipeline to preprocess 'Age' column
age_pipeline = Pipeline(steps=[
    ('age_imputer', SimpleImputer(strategy='mean')),  # Impute missing 'Age' values
    ('age_scaler', MinMaxScaler())  # Scale 'Age' feature
])

# Pipeline to preprocess 'Fare' column
fare_pipeline = Pipeline(steps=[
    #('fare_imputer', SimpleImputer(strategy='mean')),  # Optionally impute missing 'Fare'
    ('fare_scaler', MinMaxScaler())  # Scale 'Fare' feature
])

# Pipeline to create and scale the 'FamilySize' feature
family_size_pipeline = Pipeline(steps=[
    ('family_size_creator', FunctionTransformer(family_size)),
    ('family_size_scaler', MinMaxScaler()),  # Scale 'FamilySize'
])

# Pipeline to preprocess 'Embarked' column
embarked_pipeline = Pipeline(steps=[
    ('embarked_imputer', FunctionTransformer(impute_embarked)),  # Impute missing 'Embarked' values
    ('embarked_onehot', OneHotEncoder())  # One-hot encode 'Embarked'
])

# Create a ColumnTransformer to preprocess all relevant features
knn_preprocessor = ColumnTransformer(transformers=[
    ('age_encoder', age_pipeline, ['Age']),  # Preprocess 'Age'
    ('fare_encoder', fare_pipeline, ['Fare']),  # Preprocess 'Fare'
    ('family_size', family_size_pipeline, ['SibSp', 'Parch']),  # Preprocess 'FamilySize'
    ('embarked_encoder', embarked_pipeline, ['Embarked']),  # Preprocess 'Embarked'
    ('sex_encoder', OneHotEncoder(), ['Sex']),  # One-hot encode 'Sex'
    ('pclass_scaler', MinMaxScaler(), ['Pclass']),  # Scale 'Pclass'
], remainder='passthrough')

# Create a complete pipeline with preprocessing and the KNN classifier
knn_pipeline = Pipeline(steps=[
    ('knn_preprocessor', knn_preprocessor),  # Data preprocessing steps
    ('knn_classifier', KNeighborsClassifier(n_neighbors=5))  # KNN Classifier
])

# Separate features and target variable
#X = data.drop('Survived', axis=1)
X = titanic_data.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
y = titanic_data['Survived']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
knn_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn_pipeline.predict(X_test)

# Evaluate the model performance
knn_accuracy = accuracy_score(y_test, y_pred)
print(f"\nKNN Model Accuracy: {knn_accuracy:.2f}")

# Confusion matrix for evaluating the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))



KNN Model Accuracy: 0.80
Confusion Matrix:
[[90 15]
 [21 53]]


# **Experiment Tracking**

In [None]:
import mlflow
import mlflow.sklearn

# Set the tracking URI and experiment name
mlflow.set_tracking_uri(uri="http://3.85.232.20:5000")
mlflow.set_experiment("KNN Experiment")

# Start a new MLflow run
with mlflow.start_run():

    # Log the prameters related to KNN model
    mlflow.log_param("model","KNN")
    mlflow.log_param("n_neighbors", 5)
    mlflow.log_param("metric", 'euclidean')

    # Log the accuracy metric
    mlflow.log_metric("accuracy", knn_accuracy)

    # Log the KNN model (use the knn_pipeline variable)
    mlflow.sklearn.log_model(knn_pipeline, "KNN Algorithm")


2025/05/02 10:37:30 INFO mlflow.tracking.fluent: Experiment with name 'KNN Experiment' does not exist. Creating a new experiment.


🏃 View run unleashed-pig-897 at: http://3.87.94.8:5000/#/experiments/346416385151607627/runs/c4b68d93f72142fdbf7948de32b9d8d4
🧪 View experiment at: http://3.87.94.8:5000/#/experiments/346416385151607627


# **Sklearn Pipeline Implementation (Decision Tree)**

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the dataset
data = pd.read_csv('titanic.csv')

# Custom function to impute missing values in the 'Embarked' column
def impute_embarked(X):
    X['Embarked'] = X['Embarked'].fillna(X['Embarked'].mode()[0])  # Fill missing values
    return X

# Custom function to create the 'FamilySize' feature
def create_family_size(X):
    X['FamilySize'] = X['SibSp'] + X['Parch'] + 1  # Adding 1 for the individual themselves
    return X

# Custom function to drop specified columns
def drop_columns(X):
    return X.drop(['SibSp', 'Parch'], axis=1)

# Function to create 'FamilySize' and drop 'SibSp' and 'Parch' columns
def family_size(X):
    X = create_family_size(X)
    X = drop_columns(X)
    return X

# Create pipelines for 'Age'
age_pipeline = Pipeline(steps=[
    ('age_imputer', SimpleImputer(strategy='mean')),  # Impute missing 'Age' values
    ('age_scaler', MinMaxScaler())  # Scale 'Age' feature
])

# Create pipelines for 'Fare'
fare_pipeline = Pipeline(steps=[
    ('fare_scaler', MinMaxScaler())  # Scale 'Fare' feature
])

# Create pipelines for 'FamilySize'
family_size_pipeline = Pipeline(steps=[
    ('family_size_creator', FunctionTransformer(family_size)),
    ('family_size_scaler', MinMaxScaler())  # Scale 'FamilySize' feature
])

# Create pipelines for 'Embarked'
embarked_pipeline = Pipeline(steps=[
    ('embarked_imputer', FunctionTransformer(impute_embarked)),  # Impute missing 'Embarked' values
    ('embarked_onehot', OneHotEncoder())  # One-hot encode 'Embarked'
])

# Create a ColumnTransformer to preprocess the data
dt_preprocessor = ColumnTransformer(transformers=[
    ('drop_columns', 'drop', ['PassengerId', 'Name', 'Ticket', 'Cabin']),  # Drop irrelevant columns
    ('age_encoder', age_pipeline, ['Age']),  # Preprocess 'Age'
    ('fare_encoder', fare_pipeline, ['Fare']),  # Preprocess 'Fare'
    ('family_size', family_size_pipeline, ['SibSp', 'Parch']),  # Preprocess 'FamilySize'
    ('embarked_encoder', embarked_pipeline, ['Embarked']),  # Preprocess 'Embarked'
    ('sex_encoder', OneHotEncoder(), ['Sex']),  # One-hot encode 'Sex'
    ('pclass_scaler', MinMaxScaler(), ['Pclass']),  # Scale 'Pclass'
], remainder='passthrough')

# Create a complete pipeline that includes preprocessing and the Decision Tree classifier
dt_pipeline = Pipeline(steps=[
    ('dt_preprocessor', dt_preprocessor),  # Data preprocessing steps
    ('dt_classifier', DecisionTreeClassifier(criterion='entropy'))  # Decision Tree Classifier
])

# Separate features and target variable
X = data.drop('Survived', axis=1)
y = data['Survived']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
dt_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_pipeline.predict(X_test)

# Evaluate the model performance
dt_accuracy = accuracy_score(y_test, y_pred)
print(f"\nDecision Tree Model Accuracy: {dt_accuracy:.2f}")

# Confusion matrix for evaluating the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))



Decision Tree Model Accuracy: 0.77
Confusion Matrix:
[[82 23]
 [19 55]]


# **Experiment Tracking**

In [None]:
import mlflow
import mlflow.sklearn

# Set the tracking URI and experiment name
mlflow.set_tracking_uri(uri="http://3.87.94.8:5000")
mlflow.set_experiment("Decision Tree Experiment")

# Start a new MLflow run
with mlflow.start_run():

    # Log the parameters related to Decision Tree model
    mlflow.log_param("model","Decision Tree")
    mlflow.log_param("criterion", "entropy")

    # Log the accuracy metric
    mlflow.log_metric("accuracy", dt_accuracy)

    # Log the Decision Tree model (use the dt_pipeline variable)
    mlflow.sklearn.log_model(dt_pipeline, "Decision Tree Algorithm")


2025/05/02 10:39:34 INFO mlflow.tracking.fluent: Experiment with name 'Decision Tree Experiment' does not exist. Creating a new experiment.


🏃 View run gregarious-goose-418 at: http://3.87.94.8:5000/#/experiments/312805377520945777/runs/e6dc6769f8d847f382c6d97fdd131946
🧪 View experiment at: http://3.87.94.8:5000/#/experiments/312805377520945777


# **Sklearn Pipeline Implementation (Random Forest)**

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier  # Import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the dataset
data = pd.read_csv('titanic.csv')

# Custom function to impute missing values in the 'Embarked' column
def impute_embarked(X):
    X['Embarked'] = X['Embarked'].fillna(X['Embarked'].mode()[0])  # Fill missing values
    return X

# Custom function to create the 'FamilySize' feature
def create_family_size(X):
    X['FamilySize'] = X['SibSp'] + X['Parch'] + 1  # Adding 1 for the individual themselves
    return X

# Custom function to drop specified columns
def drop_columns(X):
    return X.drop(['SibSp', 'Parch'], axis=1)

# Function to create 'FamilySize' and drop 'SibSp' and 'Parch' columns
def family_size(X):
    X = create_family_size(X)
    X = drop_columns(X)
    return X

# Create pipelines for 'Age'
age_pipeline = Pipeline(steps=[
    ('age_imputer', SimpleImputer(strategy='mean')),  # Impute missing 'Age' values
    ('age_scaler', MinMaxScaler())  # Scale 'Age' feature
])

# Create pipelines for 'Fare'
fare_pipeline = Pipeline(steps=[
    ('fare_scaler', MinMaxScaler())  # Scale 'Fare' feature
])

# Create pipelines for 'FamilySize'
family_size_pipeline = Pipeline(steps=[
    ('family_size_creator', FunctionTransformer(family_size)),
    ('family_size_scaler', MinMaxScaler())  # Scale 'FamilySize' feature
])

# Create pipelines for 'Embarked'
embarked_pipeline = Pipeline(steps=[
    ('embarked_imputer', FunctionTransformer(impute_embarked)),  # Impute missing 'Embarked' values
    ('embarked_onehot', OneHotEncoder())  # One-hot encode 'Embarked'
])

# Create a ColumnTransformer to preprocess the data
rf_preprocessor = ColumnTransformer(transformers=[
    ('drop_columns', 'drop', ['PassengerId', 'Name', 'Ticket', 'Cabin']),  # Drop irrelevant columns
    ('age_encoder', age_pipeline, ['Age']),  # Preprocess 'Age'
    ('fare_encoder', fare_pipeline, ['Fare']),  # Preprocess 'Fare'
    ('family_size', family_size_pipeline, ['SibSp', 'Parch']),  # Preprocess 'FamilySize'
    ('embarked_encoder', embarked_pipeline, ['Embarked']),  # Preprocess 'Embarked'
    ('sex_encoder', OneHotEncoder(), ['Sex']),  # One-hot encode 'Sex'
    ('pclass_scaler', MinMaxScaler(), ['Pclass']),  # Scale 'Pclass'
], remainder='passthrough')

# Create a complete pipeline that includes preprocessing and the Random Forest classifier
rf_pipeline = Pipeline(steps=[
    ('rf_preprocessor', rf_preprocessor),  # Data preprocessing steps
    ('rf_classifier', RandomForestClassifier(n_estimators=100, random_state=42))  # Random Forest Classifier
])

# Separate features and target variable
X = data.drop('Survived', axis=1)
y = data['Survived']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
rf_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_pipeline.predict(X_test)

# Evaluate the model performance
rf_accuracy = accuracy_score(y_test, y_pred)
print(f"\nRandom Forest Model Accuracy: {rf_accuracy:.2f}")

# Confusion matrix for evaluating the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Random Forest Model Accuracy: 0.82
Confusion Matrix:
[[91 14]
 [19 55]]


# **Experiment Tracking**

In [None]:
import mlflow
import mlflow.sklearn

# Set the tracking URI and experiment name for Random Forest
mlflow.set_tracking_uri(uri="http://3.87.94.8:5000")
mlflow.set_experiment("Random Forest Experiment")

# Start a new MLflow run
with mlflow.start_run():

    # Log the hyperparameters
    mlflow.log_param("model","Random Forest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("random_state", 42)

    # Log the accuracy metric
    mlflow.log_metric("accuracy", rf_accuracy)

    # Log the Random Forest model (use the rf_pipeline variable)
    mlflow.sklearn.log_model(rf_pipeline, "Random Forest Algorithm")


2025/05/02 10:41:09 INFO mlflow.tracking.fluent: Experiment with name 'Random Forest Experiment' does not exist. Creating a new experiment.


🏃 View run redolent-shoat-348 at: http://3.87.94.8:5000/#/experiments/707553264580851716/runs/2009dc12adbe4c049cb25872ed038a81
🧪 View experiment at: http://3.87.94.8:5000/#/experiments/707553264580851716


# Lab Task

1. Extend the experiment tracking code to log precision, recall and F1 score metrics in the Mlflow tool.
2. Register the best performing model in the Mlflow tool.
3. Perform Inference over test set using the model registered in the Step 2.
4. Perform the steps mentioned above on the following dataset.

https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease