# **Project Overview and Objectives**

## **Project Overview:**

In this project, the objective is to develop a robust machine learning model that can predict the likelihood of a patient being diabetic based on various health metrics. The project involves exploring multiple machine learning algorithms, comparing their performance, and selecting the best-performing model for potential deployment. The entire process will be tracked and managed using MLflow, an open-source platform for managing the complete machine learning lifecycle. This includes experimentation, tracking, model management, and deployment capabilities provided by Azure Machine Learning Studio.

## **Objectives:**

1. **Data Preparation:**
   - Load and preprocess the diabetes dataset, ensuring the data is clean and ready for model training.
   - Split the dataset into training and test sets to evaluate the model's performance.

2. **Model Training and Evaluation:**
   - Train multiple machine learning models, including Logistic Regression, Decision Tree, and Random Forest.
   - Utilize MLflow's autologging feature to automatically track the parameters, metrics, and artifacts for each model.
   - Manually log additional metrics (e.g., accuracy) and compare the performance of each model.

3. **Model Selection:**
   - Analyze the logged metrics to identify the best-performing model based on accuracy.
   - Register the best model in Azure ML for future use, ensuring it is ready for deployment.

4. **Model Deployment (Optional):**
   - Optionally deploy the best model as a web service in Azure, making it accessible for real-time predictions in external applications.

## **Process Overview:**

1. **Environment Setup:**
   - Verify that the necessary libraries and SDKs are installed, including `azure-ai-ml`, `mlflow`, `scikit-learn`, and others required for model training and tracking.

2. **Data Loading and Preprocessing:**
   - Load the diabetes dataset from a CSV file.
   - Split the data into features (input variables) and labels (output variable) and further split it into training and test sets.

3. **Experiment Setup:**
   - Initialize an MLflow experiment to group all related model training runs.
   - Enable MLflow autologging to automatically track model parameters, metrics, and artifacts.

4. **Model Training:**
   - Train multiple machine learning models using different algorithms.
   - For each model, log the accuracy and other relevant metrics to MLflow.
   - Save the trained model as an artifact for future reference or deployment.

5. **Model Evaluation and Selection:**
   - Compare the performance of the different models based on the logged metrics.
   - Select the best model and register it in Azure ML for potential deployment.

6. **Model Registration and Deployment:**
   - Register the best-performing model in Azure ML to ensure it is easily accessible for deployment.
   - Optionally, deploy the model as a web service for real-time predictions.

7. **Project Documentation:**
   - Document the entire process, including data exploration, model selection criteria, and the final decision-making process.

## **Expected Outcomes:**

- A well-documented machine learning pipeline that can be used to predict diabetes in patients.
- The best-performing model registered in Azure ML, ready for deployment or further experimentation.
- A clear comparison of various machine learning algorithms in the context of diabetes prediction, providing insights into which algorithm performs best under given conditions.

## **Conclusion:**

By the end of this project, you will have a comprehensive understanding of how to build, track, and deploy machine learning models using Azure ML and MLflow. The project's primary goal is to identify the best model for predicting diabetes, ensuring it is ready for deployment and further use in real-world applications.


In [1]:
# Verify that the necessary packages are installed
!pip show azure-ai-ml
!pip show mlflow

import importlib

# List of required packages
required_packages = ['azure-ai-ml', 'mlflow']

# Function to check if a package is installed
def check_package_installed(package_name):
    try:
        importlib.import_module(package_name)
        print(f"Package '{package_name}' is installed.")
    except ImportError:
        print(f"Package '{package_name}' is NOT installed. You may need to install it using pip.")

# Verify each package
for package in required_packages:
    check_package_installed(package)

Name: azure-ai-ml
Version: 1.19.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py38/lib/python3.9/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, opencensus-ext-logging, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Name: mlflow
Version: 2.15.1
Summary: MLflow is an open source platform for the complete machine learning lifecycle
Home-page: 
Author: 
Author-email: 
License: Copyright 2018 Databricks, Inc.  All rights reserved.
        
                                        Apache License
                                   Version 2.0, January 2004
                                http

In [2]:
# Connect to your Azure ML workspace
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

In [3]:
# Get a handle to the workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json


In [4]:
# Import necessary libraries
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
import uuid
import os

In [5]:
# Load the diabetes dataset
print("Reading data...")
df = pd.read_csv('./data/diabetes.csv')
df.head()

Reading data...


Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


In [6]:
# Set up the experiment
experiment_name = "mlflow-experiment-diabetes"
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='', creation_time=1723789911873, experiment_id='06172f1b-0794-4170-a26f-7350eab418ea', last_update_time=None, lifecycle_stage='active', name='mlflow-experiment-diabetes', tags={}>

In [7]:
# Log the dataset manually as an artifact
with mlflow.start_run(run_name="dataset_logging"):
    # Save the dataset to a CSV file
    dataset_path = "./data/diabetes_logged.csv"
    df.to_csv(dataset_path, index=False)
    
    # Log the dataset file
    mlflow.log_artifact(dataset_path, artifact_path="datasets")



In [8]:
# Prepare the data for model training
X, y = df[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, df['Diabetic'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In [9]:
# Initialize models
models = {
    "Logistic Regression": LogisticRegression(C=1/0.1, solver="liblinear"),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100)
}

best_accuracy = 0
best_model_run = None
best_model_name = None

In [10]:
# Train and log models manually
for model_name, model in models.items():
    with mlflow.start_run(run_name=model_name):
        # Log model parameters
        if model_name == "Logistic Regression":
            mlflow.log_param("C", 1/0.1)
            mlflow.log_param("solver", "liblinear")
        elif model_name == "Decision Tree":
            mlflow.log_param("criterion", "gini")  # or another parameter if specified
        elif model_name == "Random Forest":
            mlflow.log_param("n_estimators", 100)

        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        mlflow.log_metric("accuracy", accuracy)

        # Save the model to a file
        model_filename = f"{model_name.replace(' ', '')}_model.pkl"
        joblib.dump(model, model_filename)

        # Log the model as an artifact
        mlflow.log_artifact(model_filename, artifact_path="models")

        # Check if this model is the best so far
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_model_run = mlflow.active_run().info.run_id
            best_model_name = model_name

In [11]:
import re

# Correct the model filename
model_filename = f"{best_model_name.replace(' ', '')}_model.pkl"

# Construct the correct model URI
model_uri = f"runs:/{best_model_run}/models/{model_filename}"

# Clean the model name to ensure it's valid
def clean_model_name(name):
    return re.sub(r'[^a-zA-Z0-9_-]', '', name.replace(' ', '_'))

cleaned_model_name = clean_model_name(f"{best_model_name}_Model")

# Verify the model URI exists
try:
    artifacts = mlflow.artifacts.download_artifacts(artifact_uri=f"runs:/{best_model_run}/models")
    print(f"Available artifacts: {artifacts}")
except Exception as e:
    print(f"Error verifying model URI: {str(e)}")
    print("Please check that the model file exists and the URI is correct.")

# Attempt to register the model
try:
    model_details = mlflow.register_model(
        model_uri=model_uri,
        name=cleaned_model_name
    )
    print(f"Model registered: {model_details.name} version {model_details.version}")
except mlflow.exceptions.MlflowException as e:
    if "RESOURCE_ALREADY_EXISTS" in str(e):
        print(f"Model {cleaned_model_name} already exists. Attempting to create a new version...")
        try:
            new_version = mlflow.register_model(
                model_uri=model_uri,
                name=cleaned_model_name
            )
            print(f"New version created: {new_version.version}")
        except Exception as e2:
            print(f"Failed to create new version: {str(e2)}")
    elif "RESOURCE_DOES_NOT_EXIST" in str(e):
        print(f"The specified model file does not exist. Please check the model URI: {model_uri}")
        print("Available files in the 'models' directory:")
        try:
            files = mlflow.artifacts.list_artifacts(f"runs:/{best_model_run}/models")
            for file in files:
                print(file.path)
        except Exception as e2:
            print(f"Error listing artifacts: {str(e2)}")
    else:
        print(f"Failed to register model: {str(e)}")

# Print the best model
print(f"Best model: {best_model_name} with accuracy {best_accuracy:.4f}")

  from .autonotebook import tqdm as notebook_tqdm
Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
Registered model 'Random_Forest_Model' already exists. Creating a new version of this model...
2024/08/26 04:46:09 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: Random_Forest_Model, version 8
Created version '8' of model 'Random_Forest_Model'.


Available artifacts: /tmp/tmp12_jd1p4/models
Model registered: Random_Forest_Model version 8
Best model: Random Forest with accuracy 0.9323
