A healthcare clinic is looking to enhance its decision-making process for predicting the likelihood of diabetes in patients based on a set of health-related attributes. Diabetes is a growing concern, especially among certain age groups and populations with risk factors such as high blood pressure, obesity, and a sedentary lifestyle. The clinic has collected patient data, including metrics like plasma glucose concentration, blood pressure, body mass index (BMI), and others, which can be used to predict diabetes. However, they struggle with accurately identifying high-risk patients who may need further medical intervention or lifestyle changes.

The clinic seeks to leverage machine learning models to predict diabetes risk and ultimately improve patient care. By accurately predicting the probability of a patient having diabetes based on their health metrics, the clinic aims to provide early diagnosis, enabling timely intervention and better healthcare outcomes.

**Objective:**
You have been hired as a data scientist to help the clinic build a predictive model that can classify whether or not a patient is likely to have diabetes based on their health attributes. Your goal is to develop a machine learning pipeline using a Gradient Boosting classifier to analyze patient data and predict diabetes outcomes (class label 1: diabetes, 0: no diabetes). The outcome should involve deploying this model and enable real-time predictions for clinical use.

The dataset consists of health-related attributes of patients, with the following features:

- Pregnancies (preg): Number of times the patient has been pregnant
- Plasma glucose concentration (plas): Plasma glucose concentration in an - oral glucose tolerance test
- Blood pressure (pres): Diastolic blood pressure (mm Hg)
- Skin thickness (skin): Triceps skin fold thickness (mm)
- Serum insulin (test): 2-Hour serum insulin (mu U/ml)
- Body mass index (mass): BMI (weight in kg/height in mÂ²)
- Diabetes pedigree function (pedi): A function that scores the likelihood of diabetes based on family history
- Age (age): Age of the patient (years)
- Class (class): Diabetes outcome (1: diabetes, 0: no diabetes)

**Note: When working with google colab environment, the local files will not be accessible. It acts as a VM and all the files created will be in colab server. We cannot see that, but if you run a shell command to list the folders we will be able to see it. To overcome that, in this notebook, we are using google drive to store the folders and the code. The google drive is added to the local PC. The google drive is also mounted on colab server.**

In [1]:
# Mounting the google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
import os
os.makedirs('/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops', exist_ok=True)
os.makedirs("/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/data", exist_ok=True)
# Create a folder for storing the model building files
os.makedirs("/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/model_building", exist_ok=True)
# This takes couple of minutes to reflect locally

In [11]:
repo_id = "maheshnn/PIMA-Diabetes-Prediction"  # Hugging Face username
print(repo_id)

In [9]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/model_building/data_register.py"
from huggingface_hub.utils import RepositoryNotFoundError, HfHubHTTPError
from huggingface_hub import HfApi, create_repo
import os

repo_id = "maheshnn/PIMA-Diabetes-Prediction" 
repo_type = "dataset"

# Initialize API client
api = HfApi(token=os.getenv("HF_TOKEN"))

# Step 1: Check if the space exists
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f"Space '{repo_id}' already exists. Using it.")
except RepositoryNotFoundError:
    print(f"Space '{repo_id}' not found. Creating new space...")
    create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
    print(f"Space '{repo_id}' created.")

api.upload_folder(
    folder_path="self_paced_courses_1_mlops/data", # Uploading the data 
    repo_id=repo_id,
    repo_type=repo_type,
)

Overwriting /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/model_building/data_register.py


**Data Preparation**

In [23]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/model_building/prep.py"
import pandas as pd
import sklearn
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from huggingface_hub import login, HfApi

api = HfApi(token=os.getenv("HF_TOKEN"))
DATASET_PATH = "hf://datasets/maheshnn/PIMA-Diabetes-Prediction/pima.csv"
df = pd.read_csv(DATASET_PATH)
print("Dataset loaded successfully.")

target_col = 'class'

# Split into X (features) and y (target)
X = df.drop(columns=[target_col])
y = df[target_col]

# Perform train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Xtrain.to_csv("Xtrain.csv",index=False)
Xtest.to_csv("Xtest.csv",index=False)
ytrain.to_csv("ytrain.csv",index=False)
ytest.to_csv("ytest.csv",index=False)


files = ["Xtrain.csv","Xtest.csv","ytrain.csv","ytest.csv"]


for file_path in files:
    api.upload_file(
        path_or_fileobj=file_path,
        path_in_repo=file_path.split("/")[-1],  # just the filename
        repo_id="maheshnn/PIMA-Diabetes-Prediction",                                    
        repo_type="dataset",
    )

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/model_building/prep.py


**Model Training**

In [24]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/model_building/train.py"
# For data manipulation
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# For model training, tuning and evaluation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, recall_score

# for model serialization
import joblib

# for Hugging face space authentication to upload files
from huggingface_hub import login, HfApi, create_repo
from huggingface_hub.utils import RepositoryNotFoundError, HfHubHTTPError

api = HfApi()

Xtrain_path = "hf://datasets/maheshnn/PIMA-Diabetes-Prediction/Xtrain.csv"                   
Xtest_path = "hf://datasets/maheshnn/PIMA-Diabetes-Prediction/Xtest.csv"                      
ytrain_path = "hf://datasets/maheshnn/PIMA-Diabetes-Prediction/ytrain.csv"                    
ytest_path = "hf://datasets/maheshnn/PIMA-Diabetes-Prediction/ytest.csv"

Xtrain = pd.read_csv(Xtrain_path)
Xtest = pd.read_csv(Xtest_path)
ytrain = pd.read_csv(ytrain_path)
ytest = pd.read_csv(ytest_path)

# scale numeric features
numeric_features = [
    'preg',
    'plas',
    'pres',
    'skin',
    'test',
    'mass',
    'pedi',
    'age'
]

# Preprocessing pipeline
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features)
)

# Define GB model
gb_model = GradientBoostingClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'gradientboostingclassifier__n_estimators': [75, 100, 125],
    'gradientboostingclassifier__max_depth': [2, 3, 4],
    'gradientboostingclassifier__subsample': [0.5, 0.6]
}

# Create pipeline
model_pipeline = make_pipeline(preprocessor, gb_model)

# Grid search with cross-validation
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='recall', n_jobs=-1)
grid_search.fit(Xtrain, ytrain)


# Best model
best_model = grid_search.best_estimator_
print("Best Params:\n", grid_search.best_params_)

# Predict on training set
y_pred_train = best_model.predict(Xtrain)

# Predict on test set
y_pred_test = best_model.predict(Xtest)

# Evaluation
print("\nTraining Classification Report:")
print(classification_report(ytrain, y_pred_train))

print("\nTest Classification Report:")
print(classification_report(ytest, y_pred_test))

# Save best model
joblib.dump(best_model, "best_pima_diabetes_model_v1.joblib")

# Upload to Hugging Face
repo_id = "maheshnn/PIMA-Diabetes-Prediction"                                        
repo_type = "model"

api = HfApi(token=os.getenv("HF_TOKEN"))

# Step 1: Check if the space exists
try:
    api.repo_info(repo_id=repo_id, repo_type=repo_type)
    print(f"Model Space '{repo_id}' already exists. Using it.")
except RepositoryNotFoundError:
    print(f"Model Space '{repo_id}' not found. Creating new space...")
    create_repo(repo_id=repo_id, repo_type=repo_type, private=False)
    print(f"Model Space '{repo_id}' created.")

# create_repo("best_machine_failure_model", repo_type="model", private=False)
api.upload_file(
    path_or_fileobj="best_pima_diabetes_model_v1.joblib",
    path_in_repo="best_pima_diabetes_model_v1.joblib",
    repo_id=repo_id,
    repo_type=repo_type,
)

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/model_building/train.py


**Deployment**

In [25]:
os.makedirs("/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/deployment", exist_ok=True)

In [26]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/deployment/Dockerfile"
# Use a minimal base image with Python 3.9 installed
FROM python:3.9

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
	PATH=/home/user/.local/bin:$PATH

WORKDIR $HOME/app

COPY --chown=user . $HOME/app

# Define the command to run the Streamlit app on port "8501" and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/deployment/Dockerfile


In [28]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/deployment/app.py"
import streamlit as st
import pandas as pd
from huggingface_hub import hf_hub_download
import joblib

# Download and load the model
model_path = hf_hub_download(repo_id="maheshnn/PIMA-Diabetes-Prediction", filename="best_pima_diabetes_model_v1.joblib")                                       # enter the Hugging Face username here
model = joblib.load(model_path)

# Streamlit UI for Machine Failure Prediction
st.title("PIMA Diabetes Prediction App")
st.write("""
This application predicts the likelihood of a patient having diabetes based on their health attributes.
Please enter the sensor and configuration data below to get a prediction.
""")

# User inputs
preg = st.number_input("Number of Pregnancies", min_value=0, max_value=20, value=1)
plas = st.number_input("Plasma Glucose Concentration", min_value=0, max_value=300, value=120)
pres = st.number_input("Diastolic Blood Pressure (mm Hg)", min_value=0, max_value=200, value=70)
skin = st.number_input("Triceps Skinfold Thickness (mm)", min_value=0, max_value=100, value=20)
test = st.number_input("2-Hour Serum Insulin (mu U/ml)", min_value=0, max_value=900, value=80)
mass = st.number_input("Body Mass Index (BMI)", min_value=0.0, max_value=70.0, value=25.0, step=0.1)
pedi = st.number_input("Diabetes Pedigree Function", min_value=0.0, max_value=2.5, value=0.5, step=0.01)
age = st.number_input("Age", min_value=1, max_value=120, value=30)

# Assemble input into DataFrame
input_data = pd.DataFrame([{
    'preg': preg,
    'plas': plas,
    'pres': pres,
    'skin': skin,
    'test': test,
    'mass': mass,
    'pedi': pedi,
    'age': age
}])

# Prediction button
if st.button("Predict Diabetes"):
    prediction = model.predict(input_data)[0]
    result = "Diabetic" if prediction == 1 else "Non-Diabetic"
    st.subheader("Prediction Result:")
    st.success(f"The model predicts: **{result}**")

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/deployment/app.py


In [29]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/deployment/requirements.txt"
pandas==2.2.2
huggingface_hub==0.32.6
streamlit==1.43.2
joblib==1.5.1
scikit-learn==1.6.0

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/deployment/requirements.txt


In [30]:
os.makedirs("/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/hosting", exist_ok=True)

In [31]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/hosting/hosting.py"
from huggingface_hub import HfApi
import os

api = HfApi(token=os.getenv("HF_TOKEN"))
api.upload_folder(
    folder_path="self_paced_courses_1_mlops/deployment",
    repo_id="maheshnn/PIMA-Diabetes-Prediction"                                       
    repo_type="space",
    path_in_repo="",                          # optional: subfolder path inside the repo
)

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/hosting/hosting.py


In [2]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/requirements.txt"
huggingface_hub==0.32.6
datasets==3.6.0
pandas==2.2.2
scikit-learn==1.6.0

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/self_paced_courses_1_mlops/requirements.txt


In [7]:
import os
os.makedirs("/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/.github/workflows", exist_ok=True)

In [8]:
%%writefile "/content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/.github/workflows/pipeline.yml"
name: MLOps pipeline

on:
  workflow_dispatch:

jobs:

  register-dataset:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r self_paced_courses_1_mlops/requirements.txt
      - name: Upload Dataset to Hugging Face Hub
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python self_paced_courses_1_mlops/model_building/data_register.py

  data-prep:
    needs: register-dataset
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r self_paced_courses_1_mlops/requirements.txt
      - name: Run Data Preparation
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python self_paced_courses_1_mlops/model_building/prep.py


  model-training:
    needs: data-prep
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r self_paced_courses_1_mlops/requirements.txt
      - name: Model Building
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python self_paced_courses_1_mlops/model_building/train.py


  deploy-hosting:
    runs-on: ubuntu-latest
    needs: [model-training,data-prep,register-dataset]
    steps:
      - uses: actions/checkout@v3
      - name: Install Dependencies
        run: pip install -r self_paced_courses_1_mlops/requirements.txt
      - name: Push files to Frontend Hugging Face Space
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: python self_paced_courses_1_mlops/hosting/hosting.py

Writing /content/drive/My Drive/PGP-AI-UT-Austin/Week11-MLOps/PIMA_Diabetes_Prediction/.github/workflows/pipeline.yml
