This notebook demonstrates a CI/CD workflow for Machine Learning (MLOps). The workflow includes various steps to ensure an automated pipeline for training, versioning, testing, and deploying machine learning models.

The main components covered:

* Train a Model: Prepare the dataset, train a machine learning model, and evaluate its performance.
* Version Control: Set up a version control system using GitHub to track changes to the codebase, models, and configurations.
* Organise the repository with dedicated folders for different model versions (e.g., v1, v2).
* GitHub Actions Workflow: Create a YAML file to define the CI/CD pipeline steps, including running tests, handling artifacts, and deploying the model.
* Unit Testing: Implement unit tests to validate model functionality.
* Deploy the model using Streamlit, an interactive interface for stakeholders to make predictions based on trained models.


In this notebook, we start by training a machine learning model.
The trained model is saved with metadata in a versioned folder structure (e.g., v1/model_v1.pkl).

**GitHub Setup:**

[A separate notebook](https://colab.research.google.com/drive/1oaahB5IoubWv9OR1dXuWcK1s9bMuUu16?usp=sharing) walks you through the process of creating a GitHub repository and setting up version control.
It also includes steps to create necessary files for the CI/CD workflow, such as the .github/workflows/ci.yml file.

This separation ensures that sensitive information, such as GitHub credentials, is not pushed to the repository, maintaining security and privacy.


In [23]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
import joblib



### The Student Performance Data Set
[The Student Performance](https://archive.ics.uci.edu/dataset/320/student+performance) Data Set is a collection of data gathered from secondary school students in Portugal. It was compiled to analyse the factors that influence academic success, particularly in mathematics and Portuguese language courses. The data was collected from two schools and includes a wide range of attributes related to student demographics, social and economic factors, and academic records.

### Load the data

In [24]:
# fetch dataset
student_performance = fetch_ucirepo(id=320)
# get df
student_performance_df = student_performance.data.original
# variable information
# student_performance.variables

In [25]:
student_performance_df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
644,MS,F,19,R,GT3,T,2,3,services,other,...,5,4,2,1,2,5,4,10,11,10
645,MS,F,18,U,LE3,T,3,1,teacher,services,...,4,3,4,1,1,1,4,15,15,16
646,MS,F,18,U,GT3,T,1,1,other,other,...,1,1,1,1,1,5,6,11,12,9
647,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,6,10,10,10


### Train a simple model
Here I train a basic model to predict the final grade (G3). To keep it simple, I will use only three features for prediction. The primary objective here is not to build an optimized model or achieve high performance, but rather to demonstrate the CI/CD workflow as part of an MLOps pipeline.

In [26]:
# Keep only selected columns
selected_columns = ['G1', 'studytime', 'famsup', 'G3']
student_performance_df = student_performance_df[selected_columns]
# features and target
target = 'G3'
features = [col for col in student_performance_df.columns if col != target]

In [27]:
# Handle categorical variables via one-hot encoding
data_prepared = pd.get_dummies(student_performance_df[features])
X = data_prepared
y = student_performance_df[target]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train a Random Forest model
rf_model = RandomForestRegressor() # n_estimators=200
rf_model.fit(X_train, y_train)

# Evaluate the model
y_pred = rf_model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
rmse
# joblib.dump(model, model_path)

2.0614239875433307

In [None]:
#go to CI_CD_GitHub_process to create GitHub depository


## Save model and version
The goal is to create a function that saves the model, its performance metrics, and metadata whenever changes occur. Each time a model is retrained or modified, it will be saved under a new version and folder. The performance metrics will also be stored in JSON files for easy access and tracking.

In [None]:
/content/drive/MyDrive/01.MLops/ci_cd_demo
/content/drive/MyDrive/01.MLops/ci_cd_demo/tests/test_model.py

In [28]:
import os
import joblib
import json
import datetime

def save_model_and_metadata(model, metrics, version, repo_name="ci_cd_demo",
                            base_dir="/content/drive/MyDrive/01.MLops"):
    """
    Save model and metadata into the specified directory structure under the given base directory.

    Args:
        model: Trained machine learning model
        metrics: Dictionary containing evaluation metrics (e.g., RMSE, accuracy).
        version: Model version (e.g., "v1", "v2").
        repo_name: Name of the repository (e.g., here it's "ci_cd_demo").
        base_dir: Base directory where the repository resides.
    """
    # Construct the full repository path
    repo_dir = os.path.join(base_dir, repo_name)

    # Create paths for saving
    model_dir = os.path.join(repo_dir, version)
    os.makedirs(model_dir, exist_ok=True)

    # Save model
    model_path = os.path.join(model_dir, f"model_{version}.pkl")
    joblib.dump(model, model_path)

    # Save metadata
    metadata = {
        "model_version": version,
        "metrics": metrics,
        "timestamp": str(datetime.datetime.now())
    }
    metadata_path = os.path.join(model_dir, f"metadata_{version}.json")
    with open(metadata_path, "w") as f:
        json.dump(metadata, f)

    print(f"Model saved to {model_path}")
    print(f"Metadata saved to {metadata_path}")

call the function after each retrain

In [29]:
metrics = {"rmse": rmse}  #
save_model_and_metadata(rf_model, metrics, repo_name="ci_cd_demo", version="v1") # change to your repository name

Model saved to /content/drive/MyDrive/01.MLops/ci_cd_demo/v1/model_v1.pkl
Metadata saved to /content/drive/MyDrive/01.MLops/ci_cd_demo/v1/metadata_v1.json


Because the Colab notebook is saved in my Drive I want to copy it to my project folder so any changes to this Notebook will also be pushed to Github

In [22]:
!cp '/content/drive/MyDrive/Colab Notebooks/CI_CD_workflow_v2.ipynb' /content/drive/MyDrive/01.MLops/ci_cd_demo/

cp: cannot stat '/content/drive/MyDrive/Colab Notebooks/CI_CD_workflow_v2.ipynb': No such file or directory
