# Iris Classifier using Vertex AI


## Overview

In this tutorial, you build a scikit-learn model and deploy it on infer in local environment using Google Cloud Storage for logging and tracking model and data


### Dataset

This tutorial uses R.A. Fisher's Iris dataset, a small and popular dataset for machine learning experiments. Each instance has four numerical features, which are different measurements of a flower, and a target label that
categorizes the flower into: **Iris setosa**, **Iris versicolour** and **Iris virginica**.

This tutorial uses [a version of the Iris dataset available in the
scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris).

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), 

## Get started

## Week 1

Setting up the ML pipeline for IRIS Classifier in Vertex AI platform using GCS as demonstrated in the lecture (Hands-on: Introduction to Google Cloud, Vertex AI) in your GCP account.

1. Activate your GCP Trial
2. Setup Vertex AI Workbench (Enable appropriate services/api as required)
3. Store Training Data in Google Storage Bucket 
4. Fetch the data from Google Storage Bucket and Successfully execute the IRIS Machine Learning Training Pipeline
5. Store the Output artifacts(Models, logs, etc) in Google cloud storage bucket with folders organized by their training execution timestamp
6. Create a new script for inference and run the inference on eval set after fetching the models from GCS Output Artifacts Bucket
7. Run this Training and inference for 2 times resulting in two output artifact folders in Google cloud storage bucket
8. (Optional) Run this pipeline for two versions of data provided in github data folder

### Install Vertex AI SDK for Python and other required packages



In [1]:

# Vertex SDK for Python
! pip3 install --upgrade --quiet  google-cloud-aiplatform

### Set Google Cloud project information
Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
PROJECT_ID = "mlops-sept25"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [3]:
BUCKET_URI = f"gs://mlops-sept25"  # @param {type:"string"}
BUCKET_NAME = "mlops-sept25"
MODEL_ARTIFACT_DIR="iris_classifier/model"

**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [4]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://mlops-sept25/...
ServiceException: 409 A Cloud Storage bucket named 'mlops-sept25' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Copying the v1 and v2 datasets to the storage bucket 

In [6]:
!gsutil cp data/v1/data.csv {BUCKET_URI}/data/v1/data.csv

Copying file://data/v1/data.csv [Content-Type=text/csv]...
/ [1 files][  2.6 KiB/  2.6 KiB]                                                
Operation completed over 1 objects/2.6 KiB.                                      


In [7]:
!gsutil cp data/v2/data.csv {BUCKET_URI}/data/v2/data.csv

Copying file://data/v2/data.csv [Content-Type=text/csv]...
/ [1 files][  1.3 KiB/  1.3 KiB]                                                
Operation completed over 1 objects/1.3 KiB.                                      


In [8]:
!gsutil cp data/raw/iris.csv {BUCKET_URI}/data/raw/iris.csv

Copying file://data/raw/iris.csv [Content-Type=text/csv]...
/ [1 files][  3.9 KiB/  3.9 KiB]                                                
Operation completed over 1 objects/3.9 KiB.                                      


In [5]:
IRIS_DATA = f"gs://mlops-sept25/data/raw/iris.csv"

In [6]:
DATA_V1 = f"gs://mlops-sept25/data/v1/data.csv"
DATA_V2 = f"gs://mlops-sept25/data/v2/data.csv" 

Initially the data was overwritten due to same file name, used different directories after 

### Initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

In [7]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

### Import the required libraries

In [8]:
import os
import sys

## Simple Decision Tree model
Build a Decision Tree model on iris data

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics

In [10]:
def train_data(dataset):
    data = pd.read_csv(dataset)
    print(data.head(5))
    
    train, test = train_test_split(data, test_size = 0.4, stratify = data['species'], random_state = 42)
    X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
    y_train = train.species
    X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
    y_test = test.species
    
    mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
    mod_dt.fit(X_train,y_train)
    prediction=mod_dt.predict(X_test)
    print('\nThe accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))
    
    return mod_dt

In [55]:
# data = pd.read_csv(IRIS_DATA)
# data.head(5)

In [15]:
# data = pd.read_csv(DATA_V1)
# data.head(5)

In [53]:
# train, test = train_test_split(data, test_size = 0.4, stratify = data['species'], random_state = 42)
# X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
# y_train = train.species
# X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
# y_test = test.species

In [54]:
# mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
# mod_dt.fit(X_train,y_train)
# prediction=mod_dt.predict(X_test)
# print('The accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))

### Upload model artifacts and custom code to Cloud Storage

Before you can deploy your model for serving, Vertex AI needs access to the following files in Cloud Storage:

* `model.joblib` (model artifact)
* `preprocessor.pkl` (model artifact)

Run the following commands to upload your files:

In [18]:
# import pickle
# import joblib

# joblib.dump(mod_dt, "artifacts/model.joblib")

In [19]:
# !gsutil cp artifacts/model.joblib {BUCKET_URI}/{MODEL_ARTIFACT_DIR}/

In [20]:
# import joblib, os, datetime
# from google.cloud import storage

# timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
# output_dir = f"{MODEL_ARTIFACT_DIR}/artifacts/{timestamp}-iris"
# os.makedirs(output_dir, exist_ok=True)

# # Save model
# joblib.dump(mod_dt, f"{output_dir}/iris_model.joblib")

# # Save metrics
# with open(f"{output_dir}/metrics.txt", "w") as f:
#     f.write(f"accuracy: {metrics.accuracy_score(prediction, y_test):.3f}\n")

# # Upload to GCS
# client = storage.Client()
# bucket = client.bucket(BUCKET_URI.split('gs://')[1])
# for file in os.listdir(output_dir):
#     blob = bucket.blob(f"{output_dir}/{file}")
#     blob.upload_from_filename(f"{output_dir}/{file}")


In [11]:
import joblib, os, datetime
from google.cloud import storage

def store_to_gcs(model):
    timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    output_dir = f"{MODEL_ARTIFACT_DIR}/artifacts/{timestamp}-iris"
    os.makedirs(output_dir, exist_ok=True)

    # Save model
    joblib.dump(mod_dt, f"{output_dir}/iris_model.joblib")

    # Save metrics
    with open(f"{output_dir}/metrics.txt", "w") as f:
        f.write(f"accuracy: {metrics.accuracy_score(prediction, y_test):.3f}\n")

    # Upload to GCS
    client = storage.Client()
    bucket = client.bucket(BUCKET_URI.split('gs://')[1])
    for file in os.listdir(output_dir):
        blob = bucket.blob(f"{output_dir}/{file}")
        blob.upload_from_filename(f"{output_dir}/{file}")


### Inference script:

In [12]:
from google.cloud import storage

def download_model(bucket_name, model_path, local_file="iris_model.joblib"):
    client = storage.Client()
    blob = client.bucket(bucket_name).blob(model_path)
    blob.download_to_filename(local_file)
    return joblib.load(local_file)

In [13]:
def get_inference(data, model_artifact):
    # Load evaluation data - for data v1
    eval_df = pd.read_csv(data)  # use test set
    X_eval = eval_df[['sepal_length','sepal_width','petal_length','petal_width']]
    
    # Load model from GCS and predict
    model = download_model(BUCKET_NAME, model_artifact)
    preds = model.predict(X_eval)
    eval_df['predictions'] = preds
    print(eval_df.head())
    
    print('\nAccuracy:', "{:.3f}".format(metrics.accuracy_score(eval_df['predictions'], eval_df['species'])))

In [65]:
# # Load evaluation data - for data v1
# eval_df = pd.read_csv(DATA_V1)  # use test set
# X_eval = eval_df[['sepal_length','sepal_width','petal_length','petal_width']]

In [43]:
# # Load model from GCS and predict
# model = download_model(BUCKET_NAME, "iris_classifier/model/artifacts/20251005-022441-iris/iris_model.joblib")
# preds = model.predict(X_eval)
# eval_df['predictions'] = preds
# print(eval_df.head())

In [42]:
# print('Accuracy:', "{:.3f}".format(metrics.accuracy_score(eval_df['predictions'], eval_df['species'])))

### Training and Inference 1

In [69]:
trained_model = train_data(IRIS_DATA)
store_to_gcs(trained_model)

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

The accuracy of the Decision Tree is 0.983


In [84]:
get_inference(DATA_V1, "iris_classifier/model/artifacts/20251005-024852-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.8          4.0           1.2          0.2  setosa      setosa
1           5.7          4.4           1.5          0.4  setosa      setosa
2           5.4          3.9           1.3          0.4  setosa      setosa
3           5.1          3.5           1.4          0.3  setosa      setosa
4           5.7          3.8           1.7          0.3  setosa      setosa

Accuracy: 0.970


In [85]:
get_inference(DATA_V2, "iris_classifier/model/artifacts/20251005-024852-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 1.000


In [86]:
get_inference(IRIS_DATA, "iris_classifier/model/artifacts/20251005-024852-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 0.980


In [88]:
rm iris_model.joblib

### Taining and Inference 2

In [73]:
trained_model = train_data(DATA_V1)
store_to_gcs(trained_model)

   sepal_length  sepal_width  petal_length  petal_width species
0           5.8          4.0           1.2          0.2  setosa
1           5.7          4.4           1.5          0.4  setosa
2           5.4          3.9           1.3          0.4  setosa
3           5.1          3.5           1.4          0.3  setosa
4           5.7          3.8           1.7          0.3  setosa

The accuracy of the Decision Tree is 0.951


In [90]:
get_inference(DATA_V2, "iris_classifier/model/artifacts/20251005-025002-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 1.000


In [91]:
get_inference(IRIS_DATA, "iris_classifier/model/artifacts/20251005-025002-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 0.980


In [89]:
get_inference(DATA_V1, "iris_classifier/model/artifacts/20251005-025002-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.8          4.0           1.2          0.2  setosa      setosa
1           5.7          4.4           1.5          0.4  setosa      setosa
2           5.4          3.9           1.3          0.4  setosa      setosa
3           5.1          3.5           1.4          0.3  setosa      setosa
4           5.7          3.8           1.7          0.3  setosa      setosa

Accuracy: 0.970


In [83]:
rm iris_model.joblib

rm: cannot remove 'iris_model.joblib': No such file or directory


### Taining and Inference 3

In [77]:
trained_model = train_data(DATA_V2)
store_to_gcs(trained_model)

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

The accuracy of the Decision Tree is 1.000


In [78]:
get_inference(DATA_V1, "iris_classifier/model/artifacts/20251005-025150-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.8          4.0           1.2          0.2  setosa      setosa
1           5.7          4.4           1.5          0.4  setosa      setosa
2           5.4          3.9           1.3          0.4  setosa      setosa
3           5.1          3.5           1.4          0.3  setosa      setosa
4           5.7          3.8           1.7          0.3  setosa      setosa

Accuracy: 0.970


In [81]:
get_inference(IRIS_DATA, "iris_classifier/model/artifacts/20251005-025150-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 0.980


In [79]:
get_inference(DATA_V2, "iris_classifier/model/artifacts/20251005-025150-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 1.000


## Week 2

Incorporate DVC for the local data into the homework pipeline.
Setup DVC in IRIS Pipeline we have setup as part of Week-2 Assignment

1. Setup the git repository
2. Configure DVC to use Google Cloud storage bucket as Remote storage
3. Augment the IRIS data to simulate the data additions and start training
4. Demonstrate storing data and model files as part of DVC
5. Demonstrate the ability to traverse through data versions effortlessly using dvc checkout
* DVC Command sheet - here

#### Initialize Git Repository

In [14]:
ls

21F1001937_SEPT_2025_MLOps.ipynb  [0m[01;34mdata[0m/             iris_model.joblib
[01;34martifacts[0m/                        [01;34miris_classifier[0m/


In [19]:
# !git init

In [20]:
# !git config --global user.name "jemma-mg"
# !git config --global user.email "jemmamariyageorge@gmail.com"

In [15]:
!git status

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.bashrc[m
	[31m.cache/[m
	[31m.config/[m
	[31m.docker/[m
	[31m.gitconfig[m
	[31m.gsutil/[m
	[31m.ipynb_checkpoints/[m
	[31m.ipython/[m
	[31m.jupyter/[m
	[31m.local/[m
	[31m.npm/[m
	[31m21F1001937_SEPT_2025_MLOps.ipynb[m
	[31martifacts/[m
	[31mdata/[m
	[31miris_classifier/[m
	[31miris_model.joblib[m

nothing added to commit but untracked files present (use "git add" to track)


In [21]:
!touch .gitignore

In [140]:
%%bash
cat << 'EOF' > .gitignore
.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
*/.ipynb_checkpoints/*
*/.ipython/*
.jupyter/*
.local/*
.npm/*
EOF

In [141]:
cat .gitignore

.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
*/.ipynb_checkpoints/*
*/.ipython/*
.jupyter/*
.local/*
.npm/*


In [142]:
!git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .dvc/config[m
	[31mmodified:   .gitignore[m
	[31mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31martifacts/20251005-004237-v1/iris_model.joblib[m
	[31martifacts/20251005-005426-iris/iris_model.joblib[m
	[31mdata/v1/.gitignore[m
	[31mdata/v1/.ipynb_checkpoints/data.csv-checkpoint.dvc[m
	[31mdata/v2/.gitignore[m
	[31mdata/v2/.ipynb_checkpoints/[m
	[31miris_classifier/[m
	[31miris_model.joblib[m

no changes added to commit (use "git add" and/or "git commit -a")


In [154]:
%%bash
cat << 'EOF' >> .gitignore
iris_classifier/*
iris_model.joblib
EOF

In [105]:
cat .gitignore

.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
.jupyter/*
.local/*
.npm/*
iris_classifier/*
iris_model.joblib


In [41]:
!git status

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.gitignore[m
	[31m21F1001937_SEPT_2025_MLOps.ipynb[m
	[31martifacts/[m
	[31mdata/[m

nothing added to commit but untracked files present (use "git add" to track)


In [151]:
%%bash
cat << 'EOF' >> .gitignore
artifacts/*
EOF

In [152]:
cat .gitignore

.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
*/.ipynb_checkpoints/*
*/.ipython/*
.jupyter/*
.local/*
.npm/*
artifacts/*
artifacts/*


In [45]:
!git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .gitignore[m
	[31mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


In [46]:
!git add .
!git commit -m "Initial commit of IRIS ML pipeline"

[master 05913fd] Initial commit of IRIS ML pipeline
 2 files changed, 104 insertions(+), 12 deletions(-)


#### Configure DVC with GCS Remote

In [19]:
!pip install --quiet dvc[gcs]
!pip install --quiet dvc[gdrive]

  You can safely remove it manually.[0m[33m
[0m

In [47]:
!dvc init

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

In [49]:
!git commit -m "Initialize DVC"


[master 57eb339] Initialize DVC
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore


In [52]:
# !dvc remote add -f -d gcsremote gs://mlops-sept25/dvc-store


Setting 'gcsremote' as a default remote.
[0m

In [53]:
# !dvc remote add -d gcsremote gs://mlops-sept25/dvc-store

In [54]:
!dvc remote list


[32mgcsremote       [0m[32mgs://mlops-sept25/dvc-store     [0m[32m(default)[0m
[0m

In [55]:
!cat .dvc/config


[core]
    remote = gcsremote
['remote "gcsremote"']
    url = gs://mlops-sept25/dvc-store


In [60]:
# !dvc remote modify gcsremote project $PROJECT_ID

In [59]:
!gcloud auth list

                  Credentialed Accounts
ACTIVE  ACCOUNT
*       451836298879-compute@developer.gserviceaccount.com

To set the active account, run:
    $ gcloud config set account `ACCOUNT`



#### Track Iris Dataset with DVC

In [73]:
# %%bash
# cat << 'EOF' >> .gitignore
# data/*
# !data/**/*.dvc
# EOF

In [74]:
# !echo -e "data/**\n!data/**/*.dvc" >> .gitignore

In [75]:
cat .gitignore

.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
.jupyter/*
.local/*
.npm/*
iris_classifier/*
iris_model.joblib
data/*
artifacts/*
!data/**/*.dvc


In [76]:
!dvc add data/v1/data.csv

 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v1/data.csv |0.00 [00:00,     ?fil[A
Adding...                                                                       [A
[31mERROR[39m:  output 'data/v1/data.csv' is already tracked by SCM (e.g. Git).
    You can remove it from Git, then add to DVC.
        To stop tracking from Git:
            git rm -r --cached 'data/v1/data.csv'
            git commit -m "stop tracking data/v1/data.csv" 
[0m

In [77]:
!git rm -r --cached 'data/v1/data.csv'

rm 'data/v1/data.csv'


In [78]:
!git commit -m "stop tracking data/v1/data.csv"

[master aa36b67] stop tracking data/v1/data.csv
 1 file changed, 102 deletions(-)
 delete mode 100644 data/v1/data.csv


In [79]:
!dvc add data/v1/data.csv

 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v1/data.csv |0.00 [00:00,     ?fil[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Adding data/v1/data.csv to cache      0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /home/jupyter/data/v1/dat0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00,  8.46file/s][A

To track the changes with git, run:

	git add data/v1/data.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [80]:
!git add data/v1/data.csv.dvc .gitignore

The following paths are ignored by one of your .gitignore files:
data/v1
[33mhint: Use -f if you really want to add them.[m
[33mhint: Turn this message off by running[m
[33mhint: "git config advice.addIgnoredFile false"[m


In [81]:
!git commit -m "Track data/v1/data.csv with DVC"

[master d69a849] Track data/v1/data.csv with DVC
 1 file changed, 1 insertion(+)


In [81]:
# !git tag -a "v1.0.0" -m "Track data/v1/data.csv with DVC"

[master d69a849] Track data/v1/data.csv with DVC
 1 file changed, 1 insertion(+)


In [83]:
!pip install --quiet "dvc[gcs]"


[0m

In [84]:
!dvc push

[31mERROR[39m: unexpected error - gs is supported, but requires 'dvc-gs' to be installed: No module named 'dvc_gs'

[33mHaving any troubles?[0m Hit us up at [34mhttps://dvc.org/support[0m, we are always happy to help!
[0m

In [85]:
!pip install dvc-gs


Collecting dvc-gs
  Downloading dvc_gs-3.0.2-py3-none-any.whl.metadata (1.3 kB)
Downloading dvc_gs-3.0.2-py3-none-any.whl (10 kB)
Installing collected packages: dvc-gs
Successfully installed dvc-gs-3.0.2


In [86]:
!pip show dvc-gs

Name: dvc-gs
Version: 3.0.2
Summary: gs plugin for dvc
Home-page: 
Author: 
Author-email: Iterative <support@dvc.org>
License: Apache License 2.0
Location: /opt/conda/lib/python3.10/site-packages
Requires: dvc, gcsfs
Required-by: 


In [87]:
!dvc push


Collecting                                            |0.00 [00:00,    ?entry/s]
Pushing
Everything is up to date.
[0m

#### Augment the Dataset

In [108]:
import pandas as pd

data_v1 = pd.read_csv('data/v1/data.csv')
data_v2 = pd.read_csv('data/v2/data.csv')

# # Simulate new rows (data augmentation)
# new_data = data.sample(20, replace=True)  # duplicate some rows for example
augmented_data = pd.concat([data_v1, data_v2], ignore_index=True)

augmented_data.to_csv('data/v2/data_augmented.csv', index=False)

In [115]:
### Track augmented dataset with DVC:

!dvc add data/v2/data_augmented.csv
!git add data/v2/data_augmented.csv.dvc
!git commit -m "Add augmented Iris dataset v2"
!dvc push

 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v2/data_augmented.csv |0.00 [00:00[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Checking out /home/jupyter/data/v2/dat0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 28.47file/s][A

To track the changes with git, run:

	git add data/v2/data_augmented.csv.dvc data/v2/.gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0m[master 4b4b681] Add augmented Iris dataset v2
 1 file changed, 5 insertions(+)
 create mode 100644 data/v2/data_augmented.csv.dvc
Collecting                                            |2.00 [00:00,  130

In [124]:
# !dvc list .


In [120]:
!dvc status

Data and pipelines are up to date.                                              
[0m

In [97]:
!dvc checkout


Building workspace index                              |0.00 [00:00,    ?entry/s]
Comparing indexes                                    |1.00 [00:00, 1.95kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[31mD[0m       data/v1/data.csv
[31mD[0m       data/v2/data_augmented.csv
[0m

In [164]:
!gsutil ls gs://mlops-sept25/dvc-store

gs://mlops-sept25/dvc-store/files/


In [99]:
!dvc remote list


[32mgcsremote       [0m[32mgs://mlops-sept25/dvc-store     [0m[32m(default)[0m
[0m

In [100]:
!pip install --quiet dvc-gs


In [101]:
!dvc add data/v1/data.csv
!git add data/v1/data.csv.dvc .gitignore
!git commit -m "Track Iris v1 dataset with DVC"


 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v1/data.csv |0.00 [00:00,     ?fil[A
Adding...                                                                       [A
[31mERROR[39m: output 'data/v1/data.csv' does not exist: [Errno 2] No such file or directory: '/home/jupyter/data/v1/data.csv'
[0mThe following paths are ignored by one of your .gitignore files:
data/v1
[33mhint: Use -f if you really want to add them.[m
[33mhint: Turn this message off by running[m
[33mhint: "git config advice.addIgnoredFile false"[m
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .dvc/config[m
	[31mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


In [109]:
!dvc add data/v1/data.csv
!git add data/v1/data.csv.dvc .gitignore
!git commit -m "Track Iris v1 dataset with DVC"


 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v1/data.csv |0.00 [00:00,     ?fil[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Checking out /home/jupyter/data/v1/dat0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 17.85file/s][A

To track the changes with git, run:

	git add data/v1/data.csv.dvc data/v1/.gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0m[master 9195d98] Track Iris v1 dataset with DVC
 2 files changed, 5 insertions(+), 2 deletions(-)
 create mode 100644 data/v1/data.csv.dvc


In [110]:
!dvc push

Collecting                                            |2.00 [00:00,  132entry/s]
Pushing
![A
  0% Checking cache in 'mlops-sept25/dvc-store/files/md5'| |0/? [00:00<?,    ?fi[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Pushing to gs                         0/2 [00:00<?,     ?file/s][A

![A[A

  0%|          |/home/jupyter/.dvc/cache/files/0.00/3.77k [00:00<?,        ?B/s][A[A

                                                                                [A[A
 50%|█████     |Pushing to gs                     1/2 [00:00<00:00,  6.93file/s][A

![A[A

  0%|          |/home/jupyter/.dvc/cache/files/0.00/2.64k [00:00<?,        ?B/s][A[A

                                                                                [A[A
Pushing               

#### Now we have two versions of data tracked in DVC

* version1 - data/v1/data.csv.dvc
* version2 - data/v2/data_augmented.csv.dvc

In [123]:
!gsutil ls gs://mlops-sept25/dvc-store/files/md5/*

gs://mlops-sept25/dvc-store/files/md5/92/:
gs://mlops-sept25/dvc-store/files/md5/92/03b75e931cbba1e74a1028025169bf

gs://mlops-sept25/dvc-store/files/md5/97/:
gs://mlops-sept25/dvc-store/files/md5/97/e5854ee4196b617ce57e311bf88962


In [114]:
!dvc pull      # fetch files from GCS


Collecting                                            |0.00 [00:00,    ?entry/s]
Fetching
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
Fetching                                                                        [A
Building workspace index                              |5.00 [00:00,  873entry/s]
Comparing indexes                                    |6.00 [00:00, 1.05kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
Everything is up to date.
[0m

In [113]:
!dvc checkout # apply correct versions to workspace

Building workspace index                              |5.00 [00:00, 17.5entry/s]
Comparing indexes                                     |6.00 [00:00,  705entry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[0m

#### Integrate DVC into Training Pipeline

##### The joblib file contains the trained model artifacts including model weights

In [127]:
import joblib

model = joblib.load("iris_model.joblib")

In [128]:
# Access the trained tree
tree = model.tree_

# Number of nodes
print("Number of nodes:", tree.node_count)

# Feature indices used at each split
print("Feature indices:", tree.feature)

# Thresholds at each split
print("Thresholds:", tree.threshold)

# Values at each leaf (class counts)
print("Leaf values:", tree.value)


Number of nodes: 9
Feature indices: [ 3 -2  2  3 -2 -2  3 -2 -2]
Thresholds: [ 0.7        -2.          4.95000005  1.64999998 -2.         -2.
  1.69999999 -2.         -2.        ]
Leaf values: [[[0.33333333 0.33333333 0.33333333]]

 [[1.         0.         0.        ]]

 [[0.         0.5        0.5       ]]

 [[0.         0.93548387 0.06451613]]

 [[0.         1.         0.        ]]

 [[0.         0.33333333 0.66666667]]

 [[0.         0.03448276 0.96551724]]

 [[0.         0.33333333 0.66666667]]

 [[0.         0.         1.        ]]]


In [129]:
for i in range(tree.node_count):
    print(f"Node {i}: feature={tree.feature[i]}, threshold={tree.threshold[i]}, value={tree.value[i]}")


Node 0: feature=3, threshold=0.7000000029802322, value=[[0.33333333 0.33333333 0.33333333]]
Node 1: feature=-2, threshold=-2.0, value=[[1. 0. 0.]]
Node 2: feature=2, threshold=4.950000047683716, value=[[0.  0.5 0.5]]
Node 3: feature=3, threshold=1.649999976158142, value=[[0.         0.93548387 0.06451613]]
Node 4: feature=-2, threshold=-2.0, value=[[0. 1. 0.]]
Node 5: feature=-2, threshold=-2.0, value=[[0.         0.33333333 0.66666667]]
Node 6: feature=3, threshold=1.699999988079071, value=[[0.         0.03448276 0.96551724]]
Node 7: feature=-2, threshold=-2.0, value=[[0.         0.33333333 0.66666667]]
Node 8: feature=-2, threshold=-2.0, value=[[0. 0. 1.]]


##### Previously I had missed to tag the commits -> re running the parts again

In [130]:
!git log --oneline

[33m4b4b681[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Add augmented Iris dataset v2
[33m9195d98[m Track Iris v1 dataset with DVC
[33md69a849[m Track data/v1/data.csv with DVC
[33maa36b67[m stop tracking data/v1/data.csv
[33m57eb339[m Initialize DVC
[33m05913fd[m Initial commit of IRIS ML pipeline
[33m3cc7d90[m Initial commit of IRIS ML pipeline


In [146]:
!cat .dvc/config

[core]
    remote = gcsremote
['remote "gcsremote"']
    url = gs://mlops-sept25/dvc-store


In [147]:
!dvc add data/v1/data.csv
!git add data/v1/data.csv.dvc .gitignore data/v1/.gitignore data/v1/.ipynb_checkpoints/data.csv-checkpoint.dvc .dvc/config 21F1001937_SEPT_2025_MLOps.ipynb 
!git commit -m "Track Iris v1 dataset with DVC"
!git tag -a "v1.1.0" -m "Data version 1 - Track data/v1/data.csv with DVC"

 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v1/data.csv |0.00 [00:00,     ?fil[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Checking out /home/jupyter/data/v1/dat0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 30.36file/s][A

To track the changes with git, run:

	git add data/v1/data.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m[master c3be2db] Track Iris v1 dataset with DVC
 5 files changed, 1424 insertions(+), 42 deletions(-)
 create mode 100644 data/v1/.gitignore
 create mode 100644 data/v1/.ipynb_checkpoints/data.csv-checkpoint.dvc


In [148]:
!dvc push

Collecting                                            |3.00 [00:00,  177entry/s]
Pushing
![A
  0% Checking cache in 'mlops-sept25/dvc-store/files/md5'| |0/? [00:00<?,    ?fi[A
 50% Querying cache in 'mlops-sept25/dvc-store/files/md5'|▌|1/2 [00:00<00:00,  8[A
Pushing                                                                         [A
Everything is up to date.
[0m

In [143]:
!cat data/v1/.gitignore

/data.csv


In [144]:
!cat data/v2/.gitignore

/data_augmented.csv


In [159]:
### Track augmented dataset with DVC:

!dvc add data/v2/data_augmented.csv 
!git add data/v2/data_augmented.csv.dvc .gitignore data/v2/.gitignore data/v2/.ipynb_checkpoints/* data/v2/.ipynb_checkpoints/data.csv-checkpoint.dvc .dvc/config 21F1001937_SEPT_2025_MLOps.ipynb 
!git commit -m "Add augmented Iris dataset v2"
!git tag -a "v1.2.0" -m "Data version 2 - Track data/v2/data_augmented.csv with DVC"
!dvc push

 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v2/data_augmented.csv |0.00 [00:00[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Checking out /home/jupyter/data/v2/dat0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 25.65file/s][A

To track the changes with git, run:

	git add data/v2/data_augmented.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0mfatal: pathspec 'data/v2/.ipynb_checkpoints/data.csv-checkpoint.dvc' did not match any files
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore 

In [160]:
### Track augmented dataset with DVC:

!dvc add data/v2/data_augmented.csv 
!git add .

 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in data/v2/data_augmented.csv |0.00 [00:00[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Checking out /home/jupyter/data/v2/dat0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 18.44file/s][A

To track the changes with git, run:

	git add data/v2/data_augmented.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [161]:
!git status

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mmodified:   .gitignore[m
	[32mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m
	[32mnew file:   data/v2/.gitignore[m
	[32mnew file:   data/v2/.ipynb_checkpoints/data-checkpoint.csv[m
	[32mnew file:   data/v2/.ipynb_checkpoints/data_augmented-checkpoint.csv[m



In [163]:
!git commit -m "Add augmented Iris dataset v2"
!git tag -a "v1.2.1" -m "Data version 2 - Track data/v2/data_augmented.csv with DVC"
!dvc push

On branch master
nothing to commit, working tree clean
Collecting                                            |0.00 [00:00,    ?entry/s]
Pushing
![A
  0% Checking cache in 'mlops-sept25/dvc-store/files/md5'| |0/? [00:00<?,    ?fi[A
 50% Querying cache in 'mlops-sept25/dvc-store/files/md5'|▌|1/2 [00:00<00:00,  8[A
Pushing                                                                         [A
Everything is up to date.
[0m

In [166]:
!gsutil ls gs://mlops-sept25/dvc-store/files/md5/*

gs://mlops-sept25/dvc-store/files/md5/92/:
gs://mlops-sept25/dvc-store/files/md5/92/03b75e931cbba1e74a1028025169bf

gs://mlops-sept25/dvc-store/files/md5/97/:
gs://mlops-sept25/dvc-store/files/md5/97/e5854ee4196b617ce57e311bf88962


In [166]:
### Pu

gs://mlops-sept25/dvc-store/files/md5/92/:
gs://mlops-sept25/dvc-store/files/md5/92/03b75e931cbba1e74a1028025169bf

gs://mlops-sept25/dvc-store/files/md5/97/:
gs://mlops-sept25/dvc-store/files/md5/97/e5854ee4196b617ce57e311bf88962


#### Pulling the version 2 data from dvc with remote GCS bucket - running the training and inference

In [170]:
!git checkout "v1.2.1"
!dvc pull
!dvc checkout

M	21F1001937_SEPT_2025_MLOps.ipynb
HEAD is now at 4816e0d Add augmented Iris dataset v2
Collecting                                            |3.00 [00:00,  141entry/s]
Fetching
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
Fetching                                                                        [A
Building workspace index                              |7.00 [00:00,  808entry/s]
Comparing indexes                                    |8.00 [00:00, 1.38kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
Everything is up to date.
Building workspace index                              |7.00 [00:00, 27.5entry/s]
Comparing indexes                                    |8.00 [00:00, 1.17kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[0m

In [171]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import joblib, os, datetime

MODEL_DIR = "artifacts"

def train_data(dataset):
    # Load dataset
    data = pd.read_csv(dataset)
    
    # Train/test split
    train, test = train_test_split(data, test_size=0.4, stratify=data['species'], random_state=42)
    X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
    y_train = train['species']
    X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
    y_test = test['species']
    
    # Train model
    model = DecisionTreeClassifier(max_depth=3, random_state=1)
    model.fit(X_train, y_train)
    
    # Evaluate
    prediction = model.predict(X_test)
    acc = metrics.accuracy_score(prediction, y_test)
    print(f"\nAccuracy: {acc:.3f}")
    
    # Save model and metrics
    timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    output_dir = os.path.join(MODEL_DIR, f"{timestamp}-iris")
    os.makedirs(output_dir, exist_ok=True)
    
    model_file = os.path.join(output_dir, "iris_model.joblib")
    metrics_file = os.path.join(output_dir, "metrics.txt")
    
    joblib.dump(model, model_file)
    with open(metrics_file, "w") as f:
        f.write(f"accuracy: {acc:.3f}\n")
    
    return output_dir


In [172]:
train_data("data/v2/data_augmented.csv")


Accuracy: 0.917


'artifacts/20251005-153723-iris'

In [178]:
# !sed -i '/artifacts\/20251005-153723-iris.dvc/d' .gitignore
!sed -i '/artifacts\/*/d' .gitignore


In [179]:
!cat .gitignore

.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
*/.ipynb_checkpoints/*
*/.ipython/*
.jupyter/*
.local/*
.npm/*
iris_classifier/*
iris_model.joblib


In [181]:
TIMESTAMPED_ARTIFACT="artifacts/20251005-153723-iris"

# Replace <timestamped-folder> with the folder returned by train_data()
!dvc add {TIMESTAMPED_ARTIFACT}

# Track DVC metadata in Git
!git add {TIMESTAMPED_ARTIFACT}.dvc .gitignore
!git commit -m "Add model trained on v2 dataset with metrics"

# Push to DVC remote (GCS)
!dvc push


 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in artifacts/20251005-153723-iris |0.00 [0[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
  0%|          |Adding artifacts/20251005-153723-iris 0/2 [00:00<?,     ?file/s][A
                                                                                [A
![A
Checking out /home/jupyter/artifacts/20251005-153723-iris |0.00 [00:00,    ?file[A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 28.74file/s][A

To track the changes with git, run:

	git add artifacts/.gitignore artifacts/20251005-153723-iris.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m[detached HEAD 1b8d28b] 

In [180]:
!git status

[31mHEAD detached at [mv1.2.1
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .gitignore[m
	[31mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31martifacts/20251005-153723-iris/[m
	[31mdata/v1/.ipynb_checkpoints/data.csv[m

no changes added to commit (use "git add" and/or "git commit -a")


In [190]:
TIMESTAMPED_ARTIFACT="artifacts/20251005-153723-iris"

# Replace <timestamped-folder> with the folder returned by train_data()
!dvc add {TIMESTAMPED_ARTIFACT}

# Track DVC metadata in Git
!git add .
!git commit -m "Add model trained on v2 dataset with metrics"
!git tag -a "v1.2.2" -m "Data version 2 - Artifacts added to DVC"

# Push to DVC remote (GCS)
!dvc push


 [?25l[32m⠋[0m Checking graph
Adding...                                                                       
![A
Collecting files and computing hashes in artifacts/20251005-153723-iris |0.00 [0[A
                                                                                [A
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
                                                                                [A
![A
Checking out /home/jupyter/artifacts/20251005-153723-iris |0.00 [00:00,    ?file[A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 31.69file/s][A

To track the changes with git, run:

	git add artifacts/.gitignore artifacts/20251005-153723-iris.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m[detached HEAD f03e76e] Add model trained on v2 dataset with metrics
 2 files changed, 103 insertions(+)
 create mode 100644 artifacts/.gitignore
 create mode 100644 data/v1/.ipynb_checkpoints/data

In [191]:
!git checkout "v1.2.2"
!dvc pull
!dvc checkout

any of your branches:

  f03e76e Add model trained on v2 dataset with metrics

If you want to keep it by creating a new branch, this may be a good time
to do so with:

 git branch <new-branch-name> f03e76e

HEAD is now at 1b8d28b Add model trained on v2 dataset with metrics
Collecting                                            |0.00 [00:00,    ?entry/s]
Fetching
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
Fetching                                                                        [A
Building workspace index                              |11.0 [00:00,  781entry/s]
Comparing indexes                                    |12.0 [00:00, 1.32kentry/s]
Applying changes                                      |1.00 [00:00,   166file/s]
[32mA[0m       data/v1/.ipynb_checkpoints/data.csv
1 file added
Building workspace index                              |12.0 [00:00, 39.0entry/s]
Comparing indexes                                    |12.0 [00:00, 1.03k

In [194]:
# Pull the versioned model folder from DVC remote
!dvc pull artifacts/20251005-153723-iris
!dvc checkout


Collecting                                            |3.00 [00:00,  132entry/s]
Fetching
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
Fetching                                                                        [A
Building workspace index                              |5.00 [00:00,  982entry/s]
Comparing indexes                                    |5.00 [00:00, 1.18kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
Everything is up to date.
Building workspace index                              |12.0 [00:00, 34.7entry/s]
Comparing indexes                                     |12.0 [00:00,  953entry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[0m

In [195]:
import joblib
import os

def get_inference(data_file, model_folder):
    # Load evaluation data
    eval_df = pd.read_csv(data_file)
    X_eval = eval_df[['sepal_length','sepal_width','petal_length','petal_width']]
    
    # Load the model from local DVC folder
    model_file = os.path.join(model_folder, "iris_model.joblib")
    model = joblib.load(model_file)
    
    preds = model.predict(X_eval)
    eval_df['predictions'] = preds
    print(eval_df.head())
    
    acc = metrics.accuracy_score(eval_df['predictions'], eval_df['species'])
    print(f"\nAccuracy: {acc:.3f}")


In [196]:
get_inference("data/v2/data_augmented.csv", TIMESTAMPED_ARTIFACT)

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.8          4.0           1.2          0.2  setosa      setosa
1           5.7          4.4           1.5          0.4  setosa      setosa
2           5.4          3.9           1.3          0.4  setosa      setosa
3           5.1          3.5           1.4          0.3  setosa      setosa
4           5.7          3.8           1.7          0.3  setosa      setosa

Accuracy: 0.953


#### Demonstrate the ability to traverse through data versions effortlessly using dvc checkout

In [197]:
!git checkout "v1.1.0"
!dvc pull
!dvc checkout

error: Your local changes to the following files would be overwritten by checkout:
	21F1001937_SEPT_2025_MLOps.ipynb
Please commit your changes or stash them before you switch branches.
Aborting
Collecting                                            |0.00 [00:00,    ?entry/s]
Fetching
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
Fetching                                                                        [A
Building workspace index                             |12.0 [00:00, 1.23kentry/s]
Comparing indexes                                    |12.0 [00:00, 1.34kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
Everything is up to date.
Building workspace index                              |12.0 [00:00, 32.4entry/s]
Comparing indexes                                    |12.0 [00:00, 1.01kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[0m

In [198]:
!git status

[31mHEAD detached at [mv1.2.2
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31martifacts/20251005-153723-iris/[m
	[31mdata/v1/.ipynb_checkpoints/data.csv[m

no changes added to commit (use "git add" and/or "git commit -a")


In [199]:
!git add .
!git commit -m "add part to traverse through data versions"

[detached HEAD 86690b0] add part to traverse through data versions
 3 files changed, 819 insertions(+), 29 deletions(-)
 create mode 100644 artifacts/20251005-153723-iris/metrics.txt
 create mode 100644 data/v1/.ipynb_checkpoints/data.csv


In [201]:
!dvc pull data/v1/data.csv
!dvc checkout

Collecting                                            |1.00 [00:00, 71.8entry/s]
Fetching
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
Fetching                                                                        [A
Building workspace index                              |3.00 [00:00,  740entry/s]
Comparing indexes                                    |4.00 [00:00, 1.03kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
Everything is up to date.
Building workspace index                              |12.0 [00:00, 42.9entry/s]
Comparing indexes                                    |12.0 [00:00, 1.03kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[0m

In [202]:
!git log --oneline

[33m86690b0[m[33m ([m[1;36mHEAD[m[33m)[m add part to traverse through data versions
[33m1b8d28b[m[33m ([m[1;33mtag: v1.2.2[m[33m)[m Add model trained on v2 dataset with metrics
[33m4816e0d[m[33m ([m[1;33mtag: v1.2.1[m[33m, [m[1;32mmaster[m[33m)[m Add augmented Iris dataset v2
[33mc3be2db[m[33m ([m[1;33mtag: v1.2.0[m[33m, [m[1;33mtag: v1.1.0[m[33m)[m Track Iris v1 dataset with DVC
[33m4b4b681[m[33m ([m[1;33mtag: v1.0.0[m[33m)[m Add augmented Iris dataset v2
[33m9195d98[m Track Iris v1 dataset with DVC
[33md69a849[m Track data/v1/data.csv with DVC
[33maa36b67[m stop tracking data/v1/data.csv
[33m57eb339[m Initialize DVC
[33m05913fd[m Initial commit of IRIS ML pipeline
[33m3cc7d90[m Initial commit of IRIS ML pipeline


## Week 4

1. Setup IRIS homework pipeline into a GitHub repository with two branches dev and main
2. create evaluation and data validation unit tests using pytest or unittest
3. for evaluation and testing, configure the Continuous Integration (CI) with GitHub Actions to fetch the model and data needed for evaluation from DVC configured in Week-3
4. push inclusion of pytest code changes to dev branch and raise Pull Request to main branch
5. Every branch should have its own CI on push or PR merge
6. Run a sanity test using GitHub actions printing a report as a comment using cml.

In [1]:
!git status

[31mHEAD detached from [mv1.2.2
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mtrain.py[m

no changes added to commit (use "git add" and/or "git commit -a")


In [2]:
!git log --oneline

[33m86690b0[m[33m ([m[1;36mHEAD[m[33m)[m add part to traverse through data versions
[33m1b8d28b[m[33m ([m[1;33mtag: v1.2.2[m[33m)[m Add model trained on v2 dataset with metrics
[33m4816e0d[m[33m ([m[1;33mtag: v1.2.1[m[33m, [m[1;32mmaster[m[33m)[m Add augmented Iris dataset v2
[33mc3be2db[m[33m ([m[1;33mtag: v1.2.0[m[33m, [m[1;33mtag: v1.1.0[m[33m)[m Track Iris v1 dataset with DVC
[33m4b4b681[m[33m ([m[1;33mtag: v1.0.0[m[33m)[m Add augmented Iris dataset v2
[33m9195d98[m Track Iris v1 dataset with DVC
[33md69a849[m Track data/v1/data.csv with DVC
[33maa36b67[m stop tracking data/v1/data.csv
[33m57eb339[m Initialize DVC
[33m05913fd[m Initial commit of IRIS ML pipeline
[33m3cc7d90[m Initial commit of IRIS ML pipeline


In [3]:
!git checkout "v1.2.2"
!dvc pull
!dvc checkout

error: Your local changes to the following files would be overwritten by checkout:
	21F1001937_SEPT_2025_MLOps.ipynb
Please commit your changes or stash them before you switch branches.
Aborting
Collecting                                            |0.00 [00:00,    ?entry/s]
Fetching
![A
  0% Checking cache in '/home/jupyter/.dvc/cache/files/md5'| |0/? [00:00<?,    ?[A
Fetching                                                                        [A
Building workspace index                             |12.0 [00:00, 1.73kentry/s]
Comparing indexes                                    |12.0 [00:00, 1.04kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
Everything is up to date.
Building workspace index                              |12.0 [00:00, 69.8entry/s]
Comparing indexes                                    |12.0 [00:00, 2.24kentry/s]
Applying changes                                      |0.00 [00:00,     ?file/s]
[0m

In [8]:
!git remote -v

In [6]:
!git status

[31mHEAD detached from [mv1.2.2
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   21F1001937_SEPT_2025_MLOps.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mtrain.py[m

no changes added to commit (use "git add" and/or "git commit -a")


In [10]:
# !git config --global user.name "jemma-mg"
# !git config --global user.email "jemmamariyageorge@gmail.com"

In [12]:
!git remote remove origin

error: No such remote: 'origin'


In [13]:
# https://github.com/jemma-mg/mlops-learning.git

In [15]:
!git remote add origin https://github.com/jemma-mg/mlops-learning.git

In [16]:
!git remote -v

origin	https://github.com/jemma-mg/mlops-learning.git (fetch)
origin	https://github.com/jemma-mg/mlops-learning.git (push)


In [20]:
!git branch -M main

fatal: Invalid branch name: 'HEAD'


In [18]:
!git push -u origin main

error: src refspec main does not match any
[31merror: failed to push some refs to 'https://github.com/jemma-mg/mlops-learning.git'
[m