# Iris Classifier using Vertex AI


## Overview

In this tutorial, you build a scikit-learn model and deploy it on infer in local environment using Google Cloud Storage for logging and tracking model and data


### Dataset

This tutorial uses R.A. Fisher's Iris dataset, a small and popular dataset for machine learning experiments. Each instance has four numerical features, which are different measurements of a flower, and a target label that
categorizes the flower into: **Iris setosa**, **Iris versicolour** and **Iris virginica**.

This tutorial uses [a version of the Iris dataset available in the
scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris).

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), 

## Get started

## Week 1

Setting up the ML pipeline for IRIS Classifier in Vertex AI platform using GCS as demonstrated in the lecture (Hands-on: Introduction to Google Cloud, Vertex AI) in your GCP account.

1. Activate your GCP Trial
2. Setup Vertex AI Workbench (Enable appropriate services/api as required)
3. Store Training Data in Google Storage Bucket 
4. Fetch the data from Google Storage Bucket and Successfully execute the IRIS Machine Learning Training Pipeline
5. Store the Output artifacts(Models, logs, etc) in Google cloud storage bucket with folders organized by their training execution timestamp
6. Create a new script for inference and run the inference on eval set after fetching the models from GCS Output Artifacts Bucket
7. Run this Training and inference for 2 times resulting in two output artifact folders in Google cloud storage bucket
8. (Optional) Run this pipeline for two versions of data provided in github data folder

### Install Vertex AI SDK for Python and other required packages



In [1]:

# Vertex SDK for Python
! pip3 install --upgrade --quiet  google-cloud-aiplatform

### Set Google Cloud project information
Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
PROJECT_ID = "mlops-sept25"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [3]:
BUCKET_URI = f"gs://mlops-sept25"  # @param {type:"string"}
BUCKET_NAME = "mlops-sept25"
MODEL_ARTIFACT_DIR="iris_classifier/model"

**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [4]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://mlops-sept25/...
ServiceException: 409 A Cloud Storage bucket named 'mlops-sept25' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Copying the v1 and v2 datasets to the storage bucket 

In [6]:
!gsutil cp data/v1/data.csv {BUCKET_URI}/data/v1/data.csv

Copying file://data/v1/data.csv [Content-Type=text/csv]...
/ [1 files][  2.6 KiB/  2.6 KiB]                                                
Operation completed over 1 objects/2.6 KiB.                                      


In [7]:
!gsutil cp data/v2/data.csv {BUCKET_URI}/data/v2/data.csv

Copying file://data/v2/data.csv [Content-Type=text/csv]...
/ [1 files][  1.3 KiB/  1.3 KiB]                                                
Operation completed over 1 objects/1.3 KiB.                                      


In [8]:
!gsutil cp data/raw/iris.csv {BUCKET_URI}/data/raw/iris.csv

Copying file://data/raw/iris.csv [Content-Type=text/csv]...
/ [1 files][  3.9 KiB/  3.9 KiB]                                                
Operation completed over 1 objects/3.9 KiB.                                      


In [5]:
IRIS_DATA = f"gs://mlops-sept25/data/raw/iris.csv"

In [6]:
DATA_V1 = f"gs://mlops-sept25/data/v1/data.csv"
DATA_V2 = f"gs://mlops-sept25/data/v2/data.csv" 

Initially the data was overwritten due to same file name, used different directories after 

### Initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

In [7]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

### Import the required libraries

In [8]:
import os
import sys

## Simple Decision Tree model
Build a Decision Tree model on iris data

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics

In [10]:
def train_data(dataset):
    data = pd.read_csv(dataset)
    print(data.head(5))
    
    train, test = train_test_split(data, test_size = 0.4, stratify = data['species'], random_state = 42)
    X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
    y_train = train.species
    X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
    y_test = test.species
    
    mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
    mod_dt.fit(X_train,y_train)
    prediction=mod_dt.predict(X_test)
    print('\nThe accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))
    
    return mod_dt

In [55]:
# data = pd.read_csv(IRIS_DATA)
# data.head(5)

In [15]:
# data = pd.read_csv(DATA_V1)
# data.head(5)

In [53]:
# train, test = train_test_split(data, test_size = 0.4, stratify = data['species'], random_state = 42)
# X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
# y_train = train.species
# X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
# y_test = test.species

In [54]:
# mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
# mod_dt.fit(X_train,y_train)
# prediction=mod_dt.predict(X_test)
# print('The accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))

### Upload model artifacts and custom code to Cloud Storage

Before you can deploy your model for serving, Vertex AI needs access to the following files in Cloud Storage:

* `model.joblib` (model artifact)
* `preprocessor.pkl` (model artifact)

Run the following commands to upload your files:

In [18]:
# import pickle
# import joblib

# joblib.dump(mod_dt, "artifacts/model.joblib")

In [19]:
# !gsutil cp artifacts/model.joblib {BUCKET_URI}/{MODEL_ARTIFACT_DIR}/

In [20]:
# import joblib, os, datetime
# from google.cloud import storage

# timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
# output_dir = f"{MODEL_ARTIFACT_DIR}/artifacts/{timestamp}-iris"
# os.makedirs(output_dir, exist_ok=True)

# # Save model
# joblib.dump(mod_dt, f"{output_dir}/iris_model.joblib")

# # Save metrics
# with open(f"{output_dir}/metrics.txt", "w") as f:
#     f.write(f"accuracy: {metrics.accuracy_score(prediction, y_test):.3f}\n")

# # Upload to GCS
# client = storage.Client()
# bucket = client.bucket(BUCKET_URI.split('gs://')[1])
# for file in os.listdir(output_dir):
#     blob = bucket.blob(f"{output_dir}/{file}")
#     blob.upload_from_filename(f"{output_dir}/{file}")


In [11]:
import joblib, os, datetime
from google.cloud import storage

def store_to_gcs(model):
    timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    output_dir = f"{MODEL_ARTIFACT_DIR}/artifacts/{timestamp}-iris"
    os.makedirs(output_dir, exist_ok=True)

    # Save model
    joblib.dump(mod_dt, f"{output_dir}/iris_model.joblib")

    # Save metrics
    with open(f"{output_dir}/metrics.txt", "w") as f:
        f.write(f"accuracy: {metrics.accuracy_score(prediction, y_test):.3f}\n")

    # Upload to GCS
    client = storage.Client()
    bucket = client.bucket(BUCKET_URI.split('gs://')[1])
    for file in os.listdir(output_dir):
        blob = bucket.blob(f"{output_dir}/{file}")
        blob.upload_from_filename(f"{output_dir}/{file}")


### Inference script:

In [12]:
from google.cloud import storage

def download_model(bucket_name, model_path, local_file="iris_model.joblib"):
    client = storage.Client()
    blob = client.bucket(bucket_name).blob(model_path)
    blob.download_to_filename(local_file)
    return joblib.load(local_file)

In [13]:
def get_inference(data, model_artifact):
    # Load evaluation data - for data v1
    eval_df = pd.read_csv(data)  # use test set
    X_eval = eval_df[['sepal_length','sepal_width','petal_length','petal_width']]
    
    # Load model from GCS and predict
    model = download_model(BUCKET_NAME, model_artifact)
    preds = model.predict(X_eval)
    eval_df['predictions'] = preds
    print(eval_df.head())
    
    print('\nAccuracy:', "{:.3f}".format(metrics.accuracy_score(eval_df['predictions'], eval_df['species'])))

In [65]:
# # Load evaluation data - for data v1
# eval_df = pd.read_csv(DATA_V1)  # use test set
# X_eval = eval_df[['sepal_length','sepal_width','petal_length','petal_width']]

In [43]:
# # Load model from GCS and predict
# model = download_model(BUCKET_NAME, "iris_classifier/model/artifacts/20251005-022441-iris/iris_model.joblib")
# preds = model.predict(X_eval)
# eval_df['predictions'] = preds
# print(eval_df.head())

In [42]:
# print('Accuracy:', "{:.3f}".format(metrics.accuracy_score(eval_df['predictions'], eval_df['species'])))

### Training and Inference 1

In [69]:
trained_model = train_data(IRIS_DATA)
store_to_gcs(trained_model)

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

The accuracy of the Decision Tree is 0.983


In [84]:
get_inference(DATA_V1, "iris_classifier/model/artifacts/20251005-024852-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.8          4.0           1.2          0.2  setosa      setosa
1           5.7          4.4           1.5          0.4  setosa      setosa
2           5.4          3.9           1.3          0.4  setosa      setosa
3           5.1          3.5           1.4          0.3  setosa      setosa
4           5.7          3.8           1.7          0.3  setosa      setosa

Accuracy: 0.970


In [85]:
get_inference(DATA_V2, "iris_classifier/model/artifacts/20251005-024852-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 1.000


In [86]:
get_inference(IRIS_DATA, "iris_classifier/model/artifacts/20251005-024852-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 0.980


In [88]:
rm iris_model.joblib

### Taining and Inference 2

In [73]:
trained_model = train_data(DATA_V1)
store_to_gcs(trained_model)

   sepal_length  sepal_width  petal_length  petal_width species
0           5.8          4.0           1.2          0.2  setosa
1           5.7          4.4           1.5          0.4  setosa
2           5.4          3.9           1.3          0.4  setosa
3           5.1          3.5           1.4          0.3  setosa
4           5.7          3.8           1.7          0.3  setosa

The accuracy of the Decision Tree is 0.951


In [90]:
get_inference(DATA_V2, "iris_classifier/model/artifacts/20251005-025002-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 1.000


In [91]:
get_inference(IRIS_DATA, "iris_classifier/model/artifacts/20251005-025002-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 0.980


In [89]:
get_inference(DATA_V1, "iris_classifier/model/artifacts/20251005-025002-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.8          4.0           1.2          0.2  setosa      setosa
1           5.7          4.4           1.5          0.4  setosa      setosa
2           5.4          3.9           1.3          0.4  setosa      setosa
3           5.1          3.5           1.4          0.3  setosa      setosa
4           5.7          3.8           1.7          0.3  setosa      setosa

Accuracy: 0.970


In [83]:
rm iris_model.joblib

rm: cannot remove 'iris_model.joblib': No such file or directory


### Taining and Inference 3

In [77]:
trained_model = train_data(DATA_V2)
store_to_gcs(trained_model)

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

The accuracy of the Decision Tree is 1.000


In [78]:
get_inference(DATA_V1, "iris_classifier/model/artifacts/20251005-025150-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.8          4.0           1.2          0.2  setosa      setosa
1           5.7          4.4           1.5          0.4  setosa      setosa
2           5.4          3.9           1.3          0.4  setosa      setosa
3           5.1          3.5           1.4          0.3  setosa      setosa
4           5.7          3.8           1.7          0.3  setosa      setosa

Accuracy: 0.970


In [81]:
get_inference(IRIS_DATA, "iris_classifier/model/artifacts/20251005-025150-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 0.980


In [79]:
get_inference(DATA_V2, "iris_classifier/model/artifacts/20251005-025150-iris/iris_model.joblib")

   sepal_length  sepal_width  petal_length  petal_width species predictions
0           5.1          3.5           1.4          0.2  setosa      setosa
1           4.9          3.0           1.4          0.2  setosa      setosa
2           4.7          3.2           1.3          0.2  setosa      setosa
3           4.6          3.1           1.5          0.2  setosa      setosa
4           5.0          3.6           1.4          0.2  setosa      setosa

Accuracy: 1.000


## Week 2

Incorporate DVC for the local data into the homework pipeline.
Setup DVC in IRIS Pipeline we have setup as part of Week-2 Assignment

1. Setup the git repository
2. Configure DVC to use Google Cloud storage bucket as Remote storage
3. Augment the IRIS data to simulate the data additions and start training
4. Demonstrate storing data and model files as part of DVC
5. Demonstrate the ability to traverse through data versions effortlessly using dvc checkout
* DVC Command sheet - here

#### Initialize Git Repository

In [14]:
ls

21F1001937_SEPT_2025_MLOps.ipynb  [0m[01;34mdata[0m/             iris_model.joblib
[01;34martifacts[0m/                        [01;34miris_classifier[0m/


In [19]:
# !git init

In [20]:
# !git config --global user.name "jemma-mg"
# !git config --global user.email "jemmamariyageorge@gmail.com"

In [15]:
!git status

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.bashrc[m
	[31m.cache/[m
	[31m.config/[m
	[31m.docker/[m
	[31m.gitconfig[m
	[31m.gsutil/[m
	[31m.ipynb_checkpoints/[m
	[31m.ipython/[m
	[31m.jupyter/[m
	[31m.local/[m
	[31m.npm/[m
	[31m21F1001937_SEPT_2025_MLOps.ipynb[m
	[31martifacts/[m
	[31mdata/[m
	[31miris_classifier/[m
	[31miris_model.joblib[m

nothing added to commit but untracked files present (use "git add" to track)


In [21]:
!touch .gitignore

In [32]:
%%bash
cat << 'EOF' >> .gitignore
.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
.jupyter/*
.local/*
.npm/*
EOF

In [33]:
cat .gitignore

.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
.jupyter/*
.local/*
.npm/*


In [34]:
!git status

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.gitignore[m
	[31m21F1001937_SEPT_2025_MLOps.ipynb[m
	[31martifacts/[m
	[31mdata/[m
	[31miris_classifier/[m
	[31miris_model.joblib[m

nothing added to commit but untracked files present (use "git add" to track)


In [39]:
%%bash
cat << 'EOF' >> .gitignore
iris_classifier/*
iris_model.joblib
EOF

In [40]:
cat .gitignore

.bashrc
.gitconfig
.viminfo
.cache/*
.config/*
.docker/*
.gitconfig/*
.gsutil/*
.ipynb_checkpoints/*
.ipython/*
.jupyter/*
.local/*
.npm/*
iris_classifier/*
iris_model.joblib


In [41]:
!git status

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.gitignore[m
	[31m21F1001937_SEPT_2025_MLOps.ipynb[m
	[31martifacts/[m
	[31mdata/[m

nothing added to commit but untracked files present (use "git add" to track)


In [42]:
!git add .
!git commit -m "Initial commit of IRIS ML pipeline"

[master (root-commit) 3cc7d90] Initial commit of IRIS ML pipeline
 11 files changed, 1946 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 21F1001937_SEPT_2025_MLOps.ipynb
 create mode 100644 artifacts/20251005-003900/iris_model_v1.joblib
 create mode 100644 artifacts/20251005-003900/metrics.txt
 create mode 100644 artifacts/20251005-004237-v1/metrics.txt
 create mode 100644 artifacts/20251005-005426-iris/metrics.txt
 create mode 100644 data/raw/.ipynb_checkpoints/iris-checkpoint.csv
 create mode 100644 data/raw/iris.csv
 create mode 100644 data/v1/.ipynb_checkpoints/data-checkpoint.csv
 create mode 100644 data/v1/data.csv
 create mode 100644 data/v2/data.csv


In [43]:
%%bash
cat << 'EOF' >> .gitignore
data/*
artifacts/*
EOF

#### Configure DVC with GCS Remote

In [19]:
!pip install --quiet dvc[gcs]
!pip install --quiet dvc[gdrive]

  You can safely remove it manually.[0m[33m
[0m