# 02-01: Kaggle + Vertex AI AutoML with Fruit and Vegetable Disease (Healthy vs Rotten) Dataset

Train an image classification model using [Google Cloud Vertex AI](https://cloud.google.com/vertex-ai) and AutoML with data from [Kaggle](https://www.kaggle.com/datasets/muhammad0subhan/fruit-and-vegetable-disease-healthy-vs-rotten/data).


* Kaggle page:  https://www.kaggle.com/datasets/muhammad0subhan/fruit-and-vegetable-disease-healthy-vs-rotten
* dataset: https://www.kaggle.com/datasets/muhammad0subhan/fruit-and-vegetable-disease-healthy-vs-rotten/data
* notebook: https://www.kaggle.com/code/osamaabobakr/fruit-and-vegetable-disease-healthy-vs-rotten

by: Justin Marciszewski | justinjm@google.com | AI/ML Specialist CE

refs:

* https://cloud.google.com/vertex-ai/docs/training-overview
* https://cloud.google.com/vertex-ai/docs/tutorials/image-classification-automl/overview
* https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_image_classification_online_prediction.ipynb


## Setup



### Install packages

In [None]:
packages = [
    ('numpy', 'numpy'),
    ('cv2', 'opencv-python'),
    ('matplotlib.pyplot', 'matplotlib'),
    ('seaborn', 'seaborn'),
    ('kaggle.api.kaggle_api_extended', 'kaggle'),
    ('sklearn.model_selection', 'scikit-learn'),
    ('sklearn.utils', 'scikit-learn'),
    ('keras', 'keras'),
    ('tensorflow.keras', 'tensorflow'),
    ('tensorflow.keras.layers', 'tensorflow'),
    ('tensorflow.keras.models', 'tensorflow'),
    ('tensorflow.keras.applications', 'tensorflow'),
    ('tensorflow.keras.preprocessing.image', 'tensorflow')
]

import importlib
install = False
for package in packages:
    try:
        importlib.import_module(package[0])
    except ImportError:
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

if install:
    print("Installation of missing packages complete. Please run the next cell to restart the kernel before proceeding.")

### Restart Kernel (If Installs Occured)
After a kernel restart the code submission can start with the next cell after this one.

In [None]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Setup 

### Set constants

In [None]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

In [None]:
LOCATION = "us-central1"  
REGION = 'us-central1' 

SERIES = "02-kaggle-vertex-ai"
EXPERIMENT = "02-automl" # notebook number 

BUCKET_NAME = f"{PROJECT_ID}-fruit-and-veg-image-model"

## model training 
DESIRED_LABELS = [
    'Apple__Healthy', 'Apple__Rotten',
    'Banana__Healthy', 'Banana__Rotten',
    'Bellpepper__Healthy', 'Bellpepper__Rotten'
]
NUM_CLASSES = len(DESIRED_LABELS)

### Packages

In [None]:
# Data Ingestion
from datetime import datetime
import os
from pathlib import Path
import subprocess
import time
import json
import re
import random
import tempfile
import threading
import pandas as pd

from google.cloud import storage
from google.cloud.exceptions import NotFound

from kaggle.api.kaggle_api_extended import KaggleApi

# Data pre-processing
from PIL import Image  # For image loading and preprocessing
from concurrent.futures import ThreadPoolExecutor

# Modeling 
from google.cloud import aiplatform
import base64
import tensorflow as tf
import numpy as np

### Parameters

In [None]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
print(f"TIMESTAMP: {TIMESTAMP}")
URI = f"gs://{BUCKET_NAME}/{SERIES}/{EXPERIMENT}" 
DIR = f"temp/{EXPERIMENT}"

LOCAL_DATA_DIR = f"{DIR}/data"
LOCAL_CSV_IMAGE_DATA_PATH = f"{LOCAL_DATA_DIR}/labels.csv"

DATASET_CSV = f"{URI}/{TIMESTAMP}/labels.csv"

DATASET_DISPLAY_NAME = f"{SERIES}-{TIMESTAMP}"

### Experiment Tracking 

In [None]:
FRAMEWORK = 'tf'
TASK = 'classification'
MODEL_TYPE = 'tl'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'
RUN_NAME = f'run-{TIMESTAMP}'

### Create a local directories for staging files 

* data files from creating labels.csv
* build files for creating custom container and running a custom job 
* model training output files and example input images for local inference

In [None]:
! rm -rf $LOCAL_DATA_DIR
! mkdir -p $LOCAL_DATA_DIR

In [None]:
if not os.path.exists(f"{DIR}/build"):
    os.makedirs(f"{DIR}/build")

In [None]:
if not os.path.exists(f"{DIR}/output"):
    os.makedirs(f"{DIR}/output")

## Clients 

In [None]:
storage_client = storage.Client(project=PROJECT_ID)
aiplatform.init(project=PROJECT_ID, location=REGION)

## Create Storage Bucket

In [None]:
def check_and_create_bucket(bucket_name, location):
    try:
        storage_client.get_bucket(bucket_name)
        print(f"Bucket {bucket_name} already exists.")
    except NotFound:
        bucket = storage_client.create_bucket(bucket_or_name=bucket_name, location=location)
        print(f"Bucket {bucket_name} created.")

In [None]:
check_and_create_bucket(BUCKET_NAME, LOCATION)

## Get Data from Kaggle

### Setup Kaggle credentials

You will need a Kaggle account and locate or create a kaggle.json file in the directory: `/home/jupyter/.config/kaggle`

Steps:

* manually download your credentail file from kaggle.com -> Profile
* run this command in terminal to move it to the correct location: `mv kaggle.json .config/kaggle/kaggle.json`


### Download images 

In [None]:
# Set up Kaggle credentials 
os.environ['KAGGLE_USERNAME'] = 'YOUR_KAGGLE_USERNAME' 
os.environ['KAGGLE_KEY'] = 'YOUR_KAGGLE_API_KEY'

# Initialize the Kaggle API
api = KaggleApi()
api.authenticate()

# Specify the dataset you want to download
dataset_slug = 'muhammad0subhan/fruit-and-vegetable-disease-healthy-vs-rotten'

# Download the dataset
api.dataset_download_files(dataset_slug, path=LOCAL_DATA_DIR, unzip=True)

### Convert images

In [None]:
def convert_image_to_rgb_and_jpeg(image_path):
    """Converts and saves an image to RGB JPEG format, overwriting the original."""
    try:
        img = Image.open(image_path)

        if img.mode != 'RGB':
            img = img.convert('RGB')

        img.save(image_path, format='JPEG')  # Overwrite the original
        # print(f'Converted and saved: {image_path}')

    except Exception as e:
        print(f'Error processing {image_path}: {e}')

def process_directory(root_dir, subdirs_to_convert, max_workers=None):
    """Processes images within specified subdirectories using multithreading."""
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        for root, dirs, files in os.walk(root_dir):
            # Filter directories based on the provided list
            dirs[:] = [d for d in dirs if d in subdirs_to_convert]

            for file in files:
                if file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):  # Add more extensions if needed
                    image_path = Path(root) / file
                    executor.submit(convert_image_to_rgb_and_jpeg, image_path)

In [None]:
root_directory = f"{LOCAL_DATA_DIR}/Fruit And Vegetable Diseases Dataset"
subdirectories_to_convert = DESIRED_LABELS

process_directory(root_directory, subdirectories_to_convert)

## Load to GCS

Load only a subset of images (set by the `DESIRED_LABELS` list) for demonstration purposes, update the `DESIRED_LABELS` to include all the images in the Kaggle dataset.

In [None]:
# Loop over each subdirectory (label) and copy the contents using gsutil
for subdir in DESIRED_LABELS:
    source = f'"{LOCAL_DATA_DIR}/Fruit And Vegetable Diseases Dataset/{subdir}/*"'
    destination = f"{URI}/data/{subdir}/"
    print(destination)
    command = f"gsutil -m cp -r {source} {destination} > /dev/null 2>&1"
    
    # Execute the command using subprocess
    process = subprocess.run(command, shell=True)
    
    if process.returncode == 0:
        print(f"Successfully copied {subdir}")
    else:
        print(f"Failed to copy {subdir}")

## Prepare data 

refs:

* https://cloud.google.com/vertex-ai/docs/image-data/classification/prepare-data 

### Create csv labels file and upload for use in model training

Create a csv file called `labels.csv` with the schema:  `gs://filename.jpg, label` 

This file should contain no headers and be located in GCS 

In [None]:
def get_file_list(bucket_name):
    # get list of all files from bucket
    bucket = storage_client.bucket(bucket_name)
    blobs = bucket.list_blobs()
    file_list = ['gs://' + bucket_name + '/' + blob.name for blob in blobs]
    
    return file_list

In [None]:
file_list = get_file_list(BUCKET_NAME)
file_list[:10]

In [None]:
def create_dataframe(file_list, filter_pattern):
    # filter to include on filenames with jpg filename
    image_files = [file for file in file_list if file.endswith(('.jpg'))]
    df = pd.DataFrame(image_files, columns=['filename'])
    
    ## filter to only 3 foods per constants set above for demonstration purposes 
    df = df[df['filename'].str.contains(filter_pattern, regex=True)]
    
    # Extract the label from the GCS path (it's the second part after the bucket name)
    df['label'] = df['filename'].apply(lambda x: x.split('/')[6])  # Assuming the label is in the ith segment of the path
    
    return df

In [None]:
pd.options.display.max_colwidth = 100 # set option to view long strings 

df_labels = create_dataframe(file_list, 
                             filter_pattern = '|'.join(DESIRED_LABELS))
df_labels.head()

In [None]:
df_labels.shape[0]

In [None]:
df_labels['label'].value_counts()

### Save labels.csv

Save labels.csv locally and to GCS Bucket for use in vertex ai training in next step

In [None]:
df_labels.to_csv(LOCAL_CSV_IMAGE_DATA_PATH, index=False, header=False)

In [None]:
bucket = storage_client.bucket(BUCKET_NAME)
blob = bucket.blob(f"{SERIES}/{EXPERIMENT}/{TIMESTAMP}/labels.csv")
blob.upload_from_filename(LOCAL_CSV_IMAGE_DATA_PATH)

## Create Vertex AI Dataset

Create a managed Vertex AI dataset. 

refs:

* https://cloud.google.com/vertex-ai/docs/image-data/classification/create-dataset#aiplatform_create_dataset_image_sample-python_vertex_ai_sdk

In [None]:
dataset = aiplatform.ImageDataset.create(
        display_name=f"{SERIES}_{EXPERIMENT}_{TIMESTAMP}",
        gcs_source=[DATASET_CSV],
        import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification, 
        sync=True,
    )

## Model Training

Submit the AutoML training job to Vertex AI

refs

* https://cloud.google.com/vertex-ai/docs/image-data/classification/train-model#aiplatform_create_training_pipeline_image_classification_sample-python_vertex_ai_sdk



In [None]:
job = aiplatform.AutoMLImageTrainingJob(
    display_name=f"{SERIES}_{EXPERIMENT}_{TIMESTAMP}",
    model_type="CLOUD",
    prediction_type="classification",
    multi_label=False,
)

In [None]:
## manual set here if needed 
# dataset = aiplatform.ImageDataset(dataset_id)

In [None]:
model = job.run(
    dataset=dataset,
    model_display_name=f"{SERIES}_{EXPERIMENT}_{TIMESTAMP}",
    training_fraction_split=0.4,
    validation_fraction_split=0.3,
    test_fraction_split=0.3,
    budget_milli_node_hours=8000,
    disable_early_stopping=False,
    sync=True)

## Evaluate Model

Evaluate your AutoML image classification model here if needed so that you can iterate on your model.

Vertex AI provides model evaluation metrics to help you determine the performance of your models, such as precision and recall metrics. Vertex AI calculates evaluation metrics by using the [test set](https://cloud.google.com/vertex-ai/docs/general/ml-use).

refs:

* https://cloud.google.com/vertex-ai/docs/image-data/classification/evaluate-model

In [None]:
try:
    model
    print("model object set!")
except NameError:
    print(f"model object not set, fetching...")
    models = aiplatform.Model.list(filter=f"display_name={SERIES}_{EXPERIMENT}_{TIMESTAMP}")
    model = models[0]

In [None]:
model_evaluations = model.list_model_evaluations()

for model_evaluation in model_evaluations:
    print(json.dumps(model_evaluation.to_dict(), indent=4))

## Get Predictions - Online

We will now get an an online (real-time) prediction from our model. 

Online predictions are synchronous requests made to a model endpoint. Use online predictions when you are making requests in response to application input or in situations that require timely inference.

You can read more about this at the references linked below.

refs:

* https://cloud.google.com/vertex-ai/docs/image-data/classification/get-predictions



### Deploy model

In [None]:
endpoint = model.deploy()

### Get a test item

Get the first image from our `labels.csv` file to use as a test item to ensure our model returns the expected response.

In practice, we would use an image that our trained model has not seen before.

In [None]:
test_item = !gsutil cat $DATASET_CSV | head -n1
if len(str(test_item[0]).split(",")) == 3:
    _, test_item, test_label = str(test_item[0]).split(",")
else:
    test_item, test_label = str(test_item[0]).split(",")

print(f"\nTest Item:\n")
print(f"  Test Item Source: {test_item}\n")  
print(f"  Test Item Actual Label: {test_label}")

### Make prediction

Print raw prediction output to see raw model output:

In [None]:
with tf.io.gfile.GFile(test_item, "rb") as f:
    content = f.read()

# The format of each instance should conform to the deployed model's prediction input schema.
instances = [{"content": base64.b64encode(content).decode("utf-8")}]

prediction = endpoint.predict(instances=instances)

print(json.dumps(prediction, indent=4))

Finally, print only label with highest confidence score in a pretty way for demonstration purposes 

In [None]:
# Extract the relevant data from the prediction response
confidences = prediction.predictions[0]['confidences']
display_names = prediction.predictions[0]['displayNames']

# Find the index of the highest confidence score
max_confidence_index = confidences.index(max(confidences))

# Extract the label with the highest confidence and its score
top_label = display_names[max_confidence_index]
top_confidence = confidences[max_confidence_index]

# Print the result in a pretty format
print(f"\nPrediction Result:\n")
print(f"  Top Label: {top_label}")
print(f"  Confidence: {top_confidence:2f}\n")  

print(f"\nTest Item Actuals:\n")
print(f"  Test Item Actual Label: {test_label}")
print(f"  Test Item Source: {test_item}\n")  

## Cleanup !!danger zone!!

In [None]:
# undeploy endpoint only 
# endpoint.undeploy_all()

In [None]:
## ! warning - running the code below deletes objects that require 
## long running processes to recreate

# Delete the dataset using the Vertex dataset object
## dataset.delete()

# Delete the endpoint using the Vertex endpoint object
## endpoint.delete()

# Delete the model using the Vertex model object
##model.delete()

# Delete the AutoML trainig job
##job.delete()

# Delete Cloud Storage objects that were created
## delete_bucket = False  # Set True for deletion
## if delete_bucket:
##     ! gsutil -m rm -r gs://$BUCKET_NAME