# Vignette for image classification 

This vignette explores the progression from classical machine learning to deep learning for image classification using the Cats vs Dogs dataset from Kaggle. We start by building baseline models—SVM and XGBoost—trained on flattened image vectors. These baselines help show the limitations of classical ML when images are reduced to tabular form. We then introduce a Convolutional Neural Network (CNN), which uses the spatial structure of images, and briefly discuss Vision Transformers (ViT), which apply transformer architectures to sequences of image patches. Together, these models illustrate the evolution from traditional approaches to modern architectures, highlighting both the strengths and weaknesses of each in image classification tasks.

Data:  The Cats vs Dogs dataset from Kaggle contains 25,000 real-world images of cats and dogs, labeled as 0 or 1. Half of them are in train1, and half are in test. Because the dataset is too large to store on GitHub, it is kept locally under data/ and excluded through .gitignore.

## Preprocessing

Before training any models, the images must be standardized and prepared in a format suitable for both classical machine learning and deep learning methods. This involves resizing the images, extracting labels from filenames, and converting each image into numerical arrays that serve as model inputs.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

#Function for Non-Deep Learning Models
def prepare_ml_data(data_path, target_size=(64, 64)):
    filenames = [f for f in os.listdir(data_path) if f.endswith('.jpg')]
    X, y = [], []
    for filename in filenames:
        img = Image.open(os.path.join(data_path, filename)).convert('RGB').resize(target_size)
        img_array = np.array(img).astype(np.float32) / 255.0
        X.append(img_array.flatten())
        y.append(0 if filename.split('.')[0] == 'cat' else 1)
    return np.array(X), np.array(y)

The prepare_ml_data function loads all .jpg images from a given folder and preprocesses them for classical machine learning models. Each image is opened, converted to RGB, resized to 64×64 pixels, and normalized to values between 0 and 1. The image is then flattened into a one-dimensional vector so models like SVM and XGBoost can use it as tabular input. Labels are extracted from the filename, assigning 0 for cats and 1 for dogs. The function returns two NumPy arrays: X containing the flattened image data and y containing the corresponding labels.

In [None]:
#Function for Deep Learning Models
def prepare_dl_data(data_path, target_size=(224, 224)):
    filenames = [f for f in os.listdir(data_path) if f.endswith('.jpg')]
    X, y = [], []
    for filename in filenames:
        img = Image.open(os.path.join(data_path, filename)).convert('RGB').resize(target_size)
        img_array = np.array(img).astype(np.float32) / 255.0
        X.append(img_array)
        y.append(0 if filename.split('.')[0] == 'cat' else 1)
    return np.array(X), np.array(y)

This function loads all .jpg images from a folder and preprocesses them for deep learning models such as CNNs. Each image is opened, converted to RGB, resized to 224×224 pixels, and normalized to values between 0 and 1. Unlike the classical ML version, the images are not flattened, rather they are kept as 3D arrays (height × width × channels) so that convolutional layers can learn spatial features. Labels are extracted from the filenames, also assigning 0 for cats and 1 for dogs. The function returns two NumPy arrays: X containing the processed image tensors and y containing the labels.

## Classical Machine Learning Models

To establish a starting point for image classification performance, we first train two classical machine learning models: Support Vector Machines (SVM) and XGBoost. These models serve as baselines before introducing deep learning methods like CNNs. Because classical ML algorithms cannot operate directly on image grids, we use the preprocessed and flattened 64×64 image vectors prepared earlier.

## XGBoost

In [None]:
import sys
sys.path.append("../src")  

# Calling the preprocessing function
from preprocessing import prepare_ml_data

# Import necessary libraries for modeling
from sklearn.model_selection import train_test_split 
from xgboost import XGBClassifier 
from sklearn.svm import SVC 
from sklearn.metrics import accuracy_score

In [None]:
# Both XGBoost and SVM require 2D input data, so we will flatten the images
X, y = prepare_ml_data("../data_sample", target_size=(64, 64))

In [None]:
import os
import numpy as np

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)


#save_dir = "../data/processed"
#os.makedirs(save_dir, exist_ok=True)

#np.save(os.path.join(save_dir, "X_train.npy"), X_train)
#np.save(os.path.join(save_dir, "X_test.npy"), X_test)
#np.save(os.path.join(save_dir, "y_train.npy"), y_train)
#np.save(os.path.join(save_dir, "y_test.npy"), y_test)

In [None]:
# Dimensions of the datasets
print("Training set:", X_train.shape, y_train.shape)
print("Testing set:", X_test.shape, y_test.shape)

In [None]:
# Training an XGBoost classifier

xgb_model = XGBClassifier(
    n_estimators=100,     # number of boosted trees
    max_depth=5,          # tree depth (controls complexity)
    learning_rate=0.1,    # boosting shrinkage
    subsample=0.8,        # use 80% of samples per tree
    colsample_bytree=0.8, # use 80% of features per tree
    eval_metric="logloss" # required to suppress warnings for binary classification
)

xgb_model.fit(X_train, y_train)

# Predictions on test set
y_pred_xgb = xgb_model.predict(X_test) 

# Evaluate accuracy
xgb_acc = accuracy_score(y_test, y_pred_xgb)
print(f"XGBoost Test Accuracy: {xgb_acc:.4f}")
# Test accuracy = 75%

Since XGBoost doesn't naturally capture spatial patterns (edges, textures, shapes) like CNNs do, we treat it as a baseline model against CNNs.

In [None]:
#Train a SVM model

svm_model = SVC(kernel="linear") # linear kernel is fast and is commonly used as a baseline model
svm_model.fit(X_train, y_train) 
y_pred_svm = svm_model.predict(X_test) 
print("SVM accuracy:", accuracy_score(y_test, y_pred_svm))
# Test accuracy = 50%

SVM finds the best hyperplane that separates the two classes (cat vs dog). We obtained a test accuracy of 50%, meaning that the model is essentially a coin toss, highlighting the limitations of applying classical machine learning directly to raw image data. Since the images were flattened into long vectors, the spatial relationships between pixels were lost, making it difficult for the SVM to capture meaningful patterns, such as fur texture or shapes. This result motivates the need for models that can exploit the inherent structure of images, such as convolutional neural networks and vision transformers.

# Deep Learning Methods