## CS610: Applied Machine Learning Project

This notebook serves to detail our workflow and code on our project objective of building an image classification model to fulfil the business objective of identifying different shoe models. The dataset can be found in the repository and is sourced from Kaggle <br>(<i>**state source?**</i>).

### Install and import packages

In [None]:
!pip install opencv-python
!pip install cv2

Collecting opencv-python
  Downloading opencv_python-4.11.0.86-cp37-abi3-win_amd64.whl.metadata (20 kB)
Downloading opencv_python-4.11.0.86-cp37-abi3-win_amd64.whl (39.5 MB)
   ---------------------------------------- 0.0/39.5 MB ? eta -:--:--
   - -------------------------------------- 1.0/39.5 MB 6.3 MB/s eta 0:00:07
   --- ------------------------------------ 3.1/39.5 MB 8.4 MB/s eta 0:00:05
   ----- ---------------------------------- 5.5/39.5 MB 9.1 MB/s eta 0:00:04
   ------- -------------------------------- 7.9/39.5 MB 9.7 MB/s eta 0:00:04
   ---------- ----------------------------- 10.5/39.5 MB 10.4 MB/s eta 0:00:03
   ------------- -------------------------- 13.4/39.5 MB 10.9 MB/s eta 0:00:03
   ---------------- ----------------------- 16.3/39.5 MB 11.4 MB/s eta 0:00:03
   ------------------- -------------------- 19.4/39.5 MB 11.8 MB/s eta 0:00:02
   ---------------------- ----------------- 22.5/39.5 MB 12.2 MB/s eta 0:00:02
   -------------------------- ------------- 25.7/39.5

ERROR: Could not find a version that satisfies the requirement cv2 (from versions: none)
ERROR: No matching distribution found for cv2


In [2]:
import cudf
import numpy as np
import pandas as pd
from PIL import Image
import cv2
import os
import tqdm
from skimage.feature import hog
from cuml.model_selection import train_test_split
from cuml.metrics import accuracy_score
from cuml.svm import SVC
import torch
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader
from cuml import LogisticRegression
from cuml.ensemble import RandomForestClassifier
from cuml.neighbors import KNeighborsClassifier
from sklearn.ensemble import StackingClassifier
import xgboost as xgb
import time
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score
import warnings
warnings.filterwarnings("ignore")

In [3]:
import cuml
print(cuml.__version__)
%load_ext cuml.accel

25.02.01
[2025-06-18 17:01:05.396] [CUML] [info] cuML: Installed accelerator for sklearn.
[2025-06-18 17:01:22.591] [CUML] [info] cuML: Installed accelerator for umap.
[2025-06-18 17:01:22.699] [CUML] [info] cuML: Installed accelerator for hdbscan.
[2025-06-18 17:01:22.699] [CUML] [info] cuML: Successfully initialized accelerator.


### Loading image summary

Do note that the pathnames have been changed to assume that you are in the `CS610_AML_Group_Project` directory

In [2]:
image_csv = pd.read_csv('../CS610_AML_Group_Project/dataset_stats.csv')
image_csv.head()

Unnamed: 0,class,image_count,avg_width,avg_height,min_width,min_height,max_width,max_height,formats,corrupt_files
0,adidas_forum_high,150,143,124,78,81,162,140,jpeg,0
1,adidas_ultraboost,150,142,128,93,49,162,140,jpeg,0
2,new_balance_550,150,134,129,79,40,162,140,jpeg,0
3,new_balance_574,150,131,133,78,68,162,140,jpeg,0
4,converse_one_star,150,138,130,82,67,162,140,jpeg,0


In [3]:
# check images format
image_csv['formats'].unique()

array(['jpeg', 'jpeg, png'], dtype=object)

In [6]:
def count_images(datasource_path):
    image_counts = {}
    image_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp'}

    if not os.path.isdir(datasource_path):
        print(f"Error: Path '{datasource_path}' is not a directory.")
        return image_counts

    for subfolder_name in os.listdir(datasource_path):
        subfolder_path = os.path.join(datasource_path, subfolder_name)

        if os.path.isdir(subfolder_path):
            count = 0
            for file_name in os.listdir(subfolder_path):
                file_path = os.path.join(subfolder_path, file_name)
                if os.path.isfile(file_path):
                    _, ext = os.path.splitext(file_name)
                    if ext.lower() in image_extensions:
                        count += 1
            image_counts[subfolder_name] = count
    return image_counts

image_dir = 'datasource'
print(f"Scanning: {image_dir}")
counts = count_images(image_dir)

if counts:
    for folder, count in counts.items():
        print(f"{folder}: {count} images")
else:
    print("No images found or path is incorrect/empty.")
# else:

Scanning: datasource
adidas_forum_high: 150 images
adidas_forum_low: 115 images
adidas_gazelle: 149 images
adidas_nmd_r1: 115 images
adidas_samba: 115 images
adidas_stan_smith: 147 images
adidas_superstar: 115 images
adidas_ultraboost: 150 images
asics_gel-lyte_iii: 91 images
converse_chuck_70_high: 115 images
converse_chuck_70_low: 148 images
converse_chuck_taylor_all-star_high: 114 images
converse_chuck_taylor_all-star_low: 114 images
converse_one_star: 150 images
new_balance_327: 108 images
new_balance_550: 150 images
new_balance_574: 150 images
new_balance_990: 113 images
new_balance_992: 150 images
nike_air_force_1_high: 115 images
nike_air_force_1_low: 147 images
nike_air_force_1_mid: 148 images
nike_air_jordan_11: 113 images
nike_air_jordan_1_high: 115 images
nike_air_jordan_1_low: 115 images
nike_air_jordan_3: 100 images
nike_air_jordan_4: 150 images
nike_air_max_1: 106 images
nike_air_max_270: 149 images
nike_air_max_90: 150 images
nike_air_max_95: 115 images
nike_air_max_97: 

### Image Processing

#### Resizing

In [None]:
# resize images
def resize_image_in_folder(input_dir, output_dir, size=(224, 224), desc='resizing images'):
    if not os.path.exists(input_dir):
        print(f"Input directory {input_dir} does not exist. Please check the path.")
        return

    os.makedirs(output_dir, exist_ok=True)
    supported_formats = ('.png', '.jpg', '.jpeg', '.gif', '.bmp', '.tiff', '.webp')
    for filename in os.listdir(input_dir):
        if filename.lower().endswith(supported_formats):
            img_input_path = os.path.join(input_dir, filename)
            img_output_path = os.path.join(output_dir, filename)
            try:
                img = cv2.imread(img_input_path, cv2.IMREAD_UNCHANGED)

                if img is None:
                    print(f"Error loading {img_input_path}")
                    continue
                resized_img = cv2.resize(img, size, interpolation=cv2.INTER_LANCZOS4)

                if img_output_path.lower().endswith(('.jpg', '.jpeg')) and resized_img.shape[-1] == 4:
                    resized_img = cv2.cvtColor(resized_img, cv2.COLOR_BGRA2BGR)
                cv2.imwrite(img_output_path, resized_img)
            except Exception as e:
                print(f"Error processing {img_input_path}: {e}")

In [None]:
# process all folders
def batch_resize_images(base_input_dir, base_output_dir, size=(224, 224)):
    if not os.path.exists(base_input_dir):
        print(f"Base directory {base_input_dir} does not exist. Please check the path.")
        return

    os.makedirs(base_output_dir, exist_ok=True) # if output directory does not exist, create it.

    for folder in tqdm.tqdm(os.listdir(base_input_dir)):
        current_input_subfolder = os.path.join(base_input_dir, folder)
        current_output_subfolder = os.path.join(base_output_dir, folder)

        if os.path.isdir(current_input_subfolder):
            resize_image_in_folder(current_input_subfolder, current_output_subfolder, size=size)
        else:
            print(f"Skipping {current_input_subfolder} as it is not a directory.")

    print("Batch resizing completed.")

In [None]:
base_input_dir = '../CS610_AML_Group_Project/datasource'
base_output_dir = '../CS610_AML_Group_Project/resized_images'

In [None]:
batch_resize_images(base_input_dir, base_output_dir, size=(224, 224))

100%|██████████| 50/50 [00:05<00:00,  8.70it/s]

Batch resizing completed.





#### Gray Scaling

In [None]:
def grayscale_image_in_folder(input_dir, output_dir):
    if not os.path.exists(input_dir):
        print(f"Input directory {input_dir} does not exist. Please check the path.")
        return

    os.makedirs(output_dir, exist_ok=True)
    supported_formats = ('.png', '.jpg', '.jpeg', '.gif', '.bmp', '.tiff', '.webp')
    for filename in os.listdir(input_dir):
        if filename.lower().endswith(supported_formats):
            img_input_path = os.path.join(input_dir, filename)
            img_output_path = os.path.join(output_dir, filename)
            try:
                img = cv2.imread(img_input_path)
                if img is None:
                    print(f"Error loading {img_input_path}")
                    continue
                # Convert to grayscale
                gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
                cv2.imwrite(img_output_path, gray_img)
            except Exception as e:
                print(f"Error processing {img_input_path}: {e}")

In [None]:
def batch_grayscale_images(base_input_dir, base_output_dir):
    if not os.path.exists(base_input_dir):
        print(f"Base directory {base_input_dir} does not exist. Please check the path.")
        return

    os.makedirs(base_output_dir, exist_ok=True)

    for folder in tqdm.tqdm(os.listdir(base_input_dir)):
        current_input_subfolder = os.path.join(base_input_dir, folder)
        current_output_subfolder = os.path.join(base_output_dir, folder)

        if os.path.isdir(current_input_subfolder):
            grayscale_image_in_folder(current_input_subfolder, current_output_subfolder)
        else:
            print(f"Skipping {current_input_subfolder} as it is not a directory.")

    print("Batch grayscale completed.")

In [None]:
base_input_dir = '../CS610_AML_Group_Project/resized_images'
base_output_dir = '../CS610_AML_Group_Project/grayscale_images'
batch_grayscale_images(base_input_dir, base_output_dir)

100%|██████████| 50/50 [00:01<00:00, 28.81it/s]

Batch grayscale completed.





#### Flatten

In [None]:
def flatten_images_recursive(input_dir):
    features = []
    filenames = []
    supported_formats = ('.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.webp')
    for root, dirs, files in os.walk(input_dir):
        for filename in files:
            if filename.lower().endswith(supported_formats):
                img_path = os.path.join(root, filename)
                img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
                if img is None:
                    continue
                img_flat = img.flatten()
                features.append(img_flat)
                # Save relative path for label or traceability
                rel_path = os.path.relpath(img_path, input_dir)
                filenames.append(rel_path)
    flattened = np.array(features)
    return flattened, filenames

# Example usage:
input_dir = '../CS610_AML_Group_Project/grayscale_images'
flattened, filenames = flatten_images_recursive(input_dir)
print(flattened.shape)  # (num_images, 224*224)

(5953, 50176)


## Feature extraction and Model Training
Feature extraction serves as an important part of the data processing step as the correct method used will help the models to learn the features better and hence produce higher accuracy. To investigate on which method is the better feature extraction method, two RandomForestClassifier models with the same set of parameters (found using RandomizedSearchCV previously) was used. The accuracy score will be used to determine which method is better for this use case.

#### Pipeline Models using Feature Extraction Method 1 - By HOG

In [25]:
from skimage.feature import hog

def extract_hog_features_recursive(input_dir, pixels_per_cell=(16, 16), cells_per_block=(2, 2)):
    features = []
    filenames = []
    supported_formats = ('.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.webp')
    for root, dirs, files in os.walk(input_dir):
        for filename in files:
            if filename.lower().endswith(supported_formats):
                img_path = os.path.join(root, filename)
                img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
                if img is None:
                    continue
                # Extract HOG features
                hog_feature = hog(img, pixels_per_cell=pixels_per_cell, cells_per_block=cells_per_block, feature_vector=True)
                features.append(hog_feature)
                rel_path = os.path.relpath(img_path, input_dir)
                filenames.append(rel_path)
    hogged = np.array(features)
    return hogged, filenames

# Example usage:
input_dir = '../CS610_AML_Group_Project/grayscale_images'
hogged, filenames = extract_hog_features_recursive(input_dir)
print(hogged.shape)  # (num_images, hog_feature_dim)

(5953, 6084)


In [None]:
#DO NOT RUN - save filenames for easier access (without needing to run hog function again)
np.savetxt("filename.csv",
        filenames,
        delimiter =", ",
        fmt ='% s')

In [26]:
#get filenames
import csv
filenames = []
with open("./filename.csv", 'r', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        filenames.extend(row)

In [27]:
#Labeling
y = [f.split(os.sep)[0] for f in filenames]
#split data into train_test split
x = hogged.astype(np.float32)
y = np.array(y)
y, uniques = pd.factorize(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)
x_train = pd.DataFrame(x_train, dtype=np.float32)
y_train = pd.Series(y_train, dtype=np.int32)
x_test = pd.DataFrame(x_test, dtype=np.float32)
y_test = pd.Series(y_test, dtype=np.int32)

In [28]:
#Check if data is prepared successfully
print("Number of Samples:", len(y_train))
print("Number of Labels:", len(np.unique(y_train)))
counts = y_train.value_counts()
print("Label Distribution:")
print(counts)

Number of Samples: 4763
Number of Labels: 50
Label Distribution:
0     120
7     120
15    120
16    120
13    120
36    120
37    120
29    120
26    120
18    120
21    119
33    119
28    119
10    119
2     119
48    119
45    119
43    119
39    118
41    118
5     118
42    118
20    118
49    116
34     86
47     86
14     86
27     85
25     80
40     78
31     78
35     78
3      76
17     76
22     74
44     74
1      74
32     74
6      74
12     74
8      73
46     70
19     70
30     69
38     64
24     63
11     62
23     62
9      60
4      59
Name: count, dtype: int64


#### 1) RandomForestClassifier - feature extraction by hog

In [None]:
param_distributions = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [10, 20, 30, 40, None],
    'max_features': ['sqrt', 'log2', 0.5, 0.8, 1.0]
}

rf = RandomForestClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=20,
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

random_search.fit(x_train, y_train)

print("✅ Best params found:", random_search.best_params_)

In [None]:
rf = RandomForestClassifier(n_estimators=100, max_depth=25, random_state=42)
rf.fit(x_train, y_train)

y_pred = rf.predict(x_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.2397582269979852
                                     precision    recall  f1-score   support

                  adidas_forum_high       0.22      0.42      0.29        38
                   adidas_forum_low       0.50      0.17      0.26        23
                     adidas_gazelle       0.33      0.35      0.34        37
                      adidas_nmd_r1       0.39      0.29      0.33        24
                       adidas_samba       0.00      0.00      0.00        18
                  adidas_stan_smith       0.23      0.24      0.23        37
                   adidas_superstar       0.11      0.04      0.06        23
                  adidas_ultraboost       0.26      0.38      0.31        37
                 asics_gel-lyte_iii       0.53      0.39      0.45        23
             converse_chuck_70_high       0.40      0.21      0.28        19
              converse_chuck_70_low       0.20      0.30      0.24        37
converse_chuck_taylor_all-star_high       0.33

#### 2) KNNClassifier - feature extraction by hog

In [None]:
# Start timing
start_time = time.time()

# Base model
base_model = cuml.neighbors.KNeighborsClassifier()

# Hyperparameters
param_dist = {
    'n_neighbors': randint(1, 30),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    verbose=2,
    random_state=42,
    error_score='raise'
)
random_search.fit(x_train, y_train)

# End timing
end_time = time.time()
training_time = end_time - start_time

[2025-06-16 19:29:27.663] [CUML] [info] cuML: Installed accelerator for sklearn.
[2025-06-16 19:29:38.496] [CUML] [info] cuML: Installed accelerator for umap.
[2025-06-16 19:29:38.510] [CUML] [info] cuML: Installed accelerator for hdbscan.
[2025-06-16 19:29:38.510] [CUML] [info] cuML: Successfully initialized accelerator.
[2025-06-16 19:29:38.530] [CUML] [info] Unused keyword parameter: handle during CPU estimator initialization
[2025-06-16 19:29:38.530] [CUML] [info] Unused keyword parameter: verbose during CPU estimator initialization
[2025-06-16 19:29:38.530] [CUML] [info] Unused keyword parameter: output_type during CPU estimator initialization
[2025-06-16 19:29:38.533] [CUML] [info] Unused keyword parameter: leaf_size during cuML estimator initialization
[2025-06-16 19:29:38.533] [CUML] [info] Unused keyword parameter: n_jobs during cuML estimator initialization
Fitting 5 folds for each of 50 candidates, totalling 250 fits
[2025-06-16 19:29:38.540] [CUML] [info] Unused keyword par

In [34]:
# Best model
best_model = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'distance'}
Best Accuracy: 0.288472
Total Training Time: 0.38 minutes


In [35]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_model.predict(x_train)
y_pred_test = best_model.predict(x_test)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\n TEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.998110434600042
Precision (macro): 0.9977670903469599
Recall (macro): 0.9976976318551373
F0.5-Score (macro): 0.9977159510029364

 TEST METRICS
Accuracy: 0.32605042016806723
Precision (macro): 0.3353867836211309
Recall (macro): 0.32119180107284623
F0.5-Score (macro): 0.326634977153412


#### 3) XGBoostClassifier - feature extraction by hog

In [None]:
# Further split into validation set for early stopping
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)
x_train = pd.DataFrame(x_train, dtype=np.float32)
y_train = pd.Series(y_train, dtype=np.int32)
x_valid = pd.DataFrame(x_valid, dtype=np.float32)
y_valid = pd.Series(y_valid, dtype=np.int32)
x_test = pd.DataFrame(x_test, dtype=np.float32)
y_test = pd.Series(y_test, dtype=np.int32)

# Start timing
start_time = time.time()

# Balance class weights
sample_weights = compute_sample_weight(
    class_weight="balanced",
    y=y_train
)

# Base model
base_model = xgb.XGBClassifier(
    device="cuda",
    tree_method="hist",
    objective="multi:softprob",
    num_class=len(np.unique(y_train)),
    early_stopping_rounds=10,
    eval_metric=['merror','mlogloss'],
    random_state=42
)

# Hyperparameters
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 12),
    'learning_rate': uniform(0.01, 0.19),  # range: 0.01 to 0.2
    'subsample': uniform(0.7, 0.3),        # range: 0.7 to 1.0
    'colsample_bytree': uniform(0.7, 0.3)  # range: 0.7 to 1.0
}


# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    verbose=2,
    random_state=42,
    error_score='raise'
)
random_search.fit(
    x_train, y_train,
    sample_weight=sample_weights,
    eval_set=[(x_valid, y_valid)],
    verbose=0)

# End timing
end_time = time.time()
training_time = end_time - start_time

The cuml.accel extension is already loaded. To reload it, use:
  %reload_ext cuml.accel
Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  31.7s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  34.9s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  30.3s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  35.6s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  34.4s
[CV] END colsample_bytree=0.8337498258560773,

In [None]:
# Best model
best_model = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'colsample_bytree': np.float64(0.9950269422684528), 'learning_rate': np.float64(0.08577664406446507), 'max_depth': 3, 'n_estimators': 250, 'subsample': np.float64(0.7962340194915207)}
Best Accuracy: 0.2938822397645927
Total Training Time: 193.18 minutes


In [None]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_model.predict(x_train)
y_pred_test = best_model.predict(x_test)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\nTEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.9996637525218561
Precision (macro): 0.9995192307692307
Recall (macro): 0.9995967741935484
F0.5-Score (macro): 0.9995322002674942

 TEST METRICS
Accuracy: 0.3041722745625841
Precision (macro): 0.3239029516593595
Recall (macro): 0.2903715446023294
F0.5-Score (macro): 0.30795705037248194


### Pipeline Models using Feature Extraction Method 2 - Using pretrained CNN

ResNet50 will be used as the feature extractor due to its pre-trained weights, derived from large datasets like ImageNet, and is a popular choice to use for computer vision applications such as image classification.
Reference:
1) https://medium.com/@meetkalathiya1301/feature-extraction-using-pre-trained-models-for-image-classification-16e6ff43f268
2) https://stackoverflow.com/questions/62117707/extract-features-from-pretrained-resnet50-in-pytorch

In [15]:
#Process image data for feature extraction using CNN
base_output_dir = '../CS610_AML_Group_Project/resized_images'
img_transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=[0.485,0.456,0.406],std=[0.229,0.224,0.225])]) #mean and std based on ImageNet - normalise image data closer to normal distribution
img_dataset = datasets.ImageFolder(base_output_dir, transform=img_transform)
data_loader = DataLoader(img_dataset, batch_size=32, num_workers=4)

input_dir = '../CS610_AML_Group_Project/grayscale_images'

In [16]:
#define function for CNN feature extraction
def cnn_feature_extract(cnn_feature_extractor, data_loader):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    #prepare cnn model to use for feature extraction
    cnn_feature_extractor.eval()
    cnn_feature_extractor.fc = torch.nn.Identity() #replace fully connected layer of pretrained cnn with Identity layer
    for para in cnn_feature_extractor.parameters():
        para.requires_grad = False #freeze weights
    #feature extraction
    features_list, labels_list = [], []
    cnn_feature_extractor.to(device)
    with torch.no_grad():
        for images, labels in data_loader:
            images = images.to(device)
            feature = cnn_feature_extractor(images)
            feature = feature.view(feature.size(0),-1) #flatten into (n_samples, n_features) for non-CNN models
            #convert tensors into numpy for fitting into non-CNN models and add into lists
            features_list.append(feature.cpu().numpy())
            labels_list.append(labels.numpy())

    return cnn_feature_extractor, np.vstack(features_list), np.hstack(labels_list)

In [17]:
#initialise and extract features using CNN feature extractor
weights = models.ResNet50_Weights.IMAGENET1K_V2
resnet50_extractor = models.resnet50(weights=weights)
resnet50_extractor, X, y = cnn_feature_extract(resnet50_extractor, data_loader) #X = features, y =labels
#no need labelling as the numpy array is generated from the data_loader

Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 200MB/s]


In [18]:
#CNN training and test split
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)
x_train = pd.DataFrame(x_train, dtype=np.float32)
y_train = pd.Series(y_train, dtype=np.int32)
x_test = pd.DataFrame(x_test, dtype=np.float32)
y_test = pd.Series(y_test, dtype=np.int32)
#same as original flow
print("Number of Samples:", len(y_train))
print("Number of Labels:", len(np.unique(y_train)))
counts = y_train.value_counts()
print("Label Distribution:")
print(counts)

Number of Samples: 4763
Number of Labels: 50
Label Distribution:
0     120
7     120
15    120
16    120
13    120
36    120
37    120
29    120
26    120
18    120
21    119
33    119
28    119
10    119
2     119
48    119
45    119
43    119
39    118
41    118
5     118
42    118
20    118
49    116
34     86
47     86
14     86
27     85
25     80
40     78
31     78
35     78
3      76
17     76
22     74
44     74
1      74
32     74
6      74
12     74
8      73
46     70
19     70
30     69
38     64
24     63
11     62
23     62
9      60
4      59
Name: count, dtype: int64


#### 1) RandomForestClassifier - feature extraction by CNN

In [None]:
#test using method 2 feature extraction - cnn
%load_ext cuml.accel
rf2 = RandomForestClassifier(max_depth=25, max_features=np.float64(0.9849549260809971), n_estimators=87, random_state=42)

rf2.fit(x_train, y_train)
y_pred = rf2.predict(x_test)
print("Best score:", accuracy_score(y_test, y_pred))

The cuml.accel extension is already loaded. To reload it, use:
  %reload_ext cuml.accel
Best score: 0.37563025210084033


#### 2) KNNClassifier - feature extraction by CNN

In [19]:
# Start timing
start_time = time.time()

# Base model
base_model = cuml.neighbors.KNeighborsClassifier()

# Hyperparameters
param_dist = {
    'n_neighbors': randint(1, 30),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    verbose=0,
    random_state=42,
    error_score='raise',
)
random_search.fit(x_train, y_train)

# End timing
end_time = time.time()
training_time = end_time - start_time

[2025-06-18 17:11:57.475] [CUML] [info] Unused keyword parameter: handle during CPU estimator initialization
[2025-06-18 17:11:57.475] [CUML] [info] Unused keyword parameter: verbose during CPU estimator initialization
[2025-06-18 17:11:57.475] [CUML] [info] Unused keyword parameter: output_type during CPU estimator initialization
[2025-06-18 17:11:57.479] [CUML] [info] Unused keyword parameter: leaf_size during cuML estimator initialization
[2025-06-18 17:11:57.479] [CUML] [info] Unused keyword parameter: n_jobs during cuML estimator initialization
[2025-06-18 17:11:57.489] [CUML] [info] Unused keyword parameter: leaf_size during cuML estimator initialization
[2025-06-18 17:11:57.489] [CUML] [info] Unused keyword parameter: n_jobs during cuML estimator initialization
[2025-06-18 17:11:58.645] [CUML] [info] Unused keyword parameter: leaf_size during cuML estimator initialization
[2025-06-18 17:11:58.645] [CUML] [info] Unused keyword parameter: n_jobs during cuML estimator initializatio

In [20]:
# Best model
best_model = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'metric': 'cosine', 'n_neighbors': 22, 'weights': 'uniform'}
Best Accuracy: 0.368048
Total Training Time: 0.16 minutes


In [21]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_model.predict(x_train)
y_pred_test = best_model.predict(x_test)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\nTEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.4940163762334663
Precision (macro): 0.5302002197059281
Recall (macro): 0.4782061778595254
F0.5-Score (macro): 0.5103194341030439

TEST METRICS
Accuracy: 0.39915966386554624
Precision (macro): 0.43488781188326037
Recall (macro): 0.381419474026205
F0.5-Score (macro): 0.40741327451969217


#### 3) XGBoostClassifier - feature extraction by CNN

In [22]:
# Further split into validation set for early stopping
x_train, x_valid, y_train, y_valid = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)
x_train, x_test, y_train, y_test = train_test_split(x_train,y_train, test_size=0.2, random_state=42, stratify=y)
x_train = pd.DataFrame(x_train, dtype=np.float32)
y_train = pd.Series(y_train, dtype=np.int32)
x_valid = pd.DataFrame(x_valid, dtype=np.float32)
y_valid = pd.Series(y_valid, dtype=np.int32)
x_test = pd.DataFrame(x_test, dtype=np.float32)
y_test = pd.Series(y_test, dtype=np.int32)

# Start timing
start_time = time.time()

# Balance class weights
sample_weights = compute_sample_weight(
    class_weight="balanced",
    y=y_train
)

# Base model
base_model = xgb.XGBClassifier(
    device="cuda",
    tree_method="hist",
    objective="multi:softprob",
    num_class=len(np.unique(y_train)),
    early_stopping_rounds=10,
    eval_metric=['merror','mlogloss'],
    random_state=42
)

# Hyperparameters
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 12),
    'learning_rate': uniform(0.01, 0.19),  # range: 0.01 to 0.2
    'subsample': uniform(0.7, 0.3),        # range: 0.7 to 1.0
    'colsample_bytree': uniform(0.7, 0.3)  # range: 0.7 to 1.0
}


# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=20,
    scoring='accuracy',
    cv=5,
    verbose=2,
    random_state=42,
    error_score='raise'
)
random_search.fit(
    x_train, y_train,
    sample_weight=sample_weights,
    eval_set=[(x_valid, y_valid)],
    verbose=0)

# End timing
end_time = time.time()
training_time = end_time - start_time

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  25.1s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  24.8s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  28.7s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  28.1s
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time=  24.7s
[CV] END colsample_bytree=0.8337498258560773, learning_rate=0.028995234005420548, max_depth=10, n_estimators=422, subsample=0.8803345

In [23]:
# Best model
best_model = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'colsample_bytree': np.float64(0.7692681476866446), 'learning_rate': np.float64(0.05579483854494223), 'max_depth': 9, 'n_estimators': 477, 'subsample': np.float64(0.848553073033381)}
Best Accuracy: 0.480193
Total Training Time: 51.91 minutes


In [24]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_model.predict(x_train)
y_pred_test = best_model.predict(x_test)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\nTEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.9986880083967462
Precision (macro): 0.9980447662936142
Recall (macro): 0.998463719663305
F0.5-Score (macro): 0.9981037815524161

TEST METRICS
Accuracy: 0.49789915966386555
Precision (macro): 0.49190364467781644
Recall (macro): 0.4849040774557971
F0.5-Score (macro): 0.4810487944253243


# <i> Shift to end of both pipelines</i>
<i>The CNN method proved to be more beneficial in feature extraction than the usual hog feature extraction method, as the RandomForestClassifier model (both with the same parameters set) trained with the CNN-extracted features had produced a higher accuracy of ~37.56%, as compared to the model trained with hog-extracted data which produced an accuracy of ~23.36%.

### Model Stacking

Stacking is a method that help to improve the overall performance of models as the weakness of a certain models can be compensated by the strengths of other models. Hence, we decided to utilise stacking to improve the overall performance of the model. For this technique, only the CNN-feature extraction method will be used as it has been proven to provide better model performance (in terms of accuracy).
<br>
<br>
The models used earlier in the code will be used in this stacking technique to determine if stacking improves the overall performance.

In [None]:
estimators = [('rcf_model',RandomForestClassifier(random_state=42)),("xgboost",xgb.XGBClassifier(random_state=42)),("knn", KNeighborsClassifier())]
stacking_cf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), cv=5, passthrough=False, verbose=1)
#split dataset into training and test dataset and fit into model via cross_val_score (cross validation approach)
stacking_cf.fit(x_train,y_train)

NameError: name 'x_train' is not defined

In [None]:
print("Accuracy Score:",stacking_cf.score(x_test, y_encoded_test))

Accuracy Score: 0.48823529411764705


### CNN model