## CS610: Applied Machine Learning Project

This notebook serves to detail our workflow and code on our project objective of building an image classification model to fulfil the business objective of identifying different shoe models. The dataset can be found in the repository and is sourced from Kaggle <br>(<i>**state source?**</i>).

### Install and import packages

In [6]:
!pip install opencv-python torch torchvision

Collecting torchvision
  Downloading torchvision-0.22.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Downloading torchvision-0.22.1-cp312-cp312-manylinux_2_28_x86_64.whl (7.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: torchvision
Successfully installed torchvision-0.22.1


In [28]:
# import cudf
import numpy as np
import pandas as pd
from PIL import Image
import cv2
import os
import tqdm
import xgboost as xgb
import time
from skimage.feature import hog
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score
import torch
from torch import nn, optim
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader, random_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.class_weight import compute_sample_weight
import warnings
import pickle
warnings.filterwarnings("ignore")

In [None]:
# Running On CPU, Please skip this cell
import cuml
print(cuml.__version__)
from cuml.model_selection import train_test_split
from cuml.metrics import accuracy_score
%load_ext cuml.accel

25.06.00
cuML: Accelerator installed.


## Check Data Source

In [4]:
def count_images(datasource_path):
    image_counts = {}
    image_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp'}

    if not os.path.isdir(datasource_path):
        print(f"Error: Path '{datasource_path}' is not a directory.")
        return image_counts

    for subfolder_name in os.listdir(datasource_path):
        subfolder_path = os.path.join(datasource_path, subfolder_name)

        if os.path.isdir(subfolder_path):
            count = 0
            for file_name in os.listdir(subfolder_path):
                file_path = os.path.join(subfolder_path, file_name)
                if os.path.isfile(file_path):
                    _, ext = os.path.splitext(file_name)
                    if ext.lower() in image_extensions:
                        count += 1
            image_counts[subfolder_name] = count
    return image_counts

image_dir = 'datasource'
print(f"Scanning: {image_dir}")
counts = count_images(image_dir)

if counts:
    for folder, count in counts.items():
        print(f"{folder}: {count} images")
else:
    print("No images found or path is incorrect/empty.")
# else:

Scanning: datasource
adidas_forum_high: 150 images
adidas_forum_low: 115 images
adidas_gazelle: 149 images
adidas_nmd_r1: 115 images
adidas_samba: 115 images
adidas_stan_smith: 147 images
adidas_superstar: 115 images
adidas_ultraboost: 150 images
asics_gel-lyte_iii: 91 images
converse_chuck_70_high: 115 images
converse_chuck_70_low: 148 images
converse_chuck_taylor_all-star_high: 114 images
converse_chuck_taylor_all-star_low: 114 images
converse_one_star: 150 images
new_balance_327: 108 images
new_balance_550: 150 images
new_balance_574: 150 images
new_balance_990: 113 images
new_balance_992: 150 images
nike_air_force_1_high: 115 images
nike_air_force_1_low: 147 images
nike_air_force_1_mid: 148 images
nike_air_jordan_11: 113 images
nike_air_jordan_1_high: 115 images
nike_air_jordan_1_low: 115 images
nike_air_jordan_3: 100 images
nike_air_jordan_4: 150 images
nike_air_max_1: 106 images
nike_air_max_270: 149 images
nike_air_max_90: 150 images
nike_air_max_95: 115 images
nike_air_max_97: 

## Data Augmentation

Data Augmentation in image classification can improve the performance and robustness of the models.  
More Detailly, it has the following benefits.  
1. preventing overfitting
2. increasing dataset size and diversity
3. improve generalisation capabilities
4. enhancing model robustness
5. address class imbalance


**Common Data Augmentation Techniques for Images**  
1. Geometric Transformations:
    a. Flipping  
    b. Rotation  
    c. Cropping  
2. Photometric Transformations:
    a. Brightness Adjustment  
    b. Contrast Adjustment  
    c. Saturation Adjustment  
    d. Hue Adjustment  
    c. Grayscaling  
3. Noise Injection
4. Random Erasing

In our project, we are building models to identify sneakers by uploaded the full image for sneakers.  
To improve the robustness based on our needs, we adopted three techniques - flipping, rotation and brightness adjustment.

In [5]:
def image_augmentation(image, augmentation_type,angle_range=(-15, 15), brightness_range=(0.7, 1.3)):
    if augmentation_type == 'flip':
        return cv2.flip(image, 1)
    elif augmentation_type == 'rotate':
        angle = np.random.randint(angle_range[0], angle_range[1] + 1)
        (h, w) = image.shape[:2]
        center = (w // 2, h // 2)
        M = cv2.getRotationMatrix2D(center, angle, 1.0)
        return cv2.warpAffine(image, M, (w, h), borderMode=cv2.BORDER_REFLECT)
    elif augmentation_type == 'brightness':
        brightness_factor = np.random.uniform(brightness_range[0], brightness_range[1])
        return np.clip(image * brightness_factor, 0, 255).astype(np.uint8)
    return image # return orginal image

In [6]:
# Create a directory to store the augmented images
aug_dir = 'augmented_images'
os.makedirs(aug_dir, exist_ok = True)

all_images_paths = []
all_images_labels = []



sneaker_names_list = os.listdir(image_dir)
print("====== Image Augmentation Starts ======")
for sneaker_name in tqdm.tqdm(sneaker_names_list, desc="Augmenting"):
    original_path = os.path.join(image_dir, sneaker_name)
    if os.path.isdir(original_path):
        aug_path = os.path.join(aug_dir, sneaker_name)
        os.makedirs(aug_path, exist_ok = True)
        
        for image in os.listdir(original_path):

            # design saved path
            # 1. orinal
            image_full_path = os.path.join(original_path, image)
            original_image = cv2.imread(image_full_path)
            if original_image is None:
                print(f'WARNING: CANNOT READ IMAGE {image_full_path}, SKIPPED!')
                continue
            
            base, ext = os.path.splitext(image)

            # design saved path
            # 1. orinal
            image_name_original = f'{base}_original{ext}'
            original_image_saved_path = os.path.join(aug_path,image_name_original)
            # 2. flipped
            image_name_flipped = f'{base}_flipped{ext}'
            flipped_image_saved_path = os.path.join(aug_path,image_name_flipped)
            # 3. rotated
            image_name_rotated = f'{base}_rotated{ext}'
            rotated_image_saved_path = os.path.join(aug_path,image_name_rotated)
            # 4. bright
            image_name_brightened = f'{base}_brightened{ext}'
            brightened_image_saved_path = os.path.join(aug_path,image_name_brightened)

            # augmentation operations
            # 1. original
            cv2.imwrite(original_image_saved_path, original_image)
            all_images_paths.append(original_image_saved_path)
            all_images_labels.append(sneaker_name)

            # 2. flipped
            img_flipped = image_augmentation(original_image, augmentation_type = 'flip')
            cv2.imwrite(flipped_image_saved_path, img_flipped)
            all_images_paths.append(flipped_image_saved_path)
            all_images_labels.append(sneaker_name)

            # 3. rotated
            img_rotated = image_augmentation(original_image, augmentation_type = 'rotate')
            cv2.imwrite(rotated_image_saved_path, img_rotated)
            all_images_paths.append(rotated_image_saved_path)
            all_images_labels.append(sneaker_name)

            # 4. brightness
            img_bright = image_augmentation(original_image, augmentation_type = 'brightness')
            cv2.imwrite(brightened_image_saved_path, img_bright)
            all_images_paths.append(brightened_image_saved_path)
            all_images_labels.append(sneaker_name)

print("====== Image Augmentation Starts ======")

image_df_augmented = pd.DataFrame({
    'path': all_images_paths,
    'label': all_images_labels
})

print(f"We have now {len(image_df_augmented)} images for modelling")

image_df_augmented



Augmenting:  12%|█▏        | 6/50 [00:10<01:22,  1.88s/it]



Augmenting:  46%|████▌     | 23/50 [00:45<01:48,  4.03s/it]



Augmenting: 100%|██████████| 50/50 [01:54<00:00,  2.28s/it]

We have now 25920 images for modelling





Unnamed: 0,path,label
0,augmented_images\adidas_forum_high\0001_origin...,adidas_forum_high
1,augmented_images\adidas_forum_high\0001_flippe...,adidas_forum_high
2,augmented_images\adidas_forum_high\0001_rotate...,adidas_forum_high
3,augmented_images\adidas_forum_high\0001_bright...,adidas_forum_high
4,augmented_images\adidas_forum_high\0002_origin...,adidas_forum_high
...,...,...
25915,augmented_images\yeezy_slide\0144_brightened.jpg,yeezy_slide
25916,augmented_images\yeezy_slide\0145_original.jpg,yeezy_slide
25917,augmented_images\yeezy_slide\0145_flipped.jpg,yeezy_slide
25918,augmented_images\yeezy_slide\0145_rotated.jpg,yeezy_slide


In [8]:
image_df_augmented.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25920 entries, 0 to 25919
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   path    25920 non-null  object
 1   label   25920 non-null  object
dtypes: object(2)
memory usage: 405.1+ KB


## Image Processing

### Resizing

In [9]:
# resize images
def resize_image_in_folder(input_dir, output_dir, size=(224, 224), desc='resizing images'):
    if not os.path.exists(input_dir):
        print(f"Input directory {input_dir} does not exist. Please check the path.")
        return

    os.makedirs(output_dir, exist_ok=True)
    supported_formats = ('.png', '.jpg', '.jpeg', '.gif', '.bmp', '.tiff', '.webp')
    for filename in os.listdir(input_dir):
        if filename.lower().endswith(supported_formats):
            img_input_path = os.path.join(input_dir, filename)
            img_output_path = os.path.join(output_dir, filename)
            try:
                img = cv2.imread(img_input_path, cv2.IMREAD_UNCHANGED)

                if img is None:
                    print(f"Error loading {img_input_path}")
                    continue
                resized_img = cv2.resize(img, size, interpolation=cv2.INTER_LANCZOS4)

                if img_output_path.lower().endswith(('.jpg', '.jpeg')) and resized_img.shape[-1] == 4:
                    resized_img = cv2.cvtColor(resized_img, cv2.COLOR_BGRA2BGR)
                cv2.imwrite(img_output_path, resized_img)
            except Exception as e:
                print(f"Error processing {img_input_path}: {e}")

In [10]:
# process all folders
def batch_resize_images(base_input_dir, base_output_dir, size=(128, 128)):
    if not os.path.exists(base_input_dir):
        print(f"Base directory {base_input_dir} does not exist. Please check the path.")
        return

    os.makedirs(base_output_dir, exist_ok=True) # if output directory does not exist, create it.

    for folder in tqdm.tqdm(os.listdir(base_input_dir)):
        current_input_subfolder = os.path.join(base_input_dir, folder)
        current_output_subfolder = os.path.join(base_output_dir, folder)

        if os.path.isdir(current_input_subfolder):
            resize_image_in_folder(current_input_subfolder, current_output_subfolder, size=size)
        else:
            print(f"Skipping {current_input_subfolder} as it is not a directory.")

    print("Batch resizing completed.")

In [11]:
input_dir = '../CS610_AML_Group_Project/augmented_images'
output_dir = '../CS610_AML_Group_Project/resized_images'

In [12]:
batch_resize_images(input_dir, output_dir, size=(128, 128))

100%|██████████| 50/50 [05:53<00:00,  7.07s/it]

Batch resizing completed.





### Gray Scaling

In [None]:
def grayscale_image_in_folder(input_dir, output_dir):
    if not os.path.exists(input_dir):
        print(f"Input directory {input_dir} does not exist. Please check the path.")
        return

    os.makedirs(output_dir, exist_ok=True)
    supported_formats = ('.png', '.jpg', '.jpeg', '.gif', '.bmp', '.tiff', '.webp')
    for filename in os.listdir(input_dir):
        if filename.lower().endswith(supported_formats):
            img_input_path = os.path.join(input_dir, filename)
            img_output_path = os.path.join(output_dir, filename)
            try:
                img = cv2.imread(img_input_path)
                if img is None:
                    print(f"Error loading {img_input_path}")
                    continue
                # Convert to grayscale
                gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
                cv2.imwrite(img_output_path, gray_img)
            except Exception as e:
                print(f"Error processing {img_input_path}: {e}")

In [14]:
def batch_grayscale_images(base_input_dir, base_output_dir):
    if not os.path.exists(base_input_dir):
        print(f"Base directory {base_input_dir} does not exist. Please check the path.")
        return

    os.makedirs(base_output_dir, exist_ok=True)

    for folder in tqdm.tqdm(os.listdir(base_input_dir)):
        current_input_subfolder = os.path.join(base_input_dir, folder)
        current_output_subfolder = os.path.join(base_output_dir, folder)

        if os.path.isdir(current_input_subfolder):
            grayscale_image_in_folder(current_input_subfolder, current_output_subfolder)
        else:
            print(f"Skipping {current_input_subfolder} as it is not a directory.")

    print("Batch grayscale completed.")

In [15]:
base_input_dir = '../CS610_AML_Group_Project/resized_images'
base_output_dir = '../CS610_AML_Group_Project/grayscale_images'
batch_grayscale_images(base_input_dir, base_output_dir)

100%|██████████| 50/50 [04:57<00:00,  5.95s/it]

Batch grayscale completed.





## Feature extraction and Model Training
Feature extraction serves as an important part of the data processing step as the correct method used will help the models to learn the features better and hence produce higher accuracy. To investigate on which method is the better feature extraction method, two RandomForestClassifier models with the same set of parameters (found using RandomizedSearchCV previously) was used. The accuracy score will be used to determine which method is better for this use case.

### Pipeline Models using Feature Extraction Method 1 - By HOG

#### Feature Extraction by HOG

In [3]:
def extract_hog_features_recursive(input_dir, force_size = (128, 128), pixels_per_cell=(16, 16), cells_per_block=(2, 2)):
    features = []
    filenames = []
    supported_formats = ('.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.webp')
    for root, dirs, files in tqdm.tqdm(os.walk(input_dir)):
        for filename in tqdm.tqdm(files):
            if filename.lower().endswith(supported_formats):
                img_path = os.path.join(root, filename)
                img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
                if img is None:
                    continue
                # force resized in case feature extraction failed
                img_resized = cv2.resize(img, force_size, interpolation=cv2.INTER_AREA)
                # pixel normalisation
                img_normalised = img_resized.astype(np.float32) / 255.0
                # Extract HOG features
                try:
                    hog_feature = hog(img_normalised, pixels_per_cell=pixels_per_cell, cells_per_block=cells_per_block, feature_vector=True)
                    features.append(hog_feature)
                    rel_path = os.path.relpath(img_path, input_dir)
                    filenames.append(rel_path)
                except Exception as e:
                    print("WARNING: {img_path} Failed with HOG feature extraction!")
                    continue
    hogged = np.array(features)
    return hogged, filenames

# Example usage:
input_dir = '../CS610_AML_Group_Project/grayscale_images'
print('====== HOG Extraction Starts! ======')
hogged, filenames = extract_hog_features_recursive(input_dir)
print('====== HOG Extraction Completed! ======')
print(hogged.shape)  # (num_images, hog_feature_dim)



0it [00:00, ?it/s]
100%|██████████| 600/600 [00:08<00:00, 70.32it/s]
100%|██████████| 460/460 [00:07<00:00, 65.12it/s]
100%|██████████| 596/596 [00:08<00:00, 66.39it/s]
100%|██████████| 460/460 [00:06<00:00, 70.68it/s]
100%|██████████| 460/460 [00:05<00:00, 81.87it/s]
100%|██████████| 588/588 [00:09<00:00, 65.17it/s]
100%|██████████| 456/456 [00:06<00:00, 72.51it/s]
100%|██████████| 600/600 [00:08<00:00, 71.36it/s]
100%|██████████| 364/364 [00:05<00:00, 70.69it/s]
100%|██████████| 460/460 [00:06<00:00, 75.32it/s]
100%|██████████| 592/592 [00:08<00:00, 69.15it/s]
100%|██████████| 456/456 [00:05<00:00, 79.02it/s]
100%|██████████| 456/456 [00:05<00:00, 78.55it/s]
100%|██████████| 600/600 [00:09<00:00, 65.09it/s]
100%|██████████| 432/432 [00:06<00:00, 63.86it/s]
100%|██████████| 600/600 [00:09<00:00, 63.39it/s]
100%|██████████| 600/600 [00:08<00:00, 71.59it/s]
100%|██████████| 452/452 [00:06<00:00, 73.23it/s]
100%|██████████| 600/600 [00:08<00:00, 71.36it/s]
100%|██████████| 460/460 [00:06

(25920, 1764)


In [4]:
print("====== Feature Matrix Check ======")
print(hogged)
print(f"Shape: {hogged.shape}")

[[0.16098824 0.         0.11383587 ... 0.01137023 0.00299632 0.        ]
 [0.08114031 0.03628705 0.06884983 ... 0.         0.         0.        ]
 [0.19807696 0.09491005 0.16007036 ... 0.00457401 0.         0.        ]
 ...
 [0.         0.         0.         ... 0.00290865 0.         0.        ]
 [0.11182898 0.         0.         ... 0.         0.         0.        ]
 [0.02163972 0.         0.09180957 ... 0.02596537 0.0046525  0.        ]]
Shape: (25920, 1764)


In [5]:
print("====== File Check ======")
print(filenames)

['adidas_forum_high/0001_brightened.jpg', 'adidas_forum_high/0001_flipped.jpg', 'adidas_forum_high/0001_original.jpg', 'adidas_forum_high/0001_rotated.jpg', 'adidas_forum_high/0002_brightened.jpg', 'adidas_forum_high/0002_flipped.jpg', 'adidas_forum_high/0002_original.jpg', 'adidas_forum_high/0002_rotated.jpg', 'adidas_forum_high/0003_brightened.jpg', 'adidas_forum_high/0003_flipped.jpg', 'adidas_forum_high/0003_original.jpg', 'adidas_forum_high/0003_rotated.jpg', 'adidas_forum_high/0004_brightened.jpg', 'adidas_forum_high/0004_flipped.jpg', 'adidas_forum_high/0004_original.jpg', 'adidas_forum_high/0004_rotated.jpg', 'adidas_forum_high/0005_brightened.jpg', 'adidas_forum_high/0005_flipped.jpg', 'adidas_forum_high/0005_original.jpg', 'adidas_forum_high/0005_rotated.jpg', 'adidas_forum_high/0006_brightened.jpg', 'adidas_forum_high/0006_flipped.jpg', 'adidas_forum_high/0006_original.jpg', 'adidas_forum_high/0006_rotated.jpg', 'adidas_forum_high/0007_brightened.jpg', 'adidas_forum_high/000

In [10]:
#DO NOT RUN - save filenames and hogged for easier access (without needing to run hog function again)
np.savetxt("filename.csv", filenames, delimiter =",", fmt ='% s')
np.save("hogged.npy", hogged)

In [5]:
#get filenames
import csv
filenames = []
with open("./filename.csv", 'r', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        filenames.extend(row)
len(filenames)

25920

In [15]:
#get hogged
hogged = np.load("hogged.npy")
hogged.shape

(25920, 1764)

In [6]:
#Labeling
y = [f.split(os.sep)[0] for f in filenames]
#split data into train_test split
x = hogged.astype(np.float32)
y = np.array(y)
y, uniques = pd.factorize(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)
x_train = pd.DataFrame(x_train, dtype=np.float32)
y_train = pd.Series(y_train, dtype=np.int32)
x_test = pd.DataFrame(x_test, dtype=np.float32)
y_test = pd.Series(y_test, dtype=np.int32)

In [7]:
#Check if data is prepared successfully
print("Number of Samples:", len(y_train))
print("Number of Labels:", len(np.unique(y_train)))
counts = y_train.value_counts()
print("Label Distribution:")
print(counts)

Number of Samples: 20736
Number of Labels: 50
Label Distribution:
15    480
37    480
16    480
0     480
13    480
18    480
29    480
26    480
36    480
7     480
45    477
33    477
28    477
2     477
21    474
10    474
48    473
41    473
39    473
43    473
42    470
20    470
5     470
49    464
35    368
31    368
40    368
9     368
3     368
32    368
24    368
30    368
1     368
38    368
46    368
19    368
4     368
6     365
12    365
23    365
11    365
44    362
17    362
22    362
47    346
14    346
34    342
27    339
25    320
8     291
Name: count, dtype: int64


#### Feature Standardisation

In [7]:
print("\n====== Feature Standardisation Started! ======")
scaler = StandardScaler()
scaler.fit(x_train) 

x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

print("\n====== Feature Standardisation Completed! ======")
print(f"The Shape for Training Set after Feature Standardisation: {x_train_scaled.shape}")
print(f"The Shape for Testing Set after Feature Standardisation: {x_test_scaled.shape}")




The Shape for Training Set after Feature Standardisation: (20736, 1764)
The Shape for Testing Set after Feature Standardisation: (5184, 1764)


#### Dimensionality Reduction by PCA

In [8]:
print("\n====== Dimensionality Reduction by PCA Started! ======")
pca = PCA(n_components=0.85, random_state=42) 
pca.fit(x_train_scaled)


x_train_pca = pca.transform(x_train_scaled)
x_test_pca = pca.transform(x_test_scaled)

print("\n====== Dimensionality Reduction by PCA Completed! ======")
print(f"The Shape for Training Set after Dimensionality Reduction by PCA: {x_train_pca.shape}")
print(f"The Shape for Testing Set after Dimensionality Reduction by PCA: {x_test_pca.shape}")
print(f"The Number of Chosen PCA: {pca.n_components_}")
print(f"The Explained Variance Ratio: {np.sum(pca.explained_variance_ratio_):.4f}")



The Shape for Training Set after Dimensionality Reduction by PCA: (20736, 242)
The Shape for Testing Set after Dimensionality Reduction by PCA: (5184, 242)
The Number of Chosen PCA: 242
The Explained Variance Ratio: 0.8504


#### 1) RandomForestClassifier - feature extraction by hog

In [None]:
param_distributions = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [10, 20, 30, 40],
    'max_features': ['sqrt', 'log2', 0.5, 0.8, 1.0]
}

rf = RandomForestClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=10,
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

random_search.fit(x_train_pca, y_train)

print("Best params found:", random_search.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best params found: {'n_estimators': 150, 'max_features': 0.5, 'max_depth': 40}


In [None]:
best_hog_rf = random_search.best_estimator_

# Accuracy
print("====== Accuracy ======")
y_train_pred = best_hog_rf.predict(x_train_pca)
print("Accuracy on Training Set:", accuracy_score(y_train, y_train_pred))

y_test_pred = best_hog_rf.predict(x_test_pca)
print("Accuracy on Testing Set:", accuracy_score(y_test, y_test_pred))

# Precision
print("====== Precision (Macro) ======")
precision_macro_train = precision_score(y_train, y_train_pred, average='macro')
print("Precision (Macro) on Training Set: ", precision_macro_train)
precision_macro_test = precision_score(y_test, y_test_pred, average='macro')
print("Precision (Macro) on Testing Set: ", precision_macro_test)

print("====== Precision (Micro) ======")
precision_micro_train = precision_score(y_train, y_train_pred, average='micro')
print("Precision (Micro) on Training Set: ", precision_micro_train)
precision_micro_test = precision_score(y_test, y_test_pred, average='micro')
print("Precision (Micro) on Testing Set: ", precision_micro_test)

print("====== Precision (Weighted) ======")
precision_weighted_train = precision_score(y_train, y_train_pred, average='weighted')
print("Precision (Weighted) on Training Set: ", precision_weighted_train)
precision_weighted_test = precision_score(y_test, y_test_pred, average='weighted')
print("Precision (Weighted) on Testing Set: ", precision_weighted_test)


# Recall
print("====== Recall (Macro) ======")
recall_macro_train = recall_score(y_train, y_train_pred, average='macro')
print("Recall (Macro) on Training Set: ", recall_macro_train)
recall_macro_test = recall_score(y_test, y_test_pred, average='macro')
print("Recall (Macro) on Testing Set: ", recall_macro_test)

print("====== Recall (Micro) ======")
recall_micro_train = recall_score(y_train, y_train_pred, average='micro')
print("Recall (Micro) on Training Set: ", recall_micro_train)
recall_micro_test = recall_score(y_test, y_test_pred, average='micro')
print("Recall (Micro) on Testing Set: ", recall_micro_test)

print("====== Recall (Weighted) ======")
recall_weighted_train = recall_score(y_train, y_train_pred, average='weighted')
print("Recall (Weighted) on Training Set: ", recall_weighted_train)
recall_weighted_test = recall_score(y_test, y_test_pred, average='weighted')
print("Recall (Weighted) on Testing Set: ", recall_weighted_test)


Accuracy on Training Set: 0.9990837191358025
Accuracy on Testing Set: 0.6695601851851852
Precision (Macro) on Training Set:  0.9990048915029334
Precision (Macro) on Testing Set:  0.6865944002161385
Precision (Micro) on Training Set:  0.9990837191358025
Precision (Micro) on Testing Set:  0.6695601851851852
Precision (Weighted) on Training Set:  0.9990921373757488
Precision (Weighted) on Testing Set:  0.678597464685919
Recall (Macro) on Training Set:  0.9989885645058653
Recall (Macro) on Testing Set:  0.6654562436631593
Recall (Micro) on Training Set:  0.9990837191358025
Recall (Micro) on Testing Set:  0.6695601851851852
Recall (Weighted) on Training Set:  0.9990837191358025
Recall (Weighted) on Testing Set:  0.6695601851851852


In [50]:
model_bank_dir = '../CS610_AML_Group_Project/model_bank'
os.makedirs(model_bank_dir, exist_ok=True)

model_filename_pickle = 'best_hog_rf_model.pkl'
model_path = os.path.join(model_bank_dir, model_filename_pickle)
with open(model_path, 'wb') as file: 
    pickle.dump(best_hog_rf, file)
print(f"Model Saved Successfully {model_path}")

Model Saved Successfully ../CS610_AML_Group_Project/model_bank\best_hog_rf_model.pkl


#### 2) KNNClassifier - feature extraction by hog

In [9]:
# Start timing
start_time = time.time()

# Base model
base_model = KNeighborsClassifier()

# Hyperparameters
param_dist = {
    'n_neighbors': randint(1, 30),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    verbose=2,
    random_state=42,
    error_score='raise'
)
random_search.fit(x_train_pca, y_train)

# End timing
end_time = time.time()
training_time = end_time - start_time

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END .....metric=cosine, n_neighbors=20, weights=uniform; total time=   0.7s
[CV] END .....metric=cosine, n_neighbors=20, weights=uniform; total time=   0.6s
[CV] END .....metric=cosine, n_neighbors=20, weights=uniform; total time=   0.6s
[CV] END .....metric=cosine, n_neighbors=20, weights=uniform; total time=   0.6s
[CV] END .....metric=cosine, n_neighbors=20, weights=uniform; total time=   0.7s
[CV] END ....metric=cosine, n_neighbors=11, weights=distance; total time=   0.5s
[CV] END ....metric=cosine, n_neighbors=11, weights=distance; total time=   0.5s
[CV] END ....metric=cosine, n_neighbors=11, weights=distance; total time=   0.5s
[CV] END ....metric=cosine, n_neighbors=11, weights=distance; total time=   0.5s
[CV] END ....metric=cosine, n_neighbors=11, weights=distance; total time=   0.5s
[CV] END ..metric=euclidean, n_neighbors=21, weights=uniform; total time=   0.6s
[CV] END ..metric=euclidean, n_neighbors=21, we

In [10]:
# Best model
best_hog_knn = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'metric': 'cosine', 'n_neighbors': 1, 'weights': 'uniform'}
Best Accuracy: 0.728540
Total Training Time: 4.32 minutes


In [11]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_hog_knn.predict(x_train_pca)
y_pred_test = best_hog_knn.predict(x_test_pca)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\n TEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.9990837191358025
Precision (macro): 0.9989975485597107
Recall (macro): 0.9990002715446642
F0.5-Score (macro): 0.9989963449787603

 TEST METRICS
Accuracy: 0.7764274691358025
Precision (macro): 0.7803785498903784
Recall (macro): 0.7741923854665811
F0.5-Score (macro): 0.7780426151227446


In [12]:
model_bank_dir = '../CS610_AML_Group_Project/model_bank'
os.makedirs(model_bank_dir, exist_ok=True)

model_filename_pickle = 'best_hog_knn_model.pkl'
model_path = os.path.join(model_bank_dir, model_filename_pickle)
with open(model_path, 'wb') as file: 
    pickle.dump(best_hog_knn, file)
print(f"Model Saved Successfully {model_path}")

Model Saved Successfully ../CS610_AML_Group_Project/model_bank\best_hog_knn_model.pkl


#### 3) XGBoostClassifier - feature extraction by hog

In [55]:
# Further split into validation set for early stopping
# Start timing
start_time = time.time()

# Balance class weights
sample_weights = compute_sample_weight(
    class_weight="balanced",
    y=y_train
)

# Base model
base_model = xgb.XGBClassifier(
    device="cuda",
    tree_method="hist",
    objective="multi:softprob",
    num_class=len(np.unique(y_train)),
    # early_stopping_rounds=10,
    eval_metric=['merror','mlogloss'],
    random_state=42
)

# Hyperparameters
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 12),
    'learning_rate': uniform(0.01, 0.19),  # range: 0.01 to 0.2
    'subsample': uniform(0.7, 0.3),        # range: 0.7 to 1.0
    'colsample_bytree': uniform(0.7, 0.3)  # range: 0.7 to 1.0
}


# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=10,
    scoring='accuracy',
    cv=3,
    verbose=2,
    random_state=42,
    error_score='raise'
)
random_search.fit(
    x_train_pca, y_train,
    sample_weight = sample_weights)

# End timing
end_time = time.time()
training_time = end_time - start_time

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time= 1.4min
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time= 1.1min
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time= 1.1min
[CV] END colsample_bytree=0.8337498258560773, learning_rate=0.028995234005420548, max_depth=10, n_estimators=422, subsample=0.8803345035229626; total time= 4.8min
[CV] END colsample_bytree=0.8337498258560773, learning_rate=0.028995234005420548, max_depth=10, n_estimators=422, subsample=0.8803345035229626; total time= 3.9min
[CV] END colsample_bytree=0.8337498258560773, learning_rate=0.028995234005420548, max_depth=10, n_estimators=422, subsample=0.8803

In [58]:
# Best model
best_hog_xgboost = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'colsample_bytree': 0.7692681476866446, 'learning_rate': 0.05579483854494223, 'max_depth': 9, 'n_estimators': 477, 'subsample': 0.848553073033381}
Best Accuracy: 0.569396
Total Training Time: 42.59 minutes


In [59]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_hog_xgboost.predict(x_train_pca)
y_pred_test = best_hog_xgboost.predict(x_test_pca)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\nTEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.9990837191358025
Precision (macro): 0.9989704675180797
Recall (macro): 0.9990254724001061
F0.5-Score (macro): 0.9989792702358958

TEST METRICS
Accuracy: 0.6712962962962963
Precision (macro): 0.682041800412016
Recall (macro): 0.6681013001410357
F0.5-Score (macro): 0.676605634132652


In [60]:
model_bank_dir = '../CS610_AML_Group_Project/model_bank'
os.makedirs(model_bank_dir, exist_ok=True)

model_filename_pickle = 'best_hog_xgboost.pkl'
model_path = os.path.join(model_bank_dir, model_filename_pickle)
with open(model_path, 'wb') as file: 
    pickle.dump(best_hog_xgboost, file)
print(f"Model Saved Successfully {model_path}")

Model Saved Successfully ../CS610_AML_Group_Project/model_bank\best_hog_xgboost.pkl


### Pipeline Models using Feature Extraction Method 2 - Using pretrained CNN

ResNet50 will be used as the feature extractor due to its pre-trained weights, derived from large datasets like ImageNet, and is a popular choice to use for computer vision applications such as image classification.
Reference:
1) https://medium.com/@meetkalathiya1301/feature-extraction-using-pre-trained-models-for-image-classification-16e6ff43f268
2) https://stackoverflow.com/questions/62117707/extract-features-from-pretrained-resnet50-in-pytorch

In [None]:
#Process image data for feature extraction using CNN
input_dir = '../CS610_AML_Group_Project/resized_images'
img_transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=[0.485,0.456,0.406],std=[0.229,0.224,0.225])]) #mean and std based on ImageNet - normalise image data closer to normal distribution
img_dataset = datasets.ImageFolder(input_dir, transform=img_transform)
data_loader = DataLoader(img_dataset, batch_size=32, num_workers=4)

In [33]:
#define function for CNN feature extraction
def cnn_feature_extract(cnn_feature_extractor, data_loader):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    #prepare cnn model to use for feature extraction
    cnn_feature_extractor.eval()
    cnn_feature_extractor.fc = torch.nn.Identity() #replace fully connected layer of pretrained cnn with Identity layer
    for para in cnn_feature_extractor.parameters():
        para.requires_grad = False #freeze weights
    #feature extraction
    features_list, labels_list = [], []
    cnn_feature_extractor.to(device)
    with torch.no_grad():
        for images, labels in data_loader:
            images = images.to(device)
            feature = cnn_feature_extractor(images)
            feature = feature.view(feature.size(0),-1) #flatten into (n_samples, n_features) for non-CNN models
            #convert tensors into numpy for fitting into non-CNN models and add into lists
            features_list.append(feature.cpu().numpy())
            labels_list.append(labels.numpy())

    return cnn_feature_extractor, np.vstack(features_list), np.hstack(labels_list)

In [18]:
#initialise and extract features using CNN feature extractor
weights = models.ResNet50_Weights.IMAGENET1K_V2
resnet50_extractor = models.resnet50(weights=weights)
resnet50_extractor, X, y = cnn_feature_extract(resnet50_extractor, data_loader) #X = features, y =labels
#no need labelling as the numpy array is generated from the data_loader

In [19]:
#CNN training and test split
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)
x_train = pd.DataFrame(x_train, dtype=np.float32)
y_train = pd.Series(y_train, dtype=np.int32)
x_test = pd.DataFrame(x_test, dtype=np.float32)
y_test = pd.Series(y_test, dtype=np.int32)
#same as original flow
print("Number of Samples:", len(y_train))
print("Number of Labels:", len(np.unique(y_train)))
counts = y_train.value_counts()
print("Label Distribution:")
print(counts)

Number of Samples: 20736
Number of Labels: 50
Label Distribution:
15    480
37    480
16    480
0     480
13    480
18    480
29    480
26    480
36    480
7     480
45    477
33    477
28    477
2     477
21    474
10    474
48    473
41    473
39    473
43    473
42    470
20    470
5     470
49    464
35    368
31    368
40    368
9     368
3     368
32    368
24    368
30    368
1     368
38    368
46    368
19    368
4     368
6     365
12    365
23    365
11    365
44    362
17    362
22    362
47    346
14    346
34    342
27    339
25    320
8     291
Name: count, dtype: int64


#### 1) RandomForestClassifier - feature extraction by CNN

In [None]:
param_distributions = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [10, 20, 30, 40],
    'max_features': ['sqrt', 'log2', 0.5, 0.8, 1.0]
}

rf2 = RandomForestClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=rf2,
    param_distributions=param_distributions,
    n_iter=10,
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

random_search.fit(x_train, y_train)

print("Best params found:", random_search.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best params found: {'n_estimators': 150, 'max_features': 'sqrt', 'max_depth': 20}


In [67]:
best_cnn_rf = random_search.best_estimator_

# Accuracy
print("====== Accuracy ======")
y_train_pred = best_cnn_rf.predict(x_train)
print("Accuracy on Training Set:", accuracy_score(y_train, y_train_pred))

y_test_pred = best_cnn_rf.predict(x_test)
print("Accuracy on Testing Set:", accuracy_score(y_test, y_test_pred))

# Precision
print("====== Precision (Macro) ======")
precision_macro_train = precision_score(y_train, y_train_pred, average='macro')
print("Precision (Macro) on Training Set: ", precision_macro_train)
precision_macro_test = precision_score(y_test, y_test_pred, average='macro')
print("Precision (Macro) on Testing Set: ", precision_macro_test)

print("====== Precision (Micro) ======")
precision_micro_train = precision_score(y_train, y_train_pred, average='micro')
print("Precision (Micro) on Training Set: ", precision_micro_train)
precision_micro_test = precision_score(y_test, y_test_pred, average='micro')
print("Precision (Micro) on Testing Set: ", precision_micro_test)

print("====== Precision (Weighted) ======")
precision_weighted_train = precision_score(y_train, y_train_pred, average='weighted')
print("Precision (Weighted) on Training Set: ", precision_weighted_train)
precision_weighted_test = precision_score(y_test, y_test_pred, average='weighted')
print("Precision (Weighted) on Testing Set: ", precision_weighted_test)


# Recall
print("====== Recall (Macro) ======")
recall_macro_train = recall_score(y_train, y_train_pred, average='macro')
print("Recall (Macro) on Training Set: ", recall_macro_train)
recall_macro_test = recall_score(y_test, y_test_pred, average='macro')
print("Recall (Macro) on Testing Set: ", recall_macro_test)

print("====== Recall (Micro) ======")
recall_micro_train = recall_score(y_train, y_train_pred, average='micro')
print("Recall (Micro) on Training Set: ", recall_micro_train)
recall_micro_test = recall_score(y_test, y_test_pred, average='micro')
print("Recall (Micro) on Testing Set: ", recall_micro_test)

print("====== Recall (Weighted) ======")
recall_weighted_train = recall_score(y_train, y_train_pred, average='weighted')
print("Recall (Weighted) on Training Set: ", recall_weighted_train)
recall_weighted_test = recall_score(y_test, y_test_pred, average='weighted')
print("Recall (Weighted) on Testing Set: ", recall_weighted_test)


Accuracy on Training Set: 0.9990837191358025
Accuracy on Testing Set: 0.8130787037037037
Precision (Macro) on Training Set:  0.998995619641934
Precision (Macro) on Testing Set:  0.8302587494646678
Precision (Micro) on Training Set:  0.9990837191358025
Precision (Micro) on Testing Set:  0.8130787037037037
Precision (Weighted) on Training Set:  0.9990918929613705
Precision (Weighted) on Testing Set:  0.8223946071517559
Recall (Macro) on Training Set:  0.9990002641408887
Recall (Macro) on Testing Set:  0.8076955634496913
Recall (Micro) on Training Set:  0.9990837191358025
Recall (Micro) on Testing Set:  0.8130787037037037
Recall (Weighted) on Training Set:  0.9990837191358025
Recall (Weighted) on Testing Set:  0.8130787037037037


In [68]:
model_bank_dir = '../CS610_AML_Group_Project/model_bank'
os.makedirs(model_bank_dir, exist_ok=True)

model_filename_pickle = 'best_cnn_rf_model.pkl'
model_path = os.path.join(model_bank_dir, model_filename_pickle)
with open(model_path, 'wb') as file: 
    pickle.dump(best_cnn_rf, file)
print(f"Model Saved Successfully {model_path}")

Model Saved Successfully ../CS610_AML_Group_Project/model_bank\best_cnn_rf_model.pkl


#### 2) KNNClassifier - feature extraction by CNN

In [21]:
# Start timing
start_time = time.time()

# Base model
base_model = KNeighborsClassifier()

# Hyperparameters
param_dist = {
    'n_neighbors': randint(1, 30),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    verbose=0,
    random_state=42,
    error_score='raise',
)
random_search.fit(x_train, y_train)

# End timing
end_time = time.time()
training_time = end_time - start_time

In [22]:
# Best model
best_cnn_knn = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'metric': 'cosine', 'n_neighbors': 1, 'weights': 'uniform'}
Best Accuracy: 0.946181
Total Training Time: 28.06 minutes


In [23]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_cnn_knn.predict(x_train)
y_pred_test = best_cnn_knn.predict(x_test)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\nTEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.9990837191358025
Precision (macro): 0.9989970845602143
Recall (macro): 0.9989993707519668
F0.5-Score (macro): 0.9989957762392467

TEST METRICS
Accuracy: 0.9780092592592593
Precision (macro): 0.97815453728823
Recall (macro): 0.9773857376940281
F0.5-Score (macro): 0.9778842703492533


In [24]:
model_bank_dir = '../CS610_AML_Group_Project/model_bank'
os.makedirs(model_bank_dir, exist_ok=True)

model_filename_pickle = 'best_cnn_knn_model.pkl'
model_path = os.path.join(model_bank_dir, model_filename_pickle)
with open(model_path, 'wb') as file: 
    pickle.dump(best_cnn_knn, file)
print(f"Model Saved Successfully {model_path}")

Model Saved Successfully ../CS610_AML_Group_Project/model_bank\best_cnn_knn_model.pkl


#### 3) XGBoostClassifier - feature extraction by CNN

In [69]:
# Start timing
start_time = time.time()

# Balance class weights
sample_weights = compute_sample_weight(
    class_weight="balanced",
    y=y_train
)

# Base model
base_model = xgb.XGBClassifier(
    device="cuda",
    tree_method="hist",
    objective="multi:softprob",
    num_class=len(np.unique(y_train)),
    # early_stopping_rounds=10,
    eval_metric=['merror','mlogloss'],
    random_state=42
)

# Hyperparameters
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 12),
    'learning_rate': uniform(0.01, 0.19),  # range: 0.01 to 0.2
    'subsample': uniform(0.7, 0.3),        # range: 0.7 to 1.0
    'colsample_bytree': uniform(0.7, 0.3)  # range: 0.7 to 1.0
}

# Randomized search tuning
random_search = RandomizedSearchCV(
    base_model,
    param_dist,
    n_iter=10,
    scoring='accuracy',
    cv=3,
    verbose=2,
    random_state=42,
    error_score='raise'
)
random_search.fit(
    x_train, y_train,
    sample_weight = sample_weights)
0
# End timing
end_time = time.time()
training_time = end_time - start_time

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time= 6.9min
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time= 7.0min
[CV] END colsample_bytree=0.8123620356542087, learning_rate=0.19063571821788408, max_depth=10, n_estimators=238, subsample=0.879055047383946; total time= 6.8min
[CV] END colsample_bytree=0.8337498258560773, learning_rate=0.028995234005420548, max_depth=10, n_estimators=422, subsample=0.8803345035229626; total time=20.8min
[CV] END colsample_bytree=0.8337498258560773, learning_rate=0.028995234005420548, max_depth=10, n_estimators=422, subsample=0.8803345035229626; total time=19.8min
[CV] END colsample_bytree=0.8337498258560773, learning_rate=0.028995234005420548, max_depth=10, n_estimators=422, subsample=0.8803

In [72]:
# Best model
best_cnn_xgb = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)
print(f"Best Accuracy: {random_search.best_score_:.6f}", )
print(f"Total Training Time: {training_time/60:.2f} minutes")

Best Parameters: {'colsample_bytree': 0.7195154778955838, 'learning_rate': 0.19028825207813332, 'max_depth': 4, 'n_estimators': 314, 'subsample': 0.7047898756660642}
Best Accuracy: 0.796345
Total Training Time: 216.38 minutes


In [73]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = best_cnn_xgb.predict(x_train)
y_pred_test = best_cnn_xgb.predict(x_test)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\nTEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.9990837191358025
Precision (macro): 0.9989785471633986
Recall (macro): 0.9990241249129476
F0.5-Score (macro): 0.9989850858454915

TEST METRICS
Accuracy: 0.8715277777777778
Precision (macro): 0.8744753441038603
Recall (macro): 0.8687617975947292
F0.5-Score (macro): 0.8726552890290106


In [74]:
model_bank_dir = '../CS610_AML_Group_Project/model_bank'
os.makedirs(model_bank_dir, exist_ok=True)

model_filename_pickle = 'best_cnn_xgb_model.pkl'
model_path = os.path.join(model_bank_dir, model_filename_pickle)
with open(model_path, 'wb') as file: 
    pickle.dump(best_cnn_xgb, file)
print(f"Model Saved Successfully {model_path}")

Model Saved Successfully ../CS610_AML_Group_Project/model_bank\best_cnn_xgb_model.pkl


### Results Interpretation
\<Add graph\>
The CNN method proved to be more beneficial in feature extraction than the usual hog feature extraction method (coupled with PCA), with all models achieving better scores on testing dataset and the models fed with hog-extracted features were showed signs of overfitting, with the test scores being much lower than the training scores.


### Model Stacking

Stacking is a method that help to improve the overall performance of models as the weakness of a certain models can be compensated by the strengths of other models. Hence, we decided to utilise stacking to improve the overall performance of the model. For this technique, only the CNN-feature extraction method will be used as it has been proven to provide better model performance (in terms of accuracy).
<br>
<br>
Using the CNN extracted feature set and the models earlier in the code, they will be used in this stacking technique to determine if stacking improves the overall performance. 

#### Import the models

In [25]:
model_paths = {"rf_model":"./model_bank/best_cnn_rf_model.pkl",
               "knn_model":"./model_bank/best_cnn_knn_model.pkl",
               "xgb_model":"./model_bank/best_cnn_xgb_model.pkl"}
models = {}
for model_name, path in model_paths.items():
    print(path)
    # with open(path, "rb") as f:
    #     models[model_name] = pickle.load(f)

print(models)

./model_bank/best_cnn_rf_model.pkl
./model_bank/best_cnn_knn_model.pkl
./model_bank/best_cnn_xgb_model.pkl
{}


#### Set up stacking

In [27]:
estimators = [('rcf_model',RandomForestClassifier(n_estimators= 150, max_features="sqrt", max_depth= 20, random_state=42)),
              ("xgboost",xgb.XGBClassifier(colsample_bytree= 0.7195154778955838, learning_rate= 0.19028825207813332, max_depth= 4, n_estimators= 314, subsample=0.7047898756660642,device="cuda",tree_method="hist", objective="multi:softprob", num_class=len(np.unique(y_train)),eval_metric=['merror','mlogloss'],random_state=42)),
              ("knn", KNeighborsClassifier(metric= "cosine", n_neighbors= 1, weights="uniform"))]
stacking_cf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), cv=5, passthrough=False, verbose=1)

# Start timing
start_time = time.time()
stacking_cf.fit(x_train,y_train)
# End timing
end_time = time.time()
training_time = end_time - start_time
print(f"Total Training Time: {training_time/60:.2f} minutes")

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.1min finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 23.8min finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    8.8s finished


Total Training Time: 37.50 minutes


In [28]:
beta = 0.5  # mis-labelled sneakers are more costly than missing labels

# Predictions
y_pred_train = stacking_cf.predict(x_train)
y_pred_test = stacking_cf.predict(x_test)

# --- Train Scores ---
print("TRAIN METRICS")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision (macro):", precision_score(y_train, y_pred_train, average='macro'))
print("Recall (macro):", recall_score(y_train, y_pred_train, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_train, y_pred_train, beta=beta, average='macro'))

# --- Test Scores ---
print("\nTEST METRICS")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision (macro):", precision_score(y_test, y_pred_test, average='macro'))
print("Recall (macro):", recall_score(y_test, y_pred_test, average='macro'))
print(f"F{beta}-Score (macro):", fbeta_score(y_test, y_pred_test, beta=beta, average='macro'))

TRAIN METRICS
Accuracy: 0.9990837191358025
Precision (macro): 0.99901375545105
Recall (macro): 0.9989867703242455
F0.5-Score (macro): 0.99900619939097

TEST METRICS
Accuracy: 0.9782021604938271
Precision (macro): 0.9784393146580949
Recall (macro): 0.9775371626284405
F0.5-Score (macro): 0.9781527083673724


In [29]:
model_bank_dir = '../CS610_AML_Group_Project/model_bank'
os.makedirs(model_bank_dir, exist_ok=True)

model_filename_pickle = 'stacked_model_pipeline.pkl'
model_path = os.path.join(model_bank_dir, model_filename_pickle)
with open(model_path, 'wb') as file: 
    pickle.dump(stacking_cf, file)
print(f"Model Saved Successfully {model_path}")

Model Saved Successfully ../CS610_AML_Group_Project/model_bank/stacked_model_pipeline.pkl


### CNN model

The ResNet50 model is used to compare the performance of a Convolutional Neural Network (CNN) against non-CNNs in image classification task. It is chosen instead of deeper models as it shows comparable accuracy and F1-score, while requiring lower training time.
<br> https://www.researchgate.net/figure/Comprative-result-of-Gender-Detection-of-LFW-Dataset_tbl4_379684426

Code Reference: https://www.kaggle.com/code/nikolasgegenava/resnet18-for-sneakers-classification

#### Data Preparation

In [2]:
#Process image data for feature extraction using CNN
input_dir = '../CS610_AML_Group_Project/resized_images'
train_transform = transforms.Compose([transforms.ToTensor(),transforms.Resize((256, 256)),transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),transforms.RandomHorizontalFlip(),transforms.ColorJitter(brightness=0.1, contrast=0.1), transforms.Normalize(mean=[0.485,0.456,0.406],std=[0.229,0.224,0.225])]) #mean and std based on ImageNet - normalise image data closer to normal distribution
val_transform = transforms.Compose([transforms.ToTensor(), transforms.Resize((224, 224)), transforms.Normalize(mean=[0.485,0.456,0.406],std=[0.229,0.224,0.225])])
img_dataset = datasets.ImageFolder(input_dir, transform=train_transform)

train_size = int(0.7*len(img_dataset))
val_size = len(img_dataset)-train_size
train_img_dataset, val_img_dataset = random_split(img_dataset, [train_size, val_size], generator=torch.Generator().manual_seed(42))

#swap the transform on val_img_dataset to the val_transform transformation
val_img_dataset.dataset.transform = val_transform

train_loader = DataLoader(train_img_dataset, batch_size=32, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_img_dataset, batch_size=32, num_workers=4, pin_memory=True)

In [3]:
#get classes from directory
num_classes = len(img_dataset.classes)
print("Number of classes:", num_classes,"\n", img_dataset.classes)

Number of classes: 50 
 ['adidas_forum_high', 'adidas_forum_low', 'adidas_gazelle', 'adidas_nmd_r1', 'adidas_samba', 'adidas_stan_smith', 'adidas_superstar', 'adidas_ultraboost', 'asics_gel-lyte_iii', 'converse_chuck_70_high', 'converse_chuck_70_low', 'converse_chuck_taylor_all-star_high', 'converse_chuck_taylor_all-star_low', 'converse_one_star', 'new_balance_327', 'new_balance_550', 'new_balance_574', 'new_balance_990', 'new_balance_992', 'nike_air_force_1_high', 'nike_air_force_1_low', 'nike_air_force_1_mid', 'nike_air_jordan_11', 'nike_air_jordan_1_high', 'nike_air_jordan_1_low', 'nike_air_jordan_3', 'nike_air_jordan_4', 'nike_air_max_1', 'nike_air_max_270', 'nike_air_max_90', 'nike_air_max_95', 'nike_air_max_97', 'nike_air_max_plus_(tn)', 'nike_air_vapormax_flyknit', 'nike_air_vapormax_plus', 'nike_blazer_mid_77', 'nike_cortez', 'nike_dunk_high', 'nike_dunk_low', 'puma_suede_classic', 'reebok_classic_leather', 'reebok_club_c_85', 'salomon_xt-6', 'vans_authentic', 'vans_old_skool',

#### Model Training
<i>Comparison between our self made model and pretrained model(?)

In [4]:
#check if cuda is available to use
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device, "is used")

cuda is used


In [19]:
#Initialise and setup pre-trained model
weights = models.ResNet50_Weights.IMAGENET1K_V2 # IMAGENET1K_V2 used as it has shown to improve the performance of ResNet50
cnn_model = models.resnet50(weights=weights)
for parameter in cnn_model.parameters():
    parameter.requires_grad = False #freeze gradient of pretrained layers to save pretrained feature training

#Replace fully connected layer with new nn.Linear to allow it to be trained on our dataset
cnn_model.fc = nn.Linear(cnn_model.fc.in_features, num_classes)
#switch to gpu if available
cnn_model.to(device)


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

In [20]:
#Set up Loss and Optimiser
criterion = nn.CrossEntropyLoss()
optimiser = optim.Adam(cnn_model.fc.parameters(), lr=1e-3)

In [21]:
#set up training loop
for epoch in range(10):
    cnn_model.train()
    running_loss, running_corrects = 0.0, 0
    for i, data in enumerate(train_loader):
        images, labels = data
        images, labels = images.to(device), labels.to(device)
        optimiser.zero_grad()
        outputs = cnn_model(images)
        loss    = criterion(outputs, labels)
        loss.backward()
        optimiser.step()
        running_loss    += loss.item() * images.size(0)
        running_corrects+= (outputs.argmax(1) == labels).sum().item()

    epoch_loss = running_loss / len(train_loader.dataset)
    epoch_acc  = running_corrects / len(train_loader.dataset)
    print("Epoch %d, Epoch loss: %.3f, Epoch accuracy: %.3f"%(epoch+1, epoch_loss, epoch_acc))
    torch.save(cnn_model.state_dict(), "./model_bank/best_resnet50.pth")

Epoch 1, Epoch loss: 2.884, Epoch accuracy: 0.360
Epoch 2, Epoch loss: 1.872, Epoch accuracy: 0.599
Epoch 3, Epoch loss: 1.450, Epoch accuracy: 0.690
Epoch 4, Epoch loss: 1.193, Epoch accuracy: 0.748
Epoch 5, Epoch loss: 1.009, Epoch accuracy: 0.794
Epoch 6, Epoch loss: 0.874, Epoch accuracy: 0.818
Epoch 7, Epoch loss: 0.770, Epoch accuracy: 0.849
Epoch 8, Epoch loss: 0.701, Epoch accuracy: 0.860
Epoch 9, Epoch loss: 0.621, Epoch accuracy: 0.881
Epoch 10, Epoch loss: 0.556, Epoch accuracy: 0.891


#### Model Evaluation

In [29]:
#set up validation loop
holder = {}
holder['y_true'], holder['y_hat'] = [], []
best_val_acc = 0
best_val_epoch = float("inf")
cnn_model.eval()
running_loss, running_corrects = 0.0, 0
with torch.no_grad():
    for data in val_loader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)

            outputs = cnn_model(images)
            _, preds = torch.max(outputs, 1)   # logits not required, index pos is sufficient
            running_loss += loss.item() * images.size(0)
            running_corrects+= (preds == labels).sum().item()
            holder['y_true'].extend(
                list(labels.cpu().detach().numpy())
            )
            holder['y_hat'].extend(
                list(preds.cpu().detach().numpy())
            )

    y_true_all = holder['y_true']
    y_pred_all = holder['y_hat']
    M = confusion_matrix(y_true_all, y_pred_all)
    conf_matrix = ConfusionMatrixDisplay(M)
    print("Confusion matrix: \n", conf_matrix)
    print(classification_report(y_true_all, y_pred_all))
    val_epoch_loss = running_loss / len(val_loader.dataset)
    val_epoch_acc  = running_corrects / len(val_loader.dataset)
    print("Epoch loss: %.3f, Epoch accuracy: %.3f"%(val_epoch_loss, val_epoch_acc))

Confusion matrix: 
 <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x7c3269a282f0>
              precision    recall  f1-score   support

           0       0.78      0.84      0.81       171
           1       0.85      0.62      0.71       136
           2       0.83      0.78      0.80       169
           3       0.92      0.76      0.83       135
           4       0.81      0.89      0.85       137
           5       0.72      0.85      0.78       173
           6       0.77      0.83      0.80       127
           7       0.74      0.92      0.82       163
           8       0.82      0.73      0.77       102
           9       0.93      0.78      0.85       143
          10       0.75      0.92      0.83       168
          11       0.79      0.84      0.81       152
          12       0.82      0.79      0.80       124
          13       0.84      0.79      0.81       201
          14       0.87      0.85      0.86       123
          15       0.81   