## 1. Set Data For Training - MNIST Dataset

- Dataset: MNIST (28x28 handwritten digit images)
- Classes: 10 (digits 0-9)
- Original features: 784 (28x28 pixels)
- Reduced features: 4 (using PCA)
- Random seed: 42
- Scale: MinMaxScaler [0, 1]
- **Total samples: 500 (subset from original 70000)**
- **Data Split: 300 Train (60%) / 100 Validation (20%) / 100 Test (20%)**
- **Step-by-step hyperparameter tuning using validation set**
- SVM Pipeline with preprocessing and systematic parameter optimization

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA


# Load MNIST dataset (28x28 handwritten digit images)
mnist = datasets.fetch_openml('mnist_784', version=1, parser='auto')
X = mnist.data.to_numpy() if hasattr(mnist.data, 'to_numpy') else mnist.data
y = mnist.target.astype(int).to_numpy() if hasattr(mnist.target, 'to_numpy') else mnist.target.astype(int)
print(f"Original dataset shape: {X.shape}")
print(f"Original number of samples: {X.shape[0]}")

# Limit to 500 samples for training/validation/test
X_subset, _, y_subset, _ = train_test_split(
    X, y, train_size=500, random_state=42, stratify=y
)

print(f"\nSubset dataset shape: {X_subset.shape}")
print(f"Number of samples: {X_subset.shape[0]}")
print(f"Number of features: {X_subset.shape[1]}")
print(f"Number of classes: {len(np.unique(y_subset))}")
print(f"Classes: {np.unique(y_subset)}")

# Show sample distribution
print("\nClass distribution in subset:")
unique, counts = np.unique(y_subset, return_counts=True)
for i, count in zip(unique, counts):
    print(f"Class {i}: {count} samples")

# Split data into train/validation/test sets
# First split: 300 train (60%), 200 temp (40%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X_subset, y_subset, train_size=300, random_state=42, stratify=y_subset
)

# Second split: 100 validation, 100 test (from the 200 temp)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print(f"\nFinal split (300 train, 100 val, 100 test):")
print(f"Training samples: {len(y_train)}")
print(f"Validation samples: {len(y_val)}")
print(f"Test samples: {len(y_test)}")


# Normalize features to [0,1]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)


# PCA for dimensionality reduction (from 784 to 4 dimensions)
print("\nApplying PCA for dimensionality reduction...")
pca = PCA(n_components=8)
X_train = pca.fit_transform(X_train)
X_val = pca.transform(X_val)
X_test = pca.transform(X_test)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca.explained_variance_ratio_):.4f}")

print(f"\nData splits:")
print(f"Training set: {X_train.shape} (300 samples, 60%)")
print(f"Validation set: {X_val.shape} (100 samples, 20%)")
print(f"Test set: {X_test.shape} (100 samples, 20%)")

Original dataset shape: (70000, 784)
Original number of samples: 70000

Subset dataset shape: (500, 784)
Number of samples: 500
Number of features: 784
Number of classes: 10
Classes: [0 1 2 3 4 5 6 7 8 9]

Class distribution in subset:
Class 0: 49 samples
Class 1: 56 samples
Class 2: 50 samples
Class 3: 51 samples
Class 4: 49 samples
Class 5: 45 samples
Class 6: 49 samples
Class 7: 52 samples
Class 8: 49 samples
Class 9: 50 samples

Final split (300 train, 100 val, 100 test):
Training samples: 300
Validation samples: 100
Test samples: 100

Applying PCA for dimensionality reduction...
0.4507376443161647
Explained variance ratio: [0.09909844 0.07003463 0.06263133 0.05757995 0.05025231 0.04532871
 0.03584532 0.02996696]
Total explained variance: 0.4507

Data splits:
Training set: (300, 8) (300 samples, 60%)
Validation set: (100, 8) (100 samples, 20%)
Test set: (100, 8) (100 samples, 20%)


In [7]:
# Save processed data
np.savez_compressed(
	"../data/mnist_8features_data.npz",
	X_train=X_train,
	X_val=X_val,
	X_test=X_test,
	y_train=y_train,
	y_val=y_val,
	y_test=y_test
)
print("Data saved to ../data/mnist_8features_data.npz")

Data saved to ../data/mnist_8features_data.npz


In [8]:
# Load processed MNIST data
data = np.load("../data/mnist_8features_data.npz")

X_train = data['X_train']
y_train = data['y_train']
X_val = data['X_val']
y_val = data['y_val']
X_test = data['X_test']
y_test = data['y_test']

print(f"Loaded data shapes:")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}, y_val: {y_val.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

Loaded data shapes:
X_train: (300, 8), y_train: (300,)
X_val: (100, 8), y_val: (100,)
X_test: (100, 8), y_test: (100,)


In [9]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.getcwd())))

In [10]:
from sklearn.metrics.pairwise import rbf_kernel
from src.utils import calculate_accuracy

rbf_K_train = rbf_kernel(X_train)
rbf_K_val = rbf_kernel(X_val, X_train)
rbf_K_test = rbf_kernel(X_test, X_train)

classical_val_acc, classical_test_acc, _ = calculate_accuracy(
    rbf_K_train, rbf_K_val, rbf_K_test,
    y_train, y_val, y_test
)
print("Val acc | Test acc")
print(f"{classical_val_acc} | {classical_test_acc}")

Val acc | Test acc
0.79 | 0.75
