# Running a Machine Learning Model in a QSVM and Classic SVM Simulator With Comparasons

### Imports and Environment Setup
This cell imports all necessary libraries for data handling (`pandas`, `NumPy`), preprocessing and modelling (`scikit-learn`), plotting (`matplotlib`), and quantum machine learning (`Qiskit` and its machine-learning extensions). It also ensures the `Heatmaps/` directory exists for saving later figures.

In [1]:
##------------ Import Required Python Packages ------------
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import time

##------------ Import Required Machine Learning Packages ------------
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.svm import SVC as ClassicalSVC
from sklearn.metrics.pairwise import rbf_kernel


##------------ Import Required Qiskit Packages ------------
from qiskit.circuit.library import ZZFeatureMap
from qiskit_machine_learning.kernels import FidelityStatevectorKernel
from qiskit_machine_learning.algorithms import QSVC

##------------ Create/Find OS Directory to Save the Heatmap Results ------------
os.makedirs('../Heatmaps', exist_ok=True)

### Data Loading and Preprocessing
This cell reads the raw dataset (`bots_vs_users.csv`) into a DataFrame, drops features with over 75 % missing values, imputes and flags missing entries, encodes boolean and categorical variables, removes duplicate rows, filters out features with very low variance, and finally imputes and standardises the cleaned feature matrix in preparation for PCA and modelling.


In [2]:
#----------- Load Data into DataFrame -----------
df = pd.read_csv('../Dataset/bots_vs_users.csv')    # Read CSV file into pandas DataFrame

#----------- Drop Very Sparse Columns (>75% missing) -----------
missing_frac = df.isnull().mean().sort_values(ascending=False)    # Compute fraction of missing values per column
high_na      = missing_frac[missing_frac > 0.75].index.tolist()   # Identify columns with >75% NaNs
df           = df.drop(columns=high_na)                           # Remove those sparse columns


#----------- Numeric Imputation & Flags -----------
num_cols = df.select_dtypes(include=['float64','int64']).columns.drop('target')   # Numeric columns excluding target
df[num_cols] = df[num_cols].fillna(0)                                             # Impute NaNs with zero
for c in num_cols:
    df[c + '_was_na'] = (df[c] == 0).astype(int)                                  # Add binary flag for imputed entries


#----------- Boolean Mapping & One-Hot Encoding for Categoricals -----------
bool_cols = [
    c for c in df.columns
    if df[c].dtype == 'object' and set(df[c].dropna().unique()) <= {'True','False'}
]                                                                                 # Detect True/False columns
for c in bool_cols:
    df[c] = df[c].map({'True':1,'False':0})                                       # Map boolean strings to 0/1
cat_cols = df.select_dtypes(include=['object']).columns                           # Remaining categorical columns
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)                        # One-hot encode, drop first level

#----------- Drop Duplicate Rows -----------
df = df.reset_index(drop=True)                       # Reset index after modifications
y_full = df['target']                                # Save target series for later
before = len(df)                                     # Count rows before deduplication
df = df.drop_duplicates()                            # Remove exact duplicate records
print(f"Dropped {before - len(df)} duplicate rows")  # Report number of duplicates removed


#----------- Low-Variance Feature Filter (<1%) -----------
X_full = df.drop(columns=['target'])                              # Separate features
sel    = VarianceThreshold(threshold=0.01)                        # Initialise low-variance selector
X_sel  = sel.fit_transform(X_full)                                # Filter out features with <1% variance
kept   = X_full.columns[sel.get_support()]                        # List of retained feature names
print(f"Kept {len(kept)} features after low-variance filtering")   # Report retained feature count

#----------- Rebuild Cleaned DataFrame & Reattach Target -----------
df_clean = pd.DataFrame(X_sel, columns=kept)                       # Construct DataFrame from filtered features
df_clean['target'] = y_full.loc[df.index].values                  # Reattach the target values

#----------- Final Split into Features/Target, Imputation, Normalization & Standardization -----------
X = df_clean.drop(columns=['target'])                             # Final feature matrix
y = df_clean['target']                                            # Final target vector

# Impute any remaining missing values with the mean
imputer = SimpleImputer(strategy='mean')                          # Mean-value imputer
X_imp   = imputer.fit_transform(X)                                # Impute features

# Normalise features to [0,1]
norm_scaler = MinMaxScaler()                                       # Min–Max normaliser
X_norm      = norm_scaler.fit_transform(X_imp)                     # Scale each feature into [0,1]

# Standardise to zero mean and unit variance
std_scaler = StandardScaler()                                      # Z-score standardiser
X_scaled  = std_scaler.fit_transform(X_norm)                       # Transform normalised data

Dropped 2651 duplicate rows
Kept 74 features after low-variance filtering


### Experiment Parameter Configuration
Here we define the grid of experimental conditions: a list of PCA component counts (`n_components_list`) and dataset sample sizes (`sample_sizes`). We also initialise an empty list (`results`) to collect accuracy metrics for each run.


In [3]:
#----------- Initialise Parameters -----------
n_components_list = [2, 3, 4, 5, 6, 7, 8]   # Matrix used for adjusting the PCA values (Fits the required range 2-4)
sample_sizes      = [10, 100, 1000, 3000]   # Matrix used for adjusting the Sample size used for processing (Fits the required 100 samples)
results = []                                # Used to store each of the accuracies from all of the n components and samples

### Model Training and Evaluation Loop
This cell implements the core experiment. For each combination of PCA dimension and sample size, it:
1. Applies PCA to reduce dimensionality.
2. Subsamples and stratifies the data, then splits into training and test sets.
3. Trains a quantum SVM (`QSVC` with `ZZFeatureMap` + `FidelityStatevectorKernel`) and records its test accuracy.
4. Trains a classical RBF-kernel SVM on the same splits and records its test accuracy.
5. Computes the quantum kernel matrix on the training set, plots it as a heatmap, and saves the figure under `Heatmaps/`.


In [4]:
#----------- Experimental Loop over PCA Dimensions and Sample Sizes -----------
for n_components in n_components_list:
    # 1.) PCA reduction
    pca   = PCA(n_components=n_components)                 # Initialise PCA transformer to reduce to n_components
    X_red = pca.fit_transform(X_scaled)                    # Apply PCA to the scaled feature matrix
    
    for sample_size in sample_sizes:
        #----------- Subsampling and Train/Test Split -----------
        # 2a.) Subsample
        X_sub, _, y_sub, _ = train_test_split(
            X_red, y,
            train_size=sample_size,
            stratify=y,
            random_state=42
        )                                                   # Draw stratified subsample of size sample_size

        # 2b.) Train/test split
        X_train, X_test, y_train, y_test = train_test_split(
            X_sub, y_sub,
            test_size=0.2,
            stratify=y_sub,
            random_state=42
        )                                                   # Split subsample into 80/20 stratified train and test sets
        
        #----------- Quantum Support Vector Classifier -----------
        # 3.) Quantum QSVC
        feature_map    = ZZFeatureMap(feature_dimension=n_components, reps=2)   # Create ZZ feature map circuit (2 reps)
        quantum_kernel = FidelityStatevectorKernel(feature_map=feature_map)     # Build fidelity kernel from feature map
        t0 = time.perf_counter()
        qsvc           = QSVC(quantum_kernel=quantum_kernel)                    # Initialise QSVC with quantum kernel
        qsvc.fit(X_train, y_train)                                              # Train QSVC on training data
        quantum_acc = qsvc.score(X_test, y_test)                                # Evaluate QSVC accuracy on test set
        quantum_train_time = time.perf_counter() - t0
        sv_idx_q         = qsvc.support_                    # indices of SVs in X_train
        alpha_q          = qsvc.dual_coef_[0]               # dual coefficients (y*alpha)
        y_sv_q           = y_train.iloc[sv_idx_q].to_numpy() if hasattr(y_train, 'iloc') else y_train[sv_idx_q]
        # compute quantum margin
        K_sv_q           = quantum_kernel.evaluate(x_vec=X_train[sv_idx_q])
        margin_sq_q      = (alpha_q * y_sv_q) @ K_sv_q @ (alpha_q * y_sv_q)
        margin_q         = 1.0 / np.sqrt(margin_sq_q)
        
        #----------- Classical Support Vector Classifier Baseline -----------
        # 4.) Classical SVM (RBF)
        classical_svc = ClassicalSVC(kernel='rbf', gamma='scale')             # Instantiate classical SVM with RBF kernel
        t1 = time.perf_counter()
        classical_svc.fit(X_train, y_train)                                   # Train classical SVM on training data
        classical_acc = classical_svc.score(X_test, y_test)                   # Evaluate classical SVM accuracy on test set
        classical_train_time = time.perf_counter() - t1   
        sv_idx_c         = classical_svc.support_
        alpha_c          = classical_svc.dual_coef_[0]
        y_sv_c           = y_train.iloc[sv_idx_c].to_numpy() if hasattr(y_train, 'iloc') else y_train[sv_idx_c]
        # compute classical margin using RBF kernel with gamma=1/scale (same as SVC default)
        K_sv_c           = rbf_kernel(X_train[sv_idx_c], X_train[sv_idx_c])
        margin_sq_c      = (alpha_c * y_sv_c) @ K_sv_c @ (alpha_c * y_sv_c)
        margin_c         = 1.0 / np.sqrt(margin_sq_c)
        
        #----------- Print timing & accuracies -----------
        print(f"n={n_components}, m={sample_size} → "
          f"QSVC: acc={quantum_acc:.2f}, Support Vectors={len(sv_idx_q)}, margin={margin_q:.3f} │ "
          f"SVC: acc={classical_acc:.2f}, Support Vectors={len(sv_idx_c)}, margin={margin_c:.3f}")
        
        #----------- Record Results -----------
        results.append({                                                       # Append metrics dictionary to results list
            'n_components'      : n_components,                                #   PCA component count
            'sample_size'       : sample_size,                                 #   Number of samples used
            'quantum_accuracy'  : quantum_acc,                                 #   QSVC test accuracy
            'classical_accuracy': classical_acc                                #   Classical SVM test accuracy
        })
        
        #----------- Kernel Matrix Computation and Heatmap Saving -----------
        # 5.) Printing Heatmaps
        K_train = quantum_kernel.evaluate(x_vec=X_train)                     # Compute quantum kernel matrix on training set
        plt.figure(figsize=(4,4))                                            # Create new figure for heatmap
        plt.imshow(K_train, cmap='viridis')                                  # Display kernel matrix as a heatmap
        plt.title(f"Kernel (n={n_components}, size={sample_size})")          # Set heatmap title
        plt.xlabel("train idx")                                              # Label x-axis
        plt.ylabel("train idx")                                              # Label y-axis
        plt.colorbar(label="K value")                                        # Add colorbar with label
        fname = f"../Heatmaps/kernel_n{n_components}_size{sample_size}_sim.png"     # Define output filename
        plt.savefig(fname, bbox_inches='tight')                              # Save heatmap without extra margins
        plt.close()                                                          # Close figure to prevent inline display

        X_var  = X_train.var()                                                
        gamma  = 1.0 / (X_train.shape[1] * X_var)                             
        K_train_cl = rbf_kernel(X_train, X_train, gamma=gamma)                # now gamma is a float

        plt.figure(figsize=(4,4))
        plt.imshow(K_train_cl, cmap='viridis')
        plt.title(f"Classical RBF Kernel (n={n_components}, size={sample_size})")
        plt.xlabel("train idx")
        plt.ylabel("train idx")
        plt.colorbar(label="Kernel value")

        fname_cl = f"../Heatmaps/kernel_n{n_components}_size{sample_size}_classical.png"
        plt.savefig(fname_cl, bbox_inches='tight')
        plt.close()

n=2, m=10 → QSVC: acc=1.00, Support Vectors=6, margin=0.591 │ SVC: acc=1.00, Support Vectors=5, margin=0.707
n=2, m=100 → QSVC: acc=0.80, Support Vectors=42, margin=0.093 │ SVC: acc=0.85, Support Vectors=35, margin=0.101
n=2, m=1000 → QSVC: acc=0.78, Support Vectors=376, margin=0.010 │ SVC: acc=0.90, Support Vectors=234, margin=0.016
n=2, m=3000 → QSVC: acc=0.78, Support Vectors=1084, margin=0.003 │ SVC: acc=0.90, Support Vectors=631, margin=0.006
n=3, m=10 → QSVC: acc=1.00, Support Vectors=7, margin=0.683 │ SVC: acc=1.00, Support Vectors=5, margin=0.707
n=3, m=100 → QSVC: acc=0.80, Support Vectors=47, margin=0.119 │ SVC: acc=0.90, Support Vectors=33, margin=0.110
n=3, m=1000 → QSVC: acc=0.84, Support Vectors=334, margin=0.016 │ SVC: acc=0.91, Support Vectors=221, margin=0.020
n=3, m=3000 → QSVC: acc=0.84, Support Vectors=986, margin=0.005 │ SVC: acc=0.90, Support Vectors=608, margin=0.007
n=4, m=10 → QSVC: acc=1.00, Support Vectors=8, margin=0.706 │ SVC: acc=1.00, Support Vectors=6, m

### Results Aggregation and Display
Once all runs are complete, this cell converts the collected `results` into a pandas DataFrame, displays the long-format table of metrics, and then pivots it to generate a consolidated accuracy comparison table across PCA dimensions and sample sizes.

In [5]:
#----------- Build & Display Results Table -----------
df_results = pd.DataFrame(results)                        # Construct a DataFrame from the collected results list

#----------- Pivot Into Multi-Level Column Grid -----------
df_pivot = df_results.pivot_table(                        # Create a pivot table indexed by n_components and columns by sample_size
    index='n_components',
    columns='sample_size',
    values=['quantum_accuracy','classical_accuracy']      # Include both quantum and classical accuracy values
)

print("Pivoted accuracy table:")                          # Print header for the pivoted results
display(df_pivot)                                         # Render the pivoted accuracy comparison table

Pivoted accuracy table:


Unnamed: 0_level_0,classical_accuracy,classical_accuracy,classical_accuracy,classical_accuracy,quantum_accuracy,quantum_accuracy,quantum_accuracy,quantum_accuracy
sample_size,10,100,1000,3000,10,100,1000,3000
n_components,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2,1.0,0.85,0.9,0.898333,1.0,0.8,0.78,0.778333
3,1.0,0.9,0.905,0.896667,1.0,0.8,0.845,0.84
4,1.0,0.9,0.905,0.911667,1.0,0.85,0.84,0.855
5,1.0,0.9,0.905,0.908333,1.0,0.85,0.86,0.875
6,1.0,0.85,0.905,0.916667,1.0,0.85,0.87,0.903333
7,1.0,0.9,0.925,0.928333,1.0,0.85,0.875,0.896667
8,1.0,0.95,0.935,0.935,1.0,0.85,0.875,0.905
