
Applying bootstrap resampling to a K-Nearest Neighbors (KNN) classifier can be challenging due to the nature of KNN's decision boundaries. KNN relies on the proximity of data points, and bootstrapping—sampling with replacement—can lead to high overlap between training subsets, resulting in similar models that don't effectively reduce variance. This overlap diminishes the benefits of bagging, a technique that combines multiple models to improve stability and accuracy. 
STATS.STACKEXCHANGE.COM

Instead of bootstrapping, consider using the Random Subspace Method, which involves training each model on a random subset of features. This approach can enhance the diversity of the models and improve performance. 
STATS.STACKEXCHANGE.COM

Alternatively, ensemble methods like BaggingClassifier in scikit-learn can be used with KNN to create an ensemble of KNN classifiers trained on different subsets of the data. This method can help reduce variance and improve the robustness of the model. 
EDUCATIONALRESEARCHTECHNIQUES.COM

In summary, while bootstrap resampling with KNN is not typically recommended due to the classifier's sensitivity to data overlap, exploring ensemble methods like Random Subspace or BaggingClassifier can be more effective strategies for improving model performance.

In [2]:
import pandas as pd
import numpy as np

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [3]:
train_data = pd.read_csv('train_data.csv')
y_train = train_data['Grade'] # series not df
X_train = train_data.drop(columns=['Grade'])

test_data = pd.read_csv('test_data.csv')
y_test = test_data['Grade']
X_test = test_data.drop(columns=['Grade'])

X_train.info()
y_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pH           847 non-null    float64
 1   Temperature  847 non-null    float64
 2   Colour       847 non-null    float64
 3   Taste        847 non-null    int64  
 4   Odor         847 non-null    int64  
 5   Fat          847 non-null    int64  
 6   Turbidity    847 non-null    int64  
dtypes: float64(3), int64(4)
memory usage: 46.4 KB
<class 'pandas.core.series.Series'>
RangeIndex: 847 entries, 0 to 846
Series name: Grade
Non-Null Count  Dtype
--------------  -----
847 non-null    int64
dtypes: int64(1)
memory usage: 6.7 KB


In [5]:
# Initialize the KNN classifier
knn = KNeighborsClassifier()

# Initialize the BaggingClassifier with KNN as the base estimator
# Set max_features to control the number of features each base estimator uses
bagging = BaggingClassifier(estimator=knn, n_estimators=10, max_features=0.8, random_state=42)

# Fit the BaggingClassifier on the training data
bagging.fit(X_train, y_train)

# Evaluate the model on the test data
test_accuracy = bagging.score(X_test, y_test)
print(f"Test set accuracy: {test_accuracy:.4f}")

Test set accuracy: 0.9858


hypeparameters, cross-validation