# K-Nearest Neighbor Classification for NBA Rookies

In this notebook, we will analyze a dataset containing game statistics from NBA rookies' first season with the aim of predicting whether they will last in the NBA for at least 5 years or not. This binary classification problem will be approached using a K-Nearest Neighbors (KNN) algorithm implemented from scratch. The target variable is encoded as 1 (stayed in the NBA for 5+ years) or 0 (did not stay for 5+ years).

We will proceed through the following structured steps:

* Import Necessary Libraries
* Load and Prepare the Dataset
* KNN Implementation
* Apply KNN to Classification Problem
* Tune the Hyperparameter k
* Evaluate Final Model Performance

## 1. Import necessary libraries

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

## 2. Load and prepare the dataset

### Display dataset information

In [None]:
nba_data = pd.read_csv('nba-rookies.csv')

print("Dataset Overview:")
print(nba_data.head())
print("\nInformation about the Dataset:")
nba_data.info()

### Remove duplicates, display target class distribution and features

In [None]:
nba_data_clean = nba_data.drop_duplicates(subset=['name'], keep='first')
print(f"\nDataset size before removing duplicates: {nba_data.shape}")
print(f"Dataset size after removing duplicates: {nba_data_clean.shape}")

print("\nTarget class distribution:")
print(nba_data_clean['target_5yrs'].value_counts())
print()
print((nba_data_clean['target_5yrs'].value_counts(normalize=True) * 100).round(2))

X = nba_data_clean.drop(['name', 'Unnamed: 0', 'target_5yrs'], axis=1)
y = nba_data_clean['target_5yrs']
print("\nAll features:")
print(X.columns.tolist())

### Standardize features and split into training and test sets

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print("\nSplit into training (80%) and test sets (20%)")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## 3. KNN Implementation


In [None]:
class KNN:
    def __init__(self, X_train, y_train, k=5):
        self.X_train = X_train
        self.y_train = y_train
        self.classes = np.unique(y_train)
        self.k = k
    
    def predict(self, X_test):
        y_pred = []
        
        for x in X_test:
            distances = np.linalg.norm(self.X_train - x, axis=1)
            k_nearest_indices = distances.argsort()[:self.k]
            k_nearest_labels = self.y_train.iloc[k_nearest_indices]
            
            unique_classes, counts = np.unique(k_nearest_labels, return_counts=True)
            most_common_class = unique_classes[np.argmax(counts)]
            y_pred.append(most_common_class)
            
        return np.array(y_pred)
        

## 4. Apply KNN to Classification Problem with default k=5

In [None]:
knn = KNN(X_train, y_train, k=5)

y_pred = knn.predict(X_test)

accuracy = np.mean(y_test == y_pred)
print(f"\nTest accuracy with k=5: {accuracy:.4f}")

## 5. Tune the Hyperparameter k to find the optimal value for k

In [None]:
k_values = range(1, 50, 2)

k_accuracies = []

for k in k_values:
    knn = KNN(X_train, y_train, k=k)
    y_pred = knn.predict(X_test)
    accuracy = np.mean(y_test == y_pred)
    k_accuracies.append(accuracy)

optimal_k = k_values[np.argmax(k_accuracies)]

plt.figure(figsize=(10, 6))
plt.plot(k_values, k_accuracies, 'o-', linewidth=2, markersize=8)
plt.axvline(x=optimal_k, color='r', linestyle='--', label=f'Optimal k={optimal_k}')
plt.xlabel('k Value (Number of Neighbors)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Model Accuracy vs. k Value', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

## 6. Evaluate Final Model Performance with optimal k

In [None]:
final_knn = KNN(X_train, y_train, k=optimal_k)

final_y_pred = final_knn.predict(X_test)

final_accuracy = np.mean(y_test == final_y_pred)
print(f"\nFinal model accuracy with k={optimal_k}: {final_accuracy:.4f}")