# k-NN Classification on Synthetic Moons Dataset



git hub link https://github.com/jeentk17/introto-ai-thanakrit

## Problem & Representation

**Goal:** Given a new 2D point, predict which moon (class 0 or class 1) it belongs to.

**Data:** Two interleaving moons generated via `sklearn.datasets.make_moons(n_samples=500, noise=0.25, random_state=42)`.

**Representation passed to the algorithm:**
- **Features** `X`: shape (n_samples, 2) → columns are `(x1, x2)` coordinates.
- **Labels** `y`: shape (n_samples,) with values `{0, 1}`.

**Why scaling?** k-NN uses distances; `StandardScaler` puts features on comparable scales for fair distance computation.


## 1. Import Libraries & Setup Output Folder

In [1]:
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

SCRIPT_DIR = os.getcwd()
OUT_DIR = os.path.join(SCRIPT_DIR, "introai1")
os.makedirs(OUT_DIR, exist_ok=True)
print("Saving to:", OUT_DIR)

Saving to: c:\Users\Jeen\Desktop\introai\introai1\introai1


## 2. Generate Synthetic Moons Dataset & Split

In [None]:
# Step 1: Generate dataset
# make_moons creates a synthetic 2D dataset shaped like two moons.
# Inputs: n_samples=500, noise=0.25 to make the dataset less trivial.
# Outputs: X (features), y (labels: 0 or 1)
X, y = make_moons(n_samples=500, noise=0.25, random_state=42)

# Split dataset into training and testing sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

## 3. Visualize Dataset

In [None]:
# Step 2: Visualize dataset 
plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], s=15)
plt.title("Dataset: Two Interleaving Moons")
plt.xlabel("x1")
plt.ylabel("x2")
plt.tight_layout()
plt.savefig(os.path.join(OUT_DIR, "dataset.png"), dpi=160)
plt.close()

## 4. Model Selection: Cross-Validation over k

In [None]:
# --- Step 3: Model selection (tuning k)
# To find the best value of k, we perform cross-validation.
# For each candidate k (number of neighbors), we:
# 1. Build a pipeline with scaling + k-NN.
# 2. Run 5-fold cross-validation on the training set.
# 3. Compute average accuracy.
# Finally, we select the k with the highest mean CV score.
k_values = [1, 3, 5, 7, 9, 11, 13, 15]
cv_scores = []
for k in k_values:
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("knn", KNeighborsClassifier(n_neighbors=k))
    ])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="accuracy")
    cv_scores.append(scores.mean())

# Identify best k based on CV performance
best_idx = int(np.argmax(cv_scores))
best_k = k_values[best_idx]

plt.figure(figsize=(6, 5))
plt.plot(k_values, cv_scores, marker="o")
plt.title("Model Selection: CV Accuracy vs k")
plt.xlabel("k (neighbors)")
plt.ylabel("Cross-validated Accuracy")
plt.grid(True, linewidth=0.4)
plt.tight_layout()
plt.savefig(os.path.join(OUT_DIR, "model_selection.png"), dpi=160)
plt.close()

## 5. Train k-NN Model with Best k

In [None]:
# Step 4: Fit best model, evaluate, visualize, and save outputs
# We now instantiate a fresh Pipeline using the best k selected by cross-validation.
# The pipeline has two stages:
# (1) StandardScaler: standardizes features to zero-mean/unit-variance (crucial for distance-based k-NN).
# (2) KNeighborsClassifier: k-NN classifier using the tuned number of neighbors (best_k).
# Note: "Training" for k-NN means storing the (scaled) training data; there is
best_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=best_k))
])

# Fit the pipeline on the training data
best_pipe.fit(X_train, y_train)

## 6. Evaluate Model Performance

In [None]:
# Predictions & basic metrics
# Compute predictions on the held-out test set, then evaluate with accuracy and
y_pred = best_pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

## 7. Plot Decision Boundary

In [None]:
# Decision boundary visualization 
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(
    np.linspace(x_min, x_max, 500),
    np.linspace(y_min, y_max, 500)
)
# Stack grid coordinates into (N, 2) so the pipeline can predict on them.
grid = np.c_[xx.ravel(), yy.ravel()]
Z = best_pipe.predict(grid).reshape(xx.shape)

plt.figure(figsize=(6, 5))
plt.contourf(xx, yy, Z, alpha=0.25)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=15)
plt.title(f"Decision Boundary (k={best_k})")
plt.xlabel("x1")
plt.ylabel("x2")
plt.tight_layout()
plt.savefig(os.path.join(OUT_DIR, "decision_boundary.png"), dpi=160)
plt.close()

## 8. Plot Confusion Matrix

In [8]:
plt.figure(figsize=(5.8, 5.2))
im = plt.imshow(cm, interpolation='nearest')
plt.title("Confusion Matrix")
plt.xlabel("Predicted label")
plt.ylabel("True label")
for (i, j), v in np.ndenumerate(cm):
    plt.text(j, i, str(v), ha='center', va='center')
plt.tight_layout()
plt.savefig(os.path.join(OUT_DIR, "confusion_matrix.png"), dpi=160)
plt.close()

## 9. Display Classification Report and Summary

In [9]:
summary = f"""Best k: {best_k}\nTest Accuracy: {acc:.3f}\n\nClassification Report:\n{classification_report(y_test, y_pred)}"""
print(summary)
with open(os.path.join(OUT_DIR, "results.txt"), "w", encoding="utf-8") as f:
    f.write(summary)

Best k: 5
Test Accuracy: 0.944

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.95      0.94        63
           1       0.95      0.94      0.94        62

    accuracy                           0.94       125
   macro avg       0.94      0.94      0.94       125
weighted avg       0.94      0.94      0.94       125

