

---

### 📌 **Step 1: Importing Required Libraries**
```python
import numpy as np  # 📊 NumPy for numerical operations
import pandas as pd  # 📝 Pandas for handling datasets
import matplotlib.pyplot as plt  # 📈 Matplotlib for plotting graphs
import seaborn as sns  # 🎨 Seaborn for better-looking plots
from sklearn.model_selection import train_test_split, cross_val_score  # 🎯 Splitting data & cross-validation
from sklearn.preprocessing import StandardScaler  # 📏 Standardization of features
from sklearn.neighbors import KNeighborsClassifier, BallTree, KDTree  # 🤖 k-NN model and efficient search trees
from sklearn.metrics import accuracy_score  # ✅ Measuring model performance
```
- These libraries help with data manipulation, visualization, and applying machine learning models.

---

### 📌 **Step 2: Generating a Synthetic Dataset**
```python
np.random.seed(42)  # 🎲 Setting seed for reproducibility (same random numbers every time)
X = np.random.rand(15, 4) * 10  # 🔢 Generating a 15×4 matrix with values between 0-10
y = np.random.choice([0, 1], size=15)  # 🎯 Generating 15 random binary labels (0 or 1)
```
- We generate **15 rows** and **4 features** (values between **0 and 10**).
- The target variable `y` consists of **random binary values (0 or 1)**, representing **two classes**.

---

### 📌 **Step 3: Splitting Dataset into Training & Testing Sets**
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
- `train_test_split()` splits data into:
  - **70% training data** (used to train the model).
  - **30% testing data** (used to evaluate the model).
- `random_state=42` ensures reproducibility.

---

### 📌 **Step 4: Feature Scaling (Standardization)**
```python
scaler = StandardScaler()  # 📏 Initializing the scaler
X_train_scaled = scaler.fit_transform(X_train)  # 🔄 Standardizing training data
X_test_scaled = scaler.transform(X_test)  # ✅ Transforming test data using same scaler
```
- **Standardization** helps in **handling different feature scales** by transforming values to have **mean=0 and variance=1**.
- **Why?** Because k-NN uses distances, and different scales can affect results.

---

### 📌 **Step 5: Applying k-NN with Default k=3**
```python
knn = KNeighborsClassifier(n_neighbors=3)  # 🏹 Initializing k-NN with k=3
knn.fit(X_train_scaled, y_train)  # 🏋️ Training the model on training data
y_pred = knn.predict(X_test_scaled)  # 🤖 Making predictions on test data
accuracy = accuracy_score(y_test, y_pred)  # 🎯 Calculating accuracy
print("Initial k-NN Accuracy:", accuracy)
```
- **k-NN finds the closest 3 neighbors (`k=3`)** and predicts the majority class.
- `fit()` trains the model, and `predict()` makes predictions.
- `accuracy_score()` measures how well the model performed.

---

### 📌 **Step 6: Exploring Different Distance Metrics**
```python
metrics = ['euclidean', 'manhattan', 'minkowski']  # 📏 Distance metrics
results = {}

for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=3, metric=metric)  # 🏹 Trying different distance metrics
    knn.fit(X_train_scaled, y_train)  # 🏋️ Training
    y_pred = knn.predict(X_test_scaled)  # 🤖 Predicting
    results[metric] = accuracy_score(y_test, y_pred)  # 🎯 Storing accuracy

print("Metric Comparison:", results)
```
- **Distance metrics affect how "close" neighbors are**:
  - **Euclidean**: Straight-line distance 🏹
  - **Manhattan**: Block-wise distance 🚶‍♂️
  - **Minkowski**: Generalized distance formula 🔢
- We test each metric and store the **accuracy**.

---

### 📌 **Step 7: Exploring Different Weighting Schemes**
```python
weighting_schemes = ['uniform', 'distance']
weight_results = {}

for weight in weighting_schemes:
    knn = KNeighborsClassifier(n_neighbors=3, weights=weight)  # ⚖️ Different weighting methods
    knn.fit(X_train_scaled, y_train)  # 🏋️ Training
    y_pred = knn.predict(X_test_scaled)  # 🤖 Predicting
    weight_results[weight] = accuracy_score(y_test, y_pred)  # 🎯 Storing accuracy

print("Weighting Scheme Comparison:", weight_results)
```
- **Weighting schemes decide how votes from neighbors are counted**:
  - **Uniform**: Each neighbor gets **equal weight** ⚖️
  - **Distance**: Closer neighbors have **higher influence** 🔥
- Accuracy is stored for each scheme.

---

### 📌 **Step 8: Finding the Best k (Hyperparameter Tuning)**
```python
k_values = range(1, 11)  # 🔢 Testing k from 1 to 10
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5)  # 🤹 5-fold cross-validation
    cv_scores.append(scores.mean())  # 📊 Storing mean accuracy

best_k = k_values[np.argmax(cv_scores)]  # 🏆 Finding k with highest accuracy

plt.plot(k_values, cv_scores, marker='o', linestyle='-')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Optimal k selection')
plt.grid()
plt.show()

print("Best k:", best_k)
```
- We **test different k values** and use **5-fold cross-validation** to find the best `k`.
- **Why?** Too small `k` → **Overfitting**, too large `k` → **Underfitting**.

---

### 📌 **Step 9: Decision Boundary Visualization (2 Features Only)**
```python
X_train_2D = X_train_scaled[:, :2]  # 📊 Selecting only 2 features for visualization
X_test_2D = X_test_scaled[:, :2]

knn_2D = KNeighborsClassifier(n_neighbors=best_k)
knn_2D.fit(X_train_2D, y_train)  # 🎯 Training k-NN with the best k

x_min, x_max = X_train_2D[:, 0].min() - 1, X_train_2D[:, 0].max() + 1
y_min, y_max = X_train_2D[:, 1].min() - 1, X_train_2D[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

Z = knn_2D.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
sns.scatterplot(x=X_train_2D[:, 0], y=X_train_2D[:, 1], hue=y_train, palette='coolwarm', edgecolor='k')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.title(f'Decision Boundary with k={best_k}')
plt.show()
```
- **Only 2 features** are used for plotting decision boundaries.
- **Meshgrid** is created for prediction.
- **Decision boundary** is drawn based on class predictions.

---

### 🎯 **Final Summary**
1. **Data Preparation** ✅  
2. **Feature Scaling** ✅  
3. **Applying k-NN** ✅  
4. **Hyperparameter Tuning (Finding Best k)** ✅  
5. **Exploring Different Distance Metrics & Weighting Schemes** ✅  
6. **Decision Boundary Visualization** ✅  



# **📌 Scaled vs. Non-Scaled Dataset in Machine Learning 🚀**  

In machine learning, **scaling** means transforming numerical features so they are on a similar scale. This is super important for models that rely on **distance calculations** like **k-NN, SVM, and K-Means**! Let's break it down.  

---

## **🔹 Non-Scaled Dataset (Original Data) 🛑**  
🔹 The raw dataset where features **retain their original values**.  
🔹 Example:  
```
Feature 1  |  Feature 2
-------------------------
     10    |     500  
      2    |      50  
     15    |    1000  
```
⚠️ **Problems**:  
❌ **Different scales** (Feature 1 is between **1-100**, Feature 2 is **100-1000**).  
❌ **Distance-based models get biased** toward larger values.  
❌ **Poor model performance** for k-NN, K-Means, and SVM.  

---

## **🔹 Scaled Dataset (Normalized/Standardized Data) ✅**  
✔️ All features are transformed to have **similar scales**.  
✔️ Prevents bias toward large numbers.  
✔️ Helps **distance-based models perform better**.  

### **📌 Two Common Scaling Techniques 📏**
### **1️⃣ Standardization (Z-score scaling)**
- **Formula**:  
  \[
  X_{\text{scaled}} = \frac{X - \text{mean}(X)}{\text{std}(X)}
  \]
- 📊 **Mean = 0, Standard Deviation = 1**  
- ✅ **Used in k-NN, SVM, PCA**

---

### **2️⃣ Min-Max Scaling (Normalization)**
- **Formula**:  
  \[
  X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
  \]
- 📏 **Values transformed between 0 and 1**  
- ✅ **Used in Neural Networks & Deep Learning**  

🔹 **Example (after standardization)**:  
```
Feature 1  |  Feature 2
-------------------------
   -0.3    |    0.1  
   -1.2    |   -1.5  
    0.8    |    1.4  
```
---

## **📌 Why Should You Scale Data? ⚖️**
| 🔍 **Algorithm**        | 🏆 **Needs Scaling?** | 🤔 **Why?** |
|------------------------|--------------------|-------------|
| ✅ k-NN (k-Nearest Neighbors) | ✅ Yes | Uses **Euclidean distance**, so scale matters |
| ✅ SVM (Support Vector Machine) | ✅ Yes | Uses **dot products & distances** |
| ✅ K-Means Clustering | ✅ Yes | Distance-based clustering |
| ❌ Linear Regression | ❌ No | Coefficients adjust automatically |
| ❌ Decision Trees | ❌ No | Splitting is **not scale-dependent** |

---

## **🔹 Python Code Example 🐍**
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()  # 📏 Initialize scaler
X_scaled = scaler.fit_transform(X)  # 🔄 Transform dataset
```
✔️ This **standardizes** the dataset to **mean = 0 and variance = 1**.  

---

### **🚀 Conclusion**
✔️ Always **scale features** when using **distance-based models** like **k-NN, SVM, and Clustering algorithms**!  
✔️ **Skipping scaling can lead to incorrect predictions and poor model performance!** 😱  

