In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In [0]:
data = pd.read_csv("column_2C_weka.csv")
data.head()

In [0]:
data.info()

In [0]:
data.describe()

In [0]:
data.isnull().sum()

In [0]:
data['class'].value_counts()

In [0]:
color_list = ['red' if i == 'Abnormal' else 'green' for i in data.loc[:, 'class']]

pd.plotting.scatter_matrix(data.loc[:, data.columns != 'class'],
                           c=color_list,
                           figsize=[10,10],
                           diagonal='hist',
                           alpha=0.5,
                           s=200,
                           marker = '*',
                           edgecolor = "black"
                           )

### Similar to Demo: Basic Machine Learning Part – 1


In [0]:
x = data.loc[:, data.columns!='class']
y = data.loc[:, 'class']

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=42
)

log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
log_reg.predict(x_test)
log_reg.score(x_test, y_test)

### Similar to Demo: Basic Machine Learning Part – 2

In [0]:
x = data.loc[:, data.columns!='class']
y = data.loc[:, 'class']

x = StandardScaler().fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=42
)

log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
log_reg.predict(x_test)
log_reg.score(x_test, y_test)

# 📌 What is the K-Nearest Neighbor (KNN) Algorithm?

🎥 **Video Title:** What is the K-Nearest Neighbor (KNN) Algorithm?  
👨‍🏫 **Creator:** [IBM Technology](https://www.youtube.com/@IBMTechnology)  
🔗 **Watch here:** [YouTube Video](https://www.youtube.com/watch?v=b6uHw7QW_n4)  
🖼️ **Thumbnail:**  
![KNN](https://i.ytimg.com/vi/b6uHw7QW_n4/hqdefault.jpg)

---

## 🚀 KNN in a Nutshell

K-Nearest Neighbors (KNN) is a **supervised learning algorithm** used for both **classification and regression**, based on the concept of **similarity proximity**.

---

## 📊 Key Concepts Explained

### 📌 1. Core Principle
- KNN classifies data based on how its **features compare to its "K" nearest neighbors**.
- It assumes that similar things exist near each other.

### 🍏 2. Fruit Example
- Features: **Sweetness (x-axis)** & **Crunchiness (y-axis)**.
- New data points are classified by checking proximity to existing labeled data (e.g., apples vs oranges).

### 🧮 3. Distance Metrics
- Measures proximity using:
  - **Euclidean Distance**
  - **Manhattan Distance**
- Visualization: **Voronoi Diagrams** are used to show decision boundaries.

### 🔢 4. Choosing the Right ‘K’
- **K = 1**: Assign class of nearest neighbor.
- Larger K values smooth the model’s decisions.
- Use **odd K values** to avoid classification ties.

---

## ⚖️ Pros & Cons of KNN

### ✅ Strengths
- ✅ Simple and intuitive
- ✅ Few hyperparameters (just K & distance metric)
- ✅ Learns in real-time as new data is added

### ❌ Weaknesses
- ❌ Poor scalability with large datasets (lazy learning = memory-intensive)
- ❌ Suffers from the **curse of dimensionality**
- ❌ High sensitivity to noisy or irrelevant features

---
## 🔍 Exploring the Impact of K: Overfitting vs. Underfitting

Choosing the right value of **K** in the K-Nearest Neighbor algorithm is crucial — it directly impacts the model’s **bias-variance tradeoff** and overall performance.

---

### 🔢 What Does ‘K’ Really Mean?

- The **K** in KNN refers to the number of nearest data points used to determine a prediction for a new instance.
- You can think of it as "how many friends you're asking for advice" before deciding.

---

### 📈 Low K Value = High Variance (Overfitting)

- Example: **K = 1**
  - The model simply chooses the **closest neighbor**.
  - This can lead to **overfitting** because it's overly sensitive to noise or outliers.
  - Every tiny fluctuation in the dataset affects predictions.
  
📉 **Overfitting Symptoms:**
- Excellent training accuracy, but poor generalization to new data.
- Highly irregular decision boundaries.

---

### 📉 High K Value = High Bias (Underfitting)

- Example: **K = 15 or 20**
  - The model averages over many neighbors, potentially from different classes.
  - Can lead to **underfitting**, as it smooths out local patterns.
  
📉 **Underfitting Symptoms:**
- Model is too simplistic to capture complex patterns.
- Poor training **and** test performance.

---

### 🎯 How to Choose the Best K?

✅ **Best Practices:**
- Use **cross-validation** to test different K values and find the sweet spot.
- Choose an **odd value** of K to avoid tie votes in binary classification.
- Try plotting an **error rate vs. K** graph — this can help visualize where underfitting and overfitting happen.

🛠️ **Typical Range:**  
- Start testing with K values in the range **3–15**.
- Use **grid search** or other hyperparameter tuning tools for optimization.

---

### ⚠️ Bonus Tips

- **Data Scaling Matters:** Always scale features (e.g., with StandardScaler) before applying KNN, or else distance-based metrics become unreliable.
- **Dimensionality:** Too many features can dilute distance comparisons (curse of dimensionality), making the choice of K even harder.

---



## 🏥 Real-World Applications

- 🧬 **Healthcare:** Predicting disease risk (e.g., heart attacks, prostate cancer)
- 💸 **Finance:** Stock prediction, fraud detection
- 🛠️ **Missing Data Imputation:** Estimating unknown values
- 📺 **Recommendation Systems:** Suggesting products, movies, etc.

---


## Classification using KNeighborsClassifier

In [0]:
x = data.loc[:, data.columns!='class']
y = data.loc[:, 'class']

x = StandardScaler().fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=42
)

neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []

for i, k in enumerate(neig):
    knn = KNeighborsClassifier(n_neighbors=k)

    knn.fit(x_train, y_train)

    train_accuracy.append(knn.score(x_train, y_train))

    test_accuracy.append(knn.score(x_test, y_test))

plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('k value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy), 1+test_accuracy.index(np.max(test_accuracy))))

In [0]:
weather