In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### 1. Overview

K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm used for **classification and regression**.

Its core idea is that **similar data points tend to have similar labels**.

KNN is known as a **lazy learner** because it does not build an explicit model during training. Instead, it stores the training data and performs computations only at prediction time.

---

### 2. How the Algorithm Works

The prediction process follows these steps:

1. Choose the number of neighbors, ( K ).
2. Compute the distance between the new data point and all training points (commonly Euclidean distance).
3. Select the ( K ) closest data points.
4. For classification, assign the class that appears most frequently among the neighbors.
   For regression, take the average of their target values.

---

### 3. Choosing the Value of K

The choice of ( K ) has a significant impact on model performance.

* **Small K (e.g., K = 1):**
  Leads to overfitting. The model becomes sensitive to noise and outliers.

* **Large K (e.g., K = N):**
  Leads to underfitting. Predictions tend toward the majority class.

**Common strategies to select K:**

* Rule of thumb: ( K = sqrt{n} )
  >n= number of observations
* Empirical approach: Evaluate performance for multiple K values using validation or cross-validation and select the best one.

---

### 4. Importance of Data Preprocessing

KNN is highly sensitive to feature scales.

* **Feature Scaling:**
  Distance calculations are affected by magnitude. Standardization using `StandardScaler` is essential.

* **Removing Irrelevant Features:**
  Columns such as IDs or constant features should be removed as they add noise to distance calculations.

---

### 5. Decision Boundaries

KNN creates **non-linear decision boundaries**.

* With smaller K values, boundaries are complex and irregular.
* With larger K values, boundaries become smoother.
* Decision boundaries can be visualized in low-dimensional data to understand model behavior.

---

In [None]:
import matplotlib.pyplot as plt

In [None]:
heart = pd.read_csv('/kaggle/input/heartdisease/Heart_Disease_Prediction.csv')

In [None]:
heart.shape

In [None]:
heart.info()

In [None]:
heart.duplicated().sum()

In [None]:
heart.head()

In [None]:
heart.describe()

In [None]:
plt.figure(figsize=(20,14))
heart.hist(bins=50,figsize=(20,14))
plt.show()

In [None]:
X = heart.drop(columns=['Age','Heart Disease'])
y = heart['Heart Disease']

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
X_train.shape

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler= StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=7)

In [None]:
knn.fit(X_train,y_train)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_pred = knn.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
scores =[]

for i in range(1,16):
    knn= KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_test)
    scores.append(accuracy_score(y_test,y_pred))

In [None]:
plt.plot(range(1,16),scores)

---

### 6. Limitations of KNN

KNN may perform poorly in the following situations:

1. **Large datasets:** Prediction is slow because distances are computed against all training points.
2. **High-dimensional data:** Distances lose meaning due to the curse of dimensionality.
3. **Presence of outliers:** Small K values amplify their impact.
4. **Imbalanced datasets:** Majority classes dominate predictions.
5. **Unscaled features:** Leads to incorrect distance calculations.
6. **Interpretability requirements:** KNN does not provide feature importance or model explanations.

---

### Summary

KNN is a straightforward and effective algorithm when used on **small, well-scaled, low-dimensional datasets**. However, its computational cost and sensitivity to data characteristics make it unsuitable for large-scale or high-dimensional Kaggle problems without careful preprocessing and tuning.

---