***
## kNN - k Nearest Neighbour

1. can be used for both classification or regression usually for Classification

2. k = hyperparameter ->> depending on data it changes we ususlly select the best. k->> how many nearest neighbour to consider.

3. k must be odd in order tofavour majority

4. distance is calculated from thenew point to every other point and k neaarest neighbour is found

5. for regression ->> mean values of distance of k nearest neighbour after finding the k nearest neighbour

6. distance calculation ->> Euclidian distance

7. Manhatten Distance = |y1 - y2| + |x1 - x2| i->> when there is high dimension data

8. Lazy learner ->> during training - it memorizes data

9. calculatioon is done during making prediction

10. scaling is Very Inmporant

***

***
## Limitation 

1. slow at prediction(not suitable for huge datasets)

2. sensitive to outlier

3. sensitive to feature scaling - if not scaled ->> high valued data is favoured more ->> model becomes biased

4. curse of Dimensionality(High-Dimensional Data)

***

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.preprocessing import StandardScaler

In [2]:
# importing kNN classifier, we can also import regressor for regression task
from sklearn.neighbors import KNeighborsClassifier

In [4]:
heart_df = pd.read_csv("heart.csv")

X = heart_df.drop("target" , axis = 1)
y = heart_df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y , test_size = 0.2, random_state = 42
)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [34]:
knn = KNeighborsClassifier(n_neighbors = 7)
knn.fit(X_train_scaled, y_train)

In [35]:
y_pred = knn.predict(X_test_scaled)

In [36]:
# Evaluation

print("recall score : ", recall_score(y_test, y_pred))
print("accuracy score : ", accuracy_score(y_test, y_pred))
print("precision score : ", precision_score(y_test, y_pred))

recall score :  0.90625
accuracy score :  0.9180327868852459
precision score :  0.9354838709677419


In [48]:
# Cross Validation for Hyperparameter tuning using GridSearchCV

from sklearn.model_selection import GridSearchCV

classifier = KNeighborsClassifier()
param_grid = {"n_neighbors" : [3,5,7,9]}

classifierCV = GridSearchCV(
    classifier,
    param_grid,
    cv = 5,
    scoring="recall"
)

classifierCV.fit(X_train_scaled, y_train)
y_pred = classifierCV.predict(X_test_scaled)

# Evaluation

print("recall score : ", recall_score(y_test, y_pred))
print("accuracy score : ", accuracy_score(y_test, y_pred))
print("precision score : ", precision_score(y_test, y_pred))


recall score :  0.90625
accuracy score :  0.9180327868852459
precision score :  0.9354838709677419


In [49]:
res = pd.DataFrame(classifierCV.cv_results_)
print(res[["param_n_neighbors", "mean_test_score"]])

   param_n_neighbors  mean_test_score
0                  3         0.864387
1                  5         0.857550
2                  7         0.871795
3                  9         0.856980


In [50]:
print(classifierCV.best_params_)

{'n_neighbors': 7}


***
# Sklearn Pipeline
***

In [51]:
from sklearn.pipeline import Pipeline

In [52]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y , test_size = 0.2, random_state = 42
)

In [53]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

In [57]:
classifier = KNeighborsClassifier()
param_grid = {"knn__n_neighbors" : [3,5,7,9]} # double underscore ->> hirarchical separator

classifierCV = GridSearchCV(
    pipeline, # instead of classifier we passed pipeline now
    param_grid,
    cv = 5,
    scoring="recall"
)

classifierCV.fit(X_train, y_train) # dont pass scaled data to avoid data leakage
y_pred = classifierCV.predict(X_test)

In [58]:
print("recall score : ", recall_score(y_test, y_pred))
print("accuracy score : ", accuracy_score(y_test, y_pred))
print("precision score : ", precision_score(y_test, y_pred))

recall score :  0.90625
accuracy score :  0.9180327868852459
precision score :  0.9354838709677419


In [59]:
print(classifierCV.best_params_)

{'knn__n_neighbors': 7}
