# KNN & PCA

**Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

Ans. K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based learning algorithm. It makes predictions based on the K closest data points in the feature space using a distance metric.

Classification:

* The class is decided by majority voting among the K nearest neighbors.

* Example: If most neighbors belong to class A, the new point is classified as A.

Regression:

* The prediction is the average (or weighted average) of the target values of the K nearest neighbors.

**Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

Ans.The Curse of Dimensionality refers to problems that arise when the number of features increases.

Effect on KNN:

* Distances between points become very similar

* Nearest neighbors are no longer meaningful

* Model accuracy degrades

* Computational cost increases

KNN relies heavily on distance calculations, so high-dimensional data significantly reduces its effectiveness. Dimensionality reduction techniques like PCA help mitigate this issue.

**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

Ans.PCA is an unsupervised dimensionality reduction technique that transforms original features into a smaller set of uncorrelated variables called principal components while retaining maximum variance.

PCA:
* creates new features
* unsupervised
* removes correlation
* maximize variance

Feature Selection:
* select exist features
* often supervised
* keeps original meaning
* maximize relevance

**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**

Ans.Eigenvectors define the direction of principal components.

Eigenvalues represent the amount of variance captured by each component.

Importance:

* Eigenvectors determine the new feature space

* Eigenvalues help rank components

* Higher eigenvalue = more information retained

**Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

Ans. PCA reduces dimensionality and removes noise, while KNN benefits from:

* Faster distance computation

* Improved accuracy

* Reduced curse of dimensionality

* Together, PCA + KNN form an efficient pipeline for high-dimensional datasets.

In [18]:
#Q6.Train a KNN Classifier on the Wine dataset with and without feature
# scaling. Compare model accuracy in both cases?

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X,y = load_wine(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
knn = KNeighborsClassifier(n_neighbors=6)

#with scaling

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#train data needs fit and transform both
X_train_scaled = scaler.fit_transform(X_train)
#test data requires only transform to prevent data leakage
X_test_scaled = scaler.transform(X_test)
knn.fit(X_train_scaled,y_train)
y_pred_scaled = knn.predict(X_test_scaled)
print('-----Accuracy Score with Scaling------')
print(accuracy_score(y_test,y_pred_scaled))


#without scaling

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
print('-----Accuracy Score without scaling-------')
print(accuracy_score(y_test,y_pred))

-----Accuracy Score with Scaling------
0.9722222222222222
-----Accuracy Score without scaling-------
0.6388888888888888


In [43]:
#Q7 Train a PCA model on the Wine dataset and print the explained variance
#ratio of each principal component.

from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X,y = load_wine(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA()
pca.fit(X_train_scaled)
print('----------------Variance Ration Explained-----------')
print(pca.explained_variance_ratio_[0:4]) #first four

----------------Variance Ration Explained-----------
[0.35684314 0.19825228 0.11659894 0.07517421]


In [57]:
#Q8 Train a KNN Classifier on the PCA-transformed dataset (retain top 2
# components). Compare the accuracy with the original dataset.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X,y = load_wine(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


#PCA with 2 components

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

#model

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca,y_train)
y_pred = knn.predict(X_test_pca)
print(f'Accuracy with PCA(2 components) : {accuracy_score(y_test,y_pred)}')

Accuracy with PCA(2 components) : 0.9444444444444444


In [69]:
#Q9 Train a KNN Classifier with different distance metrics (euclidean,
# manhattan) on the scaled Wine dataset and compare the results.

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X,y=load_wine(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_manhattan = KNeighborsClassifier(metric='manhattan')

knn_euclidean.fit(X_train_scaled,y_train)
knn_manhattan.fit(X_train_scaled,y_train)

print('Euclidean Accucarcy :', accuracy_score(y_test,knn_euclidean.predict(X_test_scaled)))
print('Manhattan Accucarcy :', accuracy_score(y_test,knn_manhattan.predict(X_test_scaled)))

Euclidean Accucarcy : 0.9722222222222222
Manhattan Accucarcy : 0.9722222222222222


**Q10**

Ans. 1. Use PCA

* Reduce thousands of gene features

* Remove noise and correlation

 2 . Choose Components

* Use cumulative explained variance (e.g., 95%)

 3 . Apply KNN

* Train on PCA-transformed data

* Choose optimal K using cross-validation

 4 . Evaluation

* Accuracy

* ROC-AUC

* Cross-validation score

 5 . Business / Medical Justification

* Reduces overfitting

* Improves generalization

* Faster computation

* More reliable diagnosis

In [78]:
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

X,y=load_wine(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn = KNeighborsClassifier(n_neighbors=5)

scores = cross_val_score(knn, X_pca, y_train, cv=5, scoring='accuracy')
print("Mean CV Accuracy:", scores.mean())

Mean CV Accuracy: 0.9719211822660098
