## KNN & PCA | Assignment


1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems? Answer: K-Nearest Neighbors (KNN) is a non-parametric, lazy learning algorithm used for both classification and regression.

Classification: KNN identifies the 'k' closest data points to a new input. The new point is assigned the class that is most common (majority vote) among its neighbors.


Regression: Instead of a majority vote, KNN takes the average (mean) of the values of the 'k' nearest neighbors to predict the value for the new input.

2: What is the Curse of Dimensionality and how does it affect KNN performance? Answer: The "Curse of Dimensionality" refers to the problem that arises when the number of features (dimensions) in a dataset increases.

Effect on KNN: KNN relies on distance (like Euclidean distance) to find neighbors. In high-dimensional space, the distance between all points becomes almost equal, making it hard for KNN to distinguish between "near" and "far" neighbors. This leads to a significant drop in model performance and increased computational time.

3: What is Principal Component Analysis (PCA)? How is it different from feature selection? Answer: PCA is a dimensionality reduction technique that transforms a large set of variables into a smaller one that still contains most of the original information.


Difference: * Feature Selection: It keeps some features and completely removes others (e.g., keeping "Height" and deleting "Age").

PCA: It creates entirely new variables (Principal Components) by combining the existing ones. No original feature is simply "deleted"; they are all compressed into new ones.

4: What are eigenvalues and eigenvectors in PCA, and why are they important?  Answer:

Eigenvectors: These determine the direction of the new feature space (Principal Components). They show how the data is oriented.


Eigenvalues: These determine the magnitude or the amount of variance explained by each eigenvector.


Importance: They are critical because they help identify which directions (Principal Components) capture the most information, allowing us to reduce dimensions while keeping the most important data.

5: How do KNN and PCA complement each other when applied in a single pipeline? Answer: KNN and PCA work together perfectly to solve the "Curse of Dimensionality."


PCA acts as a pre-processing step to reduce the number of noisy or redundant features and compress the data.


KNN then works on this reduced data, which makes the distance calculations much more accurate and the overall algorithm significantly faster.

6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases. Answer:

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Data load aur split
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# 1. Without Scaling
knn = KNeighborsClassifier().fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, knn.predict(X_test))

# 2. With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier().fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f"Accuracy without scaling: {acc_no_scale:.4f}")
print(f"Accuracy with scaling: {acc_scaled:.4f}")

Accuracy without scaling: 0.7407
Accuracy with scaling: 0.9630


7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component. Answer:

In [2]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# PCA ke liye scaling zaroori hai
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data.data)

# PCA model
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"Principal Component {i+1}: {ratio:.4f}")

Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset. Answer:

In [4]:
# PCA with 2 components
pca_2 = PCA(n_components=2)
X_train_pca = pca_2.fit_transform(X_train_scaled)
X_test_pca = pca_2.transform(X_test_scaled)

# KNN on PCA data
knn_pca = KNeighborsClassifier().fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print(f"Original Scaled Accuracy: {acc_scaled:.4f}")
print(f"PCA (2 components) Accuracy: {acc_pca:.4f}")

Original Scaled Accuracy: 0.9630
PCA (2 components) Accuracy: 0.9815


9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results. Answer:

In [5]:
for metric in ['euclidean', 'manhattan']:
    knn_metric = KNeighborsClassifier(metric=metric).fit(X_train_scaled, y_train)
    acc = accuracy_score(y_test, knn_metric.predict(X_test_scaled))
    print(f"Accuracy with {metric} distance: {acc:.4f}")

Accuracy with euclidean distance: 0.9630
Accuracy with manhattan distance: 0.9630


10: Explain how you would use PCA and KNN for high-dimensional gene expression data (Biomedical case study). Answer: In high-dimensional biomedical data, I would follow this robust pipeline:

Standardization: Scale the data using StandardScaler as PCA and KNN are sensitive to variance.

PCA for Dimensionality Reduction: Apply PCA to reduce thousands of genes into a few Principal Components.

Choosing Components: Use a "Scree Plot" or cumulative explained variance (keeping 95% variance) to decide the number of components.

KNN Classification: Train KNN on the reduced components to classify cancer types.

Justification: This pipeline avoids overfitting (Curse of Dimensionality), removes noise, and speeds up the classification process, making it a reliable solution for complex medical data.