KNN & PCA

 **Assignment**

 Question 1:  What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Answer 1 : K-Nearest Neighbors (KNN) is a non-parametric, supervised learning algorithm. It operates on a simple principle: "Tell me who your neighbors are, and I'll tell you who you are."It doesn't learn a mathematical function; it simply stores training data and finds the $k$ closest points to a new input based on distance (usually Euclidean).

How It Works (The General Process)

1) Choose $k$: Select the number of nearest neighbors to check (e.g., $k=3$).

2) Calculate Distance: When a new data point arrives, the algorithm calculates its distance from every other point in the dataset (usually using Euclidean distance).

3) Find Neighbors: It identifies the $k$ points that are closest to the new data point.

4) Make Prediction: It looks at those $k$ neighbors to decide the output.

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Answer 2 : The Curse of Dimensionality refers to a set of phenomena that occur when analyzing data in high-dimensional spaces (many features) that do not occur in low-dimensional settings. For KNN, which relies entirely on "closeness," this is a critical problem.

Here is how it affects performance:
1. Distances Become Equal (Loss of Contrast)In high-dimensional space, the distance between the nearest neighbor and the farthest neighbor starts to converge.The Problem: If every point is roughly 10 units away, the "nearest" neighbor is no more relevant than any other random point.The Result: KNN loses its ability to distinguish between similar and dissimilar data, leading to random-like predictions.
2. Data Becomes Sparse (The "Lonely" Space)As dimensions increase, the volume of the space grows exponentially, but your amount of data usually stays the same.The Problem: To keep the same "density" of data as you move from 2D to 10D, you would need billions of additional points. Without them, the "nearest" neighbor might actually be located very far away in the vast empty space.The Result: The neighbor is no longer "local" enough to give a meaningful prediction, causing high error rates.
3. Computational ExplosionThe Problem: KNN is a "brute force" algorithm that calculates the distance to every point in the training set for every new prediction.The Result: With more dimensions, each individual distance calculation becomes much more complex ($O(d)$ per point), making the algorithm extremely slow and memory-heavy.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Answer 3 : Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a high-dimensional dataset into a smaller set of uncorrelated variables called Principal Components.

It works by identifying the directions (axes) in the data where the variance (information) is highest and projecting the data onto those new axes.

**Which one should you choose for KNN?**

* Use Feature Selection if you have a few specific features you suspect are "noise" and you want to keep your model simple and explainable.

* Use PCA if you have dozens or hundreds of features that are highly correlated (multi-collinearity), as PCA will merge that redundant information into a few powerful components, often leading to better KNN performance.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Answer 4:  1. Eigenvectors: The Directions

Eigenvectors are the new axes for your data. They represent the directions of maximum spread (variance).
* Role: They tell you where the information is oriented.
* Key Fact: Every eigenvector is perpendicular (orthogonal) to the others, ensuring no redundant information.

2. Eigenvalues: The Magnitude
Each eigenvector has a corresponding eigenvalue that represents a score of importance.
* Role: They tell you how much information (variance) is captured in that specific direction.
* Key Fact: A large eigenvalue means that direction is a "Principal Component"; a tiny eigenvalue suggests that direction is just "noise.

"Why They Are Important
They allow for intelligent compression:
1) Ranking: You sort eigenvalues from highest to lowest to rank your components by importance.
2) Selection: You keep the top $k$ eigenvectors that account for the most variance (e.g., 90%) and discard the rest.
3) Efficiency: This reduces a 100-feature problem into a 3-feature problem without losing the "soul" of the data.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Answer 5 : When used together, PCA prepares the data and KNN performs the prediction. This combination solves the weaknesses of using KNN alone.

1. Solving the "Curse of Dimensionality"
KNN fails in high dimensions because distances become uniform. PCA shrinks the feature space to only the most important axes, restoring the "contrast" between near and far neighbors.

2. Boosting Computational Speed
KNN is "expensive" because it calculates distances to every point. By using PCA to reduce 100 features down to 5, you drastically reduce the mathematical operations required for every single prediction.

3. Noise Reduction
High-dimensional data often contains redundant or noisy features. PCA filters this out by focusing on the highest variance, allowing KNN to find neighbors based on signal rather than noise.

Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.



In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load Data
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. KNN WITHOUT Scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred_unscaled)

# 3. KNN WITH Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy (Unscaled): {acc_unscaled:.2%}")
print(f"Accuracy (Scaled):   {acc_scaled:.2%}")

Accuracy (Unscaled): 74.07%
Accuracy (Scaled):   96.30%


Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.


In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. Load and Scale Data
wine = load_wine()
X = wine.data
X_scaled = StandardScaler().fit_transform(X)

# 2. Train PCA
# We keep all components initially to see the full distribution
pca = PCA()
pca.fit(X_scaled)

# 3. Print Explained Variance Ratio
print("Explained Variance Ratio per Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.2%}")

# Total variance explained by first two components
total_2 = sum(pca.explained_variance_ratio_[:2])
print(f"\nTotal variance explained by first two components: {total_2:.2%}")

Explained Variance Ratio per Component:
PC1: 36.20%
PC2: 19.21%
PC3: 11.12%
PC4: 7.07%
PC5: 6.56%
PC6: 4.94%
PC7: 4.24%
PC8: 2.68%
PC9: 2.22%
PC10: 1.93%
PC11: 1.74%
PC12: 1.30%
PC13: 0.80%

Total variance explained by first two components: 55.41%


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# 1. Load and Scale Data (Essential for PCA and KNN)
wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)
y = wine.target

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# 3. Scenario A: Original Dataset (All 13 Scaled Features)
knn_orig = KNeighborsClassifier(n_neighbors=5)
knn_orig.fit(X_train, y_train)
acc_orig = accuracy_score(y_test, knn_orig.predict(X_test))

# 4. Scenario B: PCA-Transformed Dataset (Top 2 Components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print(f"Accuracy (Original 13 features): {acc_orig:.2%}")
print(f"Accuracy (PCA 2 components):     {acc_pca:.2%}")

Accuracy (Original 13 features): 96.30%
Accuracy (PCA 2 components):     98.15%


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load and Scale Data
wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)
y = wine.target
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# 2. KNN with Euclidean Distance (L2 Norm)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test))

# 3. KNN with Manhattan Distance (L1 Norm)
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test))

print(f"Accuracy (Euclidean): {acc_euclidean:.2%}")
print(f"Accuracy (Manhattan): {acc_manhattan:.2%}")

Accuracy (Euclidean): 96.30%
Accuracy (Manhattan): 96.30%


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.

Due to the large number of features and a small number of samples, traditional models
overfit.

Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

Answer 10 : In high-dimensional biomedical data (like gene expressions), the number of features ($p$) often far exceeds the number of samples ($n$). This leads to overfitting and the "Curse of Dimensionality."

1. Dimensionality Reduction & SelectionStandardization:

Before PCA, you must scale the data. Genes (or wine chemicals) have different units; scaling ensures no single high-value feature dominates.PCA Transformation: Transform the features into Principal Components.Deciding $k$ Components: Use the Scree Plot or Cumulative Explained Variance. In biomedical data, we typically aim to retain 90-95% of the total variance.

In [5]:
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load and scale
X, y = load_wine(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

# PCA
pca = PCA().fit(X_scaled)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumulative_variance >= 0.90) + 1

print(f"Number of components to keep 90% variance: {n_components}")

Number of components to keep 90% variance: 8


2. KNN Classification & Evaluation :

Once the dimensions are reduced (e.g., from 13 features to 7 in the Wine dataset), you train the KNN on the reduced coordinates.

Evaluation: Use Cross-Validation (specifically Stratified K-Fold) rather than a single split. In medical datasets with small samples, a single split might be unrepresentative. Use a Confusion Matrix to ensure no specific cancer type (or wine class) is being misclassified consistently.

3. Justifying the Pipeline to Stakeholders:

To justify this to a non-technical audience, emphasize these four "Robustness Pillars":

* Noise Filtration: High-dimensional data is "noisy." PCA acts as a filter, keeping the biological "signal" and discarding random fluctuations (low-variance components).

* Preventing "Distance Collapse": In high dimensions, every patient looks equally different from every other patient. By reducing dimensions, we restore the "similarity" metric that KNN needs to work.

* Computational Efficiency: Medical diagnostics require speed. Processing 10 components instead of 20,000 genes allows for near-instant results.

* Reduced Overfitting: By limiting the "degrees of freedom" the model has, we force KNN to focus on the most significant patterns, making the model generalize better to new patients.