# KNN & PCA | Assignment

Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Ans. K-Nearest Neighbors (KNN):

 KNN is a supervised learning algorithm used for classification and regression.
 It’s a lazy learner (no explicit training phase) and non-parametric (makes no assumptions about data distribution).

How it works (General Idea):

Choose a number K (the number of neighbors).

For a new data point:

Calculate the distance (commonly Euclidean) between the new point and all points in the training data.

Identify the K closest data points (neighbors).

Predict the output based on those neighbors.

In Classification:

Each neighbor "votes" for its class.

The class with the majority votes becomes the predicted class.

👉 Example:
If K=5 and neighbors’ classes are [A, A, B, A, B] → Majority is A, so prediction = A.

In Regression:

Instead of voting, take the average (or weighted average) of the neighbors’ values.

👉 Example:
If K=3 and neighbors’ target values are [10, 12, 14] → Prediction = (10+12+14)/3 = 12.

Key Points about KNN:

Distance Metrics: Euclidean, Manhattan, Minkowski, Cosine similarity.

Choice of K:

Small K → sensitive to noise (overfitting).

Large K → smoother decision boundary but may underfit.

Feature Scaling: Very important (since distances dominate) → use normalization/standardization


Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

Ans.  -  Curse of Dimensionality

The curse of dimensionality refers to the problems that arise when data has too many features (dimensions).
As dimensions increase:

Data becomes sparse (spread out).

Distances between points become less meaningful.

Models relying on distance similarity (like KNN) start to perform poorly.

- How It Affects KNN Performance

Distance Becomes Less Discriminative

In high dimensions, the difference between the nearest and farthest neighbor shrinks.

All points appear almost equally distant → KNN can’t distinguish neighbors well.

Increased Computation

KNN requires computing distances to all training points.

As dimensions grow, computation becomes expensive.

Overfitting Risk

With many irrelevant features, KNN may consider noisy dimensions in distance calculation.

This misleads the algorithm → poor generalization.

- Example:

Imagine classifying points in:

2D (x,y): You can easily find "close neighbors."

100D: Almost every point is far away → "nearest" loses meaning.

- How to Reduce Curse of Dimensionality in KNN

Feature Selection: Keep only important features.

Dimensionality Reduction: Use PCA, t-SNE, Autoencoders.

Scaling/Normalization: Ensures no feature dominates distance.

Use weighted distances: Closer neighbors get higher weight.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Ans.Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique used in machine learning and statistics.
It transforms high-dimensional data into a new set of features (called principal components) that capture the maximum variance in the data.

- How PCA Works (Steps):

Standardize the data (so all features are on the same scale).

Compute the covariance matrix to understand feature relationships.

Find eigenvalues & eigenvectors of the covariance matrix.

Eigenvectors = directions of new features (principal components).

Eigenvalues = how much variance each component explains.

Select top k components that explain most of the variance.

Transform original data into this new reduced feature space.

- Example: If you have 100 features, PCA may reduce them to 10 principal components while still keeping ~90% of the variance.

- Principal Component Analysis (PCA) and feature selection are both dimensionality reduction techniques, but they work differently. PCA is a feature extraction method that transforms the original features into new ones called principal components, which are linear combinations of the existing features and capture the maximum variance in the data. In contrast, feature selection does not create new features but instead identifies and retains only the most relevant original features, removing the less useful or redundant ones. While PCA often improves performance by eliminating correlations and reducing noise, it reduces interpretability since the new components are not the same as the original features. Feature selection, on the other hand, maintains interpretability because it directly works with the original feature set.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Ans. Eigenvalues and Eigenvectors in PCA

Eigenvectors:
These represent the directions of the new feature space (principal components).
Each eigenvector points in the direction where data varies the most.

Eigenvalues:
These represent the amount of variance captured by their corresponding eigenvectors.
Larger eigenvalue → more information (variance) that component holds.

- Why They Are Important in PCA

Determine Principal Components

PCA computes eigenvectors of the covariance matrix of the data.

Each eigenvector is a principal component direction.

Rank Components by Importance

Eigenvalues tell us how much variance each component explains.

- Example: If PC1’s eigenvalue = 5 and PC2’s eigenvalue = 2, then PC1 captures more variance.

Dimensionality Reduction

We keep only the top k eigenvectors (with the largest eigenvalues).

This way, we reduce dimensions while keeping most of the variance.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Ans. How KNN and PCA Complement Each Other

1.KNN struggles in high dimensions

KNN relies on distance calculations.

In high-dimensional spaces (curse of dimensionality), distances lose meaning → performance drops.

2.PCA reduces dimensions before KNN

PCA transforms the data into a smaller set of uncorrelated features (principal components).

This removes noise and redundancy, making distance calculations more reliable.

3.Improved Efficiency

With fewer dimensions, KNN computes distances much faster.

This is important since KNN has high prediction-time cost.

4.Better Generalization

By keeping only the top principal components, PCA reduces overfitting.

KNN then focuses on the most informative features, improving accuracy.

- In short:

When combined, PCA reduces data complexity and noise, and KNN uses the cleaner, lower-dimensional space to make more accurate distance-based predictions.

Question 6:Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


In [1]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ------------------ Without Feature Scaling ------------------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ------------------ With Feature Scaling ------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = knn_scaling.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaling)

# Print results
print("Accuracy without scaling:", acc_no_scaling)
print("Accuracy with scaling   :", acc_scaling)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling   : 0.9629629629629629


Expected Result

Without scaling: Accuracy is usually much lower (around ~0.65–0.75).

With scaling: Accuracy improves significantly (often 0.95+).

- Conclusion: Feature scaling is crucial for KNN because it ensures that all features contribute equally to the distance metric.

Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [2]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features (important before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")


Explained Variance Ratio of each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080


What this does:

explained_variance_ratio_ → fraction of variance explained by each principal component.

The values add up to 1 (100%).

The first few PCs usually capture most of the variance (e.g., PC1 + PC2 may explain ~60–70%).

- Conclusion: This output tells you how many components you need to keep while still preserving most of the dataset’s information.

Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.


In [3]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features (important for PCA & KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# ------------------ KNN on Original Data ------------------
knn_orig = KNeighborsClassifier(n_neighbors=5)
knn_orig.fit(X_train, y_train)
y_pred_orig = knn_orig.predict(X_test)
acc_orig = accuracy_score(y_test, y_pred_orig)

# ------------------ PCA Transformation (Top 2 Components) ------------------
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca, y, test_size=0.3, random_state=42
)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test_pca, y_pred_pca)

# ------------------ Print Results ------------------
print("Accuracy on Original Dataset :", acc_orig)
print("Accuracy on PCA (2 components):", acc_pca)


Accuracy on Original Dataset : 0.9629629629629629
Accuracy on PCA (2 components): 0.9814814814814815


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.


In [4]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features (important for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# ------------------ KNN with Euclidean Distance ------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric="euclidean")
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ------------------ KNN with Manhattan Distance ------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric="manhattan")
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# ------------------ Print Results ------------------
print("Accuracy with Euclidean Distance:", acc_euclidean)
print("Accuracy with Manhattan Distance:", acc_manhattan)


Accuracy with Euclidean Distance: 0.9629629629629629
Accuracy with Manhattan Distance: 0.9629629629629629


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

Ans.  Problem Context

Dataset: High-dimensional gene expression data (thousands of genes = features).

Samples: Relatively few patient cases (small n, large p problem).

Challenge: Models tend to overfit due to high dimensionality and noise.

- 1 Use PCA to Reduce Dimensionality

Apply feature scaling first (Standardization).

Run PCA on the scaled dataset.

PCA will transform thousands of gene features into a smaller set of principal components (PCs) that capture the majority of variance (biological signal).

This reduces noise, eliminates redundancy, and makes the data more manageable.

- 2 Decide How Many Components to Keep

Look at the explained variance ratio from PCA.

Plot a cumulative variance curve (scree plot).

Choose the smallest number of PCs that explain, say, 90–95% of the variance.

Example: From 10,000 genes → maybe only 50–100 PCs are enough.

- 3 Use KNN for Classification (Post-PCA)

Train a KNN classifier on the PCA-transformed dataset.

Use grid search with cross-validation to tune hyperparameters:

k (number of neighbors).

Distance metric (Euclidean, Manhattan).

Since PCA removed noise and correlated features, distance measures in KNN are now more reliable.

- 4 Evaluate the Model

Use stratified cross-validation (important with small samples).

Metrics: Accuracy, Precision, Recall, F1-score (since misclassification in medical settings has high cost).

Compare results with and without PCA to demonstrate improvements in generalization.

- 5 Justification to Stakeholders

Why PCA?
Gene expression datasets are high-dimensional and noisy. PCA compresses data into fewer, biologically meaningful components, reducing overfitting risk.

Why KNN?
KNN is a simple, interpretable, and effective algorithm when features are reduced and scaled. It doesn’t assume linearity, which is important for complex biological patterns.

Why this pipeline is robust?

Handles curse of dimensionality.

Prevents overfitting by keeping only the most informative signals.

Computationally efficient (fewer features).

Transparent: PCA variance ratios and KNN neighborhood decisions can be communicated to clinicians.

- Summary in one line:
We reduce thousands of gene features into a small set of principal components with PCA, train a tuned KNN classifier on this lower-dimensional space, and evaluate with cross-validation — ensuring a robust, interpretable, and generalizable cancer classification model.