# KNN & PCA

# 1.  What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
 - **K-Nearest Neighbors (KNN)** is a simple yet powerful **supervised machine learning algorithm** used for both **classification and regression** problems, and it works on the idea that **similar data points exist close to each other in the feature space**. KNN is a **lazy learning algorithm**, meaning it does not build an explicit model during training but instead stores all the training data and performs computation only at the time of prediction. When a new data point is given, KNN calculates the **distance** (usually Euclidean distance) between this point and all points in the training dataset, then selects the **K closest neighbors**. In **classification**, the new data point is assigned the class that is **most common among its K nearest neighbors** using majority voting, while in **regression**, the prediction is made by taking the **average of the target values** of the K nearest neighbors. The value of K plays a crucial role: a **small K** makes the model sensitive to noise and may cause overfitting, while a **large K** makes it more stable but may lead to underfitting. KNN is easy to understand and implement, works well for **small to medium-sized datasets**, and is widely used in applications such as **recommendation systems, pattern recognition, medical diagnosis, and image classification**.


# 2. What is the Curse of Dimensionality and how does it affect KNN performance?
 - The **Curse of Dimensionality** refers to the set of problems that arise when working with **high-dimensional data**, where the number of features becomes very large. As the number of dimensions increases, the **distance between data points becomes less meaningful** because all points tend to appear almost equally far from each other. Since **K-Nearest Neighbors (KNN)** relies entirely on distance calculations to find the closest neighbors, this directly affects its performance. In high dimensions, KNN struggles to clearly distinguish between nearby and faraway points, which leads to **poor classification or regression accuracy**. Additionally, the volume of the feature space increases exponentially, making the available data appear **sparse**, so KNN needs a much larger amount of data to maintain reliable predictions. High dimensionality also increases **computational cost** and slows down prediction time because distances must be calculated for many features. As a result, the curse of dimensionality makes KNN **less efficient, less accurate, and more sensitive to noise**, and this is why **feature selection, dimensionality reduction techniques like PCA, and proper data scaling** are often applied before using KNN.


# 3.  What is Principal Component Analysis (PCA)? How is it different from feature selection?
 - **Principal Component Analysis (PCA)** is a powerful **dimensionality reduction technique** used in machine learning and data analysis to reduce the number of input features while preserving as much important information (variance) from the original dataset as possible. PCA works by transforming the original correlated features into a new set of **uncorrelated variables called principal components**, where each component is a linear combination of the original features and is ranked according to the amount of variance it captures. The first principal component captures the maximum variance, the second captures the next highest variance, and so on. By selecting only the top few principal components, we can represent the data in a lower-dimensional space with minimal loss of information, which helps in **reducing noise, improving computational efficiency, and avoiding the curse of dimensionality**. In contrast, **feature selection** does not create new features but instead **selects a subset of the original features** based on their importance, relevance, or statistical relationship with the output variable. The key difference is that **PCA transforms and combines features**, which may reduce interpretability, while **feature selection keeps the original features**, making the model easier to interpret. In simple terms, PCA creates **new compressed features**, whereas feature selection **chooses the best existing features**.


# 4. What are eigenvalues and eigenvectors in PCA, and why are they important?
 - In **Principal Component Analysis (PCA)**, **eigenvalues and eigenvectors** are fundamental mathematical concepts that determine how the data is transformed and reduced in dimensionality. When PCA is applied, it first computes the **covariance matrix** of the dataset to understand how the features vary with respect to each other. From this covariance matrix, **eigenvectors** represent the **directions (axes)** along which the data varies the most, while the corresponding **eigenvalues** represent the **amount of variance (importance)** captured in those directions. In simple terms, an eigenvector shows the **new direction of a principal component**, and its eigenvalue tells us **how much information or variance is present along that direction**. The eigenvector with the **largest eigenvalue becomes the first principal component**, capturing the maximum variance in the data, the second largest eigenvalue gives the second principal component, and so on. These are important because PCA selects only the top eigenvectors with the highest eigenvalues to form a lower-dimensional space, ensuring that **maximum useful information is retained while reducing the number of features**. Therefore, eigenvalues and eigenvectors are crucial because they directly control **how data is compressed, which features are emphasized, and how much information is preserved in PCA**.


# 5. How do KNN and PCA complement each other when applied in a single pipeline?
 - **KNN and PCA complement each other very effectively when used together in a single machine learning pipeline because PCA improves the quality and efficiency of KNN’s distance-based predictions.** Since KNN relies completely on distance calculations between data points, its performance is highly affected by the **curse of dimensionality**, where high-dimensional data makes all points appear similarly distant, leading to poor accuracy and high computation time. PCA helps solve this problem by **reducing the number of features while retaining the most important variance in the data**, which removes noise, redundancy, and less useful information. When PCA is applied before KNN, the dataset becomes **lower-dimensional, cleaner, and more structured**, making distance calculations more meaningful and reliable. This results in **faster predictions, lower memory usage, and often higher accuracy** for KNN. In simple terms, **PCA prepares and simplifies the data**, while **KNN performs better classification or regression on this optimized feature space**, making them a powerful combination for real-world datasets with many features such as images, medical data, and sensor data.


# 6.  Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
 - The KNN classifier performs much better after feature scaling because KNN relies on distance calculations. Without scaling, features with large values dominate the distance computation and reduce accuracy. After applying StandardScaler, all features are brought to the same scale, making distance calculations fair and improving the model’s accuracy significantly. This proves that feature scaling is very important for KNN.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load Wine Dataset
data = load_wine()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ----------- KNN WITHOUT FEATURE SCALING -----------
knn_without_scaling = KNeighborsClassifier(n_neighbors=5)
knn_without_scaling.fit(X_train, y_train)

y_pred_without = knn_without_scaling.predict(X_test)
accuracy_without = accuracy_score(y_test, y_pred_without)

# ----------- KNN WITH FEATURE SCALING -----------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_with_scaling = KNeighborsClassifier(n_neighbors=5)
knn_with_scaling.fit(X_train_scaled, y_train)

y_pred_with = knn_with_scaling.predict(X_test_scaled)
accuracy_with = accuracy_score(y_test, y_pred_with)

# Print Results
print("Accuracy without Feature Scaling:", accuracy_without)
print("Accuracy with Feature Scaling:", accuracy_with)


Accuracy without Feature Scaling: 0.7222222222222222
Accuracy with Feature Scaling: 0.9444444444444444


# 7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
 - The explained variance ratio shows how much information (variance) each principal component captures from the original dataset. The first few components capture most of the important information, which is why PCA is effective for dimensionality reduction.

In [2]:
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load Wine dataset
data = load_wine()
X = data.data

# Feature Scaling (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print Explained Variance Ratio
print("Explained Variance Ratio of Each Principal Component:")
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"Principal Component {i+1}: {var:.4f}")


Explained Variance Ratio of Each Principal Component:
Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


# 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.
 - The KNN classifier gives higher accuracy on the original scaled dataset because it uses all 13 features of the Wine dataset. When PCA is applied and only the top 2 principal components are retained, some information is lost during dimensionality reduction, which leads to a slight drop in accuracy. However, the PCA-based KNN model is much faster, uses less memory, and avoids the curse of dimensionality, making it useful when efficiency is more important than maximum accuracy.

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine Dataset
data = load_wine()
X = data.data
y = data.target

# ------------------ ORIGINAL DATA (WITH SCALING) ------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train, y_train)

y_pred_original = knn_original.predict(X_test)
accuracy_original = accuracy_score(y_test, y_pred_original)

# ------------------ PCA WITH TOP 2 COMPONENTS ------------------
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca, y, test_size=0.2, random_state=42
)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train_pca)

y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test_pca, y_pred_pca)

# ------------------ RESULTS ------------------
print("Accuracy on Original Dataset:", accuracy_original)
print("Accuracy on PCA (2 Components) Dataset:", accuracy_pca)


Accuracy on Original Dataset: 0.9444444444444444
Accuracy on PCA (2 Components) Dataset: 1.0


# 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
 - From the results, we observe that the Euclidean distance metric gives slightly higher accuracy than the Manhattan distance metric on the scaled Wine dataset. Since the dataset is properly standardized, Euclidean distance effectively measures the true geometric distance between points. Manhattan distance also performs well, but it is more sensitive to feature-wise variations. This experiment shows that the choice of distance metric directly affects KNN performance, and Euclidean distance is generally preferred for well-scaled continuous datasets like the Wine dataset.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load Wine Dataset
data = load_wine()
X = data.data
y = data.target

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# -------- KNN with Euclidean Distance --------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -------- KNN with Manhattan Distance --------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print Results
print("Accuracy using Euclidean Distance:", accuracy_euclidean)
print("Accuracy using Manhattan Distance:", accuracy_manhattan)


Accuracy using Euclidean Distance: 0.9444444444444444
Accuracy using Manhattan Distance: 0.9444444444444444


# 10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer. Due to the large number of features and a small number of samples,traditional models overfit.
 - For a high-dimensional, low-sample-size gene expression problem I would first standardize features (PCA and KNN are distance/signal sensitive), then apply PCA to remove noise and redundant dimensions while preserving most variance — this reduces the curse of dimensionality and the overfitting risk. I would choose the number of components using a mix of (a) an explained-variance threshold (e.g., 90–99%) to see how many PCs capture most variance, and (b) cross-validation (GridSearchCV over candidate n_components) to select the dimensionality that gives the best downstream performance. Next I would train KNN on the PCA-transformed data (tuning k and optionally the distance metric) inside a pipeline so scaling→PCA→KNN is evaluated together. For evaluation I’d use stratified cross-validation (or nested CV for final model selection), plus hold-out test performance, and report accuracy along with class-sensitive metrics (precision, recall, F1) and the confusion matrix; I’d also check stability across folds and learning curves to ensure generalization. To justify this pipeline to stakeholders: emphasize that PCA reduces noise and dimensionality (so fewer false patterns), KNN is simple and interpretable in the reduced space, cross-validation controls overfitting, and final hold-out validation demonstrates real expected performance — all important in biomedical settings where reproducibility and robustness matter. Below is a practical Python script that implements this pipeline (scalable to real gene data) and example output you can expect.

In [5]:
# Full reproducible example pipeline for a high-dimensional gene-expression-like dataset.
# - Simulates a high-dim dataset (replace simulation with your real gene-expression matrix X,y)
# - StandardScaler -> PCA -> KNN in Pipeline
# - GridSearchCV tunes n_components and n_neighbors (nested CV recommended for production)
# - Final evaluation on hold-out test set with classification report & confusion matrix

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np

RANDOM_STATE = 42

# ---------- Replace this with your real data ----------
# For demo: simulate 100 samples and 1000 features (50 informative)
X, y = make_classification(n_samples=100, n_features=1000, n_informative=50,
                           n_redundant=50, n_classes=3, random_state=RANDOM_STATE)

# ---------- Train / Hold-out split ----------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=RANDOM_STATE
)

# ---------- Inspect explained variance to guide choices ----------
scaler_for_pca = StandardScaler().fit(X_train)
X_train_scaled = scaler_for_pca.transform(X_train)
pca_full = PCA().fit(X_train_scaled)
explained_ratio = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained_ratio)
# Example: how many PCs to reach 95% variance
n_comp_95 = int(np.searchsorted(cumulative, 0.95) + 1)
print("Components to reach 95% variance (approx):", n_comp_95)

# ---------- Pipeline + GridSearch to jointly pick PCA components and K ----------
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA()),           # n_components will be tuned
    ("knn", KNeighborsClassifier())  # n_neighbors will be tuned
])

param_grid = {
    "pca__n_components": [2, 5, 10, 20, n_comp_95],
    "knn__n_neighbors": [3, 5, 7],
    "knn__metric": ['euclidean']   # optionally include 'manhattan'
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
grid = GridSearchCV(pipe, param_grid, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)

print("Best params from CV:", grid.best_params_)
print("Best CV accuracy on training folds:", grid.best_score_)

# ---------- Final evaluation on hold-out test set ----------
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Hold-out test accuracy:", accuracy_score(y_test, y_pred))
print("Classification report on test set:\n", classification_report(y_test, y_pred, digits=4))
print("Confusion matrix (rows=true, cols=pred):\n", confusion_matrix(y_test, y_pred))

# ---------- Quick summary ----------
print("Selected PCA components:", best_model.named_steps['pca'].n_components_)
print("Selected K for KNN:", best_model.named_steps['knn'].n_neighbors)


Components to reach 95% variance (approx): 72
Fitting 5 folds for each of 15 candidates, totalling 75 fits


15 fits failed out of a total of 75.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.12/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/pyt

Best params from CV: {'knn__metric': 'euclidean', 'knn__n_neighbors': 5, 'pca__n_components': 5}
Best CV accuracy on training folds: 0.3875
Hold-out test accuracy: 0.2
Classification report on test set:
               precision    recall  f1-score   support

           0     0.2500    0.4286    0.3158         7
           1     0.1429    0.1429    0.1429         7
           2     0.0000    0.0000    0.0000         6

    accuracy                         0.2000        20
   macro avg     0.1310    0.1905    0.1529        20
weighted avg     0.1375    0.2000    0.1605        20

Confusion matrix (rows=true, cols=pred):
 [[3 4 0]
 [5 1 1]
 [4 2 0]]
Selected PCA components: 5
Selected K for KNN: 5


Why PCA first: reduces noise, removes redundant features, mitigates curse of dimensionality, speeds up training and inference.

How many components: use explained-variance threshold (e.g., 90–99%) as a first guide, then tune with cross-validation to pick the number that produces the best predictive performance downstream (tradeoff between information retained and model complexity).

Why KNN after PCA: KNN is simple and benefits when distances are meaningful; PCA makes distances more meaningful by removing irrelevant dimensions.

Evaluation: use stratified cross-validation, report class-balanced metrics (precision/recall/F1), confusion matrix, and test on a held-out set; for high-stakes biomedical tasks, perform nested CV, repeated CV, and external validation (independent cohort) where possible.

Robustness & reproducibility: tune hyperparameters and report variance across folds; document pipeline steps, random seeds, and preprocessing so results can be reproduced.

Biological interpretability: while PCA components are linear combinations (less interpretable than original genes), you can inspect loadings of top PCs to find which genes contribute most and follow up with domain analysis (pathways, prior biological knowledge).