# KNN & PCA Assignment
---

**Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

->

- K-Nearest Neighbors (KNN) is a supervised learning algorithm that can be used for both classification and regression.
- It is a lazy learner, meaning it does not build an explicit model; instead, it stores the training data and makes predictions based on similarity (distance) to known data points.
---
**How it works:**

1. For a new data point, compute the distance (commonly Euclidean, Manhattan) between the point and all training samples.
2. Select the K nearest neighbors (the K closest data points).
3. Make a prediction:
 - Classification: The class label is chosen by majority voting among the K neighbors. Example: If k=5 and neighbors = [A, A, B, A, C] then predicted class = A.
  - Regression: The prediction is the average (or weighted average) of the target values of the K neighbors. Example: If neighbor values = [50, 55, 60] and K=3 then predicted value = 55.
---

**Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?**

->
- The Curse of Dimensionality refers to challenges that arise when data has a very high number of features (dimensions). As the number of dimensions increases:
1. Data sparsity: Points become widely spread out, making the feature space sparse.
2. Distance metrics lose effectiveness: In high dimensions, the difference between nearest and farthest neighbors becomes negligible, so “closeness” is not meaningful.
3. Higher computational cost: More features mean more distance calculations, increasing complexity.
---
**Effect on KNN performance:**
- Since KNN depends on distance measures, in high dimensions it struggles to identify truly close neighbors.
- As a result, neighbors become less meaningful, leading to poor classification/regression accuracy and a higher risk of overfitting.
---

**Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?**

->
- PCA is an unsupervised dimensionality reduction technique.
- It transforms the original correlated features into a new set of uncorrelated variables called principal components.
- These components are linear combinations of the original features and are ordered such that:
  - The first principal component (PC1) captures the maximum variance.
  - The second (PC2) captures the next highest variance, and so on.
- Goalis to reduce the dataset to fewer dimensions while retaining as much information (variance) as possible.
---
**Difference between PCA and Feature Selection:**

**1. PCA :**

- Creates new features (principal components) as combinations of original features.
- New features are less interpretable since they are mixes of original features.
- Focus: reduce dimensions while preserving variance in the data.

**2. Feature Selection:**

- Selects a subset of the original features (no new features are created).
- Retains interpretability since features remain in their original form.
- Focus: choose only the most relevant features and discard redundant/irrelevant ones.
---

**Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?**

->
- In PCA, we calculate the covariance matrix of the data, and then perform eigen decomposition.
- This gives us eigenvalues and eigenvectors.
---
**1. Eigenvectors (Principal Components):**
- Eigenvectors represent the directions (axes) in the feature space along which the data varies the most.
- Each eigenvector is a principal component.
- They define the new coordinate system for the transformed data.

**2. Eigenvalues (Variance Explained):**
- Eigenvalues tell us how much variance in the data is captured by its corresponding eigenvector (principal component).
- Larger eigenvalue → that component explains more variance in the dataset.
- Example: If eigenvalue of PC1 = 3.2 and PC2 = 1.1 → PC1 explains more information.
---
**Importance in PCA:**
1. Eigenvectors = directions of maximum variance (new axes).
2. Eigenvalues = amount of variance captured along each eigenvector.
3. By sorting eigenvalues (largest → smallest), we decide which principal components to keep.
---

**Question 5: How do KNN and PCA complement each other when applied in a single pipeline?**

->
**1. Reducing Dimensionality:**
- High-dimensional data causes the curse of dimensionality, where distances become less meaningful for KNN.
- PCA reduces the number of dimensions, making distance comparisons in KNN more reliable.

**2. Noise Reduction:**

- PCA removes low-variance components (which are often noise).
- Cleaner data improves KNN’s accuracy.

**3. Computation Efficiency:**
- KNN requires distance calculation with all features → expensive in high dimensions.
- PCA reduces features, lowering computation cost.

**4. Better Generalization:**
- PCA prevents KNN from overfitting by eliminating redundant/irrelevant features.
- Helps KNN focus only on the most informative components.
---

**Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.**

->

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

In [2]:
data = load_wine()
X, y = data.data, data.target

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [4]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [5]:
y_pred = knn.predict(X_test)

In [6]:
print('Accuracy without scaling:', accuracy_score(y_test, y_pred))

Accuracy without scaling: 0.7037037037037037


In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [8]:
knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)

In [9]:
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))

Accuracy with scaling: 0.9814814814814815


**Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.**

->

In [10]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [11]:
data = load_wine()
X, y = data.data, data.target

In [12]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [13]:
pca = PCA()
x_pca = pca.fit(X_scaled)

In [14]:
print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.2f}")

Explained Variance Ratio of each Principal Component:
PC1: 0.36
PC2: 0.19
PC3: 0.11
PC4: 0.07
PC5: 0.07
PC6: 0.05
PC7: 0.04
PC8: 0.03
PC9: 0.02
PC10: 0.02
PC11: 0.02
PC12: 0.01
PC13: 0.01


**Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.**

->


In [15]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

In [16]:
data = load_wine()
X,y = data.data, data.target

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [18]:
# 1. KNN on original dataset

In [19]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [20]:
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

In [21]:
y_pred = knn.predict(X_test_scaled)

In [22]:
print("Accuracy on original dataset:", accuracy_score(y_test, y_pred))

Accuracy on original dataset: 0.9814814814814815


In [23]:
# 2. KNN on PCA-transformed dataset

In [24]:
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [25]:
knn.fit(X_train_pca, y_train)
y_pred_pca = knn.predict(X_test_pca)

In [26]:
print("Accuracy with PCA (2 components):", accuracy_score(y_test, y_pred_pca))

Accuracy with PCA (2 components): 0.9629629629629629


**Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.**

->


In [27]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

In [28]:
data = load_wine()
X,y = data.data, data.target

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [30]:
# 1. By Euclidean

In [31]:
knn = KNeighborsClassifier(metric='euclidean')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_euclidean = accuracy_score(y_test, y_pred)

In [32]:
knn = KNeighborsClassifier(metric='manhattan')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc_manhattan = accuracy_score(y_test, y_pred)

In [33]:
print('Accuracy with euclidean:', acc_euclidean)
print('Accuracy with manhattan:', acc_manhattan)

Accuracy with euclidean: 0.7037037037037037
Accuracy with manhattan: 0.7407407407407407


**Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.**

Explain how you would:
- Use PCA to reduce dimensionality
- Decide how many components to keep
- Use KNN for classification post-dimensionality reduction
- Evaluate the model
- Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

->

**1. Use PCA to reduce dimensionality**
- Gene expression data has thousands of features but few samples → risk of overfitting.
- PCA transforms genes into principal components (uncorrelated features) that capture maximum variance.
- This reduces noise and removes redundancy.

**2. Decide how many components to keep**
- Look at the explained variance ratio (scree plot).
- Keep components that explain ~90–95% variance.
- Or tune the number of components using cross-validation with the classifier.

**3. Use KNN for classification**
- After PCA, the dataset is lower-dimensional → distances are meaningful.
- Apply KNN, where each patient is classified based on the majority class of its nearest neighbors.
- Tune k (number of neighbors) for best performance.

**4. Evaluate the model**
- Use Stratified Cross-Validation (since samples are limited).
- Metrics: Accuracy, Balanced Accuracy, F1-score, ROC-AUC.
- This ensures reliable performance estimation.

**5. Justify to stakeholders**
- Why PCA? : Reduces dimensionality, denoises data, avoids overfitting.
- Why KNN? : Simple, interpretable, distance-based method.
- Why this pipeline? : Robust, leakage-free (scaling + PCA inside CV), and suitable for biomedical data where features >> samples.
---