In [None]:
                                                            KNN & PCA ASSISMENT

In [None]:
Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

In [None]:
 **K-Nearest Neighbors (KNN)**

K-Nearest Neighbors is a **supervised machine learning algorithm** used for both **classification** and **regression** tasks.
It is a **non-parametric, instance-based (lazy learning)** method, meaning it does not explicitly learn a model during training but instead stores the data and makes predictions based on the similarity between new and existing points.

---

### **How it works**

1. Choose a value of **K** (number of neighbors).
2. Compute the **distance** (commonly Euclidean, Manhattan, or Minkowski) between the new data point and all points in the training set.
3. Select the **K nearest neighbors** to the new data point.
4. Make predictions based on those neighbors.

---

### **KNN in Classification**

* Each neighbor "votes" for its class.
* The class with the **majority vote** among the K neighbors is assigned to the new data point.

**Example:**
If K=5 and the nearest neighbors have classes `[Dog, Dog, Cat, Dog, Cat]`, the majority is **Dog**, so the new sample is classified as **Dog**.

---

### **KNN in Regression**

* Instead of voting, KNN takes the **average (or weighted average)** of the neighbors’ values.
* The predicted value is the mean of the target values of the K closest points.

**Example:**
If K=3 and the neighbors have house prices `[200k, 220k, 210k]`, the predicted price is **(200k + 220k + 210k)/3 = 210k**.

---

### **Key Points to Remember**

* **Choice of K:**

  * Small K → more sensitive to noise.
  * Large K → smoother decision boundaries, but may overlook local patterns.
* **Distance metric matters** (Euclidean for continuous data, Hamming for categorical, etc.).
* **Feature scaling is important** (since distances are affected by feature magnitudes).
* KNN can be **computationally expensive** for large datasets (since it requires storing and comparing with all data points).

---

✅ **In short:**

* **Classification:** Majority vote of neighbors.
* **Regression:** Average of neighbors’ values.

---



In [None]:
Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

In [None]:

### **Curse of Dimensionality**

The *curse of dimensionality* refers to problems that arise when data has **too many features (dimensions)**.
As the number of dimensions increases:

* Data becomes **sparse** (points are spread far apart).
* Distances between points become less meaningful.
* Algorithms that rely on distance or density (like **KNN**) struggle.

---

### **Why it happens?**

* In high dimensions, the **volume of space grows exponentially**.
* To cover the same proportion of the space, you’d need exponentially more data.
* Distances between nearest and farthest neighbors tend to become **almost the same** → making it difficult to distinguish which points are actually “close.”

---

### **How it affects KNN performance**

Since **KNN relies on distance** to find the “nearest” neighbors, the curse of dimensionality causes:

1. **Reduced distance contrast**:

   * In high dimensions, nearest and farthest neighbors look equally distant.
   * Example: In 2D, you can clearly see which points are close. In 100D, distances flatten out.

2. **Overfitting risk**:

   * With sparse data, KNN may pick up noise instead of true patterns.

3. **Increased computation cost**:

   * More features → more distance calculations → slower KNN.

---

### **Example (Intuition)**

* Imagine points in 1D (a line): easy to say which are close.
* In 2D (a square): still manageable.
* In 100D: almost every point seems equally far → “nearest neighbor” loses meaning.

---

### **How to reduce the curse in KNN**

* **Feature selection** → keep only relevant features.
* **Dimensionality reduction** → PCA, t-SNE, autoencoders.
* **Scaling/normalization** → ensures all features contribute fairly.

---

✅ **In short:**
The curse of dimensionality makes distance measures unreliable in high dimensions, which **hurts KNN’s accuracy and efficiency**.


In [None]:
Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

In [None]:


## **Principal Component Analysis (PCA)**

PCA is a **dimensionality reduction technique** used to transform high-dimensional data into a smaller set of uncorrelated variables called **principal components**.

* It finds new axes (directions) in the data that capture the **maximum variance**.
* These new axes are **linear combinations** of the original features.
* The first principal component captures the most variance, the second captures the next most (orthogonal to the first), and so on.
* You can keep the top *k* components to reduce dimensionality while preserving most information.

---

### **Steps of PCA (simplified)**

1. Standardize the data (so features are on the same scale).
2. Compute the **covariance matrix** of the data.
3. Find **eigenvalues and eigenvectors** of the covariance matrix.
4. Eigenvectors = new axes (principal components).
5. Project the data onto top *k* principal components.

---

## **PCA vs. Feature Selection**

| Aspect                 | **PCA (Dimensionality Reduction)**                                                       | **Feature Selection**                              |
| ---------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------- |
| **Definition**         | Creates new features (principal components) as linear combinations of original features. | Selects a subset of the original features.         |
| **Nature of Features** | Transformed features (not directly interpretable).                                       | Original features (easy to interpret).             |
| **Goal**               | Reduce dimensionality while retaining variance (information).                            | Remove irrelevant or redundant features.           |
| **Type**               | **Feature extraction** (creates new features).                                           | **Feature selection** (keeps existing ones).       |
| **Interpretability**   | Harder, since new components are combinations.                                           | Easier, since selected features are original ones. |

---

### **Example**

* Suppose you have features: Height, Weight, Arm length, Leg length.
* **Feature Selection**: Might keep only *Height* and *Weight* if they are most informative.
* **PCA**: Would create new features like *PC1 = 0.7(Height) + 0.6(Weight) + …* that captures maximum variance.

---

✅ **In short:**

* **PCA = feature extraction** (creates new combined features).
* **Feature selection = keeps the most useful original features**.

---



In [None]:
Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

In [None]:
## **Eigenvalues and Eigenvectors in PCA**

When we perform PCA, we compute the **covariance matrix** of the data.

* The **eigenvectors** of this covariance matrix give the **directions** (axes) of the new feature space (the principal components).
* The **eigenvalues** tell us how much **variance (information)** is captured along each eigenvector.

---

### **Eigenvectors (Directions of Maximum Variance)**

* Think of eigenvectors as the "arrows" that point in the directions where the data varies the most.
* In PCA, each eigenvector corresponds to a **principal component**.
* They are orthogonal (perpendicular) to each other, ensuring the new components are uncorrelated.

---

### **Eigenvalues (Magnitude of Variance)**

* Each eigenvalue corresponds to an eigenvector.
* It represents the **amount of variance** captured in that direction.
* Larger eigenvalue = more important principal component.

---

### **Why They’re Important in PCA**

1. **Identify principal components** → eigenvectors = new axes (PC1, PC2, …).
2. **Rank components by importance** → eigenvalues show how much variance each component explains.
3. **Dimensionality reduction** → keep only components with the largest eigenvalues (discard low-variance ones).

---

### **Example (Intuition)**

Imagine a 2D dataset shaped like an ellipse:

* The **long axis** of the ellipse = eigenvector with the largest eigenvalue (PC1, max variance).
* The **short axis** = eigenvector with the smaller eigenvalue (PC2, less variance).
* PCA would likely keep PC1 if reducing to 1D, since it explains most of the spread in the data.

---

✅ **In short:**

* **Eigenvectors = directions of new axes (principal components).**
* **Eigenvalues = how much variance each new axis explains.**
* They are the mathematical backbone of PCA.

---


In [None]:
Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

In [None]:

## **How KNN and PCA Work Together**

* **KNN** is a distance-based algorithm → it decides labels or predictions by comparing distances between points.
* **PCA** reduces dimensionality by projecting data into fewer, more informative features.

👉 Since high-dimensional data can hurt KNN (curse of dimensionality), **PCA is often used before KNN** to improve performance.

---

### **Pipeline: PCA + KNN**

1. **Preprocess data** (scaling/normalization, handle missing values).
2. **Apply PCA** to reduce dimensionality (keep top *k* components that explain most variance).
3. **Run KNN** on the reduced feature space.

---

### **Why This Helps**

1. **Mitigates Curse of Dimensionality**

   * PCA removes redundant/noisy features → distances in KNN become more meaningful.

2. **Improves Efficiency**

   * Fewer dimensions → faster distance calculations in KNN.

3. **Noise Reduction**

   * PCA filters out low-variance directions (often noise) → KNN works with cleaner signals.

4. **Better Generalization**

   * Reduces overfitting since KNN won’t rely on irrelevant features.

---

### **Example**

* Suppose you have 100 features in a dataset (many are correlated, e.g., height, arm length, leg length).
* PCA reduces it to 10 principal components that explain \~95% of variance.
* KNN runs on these 10 PCs →

  * Faster,
  * More accurate (distances more meaningful),
  * Less overfitting.

---

✅ **In short:**

* **PCA = dimensionality reduction / feature extraction.**
* **KNN = classification/regression using distance.**
* Together → PCA makes the feature space compact & noise-free, so KNN can compute more reliable distances.

---



In [None]:
Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().
Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
(Include your Python code and output in the code box below.)

In [1]:
# Question 6: KNN on Wine Dataset with and without Feature Scaling

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# -------------------------
# 1. KNN WITHOUT SCALING
# -------------------------
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

# -------------------------
# 2. KNN WITH SCALING
# -------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("KNN Accuracy WITHOUT scaling: {:.2f}".format(acc_no_scale))
print("KNN Accuracy WITH scaling   : {:.2f}".format(acc_scaled))


KNN Accuracy WITHOUT scaling: 0.72
KNN Accuracy WITH scaling   : 0.94


In [None]:
Key Takeaway

Without scaling: KNN performs poorly because features like “flavanoids” (range ~0–5) and “proline” (range ~100–1600) dominate distance calculations.

With scaling: Each feature contributes equally, leading to much higher accuracy.

In [None]:
Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.

In [1]:


from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (keep all components for analysis)
pca = PCA(n_components=X.shape[1])
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"PC{i+1}: {ratio:.4f}")

# Also print cumulative variance for clarity
print("\nCumulative Explained Variance:")
print(np.cumsum(pca.explained_variance_ratio_))


Explained Variance Ratio of each Principal Component:
PC1: 0.3620
PC2: 0.1921
PC3: 0.1112
PC4: 0.0707
PC5: 0.0656
PC6: 0.0494
PC7: 0.0424
PC8: 0.0268
PC9: 0.0222
PC10: 0.0193
PC11: 0.0174
PC12: 0.0130
PC13: 0.0080

Cumulative Explained Variance:
[0.36198848 0.55406338 0.66529969 0.73598999 0.80162293 0.85098116
 0.89336795 0.92017544 0.94239698 0.96169717 0.97906553 0.99204785
 1.        ]


In [None]:
Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.
(Include your Python code and output in the code box below.)

In [2]:


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# -------------------------
# 1. KNN on ORIGINAL dataset
# -------------------------
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train, y_train)
y_pred_original = knn_original.predict(X_test)
acc_original = accuracy_score(y_test, y_pred_original)

# -------------------------
# 2. PCA Transformation (top 2 components)
# -------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# Print results
print("KNN Accuracy on ORIGINAL dataset: {:.2f}".format(acc_original))
print("KNN Accuracy on PCA (2 components): {:.2f}".format(acc_pca))


KNN Accuracy on ORIGINAL dataset: 0.94
KNN Accuracy on PCA (2 components): 0.96


In [None]:
Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.
(Include your Python code and output in the code box below.)


In [3]:


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# -------------------------
# 1. KNN with EUCLIDEAN distance (p=2)
# -------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# -------------------------
# 2. KNN with MANHATTAN distance (p=1)
# -------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print("KNN Accuracy with Euclidean distance : {:.2f}".format(acc_euclidean))
print("KNN Accuracy with Manhattan distance : {:.2f}".format(acc_manhattan))


KNN Accuracy with Euclidean distance : 0.94
KNN Accuracy with Manhattan distance : 0.98


In [None]:
Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)
                                                                 

In [None]:
Step 1: Use PCA to Reduce Dimensionality

Gene expression datasets often have thousands of features but only tens or hundreds of samples.

PCA helps by extracting the directions (principal components) that capture the most variance, reducing noise and redundancy.

Step 2: Decide How Many Components to Keep

Use explained variance ratio to decide the number of components: keep enough PCs to retain ~90–95% variance.

This reduces dimensionality but keeps most biological signal.

Step 3: Use KNN for Classification Post-PCA

KNN can now operate in a low-dimensional space, making distance-based classification reliable and reducing overfitting.

Step 4: Evaluate the Model

Use train-test split or cross-validation.

Metrics: accuracy, precision, recall, F1-score, and confusion matrix, depending on clinical relevance.

Step 5: Justify to Stakeholders

PCA reduces thousands of noisy genes to meaningful components → improves generalization.

KNN is interpretable (patients classified based on similarity to existing patients).

Pipeline mitigates overfitting, improves reproducibility, and maintains biologically meaningful patterns.

Python Code Example
# Example pipeline: PCA + KNN on a simulated high-dimensional gene expression dataset

import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Simulate high-dimensional gene expression data
# 200 samples, 1000 features, 3 classes
X, y = make_classification(n_samples=200, n_features=1000, n_informative=50, 
                           n_redundant=50, n_classes=3, random_state=42)

# Standardize features
scaler = Standard

In [4]:
# Example pipeline: PCA + KNN on a simulated high-dimensional gene expression dataset

import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Simulate high-dimensional gene expression data
# 200 samples, 1000 features, 3 classes
X, y = make_classification(n_samples=200, n_features=1000, n_informative=50, 
                           n_redundant=50, n_classes=3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# -------------------------
# PCA for dimensionality reduction
# -------------------------
pca = PCA(n_components=0.95)  # retain 95% variance
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

print("Original feature dimension:", X_train.shape[1])
print("Reduced feature dimension:", X_train_pca.shape[1])

# -------------------------
# KNN classification
# -------------------------
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)
y_pred = knn.predict(X_test_pca)

# Evaluate
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Optional: cross-validation for robustness
cv_scores = cross_val_score(knn, X_train_pca, y_train, cv=5)
print("\n5-fold CV Accuracy: {:.2f} ± {:.2f}".format(cv_scores.mean(), cv_scores.std()))


Original feature dimension: 1000
Reduced feature dimension: 124

Classification Report:

              precision    recall  f1-score   support

           0       0.46      0.57      0.51        21
           1       0.44      0.55      0.49        20
           2       0.56      0.26      0.36        19

    accuracy                           0.47        60
   macro avg       0.49      0.46      0.45        60
weighted avg       0.48      0.47      0.45        60

Confusion Matrix:
 [[12  8  1]
 [ 6 11  3]
 [ 8  6  5]]

5-fold CV Accuracy: 0.39 ± 0.06
