# Question 1: What is K-Nearest Neighbors (KNN) and how does it work in classification and regression?

KNN is a simple supervised learning algorithm used for classification and regression.

- It stores all training data.
- To predict a new point, it finds the K closest points (neighbors) using a distance metric (like Euclidean).
- For classification, it assigns the most common class among the neighbors.
- For regression, it averages the values of the neighbors.

Example for classification:  
New point’s 3 nearest neighbors have labels A, A, B → predict class A.

Example for regression:  
Nearest neighbors’ values are 10, 12, 14 → predicted value = (10+12+14)/3 = 12.


# Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

The Curse of Dimensionality refers to problems when data has many features (high dimensions):

- Data becomes sparse and distances between points become less meaningful.
- KNN relies on distance, so it struggles to find true neighbors in high dimensions.
- It requires more data to work well.
- Computation cost increases with more features.

To improve, use feature selection or dimensionality reduction (like PCA) to reduce dimensions.


# Question 3: What is Principal Component Analysis (PCA)? How is it different from Feature Selection?

Principal Component Analysis (PCA) is a method used to reduce the number of features in a dataset by creating new features (called principal components) that capture the most important information (variance) in the data.

PCA is unsupervised, meaning it doesn't use the target labels. It transforms the data into a new coordinate system where the first few components keep most of the original information.

Steps in PCA:
1. Standardize the data
2. Calculate the covariance matrix
3. Compute eigenvectors and eigenvalues
4. Choose top components
5. Project data onto these components

PCA is useful when there are many features and you want to reduce noise or make the data simpler for algorithms like KNN.

Difference between PCA and Feature Selection:

- PCA creates new features as combinations of original ones.
- Feature selection keeps some of the original features and removes the rest.
- PCA is unsupervised. Feature selection can be supervised (using the target variable).
- PCA changes the meaning of features, while feature selection keeps them as they are.


# Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

- PCA starts by calculating the covariance matrix of the data to understand how features vary with each other.
- Eigenvectors are special vectors that show the directions in which the data varies the most. They define new axes (principal components) for the data.
- Eigenvalues are numbers associated with each eigenvector that measure the amount of variance (spread or information) along that direction.
- In PCA, we find all eigenvectors and their eigenvalues from the covariance matrix.
- The eigenvectors with the largest eigenvalues represent the most important directions where the data has the highest variance.
- By choosing the top eigenvectors (principal components), we reduce the data’s dimensionality but keep most of its important information.
- This process helps in removing noise and redundant features while preserving the structure of the data.
- Without eigenvalues and eigenvectors, PCA would not be able to find the best new axes to represent the data efficiently.

In summary, eigenvectors determine the directions of maximum variance, and eigenvalues tell us how important those directions are.


# Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

- KNN works by finding the nearest neighbors based on distance between data points.
- When there are many features (high dimensions), distances become less meaningful because of the curse of dimensionality.
- PCA helps by reducing the number of features while keeping most of the important information.
- By applying PCA before KNN, we get a lower-dimensional representation of the data.
- This makes distance calculations more reliable and speeds up the computation.
- The combination improves KNN’s performance and accuracy on high-dimensional data.
- So, PCA reduces noise and redundancy, and KNN uses the transformed data to make better predictions.


# Dataset:
Use the Wine Dataset from sklearn.datasets.load_wine().
# Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases

In [50]:
pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.


In [51]:
#import the library 
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#laod the dataset

wine = load_wine() 
x , y = wine.data, wine.target

#train test split
x_train , x_test, y_train, y_test = train_test_split(
    x , y, test_size = 0.2, random_state=42, stratify=y
)

#1. KNN without scaling 
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(x_train, y_train)
y_pred_no_scale = knn_no_scale.predict(x_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

#2 KNN with scaling
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(x_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(x_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

#print results

print('KNN Accuracy without scaling:', acc_no_scale)
print('KNN Accuracy with scaling :', acc_scaled)

KNN Accuracy without scaling: 0.8055555555555556
KNN Accuracy with scaling : 0.9722222222222222


# Question 7: Train a PCA model on the Wine dataset and print the explained variance
 ratio of each principal component

In [52]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#load the wine set
wine = load_wine()
x, y = wine.data, wine.target

#standardize the features (importance before pca)
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

#apply PCA
pca = PCA()
x_pca = pca.fit_transform(x_scaled)

#print explained variance ratio
print("Explained Variance Ratio of each Principal Compponent:")
print(pca.explained_variance_ratio_)

#cumulative explained variance
import numpy as np
cum_var = np.cumsum(pca.explained_variance_ratio_)
print("\nCumulative Expalined Variance:")
print(cum_var)


Explained Variance Ratio of each Principal Compponent:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]

Cumulative Expalined Variance:
[0.36198848 0.55406338 0.66529969 0.73598999 0.80162293 0.85098116
 0.89336795 0.92017544 0.94239698 0.96169717 0.97906553 0.99204785
 1.        ]


Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.

In [53]:
# import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#load dataset
wine = load_wine()
x , y = wine.data, wine.target

#train-test split
x_train , x_test, y_train, y_test = train_test_split(
    x, y, test_size = 0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

knn_orig = KNeighborsClassifier(n_neighbors=5)
knn_orig.fit(x_train_scaled, y_train)
y_pred_origin = knn_orig.predict(x_test_scaled)
acc_orig = accuracy_score(y_test, y_pred_origin)

pca = PCA(n_components=2)
x_train_pca = pca.fit_transform(x_train_scaled)
x_test_pca = pca.transform(x_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(x_train_pca, y_train)
y_pred_pca = knn_pca.predict(x_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

#print results

print("KNN Accuracy on Original (Scaled) datstets:" , acc_orig)
print("KNN Accuracy on PCA-reduced (2 componets) datasets :" , acc_pca)

KNN Accuracy on Original (Scaled) datstets: 0.9722222222222222
KNN Accuracy on PCA-reduced (2 componets) datasets : 0.9166666666666666


Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.


In [None]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features (important for distance-based methods)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ------------------------------
# KNN with Euclidean Distance (p=2)
# ------------------------------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

# ------------------------------
# KNN with Manhattan Distance (p=1)
# ------------------------------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print("KNN Accuracy with Euclidean distance:", acc_euclidean)
print("KNN Accuracy with Manhattan distance:", acc_manhattan)


Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data

# Question 10: PCA + KNN on High-Dimensional Gene Expression Dataset

We have a dataset with many more **features (genes)** than **samples (patients)**.  
This causes overfitting for traditional models.  

We will:
1. Use **PCA** to reduce dimensionality.  
2. Tune the **number of principal components**.  
3. Apply **KNN classification**.  
4. Evaluate performance with **nested cross-validation**.  


In [None]:
pip install matplotlib

In [None]:
# Step 1: Import libraries
import numpy as np
from sklearn.datasets import load_wine   # placeholder, replace with your gene data
from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score

# (For plotting results later if needed)
import matplotlib.pyplot as plt


### Step 2: Load dataset

Here we use the **Wine dataset** (13 features, 178 samples) as a placeholder.  
In your case, replace this with your **gene expression matrix**:
- `X` → rows = patients, columns = genes  
- `y` → labels (cancer type)  


In [None]:
# Load Wine dataset (replace this with your gene expression data)
wine = load_wine()
X, y = wine.data, wine.target

print("Shape of data:", X.shape)   # (samples, features)
print("Number of classes:", len(np.unique(y)))


### Step 3: Build a pipeline

Pipeline:
- StandardScaler → PCA → KNN  

We will tune:
- Number of PCA components  
- Number of neighbors in KNN  
- Distance metric (Euclidean vs Manhattan)  


In [None]:
# Build pipeline
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(svd_solver="full", random_state=42)),
    ("knn", KNeighborsClassifier())
])

# Hyperparameter grid
param_grid = {
    "pca__n_components": [5, 10, 20, 30, 50, 75, 100],
    "knn__n_neighbors": [3, 5, 7, 9, 11, 15],
    "knn__metric": ["minkowski"],
    "knn__p": [1, 2],   # 1 = Manhattan, 2 = Euclidean
    "knn__weights": ["uniform", "distance"]
}


### Step 4: Nested Cross-Validation

- **Inner loop (GridSearchCV):** finds the best parameters.  
- **Outer loop:** gives unbiased performance estimate.  
- We use **Macro-F1 score** (fair to all classes).  


In [None]:
# Inner CV for hyperparameter tuning
inner = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Scorer: Macro-F1 (good for imbalanced classes)
scorer = make_scorer(f1_score, average="macro")

# GridSearch with inner CV
gs = GridSearchCV(pipeline, param_grid, scoring=scorer, cv=inner, n_jobs=-1)

# Outer CV for unbiased evaluation
outer = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=42)

# Run nested CV
scores = cross_val_score(gs, X, y, scoring=scorer, cv=outer, n_jobs=-1)

print(f"Macro-F1 (Nested CV): {np.mean(scores):.3f} ± {np.std(scores):.3f}")


### Step 5: Justification for stakeholders

- **Dimensionality reduction:** PCA compresses thousands of gene features into a small set of informative signals.  
- **Avoids overfitting:** Nested CV ensures robust performance.  
- **KNN simplicity:** Non-parametric, interpretable, and easy to update.  
- **Clinical trust:** PCs can be mapped back to genes → pathway analysis → biological interpretability.  
