## Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**Answer:**  
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for classification and regression.  
- **Classification:** It assigns a class label based on the majority vote of the k closest data points in the feature space.  
- **Regression:** It predicts a continuous value by averaging the target values of the k nearest neighbors.  
KNN relies on a distance metric (e.g., Euclidean) to determine proximity and does not require model training, making it simple yet powerful.

---

## Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

**Answer:**  
The Curse of Dimensionality refers to the exponential increase in data sparsity as the number of features grows.  
- In high dimensions, all points tend to become equidistant, making it hard for KNN to identify meaningful neighbors.  
- It leads to poor generalization, increased computational cost, and reduced accuracy.  
Dimensionality reduction techniques like PCA are often used to mitigate this issue.

---

## Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Answer:**  
PCA is a statistical technique that transforms the original features into a new set of orthogonal components that capture the maximum variance.  
- **PCA:** Creates new features (principal components) based on linear combinations of original features.  
- **Feature Selection:** Chooses a subset of existing features based on relevance.  
PCA is a feature extraction method, while feature selection is about choosing from existing features.

---

## Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?

**Answer:**  
- **Eigenvectors:** Indicate the direction of the new feature axes (principal components).  
- **Eigenvalues:** Represent the amount of variance captured by each eigenvector.  
In PCA, eigenvectors define the principal components, and eigenvalues help rank them by importance. Selecting components with the highest eigenvalues ensures maximum information retention.

---

## Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

**Answer:**  
PCA reduces dimensionality, mitigating the Curse of Dimensionality and noise.  
KNN benefits from this by operating in a lower-dimensional space, improving accuracy and efficiency.  
Together, they form a robust pipeline:  
1. PCA extracts informative features.  
2. KNN classifies based on simplified, variance-rich components.  
This synergy enhances performance, especially in high-dimensional datasets.

---



In [1]:
# Q6: KNN Classifier on Wine Dataset – With and Without Feature Scaling
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
data = load_wine()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
knn_raw = KNeighborsClassifier()
knn_raw.fit(X_train, y_train)
y_pred_raw = knn_raw.predict(X_test)
print("Accuracy without scaling:", accuracy_score(y_test, y_pred_raw))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
print("Accuracy with scaling:", accuracy_score(y_test, y_pred_scaled))


Accuracy without scaling: 0.7222222222222222
Accuracy with scaling: 0.9444444444444444


In [3]:
# Q7: PCA Explained Variance Ratio
from sklearn.decomposition import PCA

# Apply PCA
pca = PCA()
pca.fit(X)

# Explained variance ratio
print("Explained variance ratio of each component:")
print(pca.explained_variance_ratio_)


Explained variance ratio of each component:
[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


In [4]:
# Q8: KNN on PCA-transformed data (Top 2 Components)
# PCA transform
pca_2 = PCA(n_components=2)
X_pca = pca_2.fit_transform(X)

# Train-test split
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# KNN on PCA data
knn_pca = KNeighborsClassifier()
knn_pca.fit(X_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(X_test_pca)
print("Accuracy on PCA-transformed data (2 components):", accuracy_score(y_test_pca, y_pred_pca))


Accuracy on PCA-transformed data (2 components): 0.7222222222222222


In [5]:
# Q9: KNN with Different Distance Metrics
# Euclidean
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
print("Accuracy with Euclidean distance:", accuracy_score(y_test, y_pred_euclidean))

# Manhattan
knn_manhattan = KNeighborsClassifier(metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
print("Accuracy with Manhattan distance:", accuracy_score(y_test, y_pred_manhattan))


Accuracy with Euclidean distance: 0.9444444444444444
Accuracy with Manhattan distance: 0.9444444444444444


In [6]:
# Q10: High-Dimensional Gene Expression Pipeline (Simulated Example)
from sklearn.datasets import make_classification

# Simulate high-dimensional data
X_gene, y_gene = make_classification(n_samples=100, n_features=1000, n_informative=50, random_state=42)

# Split data
X_train_gene, X_test_gene, y_train_gene, y_test_gene = train_test_split(X_gene, y_gene, test_size=0.2, random_state=42)

# Scale data
scaler_gene = StandardScaler()
X_train_gene_scaled = scaler_gene.fit_transform(X_train_gene)
X_test_gene_scaled = scaler_gene.transform(X_test_gene)

# PCA: Keep components explaining 95% variance
pca_gene = PCA(n_components=0.95)
X_train_gene_pca = pca_gene.fit_transform(X_train_gene_scaled)
X_test_gene_pca = pca_gene.transform(X_test_gene_scaled)

# KNN classification
knn_gene = KNeighborsClassifier()
knn_gene.fit(X_train_gene_pca, y_train_gene)
y_pred_gene = knn_gene.predict(X_test_gene_pca)

# Evaluation
print("Accuracy on gene expression data after PCA:", accuracy_score(y_test_gene, y_pred_gene))
print("Number of components retained:", pca_gene.n_components_)


Accuracy on gene expression data after PCA: 0.65
Number of components retained: 73
