**1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?**

K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm
that makes predictions based on the similarity between data points.

It works by:

Finding the K nearest data points using a distance metric such as Euclidean distance

Assigning the most frequent class among neighbors for classification

Predicting the average of neighbor values for regression

---

**2. What is the Curse of Dimensionality and how does it affect KNN performance?**

The Curse of Dimensionality refers to the challenges that arise when the number of
features in a dataset becomes very large.

It affects KNN by:

Making data points sparse in high-dimensional space

Reducing the effectiveness of distance measures

Increasing computational cost and reducing accuracy

---

**3. What is Principal Component Analysis (PCA)? How is it different from feature selection?**

Principal Component Analysis (PCA) is a dimensionality reduction technique that
transforms original features into a smaller set of new, uncorrelated variables
called principal components.

PCA is different from feature selection because:

PCA creates new features by combining original ones

Feature selection keeps a subset of the original features without transformation

---

**4. What are eigenvalues and eigenvectors in PCA, and why are they important?**

Eigenvectors represent the directions of maximum variance in the data,
while eigenvalues represent the amount of variance captured by those directions.

They are important because:

Eigenvectors define the principal components

Eigenvalues help decide which components to keep

---

**5. How do KNN and PCA complement each other when applied in a single pipeline?**

PCA reduces dimensionality and removes noise from the data,
making distance calculations more meaningful.

KNN benefits from this reduced feature space by achieving
better accuracy and faster computation.

---

**10. You are working with a high-dimensional gene expression dataset to classify cancer types. Explain your approach.**

PCA would be used to reduce thousands of gene features into a smaller number
of principal components, removing noise and redundancy.

The number of components would be chosen based on explained variance,
typically retaining 90â€“95% of the total variance.

KNN would then be applied on the reduced feature space for classification.

Model performance would be evaluated using cross-validation and metrics such as
accuracy, precision, recall, and ROC-AUC.

This pipeline reduces overfitting, improves generalization, and provides a
robust solution for real-world biomedical data with high dimensionality
and limited samples.

---


# Practial Questions

In [1]:
# 6.Train a KNN Classifier on the Wine dataset with and without feature scaling.
# Compare model accuracy in both cases.

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, knn.predict(X_test))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn.predict(X_test_scaled))

print("Accuracy without scaling:", acc_no_scale)
print("Accuracy with scaling:", acc_scaled)


Accuracy without scaling: 0.7407407407407407
Accuracy with scaling: 0.9629629629629629


In [2]:
# 7 .Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Scale data
X_scaled = StandardScaler().fit_transform(X)

# PCA
pca = PCA()
pca.fit(X_scaled)

print("Explained Variance Ratio:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [4]:
# 8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare
#  the accuracy with the original dataset.

from sklearn.decomposition import PCA

# PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_train_pca, X_test_pca, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.3, random_state=42
)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn.predict(X_test_pca))

print("Accuracy with PCA:", acc_pca)


Accuracy with PCA: 0.9814814814814815


In [5]:
# 9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_manhattan = KNeighborsClassifier(metric='manhattan')

knn_euclidean.fit(X_train_scaled, y_train)
knn_manhattan.fit(X_train_scaled, y_train)

acc_eu = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))
acc_man = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

print("Euclidean Accuracy:", acc_eu)
print("Manhattan Accuracy:", acc_man)


Euclidean Accuracy: 0.9629629629629629
Manhattan Accuracy: 0.9629629629629629
