# **QUESTIONS**

#1. What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

**K-Nearest Neighbors (KNN)** is a simple, non-parametric, and lazy learning algorithm used for both classification and regression. It does not "learn" a model (like a line or tree) during training; instead, it memorizes the entire training dataset. Predictions are made at runtime by finding the k closest data points to the new input.

- **In Classification:**

  -  The algorithm identifies the k nearest neighbors to the query point.

  -  It looks at the class labels of these neighbors.

  -  It assigns the class that is most frequent (mode) among the neighbors to the query point (Majority Voting).

- **In Regression:**

  -  The algorithm identifies the k nearest neighbors.

  -  It looks at the continuous values (targets) of these neighbors.

  -  It predicts the average (mean) of these values for the query point.

#2. What is the Curse of Dimensionality and how does it affect KNN  performance?

The **Curse of Dimensionality** refers to the various phenomena that arise when analyzing and organizing data in high-dimensional spaces (many features).

- **Effect on KNN:** KNN relies entirely on measuring distances (like Euclidean distance) between data points. In high-dimensional space:

  1. **Distance Convergence:** As dimensions increase, the distance between the nearest and farthest points becomes negligible. All points start to look "equally far" apart, making the concept of a "nearest neighbor" meaningless.

  2. **Sparseness:** Data becomes incredibly sparse. To cover the space effectively, the amount of data needed grows exponentially with each added dimension.

  3. **Computational Cost:** Calculating distances in many dimensions is computationally expensive and slow.

#3. What is Principal Component Analysis (PCA)? How is it different from feature selection?

**Principal Component Analysis (PCA)** is an unsupervised dimensionality reduction technique. It transforms the original variables into a new set of uncorrelated variables called **Principal Components.** These new components are linear combinations of the original features and are ordered by the amount of variance (information) they capture from the data.

**Difference from Feature Selection:**

- **Feature Selection:** Selects a subset of the original features and discards the rest (e.g., keeping "Age" and dropping "Height"). The meaning of the features is preserved.

- **PCA (Feature Extraction):** Creates new features. It blends the original features together. For example, PC1 might be 0.5*Age + 0.8*Height. You lose the original feature names/interpretability, but you retain the information (variance).

#4. What are eigenvalues and eigenvectors in PCA, and why are they important?

In the context of PCA, these linear algebra concepts define the new feature space:

- **Eigenvectors:** These determine the **direction** of the new feature space (Principal Components). They indicate the axes along which the data has the most spread (variance).

- **Eigenvalues:** These determine the **magnitude** (or importance) of the eigenvectors. An eigenvalue tells you how much variance exists in the data along its corresponding eigenvector direction.

**IMPORTANCE:** We use eigenvalues to decide how many components to keep. If a component has a very small eigenvalue, it means it carries very little information (variance) and can be dropped to reduce dimensionality.

#5. How do KNN and PCA complement each other when applied in a single pipeline?

KNN and PCA are often paired because PCA solves the biggest weakness of KNN: **The Curse of Dimensionality.**

1. **Noise Reduction:** KNN is sensitive to noise. PCA filters out noise by discarding components with low variance (small eigenvalues), leaving only the strongest patterns.

2. **Improved Metric:** By reducing the dimensions, PCA restores the meaningfulness of distance metrics (like Euclidean distance), allowing KNN to find true "neighbors" more effectively.

3. **Speed:** KNN is computationally expensive at prediction time. By reducing the number of features (e.g., from 100 to 10), PCA drastically speeds up the distance calculations required by KNN.

# **Dataset:**
# Use the **Wine Dataset** from sklearn.datasets.load_wine().
#6. Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.

Scaling is critical for KNN because it calculates distances. If one feature has a large range (e.g., Proline ~1000) and another has a small range (e.g., Hue ~1.0), the large feature will dominate the distance calculation.

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
data = load_wine()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Without Scaling
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, knn.predict(X_test))

# 2. With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier()
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f"Accuracy without scaling: {acc_no_scale:.4f}")
print(f"Accuracy with scaling:    {acc_scaled:.4f}")

Accuracy without scaling: 0.7407
Accuracy with scaling:    0.9630


**Observation:** Feature scaling drastically improved accuracy (from ~74% to ~96%) because it ensured all features contributed equally to the distance metric.

#7. Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.

In [2]:
from sklearn.decomposition import PCA

# Fit PCA on the scaled training data
pca = PCA()
pca.fit(X_train_scaled)

print("Explained Variance Ratio of Principal Components:")
print(pca.explained_variance_ratio_)

Explained Variance Ratio of Principal Components:
[0.36196226 0.18763862 0.11656548 0.07578973 0.07043753 0.04552517
 0.03584257 0.02646315 0.02174942 0.01958347 0.01762321 0.01323825
 0.00758114]


**Observation:** The first few components capture the vast majority of the information. For example, the first 2 components capture about 55% (36% + 18%) of the total variance.

#8. Train a KNN Classifier on the PCA-transformed dataset (retain top 2 components). Compare the accuracy with the original dataset.

In [3]:
# Apply PCA to keep only top 2 components
pca_2 = PCA(n_components=2)
X_train_pca = pca_2.fit_transform(X_train_scaled)
X_test_pca = pca_2.transform(X_test_scaled)

# Train KNN on PCA data
knn_pca = KNeighborsClassifier()
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print(f"Accuracy with PCA (2 Components): {acc_pca:.4f}")

Accuracy with PCA (2 Components): 0.9815


**Comparison:**

- **Original Scaled Accuracy:** 96.30%

- **PCA (2 Components) Accuracy:** 98.15%

**Conclusion:** Reducing the dataset from 13 features down to just 2 actually improved the model performance. This suggests that the original dataset contained noise or redundant features that were confusing the KNN classifier.

#9. Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.

In [4]:
# Euclidean Distance
knn_euclidean = KNeighborsClassifier(metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
acc_euclidean = accuracy_score(y_test, knn_euclidean.predict(X_test_scaled))

# Manhattan Distance
knn_manhattan = KNeighborsClassifier(metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
acc_manhattan = accuracy_score(y_test, knn_manhattan.predict(X_test_scaled))

print(f"Accuracy with Euclidean: {acc_euclidean:.4f}")
print(f"Accuracy with Manhattan: {acc_manhattan:.4f}")

Accuracy with Euclidean: 0.9630
Accuracy with Manhattan: 0.9630


**Observation:** Both metrics performed equally well on this specific split of the Wine dataset.

#10. You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
# Due to the large number of features and a small number of samples, traditional models overfit.
# Explain how you would:
# 1. Use PCA to reduce dimensionality
# 2. Decide how many components to keep
# 3. Use KNN for classification post-dimensionality reduction
# 4. Evaluate the model
# 5. Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

**Answer:-**

1. **Use PCA to reduce dimensionality:** Gene expression datasets often suffer severely from the **Curse of Dimensionality** (e.g., 20,000 genes/features but only 100 patients/samples). I would use PCA to project this high-dimensional space into a lower-dimensional subspace. This compacts the "signal" (biological variation) into the first few principal components while discarding the "noise" (technical variation) into the lower components.

2. **Decide how many components to keep:** I would use the **Cumulative Explained Variance** plot. I would select the number of components ($k$) such that they explain a significant threshold of the total variance, typically **90% or 95%.** Alternatively, I could use the "Elbow Method" on the scree plot to find the point where adding more components yields diminishing returns.

3. **Use KNN for classification post-dimensionality reduction:** After transforming the training and test data into these top $k$ principal components, I would train the KNN classifier. This is crucial because KNN distance calculations in 20,000 dimensions would be meaningless and computationally impossible.

4. **Justification to Stakeholders:** "Our raw data is too complex and noisy for standard models, which leads to 'memorizing' the data rather than learning from it (overfitting). By applying this pipeline, we first 'distill' the most important biological patterns using PCA, removing background noise. We then use KNN to classify patients based on these clear patterns. This results in a model that is faster, more accurate, and generalizes better to new patients."

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Define the pipeline steps
pipeline = Pipeline([
    ('scaler', StandardScaler()),        # Step 1: Scale features (Crucial for PCA & KNN)
    ('pca', PCA(n_components=0.95)),     # Step 2: Keep components explaining 95% variance
    ('knn', KNeighborsClassifier(n_neighbors=5)) # Step 3: Classifier
])

# Fit the pipeline on high-dimensional gene data
# pipeline.fit(X_train_gene, y_train_gene)
# predictions = pipeline.predict(X_test_gene)