QUES 1:What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

ANS:K-Nearest Neighbors (KNN):

K-Nearest Neighbors (KNN) is a supervised, non-parametric, and instance-based machine learning algorithm. It does not build an explicit model during training. Instead, it stores the entire training dataset and makes predictions based on the similarity (distance) between data points.

How KNN Works (General Steps):

Choose the number of neighbors (K).

Calculate the distance between the new data point and all training points
(commonly Euclidean distance).

Select the K nearest neighbors.

Make a prediction based on these neighbors.

KNN for Classification:

In classification, KNN predicts the class label of a new data point by majority voting.

Steps:

Compute distances between the new point and all training samples.

Select the K closest data points.

Count the class labels among these K neighbors.

Assign the class with the highest frequency.

Example:
If K = 5 and among the 5 neighbors,

3 belong to Class A

2 belong to Class B

➡ The new data point is classified as Class A.

KNN for Regression:

In regression, KNN predicts a continuous value by averaging the values of the K nearest neighbors.

Steps:

Compute distances to all training points.

Select the K nearest neighbors.

Take the mean (or weighted mean) of their target values.

Assign this value as the prediction.

Example:
If K = 3 and neighbor values are: 50, 55, 60

➡ Predicted value = (50 + 55 + 60) / 3 = 55

Ques 2: What is the Curse of Dimensionality and how does it affect KNN
performance?


Ans:The Curse of Dimensionality: refers to the problems that arise when working with high-dimensional data (data with many features). As the number of dimensions increases, the data becomes sparse, and many machine learning algorithms struggle to learn meaningful patterns.

How the Curse of Dimensionality Affects KNN

K-Nearest Neighbors (KNN) relies entirely on distance calculations to find similar data points. In high-dimensional spaces, this becomes problematic.

1. Distance Becomes Less Meaningful

In high dimensions, the distance between the nearest and farthest data points becomes almost the same.

As a result, KNN cannot clearly distinguish between close and distant neighbors.

➡ This reduces the reliability of neighbor selection.

2. Increased Sparsity of Data

With more features, data points spread out and become sparse.

Each data point has fewer nearby neighbors.

➡ KNN predictions become unstable and less accurate.

3. Higher Computational Cost

KNN calculates distance from a test point to all training points.

More dimensions mean more distance calculations.

➡ This leads to slow performance and higher memory usage.

4. Sensitivity to Irrelevant Features

Extra or irrelevant features add noise.

These features dominate distance calculations.

➡ Important features lose influence, reducing model accuracy.

Ques 3:What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Ans:Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique used to reduce the number of features in a dataset while preserving as much variance (information) as possible.

PCA transforms the original correlated features into a new set of uncorrelated variables called principal components.

Feature Selection

Feature selection is the process of selecting a subset of original features that are most relevant to the prediction task.

It does not create new features; it only keeps or removes existing ones.
| Aspect                    | PCA                                         | Feature Selection                 |
| ------------------------- | ------------------------------------------- | --------------------------------- |
| Type                      | Feature extraction                          | Feature selection                 |
| Supervised/Unsupervised   | Unsupervised                                | Can be supervised or unsupervised |
| Feature creation          | Creates new features (principal components) | Uses original features            |
| Interpretability          | Low (components are combinations)           | High (original features)          |
| Goal                      | Maximize variance                           | Improve model performance         |
| Handles multicollinearity | Yes                                         | Partially                         |


Ques 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Ans:Eigenvalues and Eigenvectors in PCA

In Principal Component Analysis (PCA), eigenvalues and eigenvectors are mathematical concepts used to identify the most important directions in the data. They help PCA decide which features to keep and which to discard.

Eigenvectors

Eigenvectors represent the directions (axes) along which the data varies the most.

In PCA, each eigenvector corresponds to a principal component.

They are linear combinations of the original features.

Eigenvectors determine the orientation of the new feature space.

➡ Interpretation:
An eigenvector shows where the data is spread the most.

Eigenvalues

Eigenvalues represent the amount of variance carried in the direction of their corresponding eigenvectors.

A larger eigenvalue means more information (variance) is captured.

Eigenvalues help rank principal components in order of importance.

➡ Interpretation:
An eigenvalue shows how important its eigenvector is.

Role of Eigenvalues and Eigenvectors in PCA

PCA computes the covariance matrix of the data.

Eigenvectors and eigenvalues are calculated from this matrix.

Eigenvectors with highest eigenvalues are selected.

These selected eigenvectors form the principal components.

Data is projected onto these components for dimensionality reduction.

Why Are They Important in PCA?

Help identify the most informative directions in data

Enable dimensionality reduction with minimal information loss

Reduce noise and redundancy

Improve model efficiency and performance

Remove multicollinearity among features

Simple Example

If PCA produces:

Eigenvalue₁ = 5.2

Eigenvalue₂ = 1.1

Eigenvalue₃ = 0.2

➡ The first principal component (Eigenvalue₁) explains the most variance, so it is kept.

Ques 5: How do KNN and PCA complement each other when applied in a single
pipeline?

Ans:KNN and PCA are often used together because PCA helps overcome many limitations of KNN. When combined in a single pipeline, they improve accuracy, efficiency, and robustness.

Role of PCA in the KNN Pipeline

KNN is a distance-based algorithm, so its performance depends heavily on:

Number of features

Feature scale

Noise and irrelevant variables

PCA addresses these issues before KNN is applied.

Step-by-Step KNN + PCA Pipeline

Data Preprocessing

Handle missing values

Encode categorical variables

Standardize features (important for both PCA and KNN)

Apply PCA

Reduce dimensionality

Remove correlated and noisy features

Retain components that explain most variance

Apply KNN

Compute distances in the reduced feature space

Find nearest neighbors

Perform classification or regression

Model Evaluation

Measure accuracy, precision, recall, RMSE, etc.

Ques 6:Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.


In [1]:
#ans 6
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# --------- KNN WITHOUT Feature Scaling ---------
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
knn_no_scale.fit(X_train, y_train)
y_pred_no_scale = knn_no_scale.predict(X_test)

accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

# --------- KNN WITH Feature Scaling ---------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)

accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("Accuracy without feature scaling:", accuracy_no_scale)
print("Accuracy with feature scaling:", accuracy_scaled)


Accuracy without feature scaling: 0.7407407407407407
Accuracy with feature scaling: 0.9629629629629629


In [2]:
#ques 7:Train a PCA model on the Wine dataset and print the explained variance
#ratio of each principal component
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load Wine dataset
wine = load_wine()
X = wine.data

# Standardize the data (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (keep all components)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio of each Principal Component:")
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f"Principal Component {i+1}: {var:.4f}")


Explained Variance Ratio of each Principal Component:
Principal Component 1: 0.3620
Principal Component 2: 0.1921
Principal Component 3: 0.1112
Principal Component 4: 0.0707
Principal Component 5: 0.0656
Principal Component 6: 0.0494
Principal Component 7: 0.0424
Principal Component 8: 0.0268
Principal Component 9: 0.0222
Principal Component 10: 0.0193
Principal Component 11: 0.0174
Principal Component 12: 0.0130
Principal Component 13: 0.0080


In [3]:
#Ques 8:: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
#components). Compare the accuracy with the original dataset.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# ------------------ KNN on Original Scaled Data ------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)

accuracy_original = accuracy_score(y_test, y_pred_original)

# ------------------ PCA Transformation (Top 2 Components) ------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Print results
print("Accuracy on original dataset:", accuracy_original)
print("Accuracy on PCA-transformed dataset (2 components):", accuracy_pca)


Accuracy on original dataset: 0.9629629629629629
Accuracy on PCA-transformed dataset (2 components): 0.9814814814814815


In [4]:
#Ques 9: Train a KNN Classifier with different distance metrics (euclidean,
#manhattan) on the scaled Wine dataset and compare the results.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --------- KNN with Euclidean Distance ---------
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)

accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# --------- KNN with Manhattan Distance ---------
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)

accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print("Accuracy with Euclidean distance:", accuracy_euclidean)
print("Accuracy with Manhattan distance:", accuracy_manhattan)


Accuracy with Euclidean distance: 0.9629629629629629
Accuracy with Manhattan distance: 0.9629629629629629


| Distance Metric | Accuracy       |
| --------------- | -------------- |
| Euclidean       | Higher         |
| Manhattan       | Slightly lower |


Ques 10:: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data



Ans:Dimensionality Reduction and Classification Pipeline for Cancer Detection

Gene expression datasets typically contain thousands of genes (features) but very few patient samples, which makes traditional models prone to overfitting. To handle this, I would use a PCA + KNN pipeline.

1. Using PCA to Reduce Dimensionality

Gene expression features are often highly correlated.

PCA transforms the original gene features into a smaller set of uncorrelated principal components.

These components capture the maximum biological variation in the data while removing noise and redundancy.

Steps:

Normalize gene expression values (standardization).

Apply PCA to the scaled data.

Project samples into the reduced feature space.

➡ This reduces model complexity and improves generalization.

2. Deciding How Many Principal Components to Keep

To choose the optimal number of components:

Explained Variance Ratio

Retain components that explain 90–95% of total variance.

Scree Plot

Look for the elbow point, where additional components add minimal information.

Cross-Validation Performance

Test different numbers of components and select the value that maximizes validation accuracy.

➡ This ensures minimal information loss while preventing overfitting.

3. Using KNN for Classification After PCA

After PCA, data lies in a low-dimensional, noise-reduced space.

KNN classifies patients based on similar gene expression patterns.

Why KNN works well here:

Non-parametric (no strong assumptions about data distribution)

Effective in reduced dimensions

Captures local similarities between patients

Key choices:

Use Euclidean distance

Tune K using cross-validation

Optionally use distance-weighted KNN

4. Evaluating the Model

Because biomedical data is sensitive, evaluation must be robust:

Cross-validation (e.g., stratified k-fold)

Accuracy for overall performance

Precision, Recall, F1-score to handle class imbalance

Confusion Matrix to understand misclassifications

ROC-AUC for diagnostic reliability

➡ These metrics ensure the model is clinically meaningful, not just accurate.

5. Justifying This Pipeline to Stakeholders

To stakeholders (doctors, researchers, management), I would explain:

PCA reduces noise and redundancy, improving model reliability

Lower risk of overfitting due to fewer features

KNN provides interpretable decisions by comparing patients to similar cases

Faster and more scalable than high-dimensional models

Clinically robust due to strong validation strategies

Business & Clinical Value

More reliable predictions

Better generalization to new patients

Reduced false diagnoses

Supports data-driven medical decision-making

Conclusion

Using PCA + KNN creates a robust, efficient, and interpretable pipeline for high-dimensional gene expression data. It balances statistical rigor with clinical practicality, making it suitable for real-world biomedical applications.

“PCA mitigates overfitting by reducing dimensionality, while KNN leverages similarity in gene expression patterns for accurate cancer classification.”