Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

Answer: The KNN algorithm is one of the simplest yet highly effective machine learning algorithms. Its strenght lies in its versatality, as it can be used for both classification and regression tasks, although it's more commonly applied in classification.

KNN is a versatile supervised learning algorithm that predicts outcomes by analyzing the closest training data points. In classification, it uses majority voting among neighbors, while in regression, it averages their values. Despite being simple and intuitive, it requires careful selection of K and proper feature scaling to perform effectively.

Let us take an example to understand how it works:
We have to predict the weight of a new data point(let say ID12) based on the given data points:

1. The distance between the new point and each training point is calculated.
2. The closest k data points are selected(based on the distance).e.g., say closest three data point selected if the value of k is 3.
3. The average of these data points is the final prediction for the new point.

    For classification, the mode would be the final prediction, but in regression, the average of the weight is taken as the final prediction.
To calculate the distance between the data points we can use either euclidean distance or manhattan distance.

Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?

Answer: Curse of Dimensionality refers to the phenomena where a set of problems arise when data exists in high- dimensional space(i.e., when the number of features/ variables increases). As the dimensionality increases, distance between points become less meaningful, and algorithms that rely on distance or density like KNN face performance issues.
Consider a case to better understand it:

- Model_1 consists of only two features say the circuit name and the country name.
- Model_2 consists of 4 features say weather and max speed of the car including the above two.
- Model_3 consists of 8 features say driver's experience, number of wins, car condition, and driver's physical fitness including all the above features.
- Model_4 consists of 16 features say driver's age, latitude, longitude, driver's height, hair color, car color, the car company, and driver's marital status including all the above features.
- Model_5 consists of 32 features.
- Model_6 consists of 64 features.
- Model_7 consists of 128 features.
- Model_8 consists of 256 features.
- Model_9 consists of 512 features.
- Model_10 consists of 1024 features.
Assuming the training data remains constant, it is observed that on increasing the number of features the accuracy tends to increase until a certain threshold value and after that it remains constant or starts to decrease.
  Accuracy:
   M1 < M2 < M3 but if we try to extrapolate this trend it doesn't hold true for all the models having more than eight features.

    Effect on KNN performance:
1. Loss of distance meaning: When distance converge, KNN struggles to distinguish between close and far neighbors.
2. High risk of overfitting: With sparse data neighbors are not truly similar, so KNN may overfit to noise.
3. Computational Cost: Distance calculations become expensive in high dimensions, slowing the alogrithm.

Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

Answer: Principal Component Analysis is a powerful technique used in machine learning for transforming high-dimensional data into a more manageable form. It works by extracting important features known as principal components, which capture the maximum variance in the data. These components are linear combinations of the original features and provide a new coordinate system for the data.
Moreover, PCA facilitates distraction-free reading by simplifying complex data while retaining essential information for the analysis. However, it's important to note that PCA assumes linear relationships between variables, which means it may not perform optimally with nonlinear data.
Nonetheless, it remains a valuable tool for visualizing data and speeding up algorithms by reducing input dimensions.
The steps involved in PCA include:
- data standardization,
- computation of the covariace matrix, - eigenvalue decomposition,
- selection of principle components based on eigenvalues,
- projection of data onto these components.

PCA serves as a fundamental technique for dimensionality reduction and feature extraction in machine learning.

Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?

Answer: A sqare matrix is associated with special type of vector called an eigenvector. When the matrix acts on the eigenvector. It keeps the direction of the eigenvector unchanged and only scales it by a scaler value called the eigenvalue.
    In mathematical terms, for a square matrix A, a non-zero vector v is an eigenvector if:
            
                A⋅v=λ⋅v
where, v is eigenvector
       λ is eigenvalue (a scaler value)
       A is matrix

Imagine you have a matrix A representing a linear transformation, such as stretching, rotating, or scaling a 2D space. When this transformation is applied to a vector v:
- Most vectors will change their direction and magnitude.
- Some special vectors, however, will only be scaled but not rotated or flipped. These special vectors are eigenvectors like: if lanbda > 1: eigenvector is stretched.
if 0 < lambda < 1: the eigenvector is compresseed.
if lambda = -1, the eigenvector flips its direction but maintains the same length.

Importance of eigenvector and eigenvalues:

1. PCA is a widely used technique for dimensionality reduction. Eigenvectors are used to determine the principal components of the data, which capture the maximum variance and help identify the most important features.
2. The algorithm that ranks web pages usees eigenvectors of a matrix representing the links between web pages. The principal eigenvector helps determine the relative importance of each page.
3. In physics, eigenvectors and eigenvalues describe the state of a system and their measurable properties, such as energy levels.
4. Eigenvector are used in facial recognition systems, particularly techniques like Eigenfaces, where they help represent images as linear combination of singnificant features.
5. In engineering, eigenvectors describe the modes of vibration in structures like bridges and buildings.

Question 5: How do KNN and PCA complement each other when applied in a single pipeline?

Answer: PCA compliments KNN in a pipeline by first performing dimensionality reduction, which helps to overcome the curse of dimensionality and improves the efficiency and accuracy of KNN by transforming high dimensional data into a lower dimensional space.
1. Dimensionality Reduction with PCA:
- Noise Reduction: PCA identifies the principle components that capture the most variance in the data, effectively reducing noise that might be present in the original high-dimensional features.
- Feature Extraction: PCA transforms the original features into a new set of uncorrelated principal components. These components represent the most significant underlying patterns in the data.
- Reduced Dimensionality: By selecting a subset of the most important principal components, PCA projects the data into a lower-dimensional space, which is more computationally tractable.

2. Improved KNN Performance:
- Faster Computation: With fewer features to consider in the lower dimensional space, the distance calculations performed by KNN become significantly faster, reducing the algorithm's overall computational complexity.
- Enhanced Accuracy: By removing irrelevant and redundant features and noise, PCA can make the data patterns more discernible.This leads to better seperation between classes in the reduced feature space, allowing KNN to make more accurate predictions.
- Avoidance of the Curse of Dimensionality: In high dimensional spaces, KNN can struggle because the concept of 'nearness' becomes less meaningful as data points become sparse. PCA mitigate this by creating a more informative, lower dimensional representation.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix, classification_report, mean_absolute_error


In [5]:
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature
#scaling. Compare model accuracy in both cases.

# data preprocessing
data = load_wine()
X = data.data
y = data.target

# train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# model_training without scaling
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_no_scale = accuracy_score(y_test, y_pred)
print("Accuracy without scaling:", accuracy_no_scale)

# model_training with scaling
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

clf.fit(x_train, y_train)
Y_pred = clf.predict(x_test)
accuracy_with_scale = accuracy_score(y_test, Y_pred)

print("Accuracy with scaling:", accuracy_with_scale)




Accuracy without scaling: 0.7037037037037037
Accuracy with scaling: 0.9814814814814815


In [12]:
#Question 7: Train a PCA model on the Wine dataset and print the explained variance
# ratio of each principal component.
#data preprocessing
wine = load_wine()
X, y = wine.data, wine.target

# train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y , test_size = 0.3, random_state = 1)

# scaling
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# PCA
pca = PCA()
pca.fit_transform(x_train)
pca.transform(x_test)

exp_variance = pca.explained_variance_ratio_
for i, ratio in enumerate(exp_variance):
  print(f"Principal Component {i+1}: {ratio}")

Principal Component 1: 0.35168280854725753
Principal Component 2: 0.19739102667462324
Principal Component 3: 0.11318948926631382
Principal Component 4: 0.07729222031371188
Principal Component 5: 0.06125163621403974
Principal Component 6: 0.05129145201022682
Principal Component 7: 0.04229865732302875
Principal Component 8: 0.026249245002008818
Principal Component 9: 0.02426133805913909
Principal Component 10: 0.018242677233753037
Principal Component 11: 0.015803322570635214
Principal Component 12: 0.013243352038909289
Principal Component 13: 0.007802774746352733


In [14]:
# Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
# components). Compare the accuracy with the original dataset.

#data preprocessing
wine = load_wine()
X, y = wine.data, wine.target

# train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

#scaling
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# apply PCA
pca = PCA(n_components = 2)
x_trained_PCA = pca.fit_transform(x_train)
x_test_PCA = pca.transform(x_test)

# model_training without PCA
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_no_PCA = accuracy_score(y_test, y_pred)

#model_training with PCA
clf.fit(x_trained_PCA, y_train)
y_pred_PCA = clf.predict(x_test_PCA)
accuracy_with_PCA = accuracy_score(y_test, y_pred_PCA)

# comparing values
print("Accuracy of the model with original dataset:", accuracy_no_PCA)
print(f"Accuracy of model after applying PCA: {accuracy_with_PCA}")

Accuracy of the model with original dataset: 0.9814814814814815
Accuracy of model after applying PCA: 0.9629629629629629


In [29]:
# Question 9: Train a KNN Classifier with different distance metrics (euclidean,
# manhattan) on the scaled Wine dataset and compare the results.

# data preprocessing
wine = load_wine()
X, y = wine.data, wine.target

#train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# scaling
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# define_model_with_metrics
knn_euclidean = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')
knn_manhattan = KNeighborsClassifier(n_neighbors = 5, metric = 'manhattan')

# model_training and evaluation
knn_euclidean.fit(x_train, y_train)
y_pred_euclidean = knn_euclidean.predict(x_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

knn_manhattan.fit(x_train, y_train)
y_pred_manhattan = knn_manhattan.predict(x_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# comparing accuracy
print("Accuracy with Euclidean distance:", accuracy_euclidean)
print("Accuracy with Manhattan distance:", accuracy_manhattan)

Accuracy with Euclidean distance: 0.9814814814814815
Accuracy with Manhattan distance: 0.9814814814814815


Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.

Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data.

Answer: Due to large number of features and a small number of samples, traditional models overfit this can happen due to a phenomenon called curse of dimensionality, to remove this we can use PCA.

Step 1:- Using PCA for dimensionality reduction: Biomedical datasets often contain hundreds or thousands of features (e.g., gene expressions, imaging biomarkers, lab test results). Many of these features are correlated or redundant, which increases noise and makes models less efficient.

- Principal Component Analysis transforms the original features into new uncorrelated features called principal components, ordered by the amount of variance they explain.

- Before applying PCA, we standardize features (zero mean, unit variance), since PCA is sensitive to scale.

- PCA reduces data complexity while preserving most of the information.

    This step improves computational efficency and reduces the risk of overfitting.

Step 2:- Deciding the number of components:
- Explained Variance Ratio: We calculate how much variance each principal component explains.

- Cumulative Variance Plot (Scree Plot): Used to find the "elbow point," where adding more components doesn't significantly increase variance explained.

- Typical Criterion: Keep enough components to retain 90 to 95% variance.

Example: If the first 10 components explain 94% of the variance, we use those instead of all 100 original features.

  This ensures that we had balanced between dimensionality reduction and retaining meaningful biomedical signals.

Step 3:- Apply KNN for classification:
- Once PCA reduced the dataset, we train a K-Nearest Neighbors classifier on the principal components.
- Post PCA, Each new patient is classified based on the "closest" existing patient profiles in the reduced feature space.
- PCA removes irrelevant noise, making these distances more meaningful.

  With fewer dimensions, KNN avoids the curse of dimensionality and performs more reliability.

Step 4:- Model Evaluation: For evaluation purposes we can use:
- Cross Validation: Use Grid Search CV or Randomized Search CV for reliable estimates.
- could use metrics to know the performace of the model-
  - Accuracy for overall correctness.
  - Precision and Recall are critical in medical tasks i.e., high recall means fewer missed patients with disease.
  - F1-score can make balances between precision and recall.
  - ROC-AUC measures the ability to distinguish diseased and healthy patients.

  could show the performance of KNN with and without PCA, demonstrate the benefit of dimensionality reduction.
Step 5:- Justification:
- Biomedical data is large, complex, and noisy. PCA helps us compress the data into fewer meaningful patterns.
- This makes predictions faster, more accurate, and less likely to overfit.
- KNN is intuitive: the model predicts based on the closest patient profiles, which is easy for medical experts to understand.
- By evaluating with recall and ROC-AUC, we ensure the model prioritizes catching as many disease cases as possible, even if it slightly lowers accuracy.

Business/Healthcare Value:
- Helps doctors with early disease detection.
- Enables personalized treatments by grouping similar patients.
- Improves trust and interpretability, since PCA can show which biomedical signals are most influential.
- Reduces computational cost → scalable to hospital-level data.