Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?

* The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
*
1. Calculate distances: A distance metric, like Euclidean distance, is used to calculate the distance between the new data point and all the points in the training dataset.
2. Identify neighbors: The algorithm finds the 'k' data points in the training set that are closest to the new point.
3. Make a prediction:
For classification: The new data point is assigned the class that is most frequent among its 'k' neighbors.
For regression: The algorithm predicts a continuous value by taking the average (or sometimes a weighted average) of the values of its 'k' neighbors.

Question 2: What is the Curse of Dimensionality and how does it affect KNN
performance?

* The Curse of Dimensionality refers to various phenomena that arise when dealing with high-dimensional data.
As the number of features or dimensions increases, the volume of the feature space grows exponentially, leading to sparsity in the data distribution.
This sparsity can result in several challenges such as increased computational complexity, overfitting, and deteriorating performance of certain algorithms.
 * The impact of dimensionality on the performance of KNN (K-Nearest Neighbors) is a well-known issue in machine learning. Here's a breakdown of how dimensionality affects KNN performance:

1. Increased Sparsity: As the number of dimensions increases, the volume of the space grows exponentially. Consequently, the available data becomes sparser, meaning that data points are spread farther apart from each other. This sparsity can lead to difficulties in finding meaningful nearest neighbors, as there may be fewer neighboring points within a given distance.
2. Equal Distances: In high-dimensional spaces, the concept of distance becomes less meaningful. As the number of dimensions increases, the distance between any two points tends to become more uniform, or equidistant. This phenomenon occurs because the influence of any single dimension diminishes as the number of dimensions grows, leading to points being distributed more uniformly across the space.
3. Degraded Performance: KNN relies on the assumption that nearby points in the feature space are likely to have similar labels. However, in high-dimensional spaces, this assumption may no longer hold true due to the increased sparsity and equalization of distances. As a result, KNN may struggle to accurately classify data points, leading to degraded performance.
4. Increased Computational Complexity: With higher dimensionality, the computational cost of KNN increases significantly. The algorithm needs to compute distances in a high-dimensional space, which involves more calculations. This can make the KNN algorithm slower and less efficient, especially when dealing with large datasets.


Question 3: What is Principal Component Analysis (PCA)? How is it different from
feature selection?

* Principal component analysis, or PCA, reduces the number of dimensions in large datasets to principal components that retain most of the original information. It does this by transforming potentially correlated variables into a smaller set of variables, called principal components.
* Principal Component Analysis (PCA) is a powerful technique used in data analysis, particularly for reducing the dimensionality of datasets while preserving crucial information. It does this by transforming the original variables into a set of new, uncorrelated variables called principal components. Here’s a breakdown of PCA’s key aspects:

1. Dimensionality Reduction: PCA helps manage high-dimensional datasets by extracting essential information and discarding less relevant features, simplifying analysis.
2. Data Exploration and Visualization: It plays a significant role in data exploration and visualization, aiding in uncovering hidden patterns and insights.
3. Linear Transformation: PCA performs a linear transformation of data, seeking directions of maximum variance.
4. Feature Selection: Principal components are ranked by the variance they explain, allowing for effective feature selection.
5. Data Compression: PCA can compress data while preserving most of the original information.
6. Clustering and Classification: It finds applications in clustering and classification tasks by reducing noise and highlighting underlying structure.
7. Advantages: PCA offers linearity, computational efficiency, and scalability for large datasets.
8. Limitations: It assumes data normality and linearity and may lead to information loss.
9. Matrix Requirements: PCA works with symmetric correlation or covariance matrices and requires numeric, standardized data.
10. Eigenvalues and Eigenvectors: Eigenvalues represent variance magnitude, and eigenvectors indicate variance direction.
11. Number of Components: The number of principal components chosen determines the number of eigenvectors computed.


Question 4: What are eigenvalues and eigenvectors in PCA, and why are they
important?
* A square matrix is associates with a special type of vector called an eigenvector. When the matrix acts on the eigenvector, it keeps the direction of the eigenvector unchanged and only scales it by a scalar value called the eigenvalue.
* Eigenvectors play a crucial role in various mathematical and real-world applications:

1. Principal Component Analysis (PCA): PCA is a widely used technique for dimensionality reduction. Eigenvectors are used to determine the principal components of the data, which capture the maximum variance and help identify the most important features.
2. Google PageRank: The algorithm that ranks web pages uses eigenvectors of a matrix representing the links between web pages. The principal eigenvector helps determine the relative importance of each page.
3. Quantum Mechanics: In physics, eigenvectors and eigenvalues describe the states of a system and their measurable properties, such as energy levels.
4. Computer Vision: Eigenvectors are used in facial recognition systems, particularly in techniques like Eigenfaces, where they help represent images as linear combinations of significant features.
5. Vibrational Analysis: In engineering, eigenvectors describe the modes of vibration in structures like bridges and buildings.

Question 5: How do KNN and PCA complement each other when applied in a single
pipeline?

* When combined in a single pipeline, PCA and KNN complement each other in the following ways:

1. Dimensionality Reduction for Efficiency:
PCA reduces the number of features before applying KNN, which lowers computational cost since KNN’s complexity increases with the number of dimensions and samples.

2. Improved Distance Calculation:
KNN depends heavily on Euclidean distance. In high-dimensional spaces, distances become less meaningful (“curse of dimensionality”). PCA helps by projecting data into the most informative directions, making distance metrics more reliable.

3. Noise Reduction:
PCA removes less significant components (often noise), making KNN more robust and improving classification accuracy.

4. Avoiding Overfitting:
Reducing the number of features through PCA can prevent KNN from overfitting to noise or redundant attributes in the dataset.

5. Better Visualization and Interpretability:
PCA can project data into 2D or 3D space, allowing easier visualization of how KNN classifies clusters of similar data points.



In [2]:
'''Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.'''

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf=KNeighborsClassifier()
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(accuracy_score(y_test,y_pred))

from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_neighbors': [3, 5, 6, 7, 9, 11, 13],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [20, 30, 40, 50]
}

grid = GridSearchCV(estimator=clf, param_grid=param_grid, cv = 5, verbose=3)
grid.fit(X_train, y_train)
grid.best_params_
grid.best_score_

best_model = grid.best_estimator_
y_pred_tuned = best_model.predict(X_test)
y_pred_tuned
print(f"Accuracy: {accuracy_score(y_test, y_pred_tuned)}")


0.7222222222222222
Fitting 5 folds for each of 224 candidates, totalling 1120 fits
[CV 1/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=uniform;, score=0.621 total time=   0.0s
[CV 2/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=uniform;, score=0.655 total time=   0.0s
[CV 3/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=uniform;, score=0.643 total time=   0.0s
[CV 4/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=uniform;, score=0.679 total time=   0.0s
[CV 5/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=uniform;, score=0.714 total time=   0.0s
[CV 1/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=distance;, score=0.655 total time=   0.0s
[CV 2/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=distance;, score=0.690 total time=   0.0s
[CV 3/5] END algorithm=auto, leaf_size=20, n_neighbors=3, weights=distance;, score=0.679 total time=   0.0s
[CV 4/5] END algorithm=auto, leaf_size=20, n_neighbors=3, 

In [4]:
'''Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.
'''
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

data = load_wine()
X, y = data.data, data.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
from sklearn.preprocessing import StandardScaler #mean=0, sigma=1
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
from sklearn.decomposition import PCA
pca = PCA()
pca
#by default the principal component will be equals to no of feature
X_train = pca.fit_transform(X_train)
X_train
X_test = scaler.transform(X_test)
X_test = pca.transform(X_test)
X_test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
from sklearn.preprocessing import StandardScaler #mean=0, sigma is 1
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)



pca = PCA(n_components=3)
X_train = pca.fit_transform(X_train)
X_train
pca.n_components
pca.explained_variance_ratio_

array([0.35383437, 0.20465595, 0.11223578])

In [5]:
'''Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.'''
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

data = load_wine()
X, y = data.data, data.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
from sklearn.decomposition import PCA
pca = PCA()
pca
X_train = pca.fit_transform(X_train)
X_train
X_test = scaler.transform(X_test)
X_test = pca.transform(X_test)
X_test

pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_train
pca.n_components
pca.explained_variance_ratio_


array([0.35168281, 0.19739103])

In [6]:
'''Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results.'''
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
wine = load_wine()
X = wine.data
y = wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)
print("Accuracy with Euclidean distance:", acc_euclidean)
print("Accuracy with Manhattan distance:", acc_manhattan)

print("\nClassification Report (Euclidean):\n", classification_report(y_test, y_pred_euclidean))
print("\nClassification Report (Manhattan):\n", classification_report(y_test, y_pred_manhattan))


Accuracy with Euclidean distance: 0.9444444444444444
Accuracy with Manhattan distance: 0.9444444444444444

Classification Report (Euclidean):
               precision    recall  f1-score   support

           0       0.93      1.00      0.97        14
           1       1.00      0.86      0.92        14
           2       0.89      1.00      0.94         8

    accuracy                           0.94        36
   macro avg       0.94      0.95      0.94        36
weighted avg       0.95      0.94      0.94        36


Classification Report (Manhattan):
               precision    recall  f1-score   support

           0       0.93      1.00      0.97        14
           1       1.00      0.86      0.92        14
           2       0.89      1.00      0.94         8

    accuracy                           0.94        36
   macro avg       0.94      0.95      0.94        36
weighted avg       0.95      0.94      0.94        36



Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data


1. Scale the data, apply PCA on training set to reduce noise and dimensionality.

2. Choose number of PCs using explained variance, scree plot and — most importantly — cross-validated predictive performance (nested CV).

3. Fit KNN on the PCA-transformed data; tune k, metric and n_components inside a pipeline to avoid leakage.

4. Evaluate with nested CV, report macro F1/AUC, confidence intervals, permutation tests and, if possible, independent cohort validation.

5. Justify to stakeholders via reduced overfitting, better generalization, biological checks of PC loadings and strict validation protocol.