#Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?
Answer:
K-Nearest Neighbors is a supervised machine learning algorithm used for
classification and regression tasks. Its called lazy learning because it doesn’t explicitly
build a model during training it just stores the data and makes predictions at query
time.
*In classification, prediction is based on the majority class of neighbors.
*In regression, prediction is based on the average value of neighbors.

#Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?
Answer:
The curse of dimensionality refers to the problems that arise when working with
high-dimensional data.
*Nearest neighbor becomes less meaningful
*Increased risk of overfitting
*Higher computational cost

#Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?
Answer:
Principal Component Analysis(PCA) is a dimensionally reduction technique that
transforms the original features into a new set of features called principal
components.
Feature selection- Selecting a subset of the original feature that are most relevant.
Keeps some of the original feature.
Remove irrelevant/redundant features.
Easy to interpret.

#Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?
Answer:
Aν = λν
Here ν = eigenvector
λ = eigenvalue (a scalar)
Importance:
*Eigenvectors define principal components
*Eigenvalues tell us the importance
*Dimensionally reduction

#Question 5: How do KNN and PCA complement each other when applied in a single pipeline?
Answer:
*PCA transforms high-dimensional data into a smaller set of informative features.
*KNN then classifies or predicts based on distance in this reduced space.
*Both complement each other because PCA reduces dimensionally and makes
distance more meaningful, while KNN use those distance for prediction.
*The combo reduces curse of dimensionality, improves efficiency, and often increases
accuracy

In [2]:
#Dataset:
"""Use the Wine Dataset from sklearn.datasets.load_wine().
Question 6: Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases.
(Include your Python code and output in the code box below.)"""
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaling = KNeighborsClassifier(n_neighbors=5)
knn_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = knn_scaling.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaling)
print("Accuracy without Scaling:", acc_no_scaling)
print("Accuracy with Scaling   :", acc_scaling)


Accuracy without Scaling: 0.7222222222222222
Accuracy with Scaling   : 0.9444444444444444


In [3]:
"""Question 7: Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component.
(Include your Python code and output in the code box below.)"""
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
wine = load_wine()
X, y = wine.data, wine.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA()
pca.fit(X_scaled)
print("Explained Variance Ratio of each Principal Component:")
print(pca.explained_variance_ratio_)


Explained Variance Ratio of each Principal Component:
[0.36198848 0.1920749  0.11123631 0.0706903  0.06563294 0.04935823
 0.04238679 0.02680749 0.02222153 0.01930019 0.01736836 0.01298233
 0.00795215]


In [4]:
"""Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset.
(Include your Python code and output in the code box below.)"""
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
acc_original = accuracy_score(y_test, y_pred_original)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

print("Accuracy with Original Dataset:", acc_original)
print("Accuracy with PCA (2 components):", acc_pca)


Accuracy with Original Dataset: 0.9444444444444444
Accuracy with PCA (2 components): 0.9444444444444444


In [5]:
"""Question 9: Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results. (Include your
Python code and output in the code box below.)"""
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
acc_euclidean = accuracy_score(y_test, y_pred_euclidean)

knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
acc_manhattan = accuracy_score(y_test, y_pred_manhattan)

print("Accuracy with Euclidean Distance:", acc_euclidean)
print("Accuracy with Manhattan Distance:", acc_manhattan)


Accuracy with Euclidean Distance: 0.9444444444444444
Accuracy with Manhattan Distance: 0.9814814814814815


In [6]:
"""Question 10: You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.
Explain how you would:
● Use PCA to reduce dimensionality
● Decide how many components to keep
● Use KNN for classification post-dimensionality reduction
● Evaluate the model
● Justify this pipeline to your stakeholders as a robust solution for real-world
biomedical data
(Include your Python code and output in the code box below.)"""
from sklearn.datasets import load_wine  # using Wine dataset as example, similar process for gene expression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_pca, y_train)
y_pred = knn.predict(X_test_pca)
accuracy = accuracy_score(y_test, y_pred)

print("Number of PCA Components Retained:", pca.n_components_)
print("Explained Variance Ratio (each component):", pca.explained_variance_ratio_)
print("Model Accuracy after PCA + KNN:", accuracy)



Number of PCA Components Retained: 10
Explained Variance Ratio (each component): [0.35730453 0.19209164 0.11006755 0.07250719 0.06973166 0.05341402
 0.04555029 0.0241568  0.02040417 0.01976974]
Model Accuracy after PCA + KNN: 0.9444444444444444
