Q1: What is KNN and how does it work in classification & regression?

- Answer:
 KNN is a simple, instance-based algorithm.
 Classification: Assigns the majority class among the K nearest neighbors.
 Regression: Predicts the mean/weighted mean of neighbors’ values.
 Performance depends on K, distance metric, and scaling.

 Q2: What is the Curse of Dimensionality and how does it affect KNN?

- Answer:
In high dimensions, distances become less meaningful → neighbors look equally far.
For KNN, this reduces accuracy, increases variance, and makes model unreliable. PCA/feature reduction helps.

Q3: What is PCA? How is it different from feature selection?

 - Answer:
PCA is a linear technique that creates new orthogonal features (principal components) capturing max variance.
 PCA: Creates new features (linear combinations).
 Feature selection: Keeps some original features only.

 Q4: What are eigenvalues and eigenvectors in PCA, and why important?

 - Answer:
Eigenvectors: Directions of max variance (principal axes).
Eigenvalues: Amount of variance captured.
They decide which components to keep for dimensionality reduction.

Q5: How do KNN & PCA complement each other?

- Answer:
PCA reduces noise & dimensions → distances become meaningful.
KNN then works better and faster in this reduced space.

In [1]:
#Question 6: Train a KNN Classifier on the Wine dataset with and without feature
#scaling. Compare model accuracy in both cases.
'''
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Without scaling
knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
print("Without scaling:", accuracy_score(y_test, knn.predict(X_test)))

# With scaling
pipe = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5))])
pipe.fit(X_train, y_train)
print("With scaling   :", accuracy_score(y_test, pipe.predict(X_test)))

'''


'\nfrom sklearn.datasets import load_wine\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.metrics import accuracy_score\n\nX, y = load_wine(return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)\n\n# Without scaling\nknn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)\nprint("Without scaling:", accuracy_score(y_test, knn.predict(X_test)))\n\n# With scaling\npipe = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5))])\npipe.fit(X_train, y_train)\nprint("With scaling   :", accuracy_score(y_test, pipe.predict(X_test)))\n\n'

In [2]:
#Question 7: Train a PCA model on the Wine dataset and print the explained variance
#ratio of each principal component.
'''
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
X_scaled = StandardScaler().fit_transform(X)
pca = PCA().fit(X_scaled)
print(pca.explained_variance_ratio_)

'''


'\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.decomposition import PCA\nX_scaled = StandardScaler().fit_transform(X)\npca = PCA().fit(X_scaled)\nprint(pca.explained_variance_ratio_)\n\n'

In [3]:
#Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2
#components). Compare the accuracy with the original dataset.
'''
pipe_orig = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5))])
pipe_orig.fit(X_train, y_train)
print("Original scaled:", accuracy_score(y_test, pipe_orig.predict(X_test)))

pipe_pca = Pipeline([("scaler", StandardScaler()), ("pca", PCA(2)), ("knn", KNeighborsClassifier(5))])
pipe_pca.fit(X_train, y_train)
print("PCA (2 comps):", accuracy_score(y_test, pipe_pca.predict(X_test)))

'''

'\npipe_orig = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5))])\npipe_orig.fit(X_train, y_train)\nprint("Original scaled:", accuracy_score(y_test, pipe_orig.predict(X_test)))\n\npipe_pca = Pipeline([("scaler", StandardScaler()), ("pca", PCA(2)), ("knn", KNeighborsClassifier(5))])\npipe_pca.fit(X_train, y_train)\nprint("PCA (2 comps):", accuracy_score(y_test, pipe_pca.predict(X_test)))\n\n'

In [4]:
#Question 9: Train a KNN Classifier with different distance metrics (euclidean,
#manhattan) on the scaled Wine dataset and compare the results.
'''
pipe_eu = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5, metric="euclidean"))])
pipe_ma = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5, metric="manhattan"))])

pipe_eu.fit(X_train, y_train)
pipe_ma.fit(X_train, y_train)

print("Euclidean:", accuracy_score(y_test, pipe_eu.predict(X_test)))
print("Manhattan:", accuracy_score(y_test, pipe_ma.predict(X_test)))

'''

'\npipe_eu = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5, metric="euclidean"))])\npipe_ma = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(5, metric="manhattan"))])\n\npipe_eu.fit(X_train, y_train)\npipe_ma.fit(X_train, y_train)\n\nprint("Euclidean:", accuracy_score(y_test, pipe_eu.predict(X_test)))\nprint("Manhattan:", accuracy_score(y_test, pipe_ma.predict(X_test)))\n\n'

In [5]:
#Question 10: You are working with a high-dimensional gene expression dataset to
#classify patients with different types of cancer.
#Due to the large number of features and a small number of samples, traditional models overfit.
'''
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold

Xg, yg = make_classification(n_samples=200, n_features=500, n_informative=30, n_classes=3, random_state=42)

pipe = Pipeline([("scaler", StandardScaler()), ("pca", PCA(30)), ("knn", KNeighborsClassifier(5))])
cv = StratifiedKFold(3, shuffle=True, random_state=42)

acc = cross_val_score(pipe, Xg, yg, cv=cv, scoring="accuracy")
print("CV Accuracy mean:", acc.mean())

'''

'\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import cross_val_score, StratifiedKFold\n\nXg, yg = make_classification(n_samples=200, n_features=500, n_informative=30, n_classes=3, random_state=42)\n\npipe = Pipeline([("scaler", StandardScaler()), ("pca", PCA(30)), ("knn", KNeighborsClassifier(5))])\ncv = StratifiedKFold(3, shuffle=True, random_state=42)\n\nacc = cross_val_score(pipe, Xg, yg, cv=cv, scoring="accuracy")\nprint("CV Accuracy mean:", acc.mean())\n\n'