In [None]:
1️⃣ What is K-Nearest Neighbors (KNN) and how does it work?
KNN is a supervised learning algorithm used for classification and regression. It works by:

Storing all the data points during training.

For a new data point, it calculates the distance (like Euclidean distance) to all training points.

It selects the K nearest neighbors and predicts the output:

Classification: by majority voting.

Regression: by averaging their values.

2️⃣ What is the difference between KNN Classification and KNN Regression?
Aspect	KNN Classification	KNN Regression
Target Variable	Categorical (e.g., labels like 'cat', 'dog')	Continuous (e.g., price, temperature)
Prediction	Based on majority class of neighbors	Based on average of neighbors' values
Voting Strategy	Mode (most common label)	Mean (or weighted average)

3️⃣ What is the role of the distance metric in KNN?
The distance metric (like Euclidean, Manhattan, or Minkowski) determines which points are considered "close". It heavily impacts which neighbors are selected:

Different metrics capture different notions of similarity.

E.g., Euclidean works well when features are continuous; Manhattan may suit high-dimensional data better.

4️⃣ What is the Curse of Dimensionality in KNN?
In high dimensions:

Distances become less meaningful (all points seem equidistant).

KNN relies on distance to find neighbors, so its performance degrades.

The more dimensions, the sparser the data becomes.

5️⃣ How can we choose the best value of K in KNN?
Use cross-validation:

Try multiple values of K (e.g., 1 to 20).

Choose K that minimizes validation error.

Common heuristic:

Odd values of K for binary classification.

K = √(number of samples) is a rough starting point.

6️⃣ What are KD Tree and Ball Tree in KNN?
KD Tree:

A binary tree that partitions the data by splitting along dimensions.

Works well in low dimensions (up to ~20-30).

Ball Tree:

Partitions space into hyperspheres (balls).

Better for high-dimensional or clustered data.

They are used to speed up nearest-neighbor searches.

7️⃣ When should you use KD Tree vs. Ball Tree?
Aspect	KD Tree	Ball Tree
Data Dimensionality	Low-dimensional (<30)	Higher-dimensional (>30)
Data Structure	Uniform, continuous data	Clustered, irregular data
Performance	Faster in low dimensions	Better in high dimensions

8️⃣ What are the disadvantages of KNN?
Slow prediction (needs to compute distance for all points).

Sensitive to irrelevant features and feature scaling.

Curse of dimensionality affects performance.

Memory intensive (stores the entire dataset).

Doesn’t handle missing values natively.

9️⃣ How does feature scaling affect KNN?
KNN is distance-based, so:

If features have different scales, larger-scale features will dominate.

Standardization (z-score) or Min-Max scaling is essential for KNN to perform well.

🔟 What is PCA (Principal Component Analysis)?
PCA is a dimensionality reduction technique that:

Transforms data into a new coordinate system (principal components).

The first component captures the maximum variance, the second the next highest, and so on.

Helps in visualization, noise reduction, and speeding up models.

1️⃣1️⃣ How does PCA work?
Steps:

Center the data (subtract mean).

Compute covariance matrix.

Find eigenvectors and eigenvalues of covariance matrix.

Select top-k principal components based on largest eigenvalues.

Transform data into the new lower-dimensional space.

1️⃣2️⃣ What is the geometric intuition behind PCA?
Imagine a cloud of points in space:

PCA finds the directions (axes) along which the data varies the most (principal components).

These directions form a new rotated coordinate system.

PCA projects the data onto this system, capturing as much variance as possible in fewer dimensions.

1️⃣3️⃣ What are Eigenvalues and Eigenvectors in PCA?
Eigenvectors: The directions (principal axes) along which variance is maximized.

Eigenvalues: The amount of variance captured by each eigenvector.

In PCA:

We select eigenvectors corresponding to the largest eigenvalues.

1️⃣4️⃣ What is the difference between Feature Selection and Feature Extraction?
Aspect	Feature Selection	Feature Extraction
Definition	Choose a subset of original features	Create new features by combining existing ones
Method Example	Select top-10 features based on importance	PCA, LDA (reduce dimensionality)
Nature	Retains original features	Transforms features

1️⃣5️⃣ How do you decide the number of components to keep in PCA?
Look at explained variance plot (scree plot).

Choose the number of components that captures a desired percentage (e.g., 95%) of variance.

Rule of thumb: Use elbow method.

1️⃣6️⃣ Can PCA be used for classification?
PCA itself is unsupervised.

However, you can:

Preprocess data with PCA to reduce dimensions.

Then feed the reduced data into a classifier (like KNN, SVM).

It can improve performance by reducing noise.

1️⃣7️⃣ What are the limitations of PCA?
Only captures linear relationships.

May discard features important for classification but with low variance.

Sensitive to scaling of features.

Principal components are often hard to interpret.

1️⃣8️⃣ How do KNN and PCA complement each other?
PCA reduces dimensionality → mitigates curse of dimensionality → KNN performs better.

PCA handles correlated features → creates orthogonal features → improves KNN’s distance calculations.

Pipeline: PCA → scaling → KNN.

1️⃣9️⃣ How does KNN handle missing values in a dataset?
KNN doesn’t handle missing values natively. You can:

Impute missing values (mean, median, KNN imputation).

Or use a KNN imputer: find K nearest neighbors based on non-missing features and impute accordingly.

2️⃣0️⃣ What are the key differences between PCA and Linear Discriminant Analysis (LDA)?
Aspect	PCA	LDA
Type	Unsupervised	Supervised
Goal	Maximize variance	Maximize class separation
Labels Needed?	No	Yes (class labels required)
Use Case	Dimensionality reduction, visualization	Classification, dimensionality reduction
Criterion	Eigenvectors of covariance matrix	Eigenvectors of between-class/within-class scatter



In [None]:
2️⃣1️⃣ Train a KNN Classifier on the Iris dataset and print model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Accuracy
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
2️⃣2️⃣ Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)

from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Synthetic regression data
X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN Regressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# Evaluate
y_pred = knn_reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
2️⃣3️⃣ Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy

for metric in ['euclidean', 'manhattan']:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test))
    print(f"Accuracy ({metric}): {acc:.2f}")
2️⃣4️⃣ Train a KNN Classifier with different values of K and visualize decision boundaries

import matplotlib.pyplot as plt
import numpy as np

# Reduce Iris to 2 features for visualization
X_vis = X[:, :2]
X_train_v, X_test_v, y_train_v, y_test_v = train_test_split(X_vis, y, test_size=0.2, random_state=42)

ks = [1, 5, 15]
plt.figure(figsize=(12,4))

for i, k in enumerate(ks):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_v, y_train_v)

    # Plot decision boundary
    x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
    y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.subplot(1, 3, i+1)
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y)
    plt.title(f"K={k}")

plt.tight_layout()
plt.show()
2️⃣5️⃣ Apply Feature Scaling before training a KNN model and compare results with unscaled data

# Unscaled KNN
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test))

# Scaled KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f"Accuracy without scaling: {acc_unscaled:.2f}")
print(f"Accuracy with scaling: {acc_scaled:.2f}")
2️⃣6️⃣ Train a PCA model on synthetic data and print the explained variance ratio for each component

from sklearn.decomposition import PCA

# Synthetic data
X_synth, _ = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)

# PCA
pca = PCA()
pca.fit(X_synth)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)
2️⃣7️⃣ Apply PCA before training a KNN Classifier and compare accuracy with and without PCA

# Without PCA
knn_no_pca = KNeighborsClassifier(n_neighbors=5)
knn_no_pca.fit(X_train_scaled, y_train)
acc_no_pca = accuracy_score(y_test, knn_no_pca.predict(X_test_scaled))

# With PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))

print(f"Accuracy without PCA: {acc_no_pca:.2f}")
print(f"Accuracy with PCA: {acc_pca:.2f}")
2️⃣8️⃣ Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': [3, 5, 7, 9], 'metric': ['euclidean', 'manhattan']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)
2️⃣9️⃣ Train a KNN Classifier and check the number of misclassified samples

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

misclassified = (y_test != y_pred).sum()
print(f"Number of misclassified samples: {misclassified}")
3️⃣0️⃣ Train a PCA model and visualize the cumulative explained variance

import matplotlib.pyplot as plt

pca = PCA()
pca.fit(X_train_scaled)

cum_var = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(6,4))
plt.plot(range(1, len(cum_var)+1), cum_var, marker='o')
plt.title("Cumulative Explained Variance by PCA")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance")
plt.grid(True)
plt.show()

3️⃣1️⃣ Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy

for weight in ['uniform', 'distance']:
    knn = KNeighborsClassifier(n_neighbors=5, weights=weight)
    knn.fit(X_train_scaled, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test_scaled))
    print(f"Accuracy (weights={weight}): {acc:.2f}")
3️⃣2️⃣ Train a KNN Regressor and analyze the effect of different K values on performance

ks = [1, 3, 5, 10, 20]
for k in ks:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train, y_train)
    mse = mean_squared_error(y_test, knn_reg.predict(X_test))
    print(f"K={k}, MSE={mse:.2f}")
3️⃣3️⃣ Implement KNN Imputation for handling missing values in a dataset

from sklearn.impute import KNNImputer
import numpy as np

# Create a dataset with missing values
X_missing = X.copy()
X_missing[::10] = np.nan  # Add NaNs every 10th row

imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X_missing)
print("Imputed dataset (first 5 rows):\n", X_imputed[:5])
3️⃣4️⃣ Train a PCA model and visualize the data projection onto the first two principal components

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.title("Data projected onto first two PCA components")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
3️⃣5️⃣ Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance

for algo in ['kd_tree', 'ball_tree']:
    knn = KNeighborsClassifier(algorithm=algo)
    knn.fit(X_train_scaled, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test_scaled))
    print(f"Accuracy (algorithm={algo}): {acc:.2f}")
3️⃣6️⃣ Train a PCA model on a high-dimensional dataset and visualize the Scree plot

X_hd, _ = make_regression(n_samples=200, n_features=50, noise=5, random_state=42)
X_hd_scaled = StandardScaler().fit_transform(X_hd)

pca_hd = PCA()
pca_hd.fit(X_hd_scaled)

plt.plot(np.arange(1, 51), pca_hd.explained_variance_ratio_, marker='o')
plt.title("Scree Plot")
plt.xlabel("Component Number")
plt.ylabel("Variance Explained")
plt.show()
3️⃣7️⃣ Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score

from sklearn.metrics import classification_report

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

print(classification_report(y_test, y_pred))
3️⃣8️⃣ Train a PCA model and analyze the effect of different numbers of components on accuracy

for n in [1, 2, 3, 4]:
    pca = PCA(n_components=n)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)
    
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_pca, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test_pca))
    print(f"Components={n}, Accuracy={acc:.2f}")
3️⃣9️⃣ Train a KNN Classifier with different leaf_size values and compare accuracy

for leaf in [10, 30, 50, 70]:
    knn = KNeighborsClassifier(n_neighbors=5, leaf_size=leaf)
    knn.fit(X_train_scaled, y_train)
    acc = accuracy_score(y_test, knn.predict(X_test_scaled))
    print(f"Leaf Size={leaf}, Accuracy={acc:.2f}")
4️⃣0️⃣ Train a PCA model and visualize how data points are transformed before and after PCA

X_2d = X[:, :2]

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(X_2d[:,0], X_2d[:,1], c=y)
plt.title("Original Data")

pca = PCA(n_components=2)
X_pca = pca.fit_transform(StandardScaler().fit_transform(X))
plt.subplot(1,2,2)
plt.scatter(X_pca[:,0], X_pca[:,1], c=y)
plt.title("Data after PCA")
plt.show()
