Q1. What is K-Nearest Neighbors (KNN) and how does it work

Ans1. K-Nearest Neighbors (KNN) is a simple, intuitive, and widely used supervised machine learning algorithm used for classification and regression tasks.

KNN is a non-parametric and instance-based learning algorithm that classifies or predicts a data point based on the labels of its nearest neighbors in the feature space.

Key points:
KNN is simple and effective, but can be slow with large datasets since it computes distances to all training points.

The choice of K affects performance: a small K can be noisy, a large K can smooth out details.

It works best when data is well-distributed and features are normalized/scaled.





Q2.  What is the difference between KNN Classification and KNN Regression

Ans2.The difference between KNN Classification and KNN Regression lies in the type of output they produce and how they aggregate the values of the nearest neighbors:

🔹 KNN Classification
Purpose: Predict a category/label (discrete output).

Output: A class label (e.g., "spam" or "not spam", "dog" or "cat").

🔹 KNN Regression
Purpose: Predict a numerical value (continuous output).
Output: A real number (e.g., price, temperature, height).





Q3.  What is the role of the distance metric in KNN

Ans3. The distance metric in K-Nearest Neighbors (KNN) is crucial because it determines how "close" or "similar" two data points are in the feature space.

🔹 Role of Distance Metric:
When a new data point is given, KNN uses the distance metric to compare it to all points in the training dataset.

Based on these distances, it selects the K nearest neighbors.

These neighbors are then used to make a prediction (via majority vote in classification or averaging in regression).





Q4. What is the Curse of Dimensionality in KNN

Ans4.The Curse of Dimensionality refers to problems that arise when working with data in high-dimensional spaces (i.e., when the number of features/variables is very large). In the context of K-Nearest Neighbors (KNN), it negatively affects the algorithm's performance and accuracy.

KNN relies on distance calculations to find the nearest neighbors. But in high dimensions:

Distances become less meaningful:

Sparsity of data:

Increased computation:





Q5. How can we choose the best value of K in KNN

Ans5. Choosing the right value of K (the number of nearest neighbors) is critical for the performance of a KNN model. A poor choice can lead to overfitting or underfitting.


| Value of K          | Behavior                | Effect                          |
| ------------------- | ----------------------- | ------------------------------- |
| Small K (e.g., K=1) | Very sensitive to noise | High variance → **overfitting** |
| Large K             | Smoother predictions    | High bias → **underfitting**    |



Q6. What are KD Tree and Ball Tree in KNN

Ans6.🔹 1. KD Tree (K-Dimensional Tree)
KD Tree and Ball Tree are data structures used to speed up the process of finding nearest neighbors in K-Nearest Neighbors (KNN), especially for large datasets.

KNN is a lazy and instance-based algorithm. Without optimization, it must compute the distance between the test point and every point in the training set — which is slow. KD Tree and Ball Tree help reduce this computational cost.

🔹 1. KD Tree (K-Dimensional Tree)

🔹 2. Ball Tree



Q6.  When should you use KD Tree vs. Ball Tree

Ans6.The choice between KD Tree and Ball Tree depends mainly on the dimensionality of your data and its distribution.

| Condition                         | Reason                                                                   |
| --------------------------------- | ------------------------------------------------------------------------ |
| **Low-dimensional data** (≤ 20)   | KD Tree performs fast axis-aligned splits efficiently in low dimensions. |
| Data is **uniformly distributed** | KD Tree divides space cleanly when data is evenly spread.                |
| You need **simple, fast queries** | KD Tree has lower overhead in simple spaces.                             |




Q7.  What are the disadvantages of KNN

Ans7.
 While KNN is simple and effective for many problems, it also has several limitations that can affect its performance and scalability.

🔹 1. Slow Prediction Time

🔹 2. Sensitive to Irrelevant or Redundant Features

🔹 3. Curse of Dimensionality

🔹 4. Memory Intensive



Q8. How does feature scaling affect KNN

 Ans8. Feature scaling is critical for the performance of K-Nearest Neighbors (KNN) because KNN is a distance-based algorithm.


🔹 Why is scaling important?
KNN uses distance metrics like Euclidean or Manhattan to compute the closeness between data points:



| Method             | Description                              | Use Case                             |
| ------------------ | ---------------------------------------- | ------------------------------------ |
| **StandardScaler** | Mean = 0, Std Dev = 1                    | Common for normally distributed data |
| **MinMaxScaler**   | Scales features to \[0, 1] range         | Best for bounded feature ranges      |
| **RobustScaler**   | Uses median and IQR (robust to outliers) | Good when data has outliers          |



Q10.  What is PCA (Principal Component Analysis)

Ans10. Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets by transforming them into a smaller number of uncorrelated variables called principal components, while preserving as much variance (information) as possible.




Q11.  How does PCA work

Ans11. Principal Component Analysis (PCA) works by finding new axes (directions) — called principal components — that best capture the maximum variance in the data. It then projects the data onto these axes to reduce the number of dimensions while retaining as much information as possible.

✅ 1. Standardize the Data

✅ 2. Compute the Covariance Matrix

✅ 3. Compute Eigenvectors and Eigenvalues
\
✅ 4. Sort Eigenvalues and Select Top k Components



Q12.  What is the geometric intuition behind PCA

Ans12. Geometric Intuition of PCA:

Imagine you have a cloud of data points scattered in a high-dimensional space. The goal of PCA is to find new axes (directions) along which the data varies the most, and then project the data onto those axes to simplify it while preserving the main structure.

Q13.  What is the difference between Feature Selection and Feature Extraction

Ans13. | Aspect         | Feature Selection                                                       | Feature Extraction                                                  |
| -------------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------- |
| **Definition** | Selecting a **subset** of the original features based on some criteria. | Creating **new features** by transforming the original features.    |
| **Output**     | Subset of original features (unchanged features).                       | New features (combinations or projections of original features).    |
| **Goal**       | Keep the most relevant original features.                               | Create compact, informative features summarizing the original data. |
| **Examples**   | - Filter methods (e.g., correlation, Chi-square)                        |                                                                     |

Q14.  What are Eigenvalues and Eigenvectors in PCA

Ans14. In PCA (Principal Component Analysis), eigenvalues and eigenvectors come from the covariance matrix of the data and are fundamental to identifying the directions (principal components) that capture the most variance.

🔹 What are Eigenvectors?
An eigenvector is a direction (a vector) in the feature space.

🔹 What are Eigenvalues?
An eigenvalue is a scalar associated with each eigenvector.

It measures the amount of variance in the data along its corresponding eigenvector (principal component).

🔹 Role in PCA:
PCA computes eigenvectors and eigenvalues of the covariance matrix of the data.






Q15.  How do you decide the number of components to keep in PCA

Ans15. To decide how many principal components to keep in PCA, you typically use one or more of these approaches:

Explained Variance Threshold:

Scree Plot (Elbow Method):
Plot the cumulative explained variance or eigenvalues versus the number of components. Pick the number at the “elbow” point where additional components add little extra variance.

Keep components with eigenvalues greater than 1 (mostly in factor analysis contexts).

Cross-Validation / Downstream Performance:
Experiment with different numbers of components and choose the number that gives the best performance in your specific task (e.g., classification accuracy).


Q16. Can PCA be used for classification


Ans16. PCA itself is not a classification algorithm, but it can be very useful as a preprocessing step for classification tasks.


How PCA helps in classification:
Dimensionality Reduction:
Noise Reduction:
Improved Model Performance:

Many classifiers (like KNN, SVM, logistic regression) perform better or faster with fewer, uncorrelated features.




Q17. What are the limitations of PCA

Ans17.
PCA assumes that the data’s structure can be captured by linear combinations of features. It fails to capture non-linear relationships in the data.

Loss of Interpretability
Principal components are linear combinations of original features and can be hard to interpret, especially in high dimensions.

Variance Does Not Equal Importance





Q18.  How do KNN and PCA complement each other

Ans18. 1. PCA reduces dimensionality for KNN


KNN performance degrades with high-dimensional data (curse of dimensionality).

PCA reduces the number of features by projecting data onto fewer principal components while preserving most of the variance.

2. PCA speeds up KNN
KNN computes distances between points, which gets slower with more features.



Q19.  How does KNN handle missing values in a dataset

Ans19. KNN itself does not inherently handle missing values. You need to preprocess the data before applying KNN. Here are common strategies:

1. Imputation Before KNN
Mean/Median/Mode Imputation:

KNN Imputation:

Use KNN itself to impute missing values by finding nearest neighbors based on other features and averaging their values.

2. Remove Samples or Features


3. Distance Calculation Adjustments
Some advanced KNN implementations adjust distance calculations to ignore missing values or estimate distances based on available features, but this is not standard.




Q20.  What are the key differences between PCA and Linear Discriminant Analysis (LDA)?

Ans20.
 | Aspect                   | PCA (Principal Component Analysis)                      | LDA (Linear Discriminant Analysis)                                                               |
| ------------------------ | ------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| **Purpose**              | Unsupervised dimensionality reduction                   | Supervised dimensionality reduction and classification                                           |
| **Goal**                 | Maximize variance in data (capture most information)    | Maximize class separability (maximize between-class variance and minimize within-class variance) |
| **Use of Labels**        | Does **not** use class labels                           | Uses class labels                                                                                |
| **Components**           | Principal components are directions of maximum variance | Discriminant components maximize class separability                                              |
| **Number of Components** | Up to number of original features                       | Up to (number of classes - 1) components                                                         |
| **Assumptions**          | No assumptions about data distribution                  | Assumes normally distributed classes with equal covariance matrices                              |
| **Typical Use Cases**    | Data visualization, noise reduction, preprocessing      | Classification, supervised feature extraction                                                    |
| **Feature Extraction**   | Finds axes capturing most overall data variance         | Finds axes that best separate classes                                                            |
| **Interpretability**     | Components represent major variance directions          | Components represent directions maximizing class separation                                      |



 Practical




Q21.  Train a KNN Classifier on the Iris dataset and print model accuracy

Ans21. from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize KNN with k=5 (default)
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

# Predict on test set
y_pred = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Classifier Accuracy on Iris dataset: {accuracy:.2f}")




Q22.  Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)


Ans22.import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Create synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=5, noise=10, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize KNN Regressor with k=5
knn_regressor = KNeighborsRegressor(n_neighbors=5)

# Train the model
knn_regressor.fit(X_train, y_train)

# Predict on test set
y_pred = knn_regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"KNN Regressor Mean Squared Error on synthetic data: {mse:.2f}")


Q23. Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy


Ans23. from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize KNN with Euclidean distance (default)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# Initialize KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print(f"Accuracy with Euclidean distance: {accuracy_euclidean:.2f}")
print(f"Accura



Q24.  Train a KNN Classifier with different values of K and visualize decision boundarie

Ans24.import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load Iris dataset and select first two features for visualization
iris = load_iris()
X = iris.data[:, :2]  # Only first two features (sepal length, sepal width)
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Function to plot decision boundaries
def plot_decision_boundary(clf, X, y, ax, title):
    h = 0.02  # step size in mesh


Q25. Apply Feature Scaling before training a KNN model and compare results with unscaled data

Ans25.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Without scaling ---
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(_



Q26. Train a PCA model on synthetic data and print the explained variance ratio for each component


Ans26.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

# Generate synthetic data (e.g., classification data with 10 features)
X, _ = make_classification(n_samples=500, n_features=10, random_state=42)

# Train PCA
pca = PCA()
pca.fit(X)

# Print explained variance ratio for each component
explained_variance_rati


Q27.  Apply PCA before training a KNN Classifier and compare accuracy with and without PCA


Ans27.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (important for both PCA and KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- KNN without PCA ---
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
accuracy_without_pca = accuracy_score(y_test, y_pred)

# --- Apply PCA ---
pca = PCA(n_components=2)  # Reduce to 2 components for example
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# --- KNN with PCA ---
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(_


Q28. Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV


Ans28. from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define KNN classifier
knn = KNeighborsClassifier()

# Define parameter grid to search
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=knn, p_



Q29.  Train a KNN Classifier and check the number of misclassified samples

Ans29.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict on test data
y_pred = knn.predict(X_test_scaled)

# Calculate misclassified samples
misclassified = (y_test != y_pred).sum()
print(f"Number of misclassified samples: {misclassified}")

Q30.  Train a PCA model and visualize the cumulative explained variance.

Ans30.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

# Generate synthetic data with 10 features
X, _ = make_classification(n_samples=500, n_features=10, random_state=42)

# Train PCA
pca = PCA()
pca.fit(X)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1),_





















































































































































