*1.  What is K-Nearest Neighbors (KNN) and how does it work ?*

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for classification and regression tasks. It's one of the simplest and most intuitive algorithms in machine learnin.

KNN Works
Training Phase:

KNN doesn't explicitly learn a model during training (it's a lazy learner). It simply stores the training dataset.

Prediction Phase:

To classify (or predict) a new data point:

Measure Distance: Calculate the distance (typically Euclidean) between the new data point and all the points in the training set.

Find Neighbors: Select the K closest data points (neighbors) to the new point.

Vote (for Classification):

The new point is assigned the most common class among its K neighbors.

Average (for Regression):

The new point's value is the average of the values of its K nearest neighbors.

*2.  What is the difference between KNN Classification and KNN Regression ?*

The main difference between KNN Classification and KNN Regression lies in how they make predictions based on the nearest neighbors:

KNN Classification
Purpose: Assigns a class label to the input data point.

Output: A categorical value (e.g., "spam" or "not spam").

Prediction Method:

Find the K nearest neighbors.

Count the number of neighbors belonging to each class.

Assign the class with the majority vote.

Example:
Classify a flower as setosa, versicolor, or virginica based on petal measurements.

KNN Regression
Purpose: Predicts a continuous value for the input data point.

Output: A numerical value (e.g., price, temperature).

Prediction Method:

Find the K nearest neighbors.

Compute the average (or weighted average) of their output values.

Return this average as the prediction.

Example:
Predict the price of a house based on size, number of rooms, and location.

*3.  What is the role of the distance metric in KNN ?*

The distance metric in K-Nearest Neighbors (KNN) plays a crucial role in determining which training data points are considered "nearest" to a given test point. Since KNN relies on proximity to make predictions, the way distance is calculated directly impacts model accuracy.

Role of the Distance Metric

It measures similarity: Closer points are assumed to be more similar.

It determines which K neighbors are selected.

It affects both classification (via voting) and regression (via averaging).



*4.  What is the Curse of Dimensionality in KNN?*

The Curse of Dimensionality in K-Nearest Neighbors (KNN) refers to the problems that arise when working with data that has many features (high dimensions). As the number of dimensions increases, the performance and effectiveness of KNN (and many other machine learning algorithms) can degrade.

The Curse of Dimensionality means that high-dimensional data can break KNN because distances become less meaningful, neighbors become less distinguishable, and the algorithm struggles to generalize well.

*5.  How can we choose the best value of K in KNN ?*

Choosing the best value of K in K-Nearest Neighbors (KNN) is critical for achieving optimal model performance. A poor choice can lead to overfitting or underfitting.

How to Choose the Best K :

1. Use Cross-Validation (Best Practice)
Split your dataset into training and validation sets (e.g., using k-fold cross-validation).

Try different values of K (e.g., 1 to 40).

Measure the validation accuracy or error for each value.

Choose the K with the best performance on the validation set.

2. Use the Elbow Method (for visualization)
Plot accuracy vs. K or error rate vs. K.

Look for the "elbow point" where the accuracy levels off or error starts increasing.

3. Heuristic Rules (Quick Estimates)
Try odd values of K (to avoid ties in classification).

A common starting point:

𝐾
=
𝑛
K=
n
​

where
𝑛
n = number of training samples.

*6.  What are KD Tree and Ball Tree in KNN? *

KD Tree and Ball Tree are data structures used to speed up the K-Nearest Neighbors (KNN) algorithm by efficiently organizing and searching high-dimensional data.

They are especially useful when the dataset is large, as brute-force KNN (which compares every test point to all training points) becomes slow.

KD Tree (K-Dimensional Tree)

What is it :

A binary tree that recursively splits the data along one dimension at a time.

Designed for low to moderate-dimensional data (usually < 20 dimensions).

 How it works:
At each level of the tree, the dataset is split by a median value along one axis (e.g., x, y, z...).

Each node contains:

A data point

A splitting axis

Pointers to left/right child nodes

 Use case:

Faster nearest neighbor search in low-dimensional space.

Used by scikit-learn's KNeighborsClassifier(algorithm='kd_tree').

 Limitation:

Becomes inefficient in high dimensions due to the curse of dimensionality.

 2. Ball Tree

 What is it

A tree-based structure that partitions data into hyperspheres (balls) rather than axis-aligned boxes like in KD Trees.

Designed for higher-dimensional data.

 How it works:

At each node:

Data is grouped into two clusters (balls) based on centroid and radius.

The process recursively builds subtrees.

 Use case:
Better than KD Tree for medium-to-high dimensional data or when using non-Euclidean distance metrics.

*7.  When should you use KD Tree vs. Ball Tree ?*

Choosing between KD Tree and Ball Tree depends primarily on your dataset’s dimensionality, size, and the distance metric you plan to use. Here's a clear guide to help you decide :

Use KD Tree When:

Low-Dimensional Data:

Works best when the number of features (dimensions) is < 20.

In low dimensions, KD Trees are very fast at nearest-neighbor search.

Euclidean Distance:

Optimized for axis-aligned splits, which suit Euclidean or similar metrics.

Balanced or Moderately Sized Datasets:

KD Trees are efficient when the dataset can be recursively split fairly evenly.

Use Ball Tree When:

Medium to High-Dimensional Data:

Ball Trees handle 20+ dimensions better than KD Trees.

Still affected by the curse of dimensionality, but less so than KD Trees.

Non-Euclidean Distance Metrics:

Supports Minkowski, Mahalanobis, and other metrics more efficiently.

Unstructured or Unevenly Distributed Data:

Ball Trees cluster data in balls (hyperspheres), which can better adapt to irregular distributions.

*8. What are the disadvantages of KNN ?*

K-Nearest Neighbors (KNN) is simple and intuitive, but it comes with several significant disadvantages that can limit its performance, especially on large or complex datasets.

Major Disadvantages of KNN
1. Computationally Expensive at Prediction Time
KNN is a lazy learner: it does no training, but must compute distances to all training points at prediction time.

This makes it slow and inefficient for large datasets.

Time Complexity:

Training:
𝑂
(
1
)
O(1)

Prediction:
𝑂
(
𝑛
⋅
𝑑
)
O(n⋅d), where
𝑛
n is the number of training samples,
𝑑
d is the number of features.

2. Sensitive to Irrelevant Features
KNN treats all features equally when calculating distance.

Irrelevant or noisy features can distort distance metrics, leading to poor predictions.

3. Needs Feature Scaling
Features on different scales (e.g., height in cm and income in dollars) can skew distance calculations.

Normalization or standardization is essential before applying KNN.

4. Curse of Dimensionality
As the number of features grows:

All points tend to become equally distant.

It becomes hard to find meaningful neighbors.

Performance and accuracy can degrade sharply.

5. Storage and Memory Intensive
Needs to store the entire training dataset in memory.

Not practical for very large datasets or memory-constrained environments.

6. No Model Interpretability
KNN doesn’t provide a model with interpretable parameters or coefficients.

Hard to explain how a prediction was made beyond “it looked like these neighbors.”

7. Struggles with Imbalanced Data
If one class dominates, KNN may bias toward the majority class, especially with larger K values.

8. Can be Affected by Outliers
Outliers in the training data can mislead predictions, particularly with small K values (e.g., K = 1).

* 9. How does feature scaling affect KNN ?*

Feature scaling has a critical impact on the performance of the K-Nearest Neighbors (KNN) algorithm because KNN relies on distance calculations to identify nearest neighbors.

Why Feature Scaling Matters in KNN :

KNN uses distance metrics like:

Euclidean distance:

(
𝑥
1
−
𝑦
1
)
2
+
(
𝑥
2
−
𝑦
2
)
2
+
…
+
(
𝑥
𝑛
−
𝑦
𝑛
)
2
(x
1
​
 −y
1
​
 )
2
 +(x
2
​
 −y
2
​
 )
2
 +…+(x
n
​
 −y
n
​
 )
2

​

Manhattan, Minkowski, etc.

Problem:

If features have different scales or units, one feature can dominate the distance calculation, regardless of its importance.

*10.  What is PCA (Principal Component Analysis)?*

Principal Component Analysis (PCA) is a powerful dimensionality reduction technique used in machine learning and data analysis. It transforms high-dimensional data into a lower-dimensional space while preserving as much variance (information) as possible.

Purpose of PCA :

Reduce the number of features (dimensions) while retaining important patterns.

Remove multicollinearity between features.

Speed up algorithms and improve visualization.

Help combat the curse of dimensionality (especially useful for KNN).


*11.  How does PCA work?*

How PCA Works Step-by-Step :

1. Standardize the Data
Center each feature by subtracting its mean so that the data has zero mean.

Often, scale to unit variance (standard deviation = 1) so all features contribute equally.

2. Compute the Covariance Matrix
Calculate the covariance matrix
𝐶
C of the standardized data.

The covariance matrix shows how features vary together:

𝐶
=
1
𝑛
−
1
∑
𝑖
=
1
𝑛
(
𝑥
𝑖
−
𝑥
ˉ
)
(
𝑥
𝑖
−
𝑥
ˉ
)
𝑇
C=
n−1
1
​
  
i=1
∑
n
​
 (x
i
​
 −
x
ˉ
 )(x
i
​
 −
x
ˉ
 )
T

If features are independent, covariances are zero.

3. Calculate Eigenvalues and Eigenvectors
Solve the equation:

𝐶
𝑣
=
𝜆
𝑣
Cv=λv
where:

𝑣
v = eigenvector (principal component direction)

𝜆
λ = eigenvalue (variance explained by this component)

Each eigenvector points in a direction of maximal variance.

4. Sort Eigenvectors by Eigenvalues
Rank the eigenvectors by their eigenvalues from largest to smallest.

The top eigenvectors correspond to the directions that explain the most variance.

5. Select Top K Principal Components
Choose the first
𝑘
k eigenvectors that capture the majority of variance (e.g., 95%).

These form the new reduced feature space.

6. Project Data onto Principal Components
Transform original data
𝑋
X onto the new space:

𝑋
PCA
=
𝑋
×
𝑊
X
PCA
​
 =X×W
where
𝑊
W is the matrix of selected eigenvectors.

*12.  What is the geometric intuition behind PCA ?*

Geometric Intuition of PCA
Imagine your data points scattered in a high-dimensional space (e.g., 2D or 3D for easy visualization):

1. Data Cloud in Space
Your dataset is like a cloud of points in space.

Each axis represents one feature (dimension).

The points are spread out unevenly, with some directions having more spread (variance) than others.

2. Finding the Direction of Maximum Variance
PCA tries to find a new axis (line) that best fits the data in terms of spread.

This new axis is the direction along which the data varies the most.

Think of it as the direction along which the cloud of points is stretched out the furthest.

3. First Principal Component
The first principal component is this line of maximum variance.

If you were to project all points onto this line, the projections would have the largest possible spread compared to any other direction.

It captures the most important pattern or information in the data.

4. Subsequent Components
The second principal component is another axis, orthogonal (at right angles) to the first, which captures the next highest variance.

This ensures new components add new, non-redundant information.

You can think of this as the second-best line along which the data spreads out, but perpendicular to the first.

5. Dimensionality Reduction
By choosing just the top
𝑘
k principal components, you flatten your data cloud onto a lower-dimensional subspace (e.g., a plane or a line).

This retains most of the shape and structure of the original data but with fewer dimensions.


*13.  What is the difference between Feature Selection and Feature Extraction ?*

Feature Selection vs. Feature Extraction :

| Aspect                | Feature Selection                                                                                                                 | Feature Extraction                                                                                 |
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| **What it does**      | Selects a **subset of original features**                                                                                         | Creates **new features** by transforming original ones                                             |
| **Original features** | **Kept as-is** (some dropped)                                                                                                     | **Combined or transformed** into new features                                                      |
| **Goal**              | Remove irrelevant or redundant features                                                                                           | Reduce dimensionality by capturing essential information                                           |
| **Interpretability**  | Usually easier to interpret (features stay the same)                                                                              | Harder to interpret (new features are combinations)                                                |
| **Examples**          | - Filter methods (correlation, chi-square) <br> - Wrapper methods (recursive feature elimination) <br> - Embedded methods (Lasso) | - Principal Component Analysis (PCA) <br> - Linear Discriminant Analysis (LDA) <br> - Autoencoders |
| **When to use**       | When original features are meaningful and you want to keep them                                                                   | When dimensionality is very high and you want to compress info                                     |
| **Effect on data**    | Dataset dimension is reduced by **dropping** features                                                                             | Dataset dimension is reduced by **transforming** features                                          |


*14.  What are Eigenvalues and Eigenvectors in PCA? *

Eigenvectors and Eigenvalues :

1. Eigenvectors
Eigenvectors are special vectors that, when multiplied by a matrix (like a covariance matrix), do not change direction—only their magnitude might change.

In PCA, each eigenvector represents a principal component — a direction in feature space along which the data varies.

Geometrically, these eigenvectors define the new axes (coordinate system) for your data.

2. Eigenvalues
Each eigenvector has a corresponding eigenvalue, a scalar that indicates how much variance (information) there is along that eigenvector.

Larger eigenvalues mean the corresponding eigenvector captures more variance in the data.

In PCA, eigenvalues tell you the importance of each principal component.

In PCA Context
Compute the covariance matrix of your (usually standardized) data.

Solve the equation:

𝐶
𝑣
=
𝜆
𝑣
Cv=λv
where:

𝐶
C = covariance matrix

𝑣
v = eigenvector

𝜆
λ = eigenvalue

This means that applying the covariance matrix to
𝑣
v simply scales it by
𝜆
λ.

*15.  How do you decide the number of components to keep in PCA?*

Deciding how many principal components to keep in PCA is a key step because it balances dimensionality reduction and information retention.

How to Decide the Number of Components in PCA ;

1. Explained Variance Ratio
Each principal component explains a certain percentage of the total variance in the data.

The explained variance ratio tells you how much information each component captures.

You typically look at the cumulative explained variance to decide how many components to keep.

2. Choose Components Based on Variance Threshold
Common practice: choose the minimum number of components that explain at least 90% to 95% of the variance.

Example:

First 3 PCs explain 92% of variance → keep 3 components.

Keeping more might add little additional information but increase complexity.

3. Scree Plot (Elbow Method)
Plot the explained variance against the number of components.

Look for an “elbow” point where adding more components yields diminishing returns.

Components after the elbow contribute very little additional variance.

4. Domain Knowledge
Sometimes, practical or domain-specific considerations affect how many components to keep.

For visualization, you might pick 2 or 3 components regardless.

For modeling, balancing accuracy and complexity matters.

5. Cross-Validation
Evaluate downstream model performance (e.g., classification accuracy) using different numbers of components.

Choose the number that gives the best trade-off between performance and dimensionality.

*16. Can PCA be used for classification?*

PCA itself is not a classification algorithm, but it can be very useful as a preprocessing step for classification tasks.

How PCA Relates to Classification :

1. PCA for Dimensionality Reduction Before Classification
PCA reduces the number of features by projecting data into a lower-dimensional space while preserving most of the variance.

This can help:

Speed up classification algorithms.

Reduce noise and irrelevant information.

Mitigate overfitting by simplifying the feature space.

2. Improving Classifier Performance
By removing redundant or less informative features, PCA can make classes more separable in the transformed space.

Some classifiers (like KNN or SVM) benefit from a smaller, cleaner feature set.

3. Limitations
PCA is unsupervised — it doesn't use class labels when finding components.

It focuses on variance, not on class separation.

Therefore, PCA might not always find the directions that best discriminate between classes.

4. Alternatives: Supervised Dimension Reduction
Methods like Linear Discriminant Analysis (LDA) take class labels into account and aim to maximize class separability.

Sometimes LDA works better than PCA for classification.


*17.  What are the limitations of PCA? *

Limitations of PCA
1. Linearity Assumption
PCA assumes that the principal components are linear combinations of the original features.

It cannot capture non-linear relationships in the data.

For complex patterns, non-linear techniques (e.g., Kernel PCA, t-SNE, UMAP) may be better.

2. Unsupervised Method
PCA does not use class labels or any target information.

It maximizes variance, but the directions of maximum variance are not always the most relevant for classification or prediction tasks.

Important discriminative features might have low variance and be ignored.

3. Sensitivity to Scaling
PCA is sensitive to the scale of features.

Features with larger scales dominate variance unless data is properly standardized before applying PCA.

4. Interpretability Issues
Principal components are linear combinations of all original features.

This can make it hard to interpret the transformed features, especially for domain experts.

5. Loss of Information
Dimensionality reduction inevitably causes some loss of information.

Choosing too few components may omit important details, hurting model performance.

6. Outlier Sensitivity
PCA can be sensitive to outliers, which can distort the directions of maximum variance.


*18.  How do KNN and PCA complement each other ?*

KNN and PCA can work really well together because their strengths complement each other, especially when dealing with high-dimensional data.

How KNN and PCA Complement Each Other :

1. PCA Reduces Dimensionality for KNN
KNN relies on distance calculations (e.g., Euclidean distance) to find nearest neighbors.

In high-dimensional spaces, distances become less meaningful due to the curse of dimensionality.

PCA reduces the number of features by projecting data into a lower-dimensional space that retains most variance.

This makes the distance computations in KNN more meaningful and efficient.

2. PCA Helps Remove Noise and Redundancy
High-dimensional data often contains noisy or redundant features.

PCA captures the most important patterns and discards noise.

KNN benefits from this by focusing on cleaner, more informative features for neighbor selection.

3. Faster KNN Computation
With fewer dimensions, KNN requires less computation for distance calculations.

This improves KNN’s speed and scalability.

4. Improves KNN Performance
By reducing irrelevant features and noise, PCA can help KNN achieve better accuracy and generalization.

*19.  How does KNN handle missing values in a dataset?*

How KNN Deals with Missing Values
1. KNN Cannot Directly Handle Missing Values
KNN relies on computing distances between points.

Missing values mean incomplete feature vectors, so the distance calculation breaks down.

2. Common Strategies to Handle Missing Data Before KNN
a) Imputation
Fill in missing values before applying KNN.

Common imputation methods:

Mean/Median Imputation: Replace missing values with the mean or median of the feature.

KNN Imputation: Use KNN itself to estimate missing values by looking at nearest neighbors with complete data.

Model-Based Imputation: Use regression or other predictive models.

b) Remove Samples or Features
Drop rows (samples) with missing values if only a few are missing.

Drop features (columns) with too many missing values.

3. KNN Variants That Handle Missing Values
Some modified KNN algorithms can handle missing values by:

Computing distances using only the features present in both samples (partial distance).

Weighting the distance by the number of available features.

These are less common and require careful implementation.

*20.  What are the key differences between PCA and Linear Discriminant Analysis (LDA)?*

Both PCA and LDA are popular dimensionality reduction techniques, but they have different goals and approaches. Here are the key differences:

| Aspect                   | Principal Component Analysis (PCA)                                | Linear Discriminant Analysis (LDA)                                         |
| ------------------------ | ----------------------------------------------------------------- | -------------------------------------------------------------------------- |
| **Goal**                 | Find directions that **maximize variance** in data (unsupervised) | Find directions that **maximize class separability** (supervised)          |
| **Supervision**          | **Unsupervised** — ignores class labels                           | **Supervised** — uses class labels                                         |
| **Focus**                | Captures overall data structure                                   | Focuses on differences between classes                                     |
| **Criteria optimized**   | Maximizes total variance                                          | Maximizes ratio of **between-class variance** to **within-class variance** |
| **Output components**    | Principal components ordered by variance explained                | Linear discriminants ordered by class separability                         |
| **Dimensionality limit** | Can extract up to $\min(n_{samples}, n_{features})$ components    | Limited to $\leq (C - 1)$ components, where $C$ = number of classes        |
| **Typical use cases**    | Data compression, visualization, noise reduction                  | Classification, feature extraction for supervised tasks                    |
| **Interpretability**     | Components are directions of max variance                         | Components maximize class separation                                       |


In [None]:
# 21. Train a KNN Classifier on the Iris dataset and print model accuracy.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize KNN classifier with k=5 neighbors
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

# Predict on test set
y_pred = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"KNN Classifier Accuracy on Iris dataset: {accuracy:.2f}")


In [None]:
# 22.  Train a KNN Regressor on a synthetic dataset and evaluate using Mean Squared Error (MSE)

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate a synthetic regression dataset
X, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize KNN regressor with k=5 neighbors
knn_reg = KNeighborsRegressor(n_neighbors=5)

# Train the model
knn_reg.fit(X_train, y_train)

# Predict on test set
y_pred = knn_reg.predict(X_test)

# Evaluate using Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"KNN Regressor Mean Squared Error: {mse:.2f}")


In [None]:
# 23.  Train a KNN Classifier using different distance metrics (Euclidean and Manhattan) and compare accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize KNN with Euclidean distance (default)
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# Initialize KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Print results
print(f"Accuracy with Euclidean distance: {accuracy_euclidean:.2f}")
print(f"Accuracy with Manhattan distance: {accuracy_manhattan:.2f}")


In [None]:
# 24.  Train a KNN Classifier with different values of K and visualize decision boundarie

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load Iris data
iris = load_iris()
X = iris.data[:, :2]  # Use only first two features for visualization
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create color maps
cmap_light = plt.cm.Pastel2
cmap_bold = plt.cm.Set1

# Values of K to try
k_values = [1, 3, 7, 15]

plt.figure(figsize=(15, 12))

for i, k in enumerate(k_values, 1):
    # Train KNN classifier
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)

    # Create mesh grid for plotting decision boundaries
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))

    # Predict class for each point in the mesh grid
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary
    plt.subplot(2, 2, i)
    plt.contourf(xx, yy, Z, cmap=cmap_light)

    # Plot training points
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold, edgecolor='k', s=50)
    plt.title(f"KNN with k={k}")
    plt.xlabel(iris.feature_names[0])
    plt.ylabel(iris.feature_names[1])
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

plt.tight_layout()
plt.show()


In [None]:
# 25.  Apply Feature Scaling before training a KNN model and compare results with unscaled data

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Without scaling ---
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
y_pred_unscaled = knn_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred_unscaled)

# --- With scaling ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy without scaling: {acc_unscaled:.2f}")
print(f"Accuracy with scaling: {acc_scaled:.2f}")


In [None]:
# 26.  Train a PCA model on synthetic data and print the explained variance ratio for each component

import numpy as np
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

# Generate synthetic dataset
X, _ = make_classification(n_samples=200, n_features=10, n_informative=5, random_state=42)

# Train PCA model
pca = PCA()
pca.fit(X)

# Print explained variance ratio for each component
explained_variance = pca.explained_variance_ratio_
for i, var_ratio in enumerate(explained_variance, 1):
    print(f"Principal Component {i}: Explained Variance Ratio = {var_ratio:.4f}")


In [None]:
# 27.  Apply PCA before training a KNN Classifier and compare accuracy with and without PCA

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for PCA and KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Without PCA ---
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
acc_without_pca = accuracy_score(y_test, y_pred)

# --- With PCA ---
pca = PCA(n_components=2)  # Reduce to 2 components for illustration
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
acc_with_pca = accuracy_score(y_test, y_pred_pca)

print(f"Accuracy without PCA: {acc_without_pca:.2f}")
print(f"Accuracy with PCA: {acc_with_pca:.2f}")


In [None]:
# 28.  Perform Hyperparameter Tuning on a KNN Classifier using GridSearchCV

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define KNN classifier
knn = KNeighborsClassifier()

# Define hyperparameter grid to search
param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9],
    'metric': ['euclidean', 'manhattan']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV to training data
grid_search.fit(X_train, y_train)

# Best parameters found
print("Best parameters:", grid_search.best_params_)

# Best cross-validation accuracy
print(f"Best cross-validation accuracy: {grid_search.best_score_:.2f}")

# Evaluate best model on test data
best_knn = grid_search.best_estimator_
test_accuracy = best_knn.score(X_test, y_test)
print(f"Test set accuracy with best parameters: {test_accuracy:.2f}")


In [None]:
# 29.  Train a KNN Classifier and check the number of misclassified samples

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict on test set
y_pred = knn.predict(X_test)

# Calculate number of misclassified samples
num_misclassified = (y_test != y_pred).sum()
print(f"Number of misclassified samples: {num_misclassified}")


In [None]:
# 30.  Train a PCA model and visualize the cumulative explained variance.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load dataset
iris = load_iris()
X = iris.data

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA()
pca.fit(X_scaled)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--', color='b')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Cumulative Explained Variance')
plt.grid(True)
plt.show()


In [None]:
# 31. Train a KNN Classifier using different values of the weights parameter (uniform vs. distance) and compare accuracy.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# KNN with uniform weights (all neighbors weighted equally)
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_uniform.fit(X_train, y_train)
y_pred_uniform = knn_uniform.predict(X_test)
acc_uniform = accuracy_score(y_test, y_pred_uniform)

# KNN with distance weights (closer neighbors weighted more)
knn_distance = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_distance.fit(X_train, y_train)
y_pred_distance = knn_distance.predict(X_test)
acc_distance = accuracy_score(y_test, y_pred_distance)

print(f"Accuracy with uniform weights: {acc_uniform:.2f}")
print(f"Accuracy with distance weights: {acc_distance:.2f}")


In [None]:
# 32.  Train a KNN Regressor and analyze the effect of different K values on performance

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression data
X, y = make_regression(n_samples=300, n_features=5, noise=15, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Try different values of K
k_values = range(1, 31)
mse_values = []

for k in k_values:
    knn_reg = KNeighborsRegressor(n_neighbors=k)
    knn_reg.fit(X_train, y_train)
    y_pred = knn_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_values.append(mse)

# Plot K vs MSE
plt.figure(figsize=(8, 5))
plt.plot(k_values, mse_values, marker='o')
plt.title("Effect of K on KNN Regressor Performance")
plt.xlabel("Number of Neighbors (K)")
plt.ylabel("Mean Squared Error (MSE)")
plt.grid(True)
plt.show()


In [None]:
# 33.  Implement KNN Imputation for handling missing values in a dataset

import numpy as np
from sklearn.impute import KNNImputer

# Example dataset with missing values (np.nan)
X = np.array([
    [1.0, 2.0, np.nan],
    [3.0, np.nan, 1.0],
    [np.nan, 0.0, 2.0],
    [4.0, 2.0, 3.0],
    [5.0, 3.0, np.nan]
])

# Initialize KNNImputer with k neighbors (default k=5)
imputer = KNNImputer(n_neighbors=2)

# Fit and transform the dataset to impute missing values
X_imputed = imputer.fit_transform(X)

print("Original data with missing values:")
print(X)
print("\nData after KNN imputation:")
print(X_imputed)


In [None]:
# 34.  Train a PCA model and visualize the data projection onto the first two principal components

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Scale the data (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot the projection
plt.figure(figsize=(8,6))
colors = ['navy', 'turquoise', 'darkorange']

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
                color=color, lw=2, label=target_name, alpha=0.7)

plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA Projection of Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# 35.  Train a KNN Classifier using the KD Tree and Ball Tree algorithms and compare performance

import time
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a helper function to train and time the model
def train_and_evaluate(algorithm):
    knn = KNeighborsClassifier(n_neighbors=5, algorithm=algorithm)
    start_time = time.time()
    knn.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy, train_time

# Train using KD Tree
acc_kd, time_kd = train_and_evaluate('kd_tree')

# Train using Ball Tree
acc_ball, time_ball = train_and_evaluate('ball_tree')

print(f"KD Tree -> Accuracy: {acc_kd:.2f}, Training Time: {time_kd:.6f} seconds")
print(f"Ball Tree -> Accuracy: {acc_ball:.2f}, Training Time: {time_ball:.6f} seconds")


In [None]:
# 36.  Train a PCA model on a high-dimensional dataset and visualize the Scree plot

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Generate a high-dimensional synthetic dataset
X, _ = make_classification(n_samples=300, n_features=50, n_informative=20, random_state=42)

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio for each component
explained_variance = pca.explained_variance_ratio_

# Scree plot
plt.figure(figsize=(10,6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.7, color='b')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')
plt.xticks(range(1, len(explained_variance) + 1))
plt.show()


In [None]:
# 37.  Train a KNN Classifier and evaluate performance using Precision, Recall, and F1-Score

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict on test set
y_pred = knn.predict(X_test)

# Print classification report (includes precision, recall, f1-score)
print(classification_report(y_test, y_pred, target_names=iris.target_names))


In [None]:
# 38.  Train a PCA model and analyze the effect of different numbers of components on accuracy

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Try different PCA components and record accuracy
component_range = range(1, X.shape[1] + 1)  # 1 to number of features
accuracies = []

for n_components in component_range:
    # Apply PCA
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    # Train KNN on PCA-transformed data
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_pca, y_train)
    y_pred = knn.predict(X_test_pca)

    # Calculate accuracy
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

# Plot results
plt.figure(figsize=(8,5))
plt.plot(component_range, accuracies, marker='o')
plt.title('Effect of Number of PCA Components on KNN Accuracy')
plt.xlabel('Number of Principal Components')
plt.ylabel('Accuracy')
plt.xticks(component_range)
plt.grid(True)
plt.show()


In [None]:
# 39.  Train a KNN Classifier with different leaf_size values and compare accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

leaf_sizes = [1, 10, 20, 30, 40, 50]
accuracies = []

for leaf_size in leaf_sizes:
    knn = KNeighborsClassifier(n_neighbors=5, leaf_size=leaf_size, algorithm='kd_tree')
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Leaf size: {leaf_size}, Accuracy: {acc:.2f}")

# Optionally, plot accuracy vs leaf_size
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.plot(leaf_sizes, accuracies, marker='o')
plt.title('Effect of leaf_size on KNN Accuracy (kd_tree)')
plt.xlabel('leaf_size')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()


In [None]:
# 40.  Train a PCA model and visualize how data points are transformed before and after PCA

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load data
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Scale the data (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA and transform data to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot original data (first two features)
plt.figure(figsize=(14,6))

plt.subplot(1, 2, 1)
colors = ['navy', 'turquoise', 'darkorange']
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X[y == i, 0], X[y == i, 1], color=color, lw=2, label=target_name, alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original Data (First Two Features)')
plt.legend()
plt.grid(True)

# Plot PCA transformed data (first two principal components)
plt.subplot(1, 2, 2)
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=color, lw=2, label=target_name, alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Transformed Data (2 Components)')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()


In [None]:
# 41.  Train a KNN Classifier on a real-world dataset (Wine dataset) and print classification report

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict on test data
y_pred = knn.predict(X_test_scaled)

# Print classification report
print(classification_report(y_test, y_pred, target_names=wine.target_names))


In [None]:
# 42.  Train a KNN Regressor and analyze the effect of different distance metrics on prediction error

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Generate synthetic regression dataset
X, y = make_regression(n_samples=300, n_features=5, noise=20, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Distance metrics to test
metrics = ['euclidean', 'manhattan']
mse_scores = []

for metric in metrics:
    knn = KNeighborsRegressor(n_neighbors=5, metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"Distance metric: {metric}, Mean Squared Error: {mse:.2f}")

# Optional: Bar plot to visualize MSE for each metric
plt.bar(metrics, mse_scores, color=['skyblue', 'salmon'])
plt.title('Effect of Distance Metric on KNN Regression Error')
plt.ylabel('Mean Squared Error')
plt.show()


In [None]:
# 43.  Train a KNN Classifier and evaluate using ROC-AUC score

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict probabilities for positive class
y_probs = knn.predict_proba(X_test_scaled)[:, 1]

# Calculate ROC-AUC
auc = roc_auc_score(y_test, y_probs)
print(f"ROC-AUC Score: {auc:.3f}")

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
plt.plot(fpr, tpr, label=f'KNN (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# 44.  Train a PCA model and visualize the variance captured by each principal component

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load data
iris = load_iris()
X = iris.data

# Scale features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA()
pca.fit(X_scaled)

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Plot variance explained by each component
plt.figure(figsize=(8,5))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.7, color='teal')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained Ratio')
plt.title('Variance Explained by Each Principal Component')
plt.xticks(range(1, len(explained_variance) + 1))
plt.show()


In [None]:
# 45.  Train a KNN Classifier and perform feature selection before training

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Selection: Select top 2 features
selector = SelectKBest(score_func=f_classif, k=2)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Scale features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=iris.target_names))


In [None]:
# 46.  Train a PCA model and visualize the data reconstruction error after reducing dimensions

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Load dataset
iris = load_iris()
X = iris.data

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Try different numbers of components and calculate reconstruction error
max_components = X.shape[1]
errors = []

for n_components in range(1, max_components + 1):
    pca = PCA(n_components=n_components)
    X_reduced = pca.fit_transform(X_scaled)
    X_reconstructed = pca.inverse_transform(X_reduced)
    mse = mean_squared_error(X_scaled, X_reconstructed)
    errors.append(mse)

# Plot reconstruction error vs number of components
plt.figure(figsize=(8,5))
plt.plot(range(1, max_components + 1), errors, marker='o', color='red')
plt.xlabel('Number of PCA Components')
plt.ylabel('Mean Squared Reconstruction Error')
plt.title('PCA Reconstruction Error vs Number of Components')
plt.grid(True)
plt.show()


In [None]:
# 47.  Train a KNN Classifier and visualize the decision boundary

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X = iris.data[:, :2]  # Use first two features for 2D plot
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Create a mesh grid for plotting decision boundaries
h = 0.02  # step size in the mesh
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict class for each point in the mesh grid
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary and training points
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Set1)
colors = ['red', 'green', 'blue']
for idx, color in enumerate(colors):
    plt.scatter(X_train_scaled[y_train == idx, 0], X_train_scaled[y_train == idx, 1],
                c=color, label=iris.target_names[idx], edgecolor='k')
plt.xlabel('Feature 1 (standardized)')
plt.ylabel('Feature 2 (standardized)')
plt.title('KNN Decision Boundary (k=5)')
plt.legend()
plt.show()


In [None]:
# 48.  Train a PCA model and analyze the effect of different numbers of components on data variance.

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load data
iris = load_iris()
X = iris.data

# Scale data before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA with all components
pca = PCA()
pca.fit(X_scaled)

# Calculate cumulative explained variance
cumulative_variance = pca.explained_variance_ratio_.cumsum()

# Plot cumulative variance vs number of components
plt.figure(figsize=(8,5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Effect of Number of PCA Components on Variance Captured')
plt.xticks(range(1, len(cumulative_variance) + 1))
plt.grid(True)
plt.show()
