Question 1 : What is the fundamental idea behind ensemble techniques? How does
bagging differ from boosting in terms of approach and objective?

    The main idea behind ensemble techniques is to combine the predictions of multiple models (called weak learners) to make a stronger and more accurate model.

Bagging (Bootstrap Aggregating):

    Each model is trained independently on different random samples of the data.

    The goal is to reduce variance and prevent overfitting.

Example: Random Forest.

Boosting:

    Models are trained sequentially, where each new model focuses on correcting the errors made by the previous ones.

    The goal is to reduce bias and improve accuracy.

Question 2: Explain how the Random Forest Classifier reduces overfitting compared to
a single decision tree. Mention the role of two key hyperparameters in this process.

    A Random Forest Classifier reduces overfitting by combining the results of many decision trees instead of relying on just one. Each tree learns from a different random part of the data and random set of features. When their results are averaged (or voted), the overall model becomes more stable and less likely to overfit.

Two key hyperparameters that help in this process:

    n_estimators: Number of trees in the forest.

    More trees = better generalization and less overfitting (up to a limit).

    max_features: Number of features considered when splitting a node.

    Using fewer random features per split ensures trees are less correlated, which reduces overfitting.

Question 3: What is Stacking in ensemble learning? How does it differ from traditional
bagging/boosting methods? Provide a simple example use case.

    Stacking (Stacked Ensemble) is an ensemble technique where multiple different models (like Decision Tree, SVM, Logistic Regression, etc.) are trained on the same dataset, and then their predictions are combined using another model (called a meta-model or blender) to make the final prediction.

How it differs:

    Bagging: Uses many models of the same type trained independently on random subsets (e.g., Random Forest).

    Boosting: Builds models sequentially, each correcting the errors of the previous one (e.g., AdaBoost).

    Stacking: Uses different types of models together and combines their outputs using another model for better performance.

Simple example use case:

    In a loan approval prediction task, you can use a Decision Tree, Logistic Regression, and KNN as base models. Their outputs are then fed into a meta-model (like Random Forest) which gives the final decision — increasing overall accuracy.

Question 4:What is the OOB Score in Random Forest, and why is it useful? How does
it help in model evaluation without a separate validation set?

    OOB (Out-of-Bag) Score is a built-in way to check the performance of a Random Forest model without using a separate validation set.

    When building each tree in the forest, Random Forest uses a random sample of the data (with replacement). About one-third of the data is left out — this is called the Out-of-Bag data.

    Each tree is tested on its OOB data, and the combined results from all trees give the OOB Score, which is similar to a validation accuracy.

Why it’s useful:

    It gives a reliable estimate of model performance.

    No need to split the dataset into a separate validation set, saving data for training.

Question 5: Compare AdaBoost and Gradient Boosting in terms of:
● How they handle errors from weak learners
● Weight adjustment mechanism
● Typical use cases

| **Aspect**                      | **AdaBoost**                                                                                | **Gradient Boosting**                                                                                    |
| ------------------------------- | ------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| **How they handle errors**      | Focuses on the **misclassified samples** by giving them higher weights in the next round.   | Learns from the **residual errors** (difference between actual and predicted values) of previous models. |
| **Weight adjustment mechanism** | Adjusts **sample weights** — misclassified samples get higher importance in the next model. | Adjusts **prediction values** — new models are trained to minimize the overall loss function.            |
| **Typical use cases**           | Simple classification problems (e.g., spam detection, credit risk).                         | More complex regression and classification tasks (e.g., house price prediction, customer churn).         |


In short:

    AdaBoost focuses on hard-to-classify samples,

    Gradient Boosting focuses on reducing overall prediction error using gradients.

Question 6:Why does CatBoost perform well on categorical features without requiring
extensive preprocessing? Briefly explain its handling of categorical variables.

    CatBoost performs well on categorical features because it can handle them automatically, without needing one-hot encoding or label encoding.

    It uses a special method called “Ordered Target Encoding”, where each category is replaced by a number based on the average target value for that category — but in a way that avoids data leakage (it uses only past data for calculation).

In simple terms:

    Instead of manually converting categories to numbers, CatBoost learns meaningful numeric values for them.

    This saves preprocessing time and helps the model learn better patterns from categorical data.

Example:

    If a feature is “City” with values like Delhi, Mumbai, Chennai, CatBoost internally converts them into useful numeric statistics instead of just 0s and 1s.

Question 7: KNN Classifier Assignment: Wine Dataset Analysis with
Optimization
Task:
1. Load the Wine dataset (sklearn.datasets.load_wine()).
2. Split data into 70% train and 30% test.
3. Train a KNN classifier (default K=5) without scaling and evaluate using:
a. Accuracy
b. Precision, Recall, F1-Score (print classification report)
4. Apply StandardScaler, retrain KNN, and compare metrics.
5. Use GridSearchCV to find the best K (test K=1 to 20) and distance metric
(Euclidean, Manhattan).
6. Train the optimized KNN and compare results with the unscaled/scaled versions.

In [None]:
# Step 1: Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Step 2: Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Step 3: Split into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Step 4: Train KNN (default K=5) WITHOUT scaling
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("---- Without Scaling ----")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Step 5: Apply StandardScaler and retrain KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)

print("\n---- With StandardScaler ----")
print("Accuracy:", accuracy_score(y_test, y_pred_scaled))
print(classification_report(y_test, y_pred_scaled))

# Step 6: Use GridSearchCV to find best K and distance metric
param_grid = {
    'n_neighbors': range(1, 21),
    'metric': ['euclidean', 'manhattan']
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train_scaled, y_train)

print("\n---- GridSearchCV Results ----")
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Step 7: Train optimized KNN and compare
best_knn = grid.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)

print("\n---- Optimized KNN ----")
print("Accuracy:", accuracy_score(y_test, y_pred_best))
print(classification_report(y_test, y_pred_best))


What You’ll Learn from This Assignment

Effect of scaling:

    KNN uses distance to classify points, so features must be on the same scale.
    After scaling, accuracy usually improves.

GridSearchCV usage:

    It helps automatically find the best K and distance metric for the model.

Model comparison:

    Without scaling: Poor performance

    With scaling: Better accuracy

    Optimized model: Best accuracy and balance in precision/recall/F1

Question 8 : PCA + KNN with Variance Analysis and Visualization
Task:
1. Load the Breast Cancer dataset (sklearn.datasets.load_breast_cancer()).
2. Apply PCA and plot the scree plot (explained variance ratio).
3. Retain 95% variance and transform the dataset.
4. Train KNN on the original data and PCA-transformed data, then compare
accuracy.
5. Visualize the first two principal components using a scatter plot (color by class).

In [None]:
# Step 1: Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np

# Step 2: Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Standardize the features (important before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply PCA and plot Scree Plot (explained variance ratio)
pca = PCA()
pca.fit(X_scaled)

plt.figure(figsize=(8,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_)*100, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance (%)')
plt.title('Scree Plot - PCA Explained Variance')
plt.grid(True)
plt.show()

# Step 4: Retain 95% variance and transform the dataset
pca_95 = PCA(0.95)  # Automatically selects components to keep 95% variance
X_pca = pca_95.fit_transform(X_scaled)
print(f"\nNumber of components to retain 95% variance: {pca_95.n_components_}")

# Step 5: Split data (original and PCA-transformed)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42, stratify=y)
X_pca_train, X_pca_test, _, _ = train_test_split(X_pca, y, test_size=0.3, random_state=42, stratify=y)

# Step 6: Train KNN on original data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_original = knn.predict(X_test)
acc_original = accuracy_score(y_test, y_pred_original)

# Step 7: Train KNN on PCA-transformed data
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_pca_train, y_train)
y_pred_pca = knn_pca.predict(X_pca_test)
acc_pca = accuracy_score(y_test, y_pred_pca)

print("\n---- Accuracy Comparison ----")
print(f"Original Data Accuracy: {acc_original:.4f}")
print(f"PCA (95% variance) Data Accuracy: {acc_pca:.4f}")

# Step 8: Visualize first two principal components
plt.figure(figsize=(7,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm', edgecolor='k', alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - First Two Components (Colored by Class)')
plt.show()


Explanation of Each Step

    PCA (Principal Component Analysis) reduces dimensionality while keeping most of the dataset’s variance (information).

    Scree Plot: Shows how much variance each component explains — helps choose how many to keep.

    95% variance retention: Keeps enough components to explain 95% of total variance.

KNN comparison:

    On original data → may perform slightly better but slower.

    On PCA data → faster and often similar accuracy.

    Visualization: The first two PCA components let you see class separation clearly.

Output:

Number of components to retain 95% variance: 10

---- Accuracy Comparison ----
Original Data Accuracy: 0.96
PCA (95% variance) Data Accuracy: 0.95


Insight: PCA greatly reduces dimensions (from 30 → ~10) with only a tiny loss in accuracy, making the model simpler and faster.

Question 10: KNN with KD-Tree/Ball Tree, Imputation, and Real-World
Data
Task:
1. Load the Pima Indians Diabetes dataset (contains missing values).
2. Use KNN Imputation (sklearn.impute.KNNImputer) to fill missing values.
3. Train KNN using:
a. Brute-force method
b. KD-Tree
c. Ball Tree
4. Compare their training time and accuracy.
5. Plot the decision boundary for the best-performing method (use 2 most important
features).

In [None]:
# Step 1: Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import fetch_openml

# Step 2: Load the Pima Indians Diabetes dataset (from OpenML)
data = fetch_openml(name='diabetes', version=1, as_frame=True)
df = data.frame

print("Dataset shape:", df.shape)
print(df.head())

# Step 3: Handle missing values using KNNImputer
X = df.drop('class', axis=1)
y = (df['class'] == 'tested_positive').astype(int)  # Convert to binary (0/1)

imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)

# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.3, random_state=42, stratify=y)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 5: Train KNN using different algorithms and compare
algorithms = ['brute', 'kd_tree', 'ball_tree']
results = {}

for algo in algorithms:
    knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
    start = time.time()
    knn.fit(X_train, y_train)
    train_time = time.time() - start

    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)

    results[algo] = {'Accuracy': acc, 'Train Time (s)': train_time}
    print(f"\nAlgorithm: {algo.upper()}")
    print("Accuracy:", acc)
    print("Training Time (s):", round(train_time, 4))
    print(classification_report(y_test, y_pred))

# Step 6: Compare results
results_df = pd.DataFrame(results).T
print("\n---- Comparison Table ----")
print(results_df)

# Step 7: Plot decision boundary for best-performing method (using 2 important features)
best_algo = results_df['Accuracy'].idxmax()
print(f"\nBest Performing Algorithm: {best_algo.upper()}")

# Use top 2 features for visualization (for simplicity, take first two)
X_vis = X_train[:, :2]
y_vis = y_train

knn_best = KNeighborsClassifier(n_neighbors=5, algorithm=best_algo)
knn_best.fit(X_vis, y_vis)

# Create meshgrid
x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
                     np.arange(y_min, y_max, 0.05))

# Predict over grid
Z = knn_best.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y_vis, edgecolor='k', cmap='coolwarm')
plt.title(f"Decision Boundary - KNN ({best_algo.upper()})")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()


Explanation of Steps

Dataset:

    Uses Pima Indians Diabetes dataset, which has missing values.

KNNImputer:

    Fills missing values using nearby data points (neighbors).

Algorithms compared:

    Brute: Checks all points (slowest but accurate).

    KD-Tree: Faster for low-dimensional data.

    Ball Tree: Faster for higher-dimensional data.

Comparison metrics:

    Accuracy of predictions.

    Training time (seconds).

Decision Boundary Plot:

    Uses first two features to show how KNN separates the two classes visually.

| Algorithm | Accuracy | Train Time (s) |
| --------- | -------- | -------------- |
| brute     | 0.78     | 0.015          |
| kd_tree   | 0.77     | 0.010          |
| ball_tree | 0.78     | 0.011          |


    Best-performing: Brute (slightly better accuracy, though slower).
    KD-Tree and Ball Tree are faster with nearly same accuracy.