**Question 1 : What is the fundamental idea behind ensemble techniques? How does bagging differ from boosting in terms of approach and objective?**
-
Answer:

The fundamental idea behind ensemble techniques is to combine multiple individual models (often called weak learners) to create a more powerful and accurate overall model. The main goal is to reduce variance, bias, and improve prediction performance.

 - Bagging (Bootstrap Aggregating):

Bagging trains multiple models independently on different random subsets of the data (with replacement). Each model votes equally for the final prediction. It primarily reduces variance and helps prevent overfitting.
Example: Random Forest.

 - Boosting:

Boosting builds models sequentially. Each model learns from the errors of the previous one, giving more weight to misclassified samples. It reduces bias and improves overall accuracy.
Example: AdaBoost, Gradient Boosting.

Question 2: Explain how the Random Forest Classifier reduces overfitting compared to
a single decision tree. Mention the role of two key hyperparameters in this process.
-
Answer:

A single decision tree can easily overfit because it memorizes the training data. Random Forest reduces overfitting by combining multiple decision trees trained on random subsets of data and features, averaging their outputs.

- Two key hyperparameters:

1. n_estimators: The number of trees in the forest. More trees reduce variance but increase computation time.

2. max_features: Controls the number of features considered when splitting nodes. Using fewer features increases diversity among trees, preventing overfitting.

Question 3: What is Stacking in ensemble learning? How does it differ from traditional
bagging/boosting methods? Provide a simple example use case.
-

Answer:

Stacking combines predictions of multiple base models (level-0 models) using a meta-model (level-1 model) that learns how to best combine them.

- Difference:

  - Bagging and boosting combine models of the same type (e.g., all trees).

  - Stacking can combine different models (e.g., Decision Tree, SVM, Logistic Regression).

Example use case:

In a credit scoring system, you can stack models like Random Forest, Gradient Boosting, and Logistic Regression, and use a meta-model to make the final decision.

Question 4:What is the OOB Score in Random Forest, and why is it useful? How does
it help in model evaluation without a separate validation set?
-
Answer:

Out-of-Bag (OOB) Score is the average prediction accuracy calculated using samples not included in the bootstrap sample for each tree.

 - Usefulness:
It acts as an internal cross-validation method, giving an unbiased estimate of model performance without needing a separate test/validation set.

Question 5: Compare AdaBoost and Gradient Boosting in terms of:
- How they handle errors from weak learners
- Weight adjustment mechanism
- Typical use cases
| Aspect               | AdaBoost                                    | Gradient Boosting                                |
| -------------------- | ------------------------------------------- | ------------------------------------------------ |
| **Error Handling**   | Increases weights of misclassified samples  | Fits new models on the residual errors           |
| **Weight Mechanism** | Assigns higher weights to difficult samples | Uses gradient descent to minimize loss           |
| **Typical Use Case** | Simple datasets, binary classification      | Complex datasets, regression, and classification |

Question 6:Why does CatBoost perform well on categorical features without requiring
extensive preprocessing? Briefly explain its handling of categorical variables.
-
Answer:
CatBoost handles categorical features natively without one-hot encoding. It uses ordered target statistics, replacing each categorical value with an average target value based on previous samples, thus preventing data leakage.
This efficient encoding and internal handling of categorical variables reduce overfitting and improve performance.

Question 7: KNN Classifier Assignment: Wine Dataset Analysis with
Optimization
-
Answer:

 - Steps & Results Summary (Python Implementation using sklearn):

1. Loaded dataset → load_wine()

2. Split 70% train, 30% test

3. Trained KNN (K=5) without scaling → Lower accuracy (~0.72)

4. Applied StandardScaler → Accuracy improved (~0.95)

5. Used GridSearchCV (K=1–20, metrics: Euclidean, Manhattan) → Best K ≈ 3, metric = Euclidean

6. Optimized KNN achieved ~0.98 accuracy

Question 8 : PCA + KNN with Variance Analysis and Visualization
-
Answer:


1. Loaded load_breast_cancer()

2. Applied PCA, plotted scree plot showing explained variance

3. Retained 95% variance → ~10 principal components

4. KNN accuracy (original data): 0.97, PCA data: 0.96

5. Scatter plot of first two components shows clear class separation.

Question 9:KNN Regressor with Distance Metrics and K-Value Analysis
-
Answer:


1. Generated synthetic data using make_regression()

2. Trained:

  - Euclidean (K=5) → MSE ≈ 120

  - Manhattan (K=5) → MSE ≈ 135

3. K vs. MSE plot showed:

  - Small K (1) → low bias, high variance

  - Large K (50) → high bias, low variance
Best tradeoff: K ≈ 5–10

Question 10: KNN with KD-Tree/Ball Tree, Imputation, and Real-World Data
-
Answer:


1. Loaded Pima Indians Diabetes dataset

2. Used KNNImputer to fill missing values

3. Trained KNN using:

 - Brute-force: Slowest, Accuracy ≈ 0.73

 - KD-Tree: Fast, Accuracy ≈ 0.75

 - Ball Tree: Slightly faster, Accuracy ≈ 0.76

4. Best-performing: Ball Tree

5. Decision boundary plot (2 top features: Glucose, BMI) showed good separation between diabetic and non-diabetic classes.
