#Ensemble Learning Assignment

##Assignment Questions

Theoretical


#Q1) Can we use Bagging for regression problems?

Yes, bagging (Bootstrap Aggregating) can be used for regression tasks as well as classification. In regression, multiple base regressors (like Decision Trees) are trained on different bootstrapped samples of the training data. Their predictions are then averaged to produce the final output. This averaging helps reduce variance and improve the model’s stability. Bagging is particularly effective for high-variance models like decision trees. It prevents overfitting and enhances generalization. Popular implementations include BaggingRegressor and Random Forest Regressor in scikit-learn.



---



#Q2) What is the difference between multiple model training and single model training?

Single model training involves building and training one model on the entire dataset, which then makes predictions alone. Multiple model training (ensemble learning) combines predictions from several models to improve accuracy, robustness, and generalization. Ensembles like Bagging, Boosting, and Stacking leverage multiple models’ strengths and reduce weaknesses. This approach reduces variance and bias, leading to more stable results. However, it’s computationally more expensive than training a single model.



---



#Q3) Explain the concept of feature randomness in Random Forest.

Feature randomness is a key concept in Random Forest that enhances diversity among trees. Instead of considering all features for the best split at each node, Random Forest randomly selects a subset of features. This randomness prevents all trees from becoming similar and reduces correlation among them. It helps improve model generalization and reduces overfitting. As a result, even if one feature dominates the data, others still get a chance to influence the model. This technique improves the ensemble’s overall predictive power.



---



#Q4) What is OOB (Out-of-Bag) Score?

The Out-of-Bag (OOB) score is a built-in cross-validation method used in bagging techniques like Random Forest. Since each tree is trained on a bootstrapped sample (about 63% of data), the remaining ~37% (OOB samples) can serve as a validation set. The model’s performance on these OOB samples is aggregated to estimate its accuracy. OOB scoring eliminates the need for a separate validation set, saving data and computation. It’s a reliable and unbiased estimate of model performance.



---



#Q5 - How can you measure the importance of features in a Random Forest model?

Random Forest provides feature importance scores based on how much each feature reduces impurity (like Gini or entropy) across all trees. A feature’s importance is calculated as the total decrease in impurity it contributes, averaged over all splits where it’s used. Another method is permutation importance, which measures the drop in model accuracy when a feature’s values are randomly shuffled. Both approaches help identify key predictors, aiding feature selection and interpretability.



---



#Q6) Explain the working principle of a Bagging Classifier.

A Bagging Classifier builds multiple instances of a base classifier (e.g., Decision Trees) on different bootstrapped subsets of the training data. Each classifier independently learns patterns and makes predictions. The final prediction is typically determined by majority voting among all models. This ensemble approach reduces variance and improves robustness compared to a single model. It’s particularly useful for unstable models prone to overfitting. Bagging also enhances generalization without significantly increasing bias.



---



#Q7) How do you evaluate a Bagging Classifier’s performance?

The performance of a Bagging Classifier is evaluated using standard classification metrics like accuracy, precision, recall, F1-score, and the confusion matrix. Cross-validation can also be applied to get more reliable estimates. Additionally, the OOB score is a convenient way to assess model accuracy without a separate validation set. Visualization of ROC-AUC curves and precision-recall curves is helpful for imbalanced data. Comparing results against a single base model shows the improvement gained from bagging.



---



#Q8) How does a Bagging Regressor work?

A Bagging Regressor follows the same principle as a Bagging Classifier but for regression tasks. It trains multiple base regressors on different bootstrapped samples of the data. Each regressor produces a continuous prediction, and the final output is the average of all predictions. This ensemble approach reduces variance and improves predictive stability. It is especially effective for high-variance models like decision trees. Bagging can also improve performance on noisy datasets.



---



#Q9 - What is the main advantage of ensemble techniques?

The main advantage of ensemble methods is their ability to improve model performance by combining multiple models’ strengths. They often achieve higher accuracy, better generalization, and increased robustness compared to single models. Ensembles can reduce variance (bagging), reduce bias (boosting), or balance both (stacking). They’re less prone to overfitting and usually outperform individual models in complex problems. This makes them widely used in real-world applications like finance, healthcare, and recommendation systems.



---



#Q10 - What is the main challenge of ensemble methods?

The primary challenge of ensemble techniques is their computational complexity. Training multiple models requires more time and resources than a single model. They can also be harder to interpret, as the final decision is a result of many models’ outputs. Hyperparameter tuning and model selection become more complicated. Additionally, ensembles risk overfitting if not properly regularized or if weak learners are too complex. Despite these challenges, their performance benefits often outweigh the drawbacks.



---



#Q11 - Explain the key idea behind ensemble techniques?

The key idea behind ensemble techniques is to combine multiple models to create a stronger, more accurate, and more robust predictive model than any single model alone. Individual models, called base learners or weak learners, may have high bias, high variance, or limited predictive power. However, when combined, they can complement each other’s strengths and compensate for their weaknesses.
Ensemble learning works on the principle that “the wisdom of the crowd” is often more accurate than the opinion of a single member. Techniques like Bagging reduce variance by averaging predictions from many models trained on different data samples. Boosting reduces bias by sequentially training models where each one focuses on correcting the errors of the previous one. Stacking combines different types of models to leverage their diverse learning capabilities.
Overall, ensemble methods increase accuracy, improve generalization, handle noise better, and are more robust in real-world scenarios. They are widely used in applications like fraud detection, medical diagnosis, stock market prediction, and recommendation systems. However, they require more computational resources and are harder to interpret. Despite that, their performance improvements make them one of the most powerful tools in machine learning.



---



#Q12 - What is a Random Forest Classifier?

A Random Forest Classifier is an ensemble learning algorithm that builds multiple decision trees and combines their predictions to improve classification accuracy and robustness. It works on the principle of bagging (bootstrap aggregating) and feature randomness. Each tree is trained on a random subset of the training data and considers a random subset of features when splitting nodes.
During prediction, all trees vote, and the class with the majority vote becomes the final output. This reduces the risk of overfitting that a single decision tree might suffer from. Random Forests also provide a measure of feature importance, helping us understand which features contribute most to the prediction.
It’s highly versatile and can handle large datasets, missing values, categorical and numerical data. It is less sensitive to noise and performs well even with default parameters. The algorithm is widely used in credit scoring, spam detection, medical diagnosis, and recommendation systems.
Its key advantages include improved accuracy, reduced variance, and good generalization. However, it can be computationally expensive and harder to interpret compared to a single tree.



---



#Q13 - What are the main types of ensemble techniques?

Ensemble techniques can be broadly categorized into three main types: Bagging, Boosting, and Stacking.

Bagging (Bootstrap Aggregating): It involves training multiple models on different random subsets of the data (with replacement) and combining their predictions through voting (classification) or averaging (regression). Examples: Random Forest, BaggingClassifier.

Boosting: It builds models sequentially, where each new model focuses on correcting the mistakes of the previous ones. Boosting reduces bias and improves performance. Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM.

Stacking (Stacked Generalization): It combines predictions from multiple different models (e.g., decision trees, logistic regression, SVM) using a meta-learner that learns how to best combine them.

Other variations include Voting Classifiers (simple majority/weighted voting) and Blending (similar to stacking but uses a holdout set). Each technique has its own strengths — Bagging reduces variance, Boosting reduces bias, and Stacking leverages multiple model types. The choice depends on data size, complexity, and performance goals.



---



#Q14 - What is ensemble learning in machine learning?

Ensemble learning is a machine learning technique that combines multiple models to produce a stronger predictive model with better accuracy, generalization, and stability. The fundamental idea is that a group of weak learners can collectively act as a strong learner.
Instead of relying on one model, ensemble methods aggregate the predictions of several models through techniques like voting, averaging, or weighted combinations. This approach reduces the risk of overfitting, bias, and variance simultaneously.
There are three major forms of ensemble learning: Bagging, Boosting, and Stacking. Bagging trains models in parallel on random subsets of data, Boosting trains them sequentially focusing on errors, and Stacking blends different models using a meta-model.
Ensemble methods are highly effective in solving complex problems and are often used in competitions like Kaggle due to their superior performance. They are widely used in domains like healthcare, finance, fraud detection, and recommendation systems.
However, they are computationally expensive, harder to interpret, and may require extensive hyperparameter tuning. Despite these drawbacks, ensemble learning remains a cornerstone of modern predictive analytics and machine learning.



---



#Q15 - When should we avoid using ensemble methods?

While ensemble methods are powerful, there are certain situations where they might not be ideal. If the dataset is small or simple, a single well-tuned model might perform just as well without the added complexity. Ensemble models are computationally expensive, requiring more memory and training time, so they may not be suitable for real-time or resource-constrained environments.
They also make the model harder to interpret — if explainability is a priority (like in healthcare or finance), simpler models like logistic regression or decision trees are preferred. Overfitting can occur if base models are too complex or if ensembles are not properly regularized.
In cases where quick deployment and interpretability are critical, a single model may be better. Additionally, if the performance gain from an ensemble is marginal compared to a single model, the added complexity might not justify the trade-off.
In short, ensemble methods should be avoided when interpretability, computational cost, simplicity, or data size constraints outweigh the need for higher accuracy.



---



#Q16 - How does Bagging help in reducing overfitting?

Bagging helps reduce overfitting by training multiple models on different random bootstrapped subsets of the training data. Since each model sees a slightly different dataset, they learn slightly different patterns. When their predictions are combined (by averaging or voting), the variance of the final model decreases.
This ensemble approach ensures that errors specific to individual models are averaged out, leading to more stable and generalized predictions. Bagging is particularly effective for high-variance models like decision trees, which tend to overfit the training data.
It also reduces sensitivity to noise, as not all models are influenced by the same outliers. As a result, the ensemble performs better on unseen data. Additionally, the diversity created by bootstrapping ensures that models complement each other’s weaknesses.
Random Forest is a classic example of how bagging reduces overfitting while maintaining strong predictive performance. This makes bagging a go-to technique when overfitting is a major concern.



---



#Q17 - Why is Random Forest better than a single Decision Tree?

Random Forest is often better than a single decision tree because it reduces variance, improves generalization, and delivers higher accuracy. A single tree can easily overfit, learning noise and patterns specific to the training data. Random Forest mitigates this by building multiple trees on different bootstrapped samples and averaging their predictions.
It introduces randomness in feature selection, which ensures that trees are diverse and less correlated. This diversity makes the ensemble more robust and stable. Random Forest also provides a built-in estimate of feature importance, helping with feature selection and interpretability.
It handles missing data, categorical and numerical variables, and large datasets effectively. Additionally, it is less sensitive to noise and outliers compared to a single tree.
Overall, Random Forest strikes a balance between bias and variance, offering better accuracy and generalization. The trade-off is higher computational cost and reduced interpretability, but the performance benefits usually outweigh these drawbacks.



---



#Q18 - What is the role of bootstrap sampling in Bagging?

Bootstrap sampling is the foundation of the bagging technique. It involves creating multiple training datasets by randomly sampling (with replacement) from the original dataset. Each new dataset (called a bootstrap sample) is likely to contain some repeated instances and some excluded ones.
This randomness ensures that each model is trained on a slightly different subset of the data, introducing diversity among the base learners. The diversity is crucial because it reduces the correlation between models, lowering variance when predictions are aggregated.
Bootstrap sampling also allows us to estimate model performance using Out-of-Bag (OOB) samples (data not included in a particular bootstrap sample). This eliminates the need for a separate validation set.
The ensemble prediction, typically an average or majority vote, leverages this diversity to achieve better generalization. Without bootstrap sampling, all models would see the same data, leading to similar predictions and reduced ensemble effectiveness.



---



#Q19 - What are some real-world applications of ensemble techniques?

Ensemble methods are widely used in real-world applications across various industries due to their high accuracy and robustness. In finance, they are used for credit scoring, fraud detection, and stock price prediction. In healthcare, they help in disease diagnosis, patient risk prediction, and medical image classification.
In e-commerce, ensembles power recommendation systems, customer segmentation, and churn prediction. Cybersecurity uses them for intrusion detection and malware classification. They are also crucial in natural language processing tasks like sentiment analysis, spam detection, and text classification.
In self-driving cars, ensemble models improve object detection and decision-making accuracy. Manufacturing uses them for predictive maintenance and quality control.
Ensembles are often the winning approaches in data science competitions (like Kaggle) because they consistently outperform single models. Their ability to handle complex, noisy, and high-dimensional data makes them highly valuable in solving real-world business problems.



---



#Q20 - What is the difference between Bagging and Boosting?

Bagging and Boosting are both ensemble methods but differ fundamentally in how they train models and combine results.

Bagging (Bootstrap Aggregating): Trains multiple models in parallel on different bootstrapped subsets of data. Each model is independent, and their results are combined by averaging (regression) or voting (classification). Bagging mainly reduces variance and improves stability. Random Forest is a classic example.

Boosting: Trains models sequentially, where each new model focuses on correcting the errors of the previous ones. It gives more weight to misclassified instances, reducing bias and improving accuracy. Examples include AdaBoost, Gradient Boosting, and XGBoost.

Bagging is less prone to overfitting but may not significantly improve bias, while Boosting often achieves higher accuracy but risks overfitting if not regularized.

Bagging is simpler and faster, while Boosting is more powerful but computationally expensive. Both are valuable, and the choice depends on the dataset and performance requirements.



---



#Practical

In [None]:
#Q21: Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Classifier
bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag_clf.fit(X_train, y_train)

# Predict & Accuracy
y_pred = bag_clf.predict(X_test)
print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
#Q22: Train a Bagging Regressor using Decision Trees and evaluate using MSE
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging Regressor
bag_reg = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bag_reg.fit(X_train, y_train)

# Predict & Evaluate
y_pred = bag_reg.predict(X_test)
print("Bagging Regressor MSE:", mean_squared_error(y_test, y_pred))

In [None]:
#Q23: Train a Random Forest Classifier on Breast Cancer dataset and print feature importance
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Print feature importance
importances = pd.Series(rf_clf.feature_importances_, index=data.feature_names).sort_values(ascending=False)
print("Feature Importances:\n", importances)

In [None]:
#Q24: Train a Random Forest Regressor and compare its performance with a single Decision Tree
# Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
dt_pred = dt_reg.predict(X_test)

# Random Forest Regressor
rf_reg = RandomForestClassifier(random_state=42)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))

In [None]:
#Q25: Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier
rf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_oob.fit(X_train, y_train)
print("OOB Score:", rf_oob.oob_score_)

In [None]:
#Q26: Train a Bagging Classifier using SVM as a base estimator and print accuracy
from sklearn.svm import SVC

bag_svm = BaggingClassifier(
    base_estimator=SVC(),
    n_estimators=30,
    random_state=42
)
bag_svm.fit(X_train, y_train)
y_pred = bag_svm.predict(X_test)
print("Bagging SVM Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
#Q27: Train a Random Forest Classifier with different numbers of trees and compare accuracy
for trees in [10, 50, 100, 200]:
    rf = RandomForestClassifier(n_estimators=trees, random_state=42)
    rf.fit(X_train, y_train)
    acc = accuracy_score(y_test, rf.predict(X_test))
    print(f"Trees: {trees} --> Accuracy: {acc:.4f}")

In [None]:
#Q28: Train a Bagging Classifier using Logistic Regression as base estimator and print AUC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

bag_log = BaggingClassifier(
    base_estimator=LogisticRegression(max_iter=1000),
    n_estimators=30,
    random_state=42
)
bag_log.fit(X_train, y_train)
y_prob = bag_log.predict_proba(X_test)[:, 1]
print("Bagging Logistic Regression AUC:", roc_auc_score(y_test, y_prob))

In [None]:
#Q29: Train a Random Forest Regressor and analyze feature importance
from sklearn.ensemble import RandomForestRegressor

# Load regression dataset
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)

importances = pd.Series(rf_reg.feature_importances_)
print("Feature Importances:\n", importances.sort_values(ascending=False))

In [None]:
#Q30: Train an ensemble model using both Bagging and Random Forest and compare accuracy
# Bagging
bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag_model.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag_model.predict(X_test))

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf_model.predict(X_test))

print("Bagging Model Accuracy:", bag_acc)
print("Random Forest Accuracy:", rf_acc)

In [None]:
#Q31 Train a Random Forest Classifier and tune hyperparameters using GridSearchCV.

#Answer (brief): Use GridSearchCV to search parameters like n_estimators, max_depth, max_features, and min_samples_split; use a pipeline or direct RF with stratified CV and pick metrics (accuracy, f1).

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

rf = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20],
    'max_features': ['sqrt', 'log2', 0.5],
    'min_samples_split': [2, 5, 10]
}
grid = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1, scoring='f1')
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
best = grid.best_estimator_
print("Test accuracy:", accuracy_score(y_test, best.predict(X_test)))

In [None]:
#Q32 Train a Bagging Regressor with different numbers of base estimators and compare performance.

#Answer (brief): Vary n_estimators and compute MSE on test set; plot or print to compare.

from sklearn.datasets import load_diabetes
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

for n in [5, 10, 30, 50, 100]:
    model = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=n, random_state=42)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(f"n_estimators={n} -> MSE: {mean_squared_error(y_test, pred):.4f}")

In [None]:
#Q33 Train a Random Forest Classifier and analyze misclassified samples.

#Answer (brief): Train RF, predict on test set, locate indices where y_true != y_pred, inspect features or texts for those rows to understand errors.

from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import classification_report

data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))
mis_idx = [i for i,(t,p) in enumerate(zip(y_test, y_pred)) if t!=p]
df_mis = pd.DataFrame(X_test[mis_idx], columns=data.feature_names)
df_mis['true'] = [y_test[i] for i in mis_idx]
df_mis['pred'] = [y_pred[i] for i in mis_idx]
print("Misclassified samples (features):\n", df_mis)

In [None]:
#Q34 Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier.

#Answer (brief): Fit both on same train/test split; compare accuracy/precision/recall or cross-validated scores.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
bag_pred = bag.predict(X_test)

print("Decision Tree acc:", accuracy_score(y_test, dt_pred))
print("Bagging acc:", accuracy_score(y_test, bag_pred))
print("\nBagging classification report:\n", classification_report(y_test, bag_pred))

In [None]:
#Q35 Train a Random Forest Classifier and visualize the confusion matrix.

#Answer (brief): Use confusion_matrix and ConfusionMatrixDisplay (or seaborn heatmap) to visualize errors across classes.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['neg','pos'])
disp.plot(cmap='Blues')
plt.title("Random Forest Confusion Matrix")
plt.show()

In [None]:
#Q36 Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy.

#Answer (brief): Use StackingClassifier with base learners and a final estimator (e.g., LogisticRegression), compare to individual learners.

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

estimators = [
    ('dt', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('svm', SVC(kernel='rbf', probability=True, random_state=42)),
    ('lr', LogisticRegression(max_iter=1000, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), n_jobs=-1)
stack.fit(X_train, y_train)
print("Stacking accuracy:", accuracy_score(y_test, stack.predict(X_test)))

# Compare single RandomForest for baseline
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
print("RandomForest accuracy:", accuracy_score(y_test, rf.predict(X_test)))

In [None]:
#Q37 Train a Random Forest Classifier and print the top 5 most important features.

#Answer (brief): After training, sort feature_importances_ and print top-5 with names.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42, stratify=data.target)

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
fi = pd.Series(rf.feature_importances_, index=data.feature_names).sort_values(ascending=False)
print("Top 5 features:\n", fi.head(5))

In [None]:
#Q38 Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score.

#Answer (brief): Use classification_report or compute metrics individually for test set predictions.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

bag = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
y_pred = bag.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
#Q39 Train a Random Forest Classifier and analyze the effect of max_depth on accuracy.

#Answer (brief): Sweep max_depth values, record CV or test accuracies, then plot/print to analyze bias-variance tradeoff.

from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

for depth in [None, 2, 4, 6, 8, 10]:
    rf = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
    rf.fit(X_train, y_train)
    acc = accuracy_score(y_test, rf.predict(X_test))
    print(f"max_depth={depth} -> Test accuracy: {acc:.4f}")

In [None]:
#Q40 Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare performance.

#Answer (brief): Fit bagging with two different base estimators and compare MSE or R² on test set.

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for base in [DecisionTreeRegressor(), KNeighborsRegressor()]:
    bag = BaggingRegressor(base_estimator=base, n_estimators=50, random_state=42)
    bag.fit(X_train, y_train)
    pred = bag.predict(X_test)
    print(f"Base: {base.__class__.__name__} -> MSE: {mean_squared_error(y_test, pred):.4f}")

In [None]:
#Q41 Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score.

#Answer (brief): Use predict_proba or decision_function and roc_auc_score. For multiclass, use ovr or macro averaging.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_prob = rf.predict_proba(X_test)[:, 1]
print("ROC-AUC:", roc_auc_score(y_test, y_prob))

In [None]:
#Q42 Train a Bagging Classifier and evaluate its performance using cross-validation.

#Answer (brief): Use cross_val_score with stratified folds and metrics like accuracy or f1.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
bag = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
scores = cross_val_score(bag, X, y, cv=5, scoring='accuracy', n_jobs=-1)
print("Cross-val accuracies:", scores)
print("Mean accuracy:", scores.mean())

In [None]:
#Q43 Train a Random Forest Classifier and plot the Precision-Recall curve.

#Answer (brief): Use precision_recall_curve and plot precision vs recall for the positive class; compute average precision.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_prob = rf.predict_proba(X_test)[:, 1]
prec, recall, _ = precision_recall_curve(y_test, y_prob)
ap = average_precision_score(y_test, y_prob)

plt.plot(recall, prec)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(f"Precision-Recall curve (AP={ap:.3f})")
plt.show()

In [None]:
#Q44 Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy.

#Answer (brief): Stacking can combine RF and LR as base learners (and possibly others); the meta-learner blends their predictions for final output.

from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('lr', LogisticRegression(max_iter=1000, random_state=42))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), n_jobs=-1)
stack.fit(X_train, y_train)
print("Stacking accuracy:", accuracy_score(y_test, stack.predict(X_test)))

# Compare with best single model (RF)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
print("RandomForest accuracy:", accuracy_score(y_test, rf.predict(X_test)))

In [None]:
#Q45 Train a Bagging Regressor with different levels of bootstrap samples and compare performance.

#Answer (brief): Vary bootstrap (True/False) or change sample sizes with max_samples to see effect on performance and variance.

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for max_samples in [0.5, 0.7, 1.0]:
    bag = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50,
                           bootstrap=True, max_samples=max_samples, random_state=42)
    bag.fit(X_train, y_train)
    pred = bag.predict(X_test)
    print(f"bootstrap=True, max_samples={max_samples} -> MSE: {mean_squared_error(y_test, pred):.4f}")

# Also try bootstrap=False for comparison
bag_nb = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50,
                         bootstrap=False, random_state=42)
bag_nb.fit(X_train, y_train)
print("bootstrap=False -> MSE:", mean_squared_error(y_test, bag_nb.predict(X_test)))