#1.What is Ensemble Learning in machine learning? Explain the key idea behind it.

    ->Ensemble learning in machine learning is a method that combines the predictions of multiple individual models (called "base learners") to achieve a single, more accurate, and robust prediction than any single model could on its own. The key idea is to leverage the "wisdom of crowds" by creating a diverse group of models

#2.What is the difference between Bagging and Boosting?

    ->Bagging is a learning approach that aids in enhancing the performance, execution, and precision of machine learning algorithms. Boosting is an approach that iteratively modifies the weight of observation based on the last classification.

#3.What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

    ->Bootstrap sampling is a resampling technique where a subset of data is randomly drawn from an original dataset with replacement.

#4.What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

    -> It is calculated using the samples that are not used in the training of the model, which is called out-of-bag samples.

#5.Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

    ->Single Decision Tree:
    
    Impurity-based importance
    Sensitivity to individual splits

    ->Random Forest:

    Ensemble averaging
    Permutation importance

#6.Write a Python program to:
    ● Load the Breast Cancer dataset using
    sklearn.datasets.load_breast_cancer()
    ● Train a Random Forest Classifier
    ● Print the top 5 most important features based on feature importance scores.

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the classifier to the data
rf_classifier.fit(X, y)
importances = rf_classifier.feature_importances_
feature_names = cancer.feature_names
feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
})

feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

print("Top 5 Most Important Features in the Breast Cancer Dataset:")
print(feature_importance_df.head(5))



Top 5 Most Important Features in the Breast Cancer Dataset:
                 feature  importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


#7.Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

single_dt_classifier = DecisionTreeClassifier(random_state=42)
single_dt_classifier.fit(X_train, y_train)

y_pred_single_dt = single_dt_classifier.predict(X_test)

accuracy_single_dt = accuracy_score(y_test, y_pred_single_dt)
print(f"Accuracy of a single Decision Tree: {accuracy_single_dt:.4f}")
bagging_classifier = BaggingClassifier(n_estimators=10, random_state=42)
bagging_classifier.fit(X_train, y_train)

y_pred_bagging = bagging_classifier.predict(X_test)

accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy of Bagging Classifier: {accuracy_bagging:.4f}")
if accuracy_bagging > accuracy_single_dt:
    print("\nThe Bagging Classifier achieved higher accuracy.")
elif accuracy_bagging < accuracy_single_dt:
    print("\nThe single Decision Tree achieved higher accuracy.")
else:
    print("\nBoth classifiers achieved the same accuracy.")

Accuracy of a single Decision Tree: 1.0000
Accuracy of Bagging Classifier: 1.0000

Both classifiers achieved the same accuracy.


#8.Write a Python program to:
    ● Train a Random Forest Classifier
    ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
    ● Print the best parameters and final accuracy

In [4]:

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rf_classifier = RandomForestClassifier(random_state=42)
param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [None, 5, 10]}

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

best_rf_model = grid_search.best_estimator_
y_pred = best_rf_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
print(f"Final accuracy on the test set: {final_accuracy:.4f}")

Best parameters found by GridSearchCV:
{'max_depth': None, 'n_estimators': 100}
Final accuracy on the test set: 1.0000


#9.Write a Python program to:
    ● Train a Bagging Regressor and a Random Forest Regressor on the California
    Housing dataset
    ● Compare their Mean Squared Errors (MSE)

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X = housing.data
y = housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
bagging_reg = BaggingRegressor(random_state=42)
bagging_reg.fit(X_train, y_train)
y_pred_bagging = bagging_reg.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Mean Squared Error (Bagging Regressor): {mse_bagging:.4f}")
print(f"Mean Squared Error (Random Forest Regressor): {mse_rf:.4f}")

Mean Squared Error (Bagging Regressor): 0.2824
Mean Squared Error (Random Forest Regressor): 0.2554


#10.You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.
    You decide to use ensemble techniques to increase model performance.
    Explain your step-by-step approach to:
    ● Choose between Bagging or Boosting
    ● Handle overfitting
    ● Select base models
    ● Evaluate performance using cross-validation
    ● Justify how ensemble learning improves decision-making in this real-world
    context.


    ->(1) choose between Bagging and Boosting by considering the nature of the data and base models.

    (2) Overfitting is handled by techniques like pruning base models, adjusting ensemble parameters, and using regularization within base models.

    (3) Base models are selected based on their diversity and ability to model complex relationships in the data.

    (4) Performance is evaluated using cross-validation, calculating metrics like precision, recall, and AUC on multiple splits of the data to get a robust performance estimate.
    
    (5) Decision-making improves because ensembles provide more stable, accurate, and reliable predictions