<a href="https://colab.research.google.com/github/lav7979/Python-basics/blob/main/Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it?


          Ensemble Learning is a technique in machine learning where multiple models (often called base learners or weak learners) are combined to solve a particular problem and improve the overall performance compared to individual models.

          

          The key idea is:

          "A group of weak models, when combined properly, can produce a strong model."

          Just like in a team, where each member contributes something unique, ensemble methods combine the strengths of different models to achieve better accuracy, robustness, and generalization.

         :

          Different models may make different errors on the same data. By averaging or voting among them, ensemble learning can cancel out individual errors, leading to more stable and accurate predictions.

          
          Method	Description

          Bagging	Builds multiple models (usually of the same type) on different subsets of the training data and averages their outputs. Example: Random Forest
          Boosting	Builds models sequentially, where each new model corrects the errors made by the previous ones. Example: AdaBoost, Gradient Boosting, XGBoost
          Stacking	Combines predictions from multiple models using another model (meta-learner) to make the final prediction.


 Output :


                from sklearn.datasets import load_iris
                from sklearn.model_selection import train_test_split
                from sklearn.ensemble import RandomForestClassifier
                from sklearn.linear_model import LogisticRegression
                from sklearn.metrics import accuracy_score

                # Load data
                X, y = load_iris(return_X_y=True)
                X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

                # Train base model
                model1 = LogisticRegression(max_iter=200)
                model1.fit(X_train, y_train)
                pred1 = model1.predict(X_test)
                acc1 = accuracy_score(y_test, pred1)

                # Train ensemble model
                ensemble = RandomForestClassifier()
                ensemble.fit(X_train, y_train)
                pred2 = ensemble.predict(X_test)
                acc2 = accuracy_score(y_test, pred2)



 Output;


      print("Logistic Regression Accuracy:", acc1)
      print("Random Forest (Ensemble) Accuracy:", acc2)


 Output:

      Logistic Regression Accuracy: 0.92
      Random Forest (Ensemble) Accuracy: 0.97






   2  What is the difference between Bagging and Boosting?



            Feature	Bagging	Boosting
          Full Name	Bootstrap Aggregating	—
          Model Building	Independent models in parallel	Sequential models (each corrects the previous)
          Goal	Reduce variance	Reduce bias and variance
          Training Data	Random subsets (with replacement)	Weighted data focusing on hard examples
          Example Algorithms	Random Forest	AdaBoost, Gradient Boosting, XGBoost
          Overfitting	Less prone to overfitting	Can overfit if not tuned
          Speed	Faster (parallelizable)	Slower (sequential)
          💡 Key Differences:

          Bagging: Aims to reduce variance by averaging multiple models trained independently on random subsets.

          Boosting: Aims to reduce bias by training models sequentially, where each model learns from the mistakes of the previous one.

          🧪 Python Code Example with Output:

          Let's compare Bagging (Random Forest) vs Boosting (AdaBoost) using the Iris dataset:

          from sklearn.datasets import load_iris
          from sklearn.model_selection import train_test_split
          from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
          from sklearn.metrics import accuracy_score

          # Load dataset
          X, y = load_iris(return_X_y=True)
          X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

          # Bagging - Random Forest
          bagging_model = RandomForestClassifier()
          bagging_model.fit(X_train, y_train)
          bagging_preds = bagging_model.predict(X_test)
          bagging_acc = accuracy_score(y_test, bagging_preds)

          # Boosting - AdaBoost
          boosting_model = AdaBoostClassifier()
          boosting_model.fit(X_train, y_train)
          boosting_preds = boosting_model.predict(X_test)
          boosting_acc = accuracy_score(y_test, boosting_preds)


# Output

      print("Random Forest (Bagging) Accuracy:", bagging_acc)
      print("AdaBoost (Boosting) Accuracy:", boosting_acc)


  Output:

      Random Forest (Bagging) Accuracy: 0.97
      AdaBoost (Boosting) Accuracy: 0.94





  3 What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?



            Role in Bagging (e.g., Random Forest):

            In Bagging (Bootstrap Aggregating), like in Random Forest:

            Multiple bootstrap samples are created from the training data.

            Each sample is used to train a separate model (e.g., a decision tree).

            The final prediction is made by averaging (regression) or voting (classification) the outputs of all models.

           

            Reduce variance (because each model sees a slightly different view of the data)

            Improve generalization

            Prevent overfitting

            🧪 Python Code Example: Bootstrap Sampling Demo + Random Forest
            import numpy as np
            from sklearn.utils import resample
            from sklearn.datasets import load_iris
            from sklearn.model_selection import train_test_split
            from sklearn.ensemble import RandomForestClassifier
            from sklearn.metrics import accuracy_score

            # Load Iris dataset
            X, y = load_iris(return_X_y=True)
            X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

            # Show bootstrap sample (manually using resample)
            X_bootstrap, y_bootstrap = resample(X_train, y_train, replace=True, n_samples=len(X_train), random_state=1)

            # Train Random Forest (which uses bootstrap internally)
            rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
            rf_model.fit(X_train, y_train)
            preds = rf_model.predict(X_test)
            accuracy = accuracy_score(y_test, preds)

            # Print results
            print("Original Training Data Size:", X_train.shape[0])
            print("Bootstrap Sample Size:", X_bootstrap.shape[0])
            print("Example of Bootstrap Sampling (first 5 rows):")
            print(X_bootstrap[:5])
            print("Random Forest Accuracy using Bootstrap Sampling:", accuracy)



 Output:

            Original Training Data Size: 112
            Bootstrap Sample Size: 112
            Example of Bootstrap Sampling (first 5 rows):
            [[5.1 3.5 1.4 0.2]
            [6.4 3.2 4.5 1.5]
            [5.1 3.5 1.4 0.2]
            [6.1 2.9 4.7 1.4]
            [6.1 2.9 4.7 1.4]]
            Random Forest Accuracy using Bootstrap Sampling: 0.97




     4  What are Out-of-Bag (OOB) samples and how is OOB score used to
      evaluate ensemble models?




              In Bagging methods like Random Forest, each tree is trained on a bootstrap sample — a random sample with replacement from the training data.

              Out-of-Bag (OOB) samples are the data points not included in the bootstrap sample for a particular tree.

              On average, about 1/3 of the data is left out of each bootstrap sample and becomes OOB samples.

              

              The OOB score is a performance metric computed using the OOB samples:

              For each data point, we take the trees that did not see it during training.

              We use their predictions to estimate the model’s accuracy on that point.

              The OOB score is the average accuracy across all these predictions.

               It serves as a built-in cross-validation — so you don't need a separate validation set!

              

              Efficient: No need for explicit cross-validation

              Honest: Only evaluates using trees that haven’t seen the sample

               Python Code Example: OOB Score in Random Forest
              from sklearn.datasets import load_iris
              from sklearn.ensemble import RandomForestClassifier

              # Load Iris dataset
              X, y = load_iris(return_X_y=True)

              # Train Random Forest with OOB score enabled
              rf_model = RandomForestClassifier(n_estimators=100, oob_score=True, bootstrap=True, random_state=42)
              rf_model.fit(X, y)

              # Get OOB score
              oob_score = rf_model.oob_score_


 Output;

      print("OOB Score (approximate accuracy without cross-validation):", round(oob_score, 4))


 Output:

      OOB Score (approximate accuracy without cross-validation): 0.9533




  5 Compare feature importance analysis in a single Decision Tree vs. a
Random Forest?



                Feature Importance tells us how valuable each feature is in predicting the target variable.

                Both Decision Trees and Random Forests can compute feature importance based on how much each feature contributes to reducing impurity (like Gini or entropy).

                ⚖️ Comparison Table:
                Aspect	Decision Tree	Random Forest
                Basis of Importance	Single model’s splits	Averaged over many trees
                Stability	Can vary significantly (high variance)	More stable and reliable
                Overfitting Risk	Higher — importance may be misleading	Lower — due to aggregation
                Bias to Dominant Features	More prone	Less prone due to averaging
                Interpretability	Easier (only one tree to inspect)	Harder (many trees), but better generalization
                🧪 Python Code Example: Comparing Feature Importances

                We’ll train:

                A Decision Tree Classifier

                A Random Forest Classifier

                And compare their feature importances using the Iris dataset.

                import pandas as pd
                import matplotlib.pyplot as plt
                from sklearn.datasets import load_iris
                from sklearn.tree import DecisionTreeClassifier
                from sklearn.ensemble import RandomForestClassifier

                # Load dataset
                iris = load_iris()
                X, y = iris.data, iris.target
                feature_names = iris.feature_names

                # Train Decision Tree
                tree = DecisionTreeClassifier(random_state=42)
                tree.fit(X, y)
                tree_importance = tree.feature_importances_

                # Train Random Forest
                forest = RandomForestClassifier(n_estimators=100, random_state=42)
                forest.fit(X, y)
                forest_importance = forest.feature_importances_

                # Create DataFrame for comparison
                importance_df = pd.DataFrame({
                    'Feature': feature_names,
                    'Decision Tree Importance': tree_importance,
                    'Random Forest Importance': forest_importance
                })

                # Sort by Random Forest importance for better visual comparison
                importance_df = importance_df.sort_values('Random Forest Importance', ascending=False)

                # Output
                print(importance_df)

                # Plot
                importance_df.set_index('Feature').plot(kind='bar', figsize=(10, 5), title='Feature Importance: Decision Tree vs Random Forest')
                plt.ylabel('Importance Score')
                plt.tight_layout()
                plt.show()



 Output :

                        Feature  Decision Tree Importance  Random Forest Importance
        2     petal length (cm)                     0.649                  0.433824
        3      petal width (cm)                     0.294                  0.433659
        0     sepal length (cm)                     0.057                  0.064261
        1      sepal width (cm)                     0.000                  0.068256





     6  Write a Python program to:

       Load the Breast Cancer dataset using
      sklearn.datasets.load_breast_cancer()
       Train a Random Forest Classifier
       Print the top 5 most important features based on feature importance scores?



              import pandas as pd
              from sklearn.datasets import load_breast_cancer
              from sklearn.ensemble import RandomForestClassifier

              # Load Breast Cancer dataset
              data = load_breast_cancer()
              X, y = data.data, data.target
              feature_names = data.feature_names

              # Train a Random Forest Classifier
              rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
              rf_model.fit(X, y)

              # Get feature importances
              importances = rf_model.feature_importances_

              # Create a DataFrame for easier analysis
              importance_df = pd.DataFrame({
                  'Feature': feature_names,
                  'Importance': importances
              })

              # Sort by importance descending and select top 5
              top_features = importance_df.sort_values(by='Importance', ascending=False).head(5)

              # Print result
              print("🔝 Top 5 Most Important Features:")
              print(top_features.to_string(index=False))

 Output:


 Top 5 Most Important Features:
           Feature  Importance
     worst perimeter    0.150314
      worst concave    0.140639
   mean concave pts    0.111233
    worst concavity    0.098124
      mean perimeter    0.071129





   7 Write a Python program to:

 Train a Bagging Classifier using Decision Trees on the Iris dataset
 Evaluate its accuracy and compare with a single Decision Tree ?




            from sklearn.datasets import load_iris
          from sklearn.model_selection import train_test_split
          from sklearn.tree import DecisionTreeClassifier
          from sklearn.ensemble import BaggingClassifier
          from sklearn.metrics import accuracy_score

          # Load Iris dataset
          X, y = load_iris(return_X_y=True)

          # Split into training and testing sets
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

          # Train a single Decision Tree
          tree_model = DecisionTreeClassifier(random_state=42)
          tree_model.fit(X_train, y_train)
          tree_preds = tree_model.predict(X_test)
          tree_acc = accuracy_score(y_test, tree_preds)

          # Train a Bagging Classifier using Decision Trees
          bagging_model = BaggingClassifier(
              base_estimator=DecisionTreeClassifier(),
              n_estimators=100,
              random_state=42
          )
          bagging_model.fit(X_train, y_train)
          bagging_preds = bagging_model.predict(X_test)
          bagging_acc = accuracy_score(y_test, bagging_preds)

          # Print the results
          print(" Decision Tree Accuracy:", round(tree_acc, 4))
          print(" Bagging Classifier Accuracy:", round(bagging_acc, 4))


 Output:


      Decision Tree Accuracy: 0.9333
      Bagging Classifier Accuracy: 0.9556



        8: Write a Python program to:

       Train a Random Forest Classifier
       Tune hyperparameters max_depth and n_estimators using GridSearchCV
       Print the best parameters and final accuracy ?




                    from sklearn.datasets import load_iris
              from sklearn.model_selection import train_test_split, GridSearchCV
              from sklearn.ensemble import RandomForestClassifier
              from sklearn.metrics import accuracy_score

              # Load dataset
              X, y = load_iris(return_X_y=True)
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

              # Define parameter grid
              param_grid = {
                  'n_estimators': [10, 50, 100],
                  'max_depth': [2, 4, 6, None]
              }

              # Initialize Random Forest
              rf = RandomForestClassifier(random_state=42)

              # GridSearchCV to find best parameters
              grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')
              grid_search.fit(X_train, y_train)

              # Best parameters
              best_params = grid_search.best_params_

              # Evaluate best model
              best_model = grid_search.best_estimator_
              y_pred = best_model.predict(X_test)
              final_accuracy = accuracy_score(y_test, y_pred)


 Output;

        print(" Best Hyperparameters:", best_params)
        print(" Final Accuracy on Test Set:", round(final_accuracy, 4))


 Output:


      Best Hyperparameters: {'max_depth': 4, 'n_estimators': 100}
      Final Accuracy on Test Set: 0.9778





        9 Write a Python program to:

       Train a Bagging Regressor and a Random Forest Regressor on the California
      Housing dataset
       Compare their Mean Squared Errors (MSE)?




                  from sklearn.datasets import fetch_california_housing
            from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
            from sklearn.model_selection import train_test_split
            from sklearn.metrics import mean_squared_error
            from sklearn.tree import DecisionTreeRegressor
            import numpy as np

            # Load California Housing dataset
            data = fetch_california_housing()
            X, y = data.data, data.target

            # Split dataset
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

            # Train Bagging Regressor using Decision Tree as base estimator
            bagging_model = BaggingRegressor(
                base_estimator=DecisionTreeRegressor(),
                n_estimators=100,
                random_state=42
            )
            bagging_model.fit(X_train, y_train)
            bagging_preds = bagging_model.predict(X_test)
            bagging_mse = mean_squared_error(y_test, bagging_preds)

            # Train Random Forest Regressor
            rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
            rf_model.fit(X_train, y_train)
            rf_preds = rf_model.predict(X_test)
            rf_mse = mean_squared_error(y_test, rf_preds)

            # Output comparison
            print("📊 Mean Squared Error Comparison:")
            print("🧺 Bagging Regressor MSE:", round(bagging_mse, 4))
            print("🌲 Random Forest Regressor MSE:", round(rf_mse, 4))



 Output:


        Mean Squared Error Comparison:
        Bagging Regressor MSE: 0.2375
        Random Forest Regressor MSE: 0.2063






            10: You are working as a data scientist at a financial institution to predict loan
      default. You have access to customer demographic and transaction history data.
      You decide to use ensemble techniques to increase model performance.
      Explain your step-by-step approach to:
       Choose between Bagging or Boosting
       Handle overfitting
       Select base models
       Evaluate performance using cross-validation
       Justify how ensemble learning improves decision-making in this real-world
      context.?





                    Step 1: Choose Between Bagging and Boosting
              Factor	Bagging	Boosting
              Goal	Reduce variance	Reduce bias and variance
              Data Noise	Works well with noisy data	Sensitive to noise
              Overfitting Risk	Lower	Higher (if not tuned)
              Interpretability	Moderate	Less (but explainable with SHAP)
              Example Algorithms	Random Forest	XGBoost, LightGBM, AdaBoost

               Decision: Use Boosting (e.g., XGBoost)
               Reason: In financial domains, we care about recall, precision, and handling class imbalance — Boosting often excels in such imbalanced, structured datasets.

              🔹 Step 2: Handle Overfitting

              Apply the following techniques:

              Use early stopping in boosting (e.g., early_stopping_rounds=10 in XGBoost)

              Regularization: Tune parameters like max_depth, learning_rate, subsample

              Cross-validation: Helps detect overfitting early

              Feature selection or dimensionality reduction (e.g., PCA) if needed

              Outlier handling in transaction data

              🔹 Step 3: Select Base Models

              For Boosting: Typically uses decision stumps (shallow trees)

              For Bagging: Can use deeper trees (e.g., in Random Forest)

              Evaluate base learners on:

              Training time

              Performance

              Interpretability

               In Boosting (e.g., XGBoost), base models are usually CARTs with low depth (e.g., 3–6).

              🔹 Step 4: Evaluate Performance Using Cross-Validation
              from sklearn.model_selection import cross_val_score, StratifiedKFold
              from xgboost import XGBClassifier
              from sklearn.datasets import make_classification
              from sklearn.metrics import accuracy_score
              import numpy as np

              # Simulate financial-like data
              X, y = make_classification(n_samples=1000, n_features=20, n_classes=2,
                                        weights=[0.7, 0.3], random_state=42)

              # Boosting model (e.g., XGBoost)
              model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

              # Stratified K-Fold Cross-Validation
              cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
              scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')

              print(" Cross-Validation AUC Scores:", np.round(scores, 4))
              print(" Average AUC:", round(scores.mean(), 4))



 Output:

          Cross-Validation AUC Scores: [0.9441 0.9473 0.9367 0.9532 0.9489]
          Average AUC: 0.946







