<a href="https://colab.research.google.com/github/lav7979/Python-basics/blob/main/Boosting_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 1: What is Boosting in Machine Learning? Explain how it improves weak
learners?


          Boosting is an ensemble learning technique in Machine Learning that combines multiple weak learners (usually decision trees with shallow depth) to form a strong learner that performs better.

          A weak learner is a model that performs slightly better than random guessing (e.g., 51% accuracy).

          Boosting sequentially trains weak learners, each one focusing on the errors made by the previous ones.

          It assigns weights to the training data and adjusts them after every iteration so that misclassified data points get more attention.

          🔧 How Boosting Improves Weak Learners

          Start with a weak learner (e.g., a small decision tree).

          Evaluate its errors.

          Train the next weak learner focusing more on the misclassified points.

          Repeat this process for several rounds.

          Combine all learners’ predictions using weighted majority vote (for classification) or weighted sum (for regression).



 Output;


              Here’s a simple demo using AdaBoost in Python with the Iris dataset:

              from sklearn.ensemble import AdaBoostClassifier
              from sklearn.tree import DecisionTreeClassifier
              from sklearn.datasets import load_iris
              from sklearn.model_selection import train_test_split
              from sklearn.metrics import accuracy_score

              # Load data
              X, y = load_iris(return_X_y=True)

              # Binary classification (for simplicity)
              y = (y == 0).astype(int)  # Classify if Iris-Setosa or not

              # Split data
              X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

              # Weak learner: decision stump
              weak_learner = DecisionTreeClassifier(max_depth=1)

              # Boosting: AdaBoost
              model = AdaBoostClassifier(base_estimator=weak_learner, n_estimators=50, learning_rate=1.0)

              # Train
              model.fit(X_train, y_train)

              # Predict
              y_pred = model.predict(X_test)

              # Accuracy
              accuracy = accuracy_score(y_test, y_pred)
              print("Boosting Accuracy:", accuracy)


 Output:

        Boosting Accuracy: 1.0





2  What is the difference between AdaBoost and Gradient Boosting in terms
of how models are trained?


          Feature	AdaBoost	Gradient Boosting
          Error Focus	Focuses on misclassified samples by reweighting them	Focuses on residual errors using gradients (loss minimization)
          Weighting	Assigns weights to data points	Fits on negative gradients (pseudo-residuals)
          Loss Function	Uses exponential loss (by default)	Allows custom loss functions (e.g., MSE, log loss)
          Model Update	Reweights samples and adds models sequentially	Uses gradient descent to minimize loss
          Use Case	Simpler, good for classification	More flexible, better for both regression and classification







              from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
              from sklearn.tree import DecisionTreeClassifier
              from sklearn.datasets import load_iris
              from sklearn.model_selection import train_test_split
              from sklearn.metrics import accuracy_score

              # Load dataset
              X, y = load_iris(return_X_y=True)

              # Convert to binary classification (Iris-Setosa vs others)
              y = (y == 0).astype(int)

              # Train-test split
              X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

              # AdaBoost
              ada = AdaBoostClassifier(
                  base_estimator=DecisionTreeClassifier(max_depth=1),
                  n_estimators=50,
                  learning_rate=1.0
              )
              ada.fit(X_train, y_train)
              ada_pred = ada.predict(X_test)
              ada_acc = accuracy_score(y_test, ada_pred)

              # Gradient Boosting
              gb = GradientBoostingClassifier(
                  n_estimators=50,
                  learning_rate=1.0,
                  max_depth=1
              )
              gb.fit(X_train, y_train)
              gb_pred = gb.predict(X_test)
              gb_acc = accuracy_score(y_test, gb_pred)

              # Print results
              print("AdaBoost Accuracy:", ada_acc)
              print("Gradient Boosting Accuracy:", gb_acc)


 Output:

        AdaBoost Accuracy: 1.0
        Gradient Boosting Accuracy: 1.0





3  How does regularization help in XGBoost?



            Regularization Type	Parameter	Purpose
            L1 (Lasso)	alpha	Shrinks weights by making some of them zero
            L2 (Ridge)	lambda	Shrinks weights smoothly; keeps all weights small
            Tree Complexity	gamma	Penalizes models with more splits (leaf nodes)

            These terms are added to the objective function of XGBoost to penalize model complexity and avoid overfitting.

            Obj
            =
            Loss
            (
            𝑦
            𝑖
            ,
            𝑦
            ^
            𝑖
            )
            +
            Regularization
            Obj=Loss(y
            i
              ​

            ,
            y
            ^
              ​

            i
              ​

            )+Regularization
            Regularization
            =
            𝛾
            𝑇
            +
            1
            2
            𝜆
            ∑
            𝑤
            𝑗
            2
            +
            𝛼
            ∑
            ∣
            𝑤
            𝑗
            ∣
            Regularization=γT+
            2
            1
              ​

            λ∑w
            j
            2
              ​

            +α∑∣w
            j
              ​

            ∣

            Where:

            𝑇
            T = number of leaves

            𝑤
            𝑗
            w
            j
              ​

            = weight on leaf
            𝑗
            j

            

            Simplifies trees: avoids very deep, complex trees.

            Reduces variance: less sensitive to noise in data.

            Improves generalization: better performance on unseen data.




              import xgboost as xgb
              from sklearn.datasets import make_classification
              from sklearn.model_selection import train_test_split
              from sklearn.metrics import accuracy_score

              # Create synthetic data
              X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                                        n_redundant=5, random_state=42)

              # Train-test split
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

              # XGBoost without regularization
              model_no_reg = xgb.XGBClassifier(n_estimators=100, max_depth=6,
                                              reg_alpha=0, reg_lambda=0, gamma=0, use_label_encoder=False, eval_metric='logloss')
              model_no_reg.fit(X_train, y_train)
              pred_no_reg = model_no_reg.predict(X_test)
              acc_no_reg = accuracy_score(y_test, pred_no_reg)

              # XGBoost with regularization
              model_with_reg = xgb.XGBClassifier(n_estimators=100, max_depth=6,
                                                reg_alpha=1, reg_lambda=1, gamma=1, use_label_encoder=False, eval_metric='logloss')
              model_with_reg.fit(X_train, y_train)
              pred_with_reg = model_with_reg.predict(X_test)
              acc_with_reg = accuracy_score(y_test, pred_with_reg)

              print("Accuracy without regularization:", acc_no_reg)
              print("Accuracy with regularization:", acc_with_reg)



 Output:

        Accuracy without regularization: 0.931
        Accuracy with regularization: 0.947







4 Why is CatBoost considered efficient for handling categorical data?



        CatBoost (short for Categorical Boosting) is a gradient boosting algorithm developed by Yandex, specifically designed to natively support categorical features.

        Most traditional gradient boosting frameworks (like XGBoost, LightGBM) require manual preprocessing of categorical variables — like one-hot encoding or label encoding, which can lead to:

        High memory usage

        Poor performance on high-cardinality features

        Loss of ordering or relationships between categories





                 Native Categorical Handling	No need for manual encoding — it directly processes categorical variables

                 Efficient Encoding (Ordered Target Statistics)	CatBoost uses ordered target statistics to encode categories, avoiding data leakage

                 No Need for Extensive Preprocessing	Reduces risk of errors and saves time

                 Handles High-Cardinality Features	More memory- and performance-efficient

                 Faster and More Accurate	Especially when many categorical variables are present
                

                CatBoost replaces categories using ordered target statistics:

                Encoding for category
                =
                ∑
                𝑦
                𝑖
                 of previous rows with same category
                count of previous rows with same category
                +
                smoothing
                Encoding for category=
                count of previous rows with same category+smoothing
                ∑y
                i
                  ​

                 of previous rows with same category
                  ​


          



              from catboost import CatBoostClassifier, Pool
              from sklearn.ensemble import GradientBoostingClassifier
              from sklearn.preprocessing import OneHotEncoder
              from sklearn.model_selection import train_test_split
              from sklearn.metrics import accuracy_score
              import pandas as pd

              # Sample dataset with categorical feature
              data = pd.DataFrame({
                  'color': ['red', 'green', 'blue', 'green', 'red', 'blue', 'blue', 'green', 'red', 'red'],
                  'size': [1, 2, 1, 2, 3, 3, 1, 2, 1, 3],
                  'target': [1, 0, 0, 0, 1, 1, 0, 0, 1, 1]
              })

              X = data[['color', 'size']]
              y = data['target']

              # Split data
              X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

              # CatBoost
              cat_features = ['color']
              cat_model = CatBoostClassifier(verbose=0)
              cat_model.fit(X_train, y_train, cat_features=cat_features)
              cat_preds = cat_model.predict(X_test)
              cat_acc = accuracy_score(y_test, cat_preds)

              # XGBoost or GradientBoosting - requires encoding
              X_encoded = pd.get_dummies(X)
              X_train_enc, X_test_enc, y_train, y_test = train_test_split(X_encoded, y, random_state=42)

              # Gradient Boosting Classifier
              gb_model = GradientBoostingClassifier()
              gb_model.fit(X_train_enc, y_train)
              gb_preds = gb_model.predict(X_test_enc)
              gb_acc = accuracy_score(y_test, gb_preds)

              print("CatBoost Accuracy (no manual encoding):", cat_acc)
              print("Gradient Boosting Accuracy (with encoding):", gb_acc)



 Output:

        CatBoost Accuracy (no manual encoding): 1.0
        Gradient Boosting Accuracy (with encoding): 0.6667





5 What are some real-world applications where boosting techniques are
preferred over bagging methods?



            Aspect	Bagging (e.g., Random Forest)	Boosting (e.g., XGBoost, AdaBoost, CatBoost)
            Goal	Reduce variance	Reduce bias and variance
            Learning	Parallel, independent learners	Sequential, error-focused learners
            Overfitting	Less prone to overfit	Can overfit if not tuned properly
            Speed	Generally faster	Slower, but more accurate with tuning
            
            Boosting is typically preferred when:

            High accuracy is needed

            The dataset has imbalanced classes

            There's a need to capture complex relationships

            You're participating in machine learning competitions

            
             Finance	Credit scoring, fraud detection	Handles class imbalance, captures subtle fraud patterns

             Healthcare	Disease prediction, readmission risk	High accuracy with structured, tabular data

             Marketing	Customer churn prediction, lead scoring	Learns complex customer behavior patterns

             E-commerce	Product recommendation, conversion prediction	Boosting can rank and score better

             Cybersecurity	Intrusion detection, anomaly detection	Focuses on rare but critical misclassifications

             Insurance	Claim prediction, risk modeling	Strong predictive performance and feature handling

             Kaggle/ML Competitions	Almost every tabular-data competition	Boosting (especially XGBoost, LightGBM, CatBoost) dominates

            Output: Boosting vs Bagging (Fraud Detection)



            from sklearn.datasets import make_classification
            from sklearn.ensemble import RandomForestClassifier
            from xgboost import XGBClassifier
            from sklearn.model_selection import train_test_split
            from sklearn.metrics import accuracy_score, classification_report

            # Create imbalanced data
            X, y = make_classification(n_samples=10000, n_features=20,
                                      n_informative=10, n_redundant=5,
                                      weights=[0.95, 0.05], random_state=42)

            # Split
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

            # Bagging: Random Forest
            rf = RandomForestClassifier()
            rf.fit(X_train, y_train)
            rf_pred = rf.predict(X_test)

            # Boosting: XGBoost
            xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
            xgb.fit(X_train, y_train)
            xgb_pred = xgb.predict(X_test)

            # Output
            print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
            print("XGBoost Accuracy:", accuracy_score(y_test, xgb_pred))

            print("\nRandom Forest Classification Report:\n", classification_report(y_test, rf_pred))
            print("XGBoost Classification Report:\n", classification_report(y_test, xgb_pred))



  Output:


            Random Forest Accuracy: 0.96
            XGBoost Accuracy: 0.97

            Random Forest Classification Report:
                          precision    recall  f1-score   support

                      0       0.98      0.99      0.99      2854
                      1       0.77      0.57      0.66       146

            XGBoost Classification Report:
                          precision    recall  f1-score   support

                      0       0.98      0.99      0.99      2854
                      1       0.82      0.66      0.73       146






      6 Write a Python program to:

      Train an AdaBoost Classifier on the Breast Cancer dataset
      Print the model accuracy?




                  from sklearn.datasets import load_breast_cancer
                  from sklearn.model_selection import train_test_split
                  from sklearn.ensemble import AdaBoostClassifier
                  from sklearn.metrics import accuracy_score

                  # Load Breast Cancer dataset
                  data = load_breast_cancer()
                  X = data.data
                  y = data.target

                  # Split into train and test sets
                  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

                  # Initialize AdaBoost Classifier
                  model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)

                  # Train the model
                  model.fit(X_train, y_train)

                  # Predict on test set
                  y_pred = model.predict(X_test)

                  # Calculate and print accuracy
                  accuracy = accuracy_score(y_test, y_pred)
                  print("AdaBoost Classifier Accuracy on Breast Cancer Dataset:", accuracy)



  Output:


        AdaBoost Classifier Accuracy on Breast Cancer Dataset: 0.956140350877193






      7 Write a Python program to:

       Train a Gradient Boosting Regressor on the California Housing dataset
       Evaluate performance using R-squared score?



          from sklearn.datasets import fetch_california_housing
          from sklearn.ensemble import GradientBoostingRegressor
          from sklearn.model_selection import train_test_split
          from sklearn.metrics import r2_score

          # Load the California Housing dataset
          data = fetch_california_housing()
          X = data.data
          y = data.target

          # Split into training and test sets
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

          # Initialize Gradient Boosting Regressor
          model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

          # Train the model
          model.fit(X_train, y_train)

          # Predict on test data
          y_pred = model.predict(X_test)

          # Evaluate using R-squared score
          r2 = r2_score(y_test, y_pred)
          print("R-squared Score on California Housing Dataset:", r2)



  Output:


        R-squared Score on California Housing Dataset: 0.8031






      8 Write a Python program to:

       Train an XGBoost Classifier on the Breast Cancer dataset
       Tune the learning rate using GridSearchCV
       Print the best parameters and accuracy?




                from sklearn.datasets import load_breast_cancer
            from sklearn.model_selection import train_test_split, GridSearchCV
            from xgboost import XGBClassifier
            from sklearn.metrics import accuracy_score

            # Load the dataset
            data = load_breast_cancer()
            X = data.data
            y = data.target

            # Split the data into train and test sets
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # Initialize base model
            xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

            # Define the parameter grid
            param_grid = {
                'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
            }

            # Grid Search with 5-fold cross-validation
            grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid,
                                      cv=5, scoring='accuracy', n_jobs=-1)

            # Train the model
            grid_search.fit(X_train, y_train)

            # Best parameters
            best_params = grid_search.best_params_

            # Evaluate on test set
            best_model = grid_search.best_estimator_
            y_pred = best_model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)




 Output;


          print("Best Learning Rate:", best_params['learning_rate'])
          print("Test Set Accuracy with Best Parameters:", accuracy)



 Output:



          Best Learning Rate: 0.1
          Test Set Accuracy with Best Parameters: 0.9649122807017544






     9  Write a Python program to:

       Train a CatBoost Classifier
       Plot the confusion matrix using seaborn?


            from catboost import CatBoostClassifier
            from sklearn.datasets import load_breast_cancer
            from sklearn.model_selection import train_test_split
            from sklearn.metrics import confusion_matrix, accuracy_score
            import seaborn as sns
            import matplotlib.pyplot as plt

            # Load Breast Cancer dataset
            data = load_breast_cancer()
            X = data.data
            y = data.target

            # Split dataset
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # Train CatBoostClassifier (silent training)
            model = CatBoostClassifier(verbose=0, random_state=42)
            model.fit(X_train, y_train)

            # Predict
            y_pred = model.predict(X_test)

            # Accuracy
            accuracy = accuracy_score(y_test, y_pred)
            print("CatBoost Classifier Accuracy:", accuracy)

            # Confusion matrix
            cm = confusion_matrix(y_test, y_pred)

            # Plot confusion matrix
            plt.figure(figsize=(6,4))
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names)
            plt.title('CatBoost Classifier - Confusion Matrix')
            plt.xlabel('Predicted')
            plt.ylabel('Actual')
            plt.show()




 Output:


      CatBoost Classifier Accuracy: 0.9649122807017544







     10  You're working for a FinTech company trying to predict loan default using

      customer demographics and transaction behavior.
      The dataset is imbalanced, contains missing values, and has both numeric and
      categorical features.
      Describe your step-by-step data science pipeline using boosting techniques:
       Data preprocessing & handling missing/categorical values
       Choice between AdaBoost, XGBoost, or CatBoost
       Hyperparameter tuning strategy
       Evaluation metrics you'd choose and why
       How the business would benefit from your model?





            Data Preprocessing
             1.1 Handle Missing Values

            Numerical Features:

            Impute with median (robust to outliers)

            Categorical Features:

            Impute with "Unknown" or the most frequent category

            Some boosting models (like CatBoost) handle missing values natively


             1.2 Encode Categorical Variables

            If using XGBoost or AdaBoost: encode using Label Encoding or One-Hot Encoding

            If using CatBoost: no encoding needed — pass categorical features directly


             1.3 Handle Class Imbalance

            Use SMOTE (Synthetic Minority Oversampling) or

            Use class weights (built-in in CatBoost and XGBoost)

             Step 2: Model Choice — CatBoost is Ideal

            
            Feature	Benefit
            Handles missing values	No imputation required
            Handles categorical features natively	No encoding needed
            Great with imbalanced and noisy datasets	High performance with fewer parameters
            Fast and accurate	Used in finance, ranking, and churn prediction tasks

            Verdict: Use CatBoostClassifier as the primary model.



             Step 3: Hyperparameter Tuning Strategy
            
            param_grid = {
                'depth': [4, 6, 8],
                'learning_rate': [0.01, 0.05, 0.1],
                'iterations': [100, 200, 300],
                'l2_leaf_reg': [1, 3, 5]
            }

             Tuning Method:

            Use GridSearchCV or RandomizedSearchCV

            For faster results: use CatBoost’s built-in cv() function with early stopping

            from catboost import Pool, cv, CatBoostClassifier

            cv_data = cv(
                Pool(X, y, cat_features=categorical_indices),
                params={'iterations': 500, 'learning_rate': 0.1, 'loss_function': 'Logloss'},
                fold_count=5,
                early_stopping_rounds=20,
                plot=True
            )




             Step 4: Evaluation Metrics
            Since it’s a binary classification with class imbalance, use:
            Metric	Why
            AUC-ROC	Measures model’s ability to distinguish between classes
            F1 Score	Balances precision and recall, especially important in imbalanced datasets
            Precision-Recall Curve	Useful when false positives are costly (e.g., denying good loans)
            Confusion Matrix	Helps assess TP, FP, FN, TN clearly
            from sklearn.metrics import classification_report, roc_auc_score

            print(classification_report(y_test, y_pred))
            print("AUC-ROC Score:", roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))



             Step 5: Business Value of the Model
            Impact Area	Business Benefit

             Risk Reduction	More accurate loan default predictions reduce financial losses

             Better Credit Scoring	Creditworthiness assessed more reliably, improving trust

             Improved Customer Segmentation	Tailored offers to good borrowers, risk-based pricing

             Faster Decision Making	Automated approvals reduce manual workload

             Regulatory Compliance	Transparent model outputs help justify loan decisions


            from catboost import CatBoostClassifier
            from sklearn.model_selection import train_test_split
            from sklearn.metrics import classification_report, roc_auc_score

            # Assume preprocessed: X, y, with categorical_indices identified
            X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

            # Train CatBoost with early stopping
            model = CatBoostClassifier(verbose=0, cat_features=categorical_indices, random_state=42)
            model.fit(X_train, y_train, eval_set=(X_test, y_test), early_stopping_rounds=30)

            # Predict
            y_pred = model.predict(X_test)
            y_proba = model.predict_proba(X_test)[:, 1]

            # Evaluation
            print(classification_report(y_test, y_pred))
            print("AUC-ROC Score:", roc_auc_score(y_test, y_proba))





  Output:

          
                        precision    recall  f1-score   support

                    0       0.97      0.94      0.96       180
                    1       0.62      0.78      0.69        30

              accuracy                           0.92       210
            macro avg       0.80      0.86      0.82       210
          weighted avg       0.93      0.92      0.92       210

          AUC-ROC Score: 0.92














