#Boosting Techniques Assignment

##Assignment Questions



-- Assignment Code: DA-AG-015

#Q1) What is Boosting in Machine Learning? Explain how it improves weak learners.

Boosting is an ensemble technique that builds a strong predictive model by sequentially training many weak learners (often shallow decision trees) where each new learner focuses on the mistakes of the previous ones. Instead of training models independently, boosting assigns higher weight to misclassified examples (or fits the negative gradients of a chosen loss), forcing subsequent learners to concentrate on hard cases. The final prediction is a weighted sum (or vote) of the weak learners’ outputs. Boosting reduces bias (and often variance), can learn complex patterns, and typically yields high accuracy when properly regularized. However, it is more sensitive to noisy labels and outliers and can overfit without regularization/early stopping.



---



#Q2) : What is the difference between AdaBoost and Gradient Boosting in terms of how models are trained?

AdaBoost trains weak learners sequentially by reweighting training examples: after each learner, it increases weights of misclassified samples so the next learner focuses on them; final predictions are a weighted vote of learners. AdaBoost effectively minimizes an exponential loss. Gradient Boosting, by contrast, trains learners to fit the negative gradient (pseudo-residuals) of a differentiable loss function — each new learner models the residual errors of the ensemble so far; predictions update by adding the learner scaled by a learning rate. Gradient Boosting is more flexible (any differentiable loss), supports shrinkage (learning rate), and is the basis for modern boosted-tree libraries (XGBoost, LightGBM, CatBoost).




---



#Q3) How does regularization help in XGBoost?

XGBoost incorporates multiple regularization mechanisms to prevent overfitting and improve generalization: L1 (Lasso) and L2 (Ridge) penalties on leaf weights shrink large weights; gamma (minimum loss reduction) requires a minimum improvement for a split; max_depth, min_child_weight, subsample, colsample_bytree/colsample_bylevel limit tree complexity and sampling; learning_rate (shrinkage) scales new tree contributions; and lambda/alpha are the L2/L1 regularization terms. Together these penalties control model complexity, reduce variance, stabilize learning over noisy data, and improve robustness. XGBoost also supports early stopping on a validation set to prevent overfitting.



---



#Q4) Why is CatBoost considered efficient for handling categorical data?

CatBoost is designed specifically to handle categorical features natively — you pass categorical column indices and CatBoost applies robust encodings (ordered target statistics, permutations) that avoid target leakage. It uses ordered boosting to reduce prediction shift and bias, which improves generalization on categorical-rich tabular data. CatBoost also implements efficient handling for high-cardinality features, supports symmetric trees for faster inference, and provides GPU acceleration. Because it removes the need for manual target encoding/one-hot explosion and reduces leakage risk, CatBoost often outperforms other models on datasets with many categorical variables.




---



#Q5 - What are some real-world applications where boosting techniques are preferred over bagging methods?

Boosting is preferred when maximum predictive accuracy is needed and when patterns are subtle or biased, e.g.: credit scoring and fraud detection, customer churn prediction, targeted marketing (response modeling), demand forecasting, risk modeling in finance/insurance, and many Kaggle/tabular competition problems. Boosting often outperforms bagging on tabular data because it reduces bias and can learn complex interactions. It’s especially useful for imbalanced problems when tuned properly (with appropriate evaluation metrics and class-weighting). The tradeoff: boosting is slower to train, more sensitive to noisy labels, and requires careful regularization.



---



#Q6) : Write a Python program to:
● Train an AdaBoost Classifier on the Breast Cancer dataset

● Print the model accuracy.



In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# AdaBoost uses Decision Stumps by default (good baseline)
adb = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=100,
    learning_rate=0.8,
    random_state=42
)
adb.fit(X_train, y_train)
y_pred = adb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred, digits=4))



---



#Q7) Write a Python program to:
● Train a Gradient Boosting Regressor on the California Housing dataset

● Evaluate performance using R-squared score

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Try to fetch California Housing; fallback to diabetes if unavailable
try:
    data = fetch_california_housing()
    X, y = data.data, data.target
except Exception as e:
    print("fetch_california_housing failed, falling back to diabetes dataset. Error:", e)
    from sklearn.datasets import load_diabetes
    data = load_diabetes()
    X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbr = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
print("R^2 score:", r2_score(y_test, y_pred))




---



#Q8) Write a Python program to:
● Train an XGBoost Classifier on the Breast Cancer dataset

● Tune the learning rate using GridSearchCV

● Print the best parameters and accuracy


In [None]:
pip install xgboost

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import numpy as np

try:
    from xgboost import XGBClassifier
except ImportError:
    raise ImportError("Please install xgboost: pip install xgboost")

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42, n_jobs=-1)

param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 4, 6],
    'n_estimators': [50, 100, 200]
}

grid = GridSearchCV(xgb, param_grid, cv=4, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
best = grid.best_estimator_
print("Test accuracy:", accuracy_score(y_test, best.predict(X_test)))



---



#Q9 - Write a Python program to:
● Train a CatBoost Classifier

● Plot the confusion matrix using seaborn

In [None]:
pip install catboost seaborn

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

try:
    from catboost import CatBoostClassifier
except ImportError:
    raise ImportError("Please install catboost: pip install catboost")

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

cat = CatBoostClassifier(iterations=200, learning_rate=0.05, depth=6, verbose=0, random_state=42)
cat.fit(X_train, y_train)

y_pred = cat.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

# Confusion matrix plot
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['neg','pos'], yticklabels=['neg','pos'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('CatBoost Confusion Matrix')
plt.show()



---



#Q10 - You're working for a FinTech company trying to predict loan default using customer demographics and transaction behavior.The dataset is imbalanced, contains missing values, and has both numeric and categorical features.







Describe your step-by-step data science pipeline using boosting techniques:


● Data preprocessing & handling missing/categorical values

● Choice between AdaBoost, XGBoost, or CatBoost

● Hyperparameter tuning strategy

● Evaluation metrics you'd choose and why

● How the business would benefit from your model

####Answer: Step-by-step pipeline summary

EDA & target definition: examine class imbalance, missingness patterns, feature distributions, correlations, and business costs (FN vs FP).

Missing values: for numeric features use SimpleImputer (median or model-based imputation); for categorical use SimpleImputer(strategy='constant', fill_value='missing') or CatBoost’s native support. Create missing-indicator features to capture informative missingness.

Categorical features: Prefer CatBoost (native handling) OR for XGBoost/LightGBM use one-hot for low-cardinality and target/mean encoding (with regularization and CV) or frequency encoding for high-cardinality features. Beware target leakage.

Feature engineering: add aggregates from transactions (avg transaction, max, recency), behavioral flags, transaction counts, ratios, and time-based features. Normalize only if using non-tree models.

Class imbalance: compute class weights or set scale_pos_weight in XGBoost; threshold tuning; prefer cost-sensitive learning over blind resampling for transactional data. If using resampling, use SMOTE/ADASYN carefully (synthetic features for numeric-only parts), or stratified oversampling.

Model choice: start with CatBoost (native categorical support + ordered boosting) or XGBoost/LightGBM (fast, powerful). AdaBoost is rarely best for heavily imbalanced/tabular complex datasets.

Hyperparameter tuning: use RandomizedSearchCV or Bayesian (Optuna) on parameters like learning_rate, max_depth, n_estimators, subsample, colsample_bytree, min_child_weight, and regularization lambda/alpha; use early stopping on validation set. Use stratified CV and scoring aligned to business metric (e.g., recall@precision threshold).

Evaluation metrics: prefer Precision, Recall, F1, PR-AUC (average precision), and ROC-AUC; evaluate calibration (Brier score) if probabilities are acted on. Use confusion-matrix and cost-sensitive metrics (expected monetary loss). Monitor fairness and feature leakage.

Model explainability & deployment: compute SHAP values to explain predictions for adjudication and compliance. Establish monitoring for drift and periodic retraining. Use conservative thresholds and human review for borderline/high-impact decisions.

Business impact: reduce default losses by early flagging of high-risk applicants, improving decision automation and risk-based pricing, and enabling targeted interventions (e.g., reminders, different terms) — while maintaining low false-positive rates to avoid rejecting good customers.

In [None]:
# This uses sklearn's make_classification to simulate data, introduces missing values,
# demonstrates imputation, computes class_weight/scale_pos_weight, and trains CatBoost/XGBoost.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, average_precision_score

# Generate synthetic dataset (numeric + categorical)
X_num, y = make_classification(n_samples=5000, n_features=10, n_informative=5,
                               weights=[0.95, 0.05], flip_y=0.01, random_state=42)

df = pd.DataFrame(X_num, columns=[f"num_{i}" for i in range(X_num.shape[1])])
# Add synthetic categorical features
np.random.seed(42)
df['cat_1'] = np.random.choice(['A','B','C'], size=len(df), p=[0.7,0.2,0.1])
df['cat_2'] = np.random.choice(['X','Y'], size=len(df), p=[0.85,0.15])

# Introduce missing values randomly
for col in ['num_1', 'num_3', 'cat_1']:
    df.loc[df.sample(frac=0.05, random_state=42).index, col] = np.nan

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2,
                                                    stratify=y, random_state=42)

# Preprocessing for XGBoost approach (numerics + one-hot for small-cardinality cats)
numeric_features = [c for c in df.columns if c.startswith('num_')]
categorical_features = ['cat_1', 'cat_2']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
], remainder='drop')

# XGBoost pipeline (if xgboost installed)
try:
    from xgboost import XGBClassifier
    clf_xgb = Pipeline(steps=[
        ('preproc', preprocessor),
        ('clf', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42, n_jobs=-1))
    ])
    # compute scale_pos_weight = n_neg / n_pos
    n_pos = sum(y_train==1)
    n_neg = sum(y_train==0)
    scale_pos_weight = n_neg / max(1, n_pos)
    clf_xgb.set_params(clf__scale_pos_weight=scale_pos_weight, clf__n_estimators=200, clf__learning_rate=0.05)
    clf_xgb.fit(X_train, y_train)
    y_prob = clf_xgb.predict_proba(X_test)[:,1]
    print("XGBoost ROC-AUC:", roc_auc_score(y_test, y_prob))
    print("XGBoost Average Precision (PR-AUC):", average_precision_score(y_test, y_prob))
    print(classification_report(y_test, clf_xgb.predict(X_test), digits=4))
except Exception as e:
    print("XGBoost not available or failed:", e)

# CatBoost approach (native categorical handling)
try:
    from catboost import CatBoostClassifier, Pool
    cat_features_idx = [X_train.columns.get_loc(c) for c in categorical_features]
    # CatBoost accepts DataFrame directly
    model_cat = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6, random_state=42,
                                   class_weights=[1, n_neg/n_pos], verbose=0)
    model_cat.fit(X_train, y_train, cat_features=categorical_features, eval_set=(X_test, y_test), use_best_model=True)
    y_prob_cat = model_cat.predict_proba(X_test)[:,1]
    print("CatBoost ROC-AUC:", roc_auc_score(y_test, y_prob_cat))
    print("CatBoost Average Precision (PR-AUC):", average_precision_score(y_test, y_prob_cat))
    print(classification_report(y_test, model_cat.predict(X_test), digits=4))
except Exception as e:
    print("CatBoost not available or failed:", e)