#Boosting Techniques

1.   What is Boosting in Machine Learning? Explain how it improves weak
learners.

Ans- Boosting in machine learning is an ensemble technique that combines multiple weak learners (models that perform slightly better than random) into a strong learner. It works sequentially, where each new learner focuses on the errors of the previous ones. By giving more weight to misclassified data points and combining all weak models, boosting reduces bias, improves accuracy, and turns weak learners into a powerful predictive model.

2.  What is the difference between AdaBoost and Gradient Boosting in terms
of how models are trained?

Ans- AdaBoost trains weak learners sequentially by adjusting the weights of data points, giving more weight to misclassified samples.

Gradient Boosting trains weak learners by fitting them to the residual errors (gradients of the loss function) from the previous model.

3.  How does regularization help in XGBoost?

Ans- Regularization in XGBoost helps by penalizing model complexity (too many trees or overly deep trees), which prevents overfitting. It controls both the weights of leaf nodes (L1 & L2 regularization) and the depth/number of trees, leading to better generalization and more robust predictions.

4. Why is CatBoost considered efficient for handling categorical data?

Ans- CatBoost is efficient for handling categorical data because it uses ordered target statistics and permutation-based encoding instead of one-hot encoding. This reduces data sparsity, avoids overfitting, and allows the model to process categorical features directly and efficiently.

5. What are some real-world applications where boosting techniques are
preferred over bagging methods?

Ans- Boosting is preferred over bagging in applications where high accuracy and complex pattern detection are needed, such as:

Credit scoring & fraud detection (finance)

Customer churn prediction (telecom, marketing)

Medical diagnosis (healthcare)

Search ranking & recommendations (Google, YouTube, e-commerce)

Risk modeling & insurance

6.  Write a Python program to:

● Train an AdaBoost Classifier on the Breast Cancer dataset
● Print the model accuracy

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("AdaBoost Classifier Accuracy:", accuracy)


AdaBoost Classifier Accuracy: 0.9736842105263158


7. Write a Python program to:

● Train a Gradient Boosting Regressor on the California Housing dataset
● Evaluate performance using R-squared score

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Gradient Boosting Regressor
model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance using R-squared score
r2 = r2_score(y_test, y_pred)
print("Gradient Boosting Regressor R-squared Score:", r2)


Gradient Boosting Regressor R-squared Score: 0.8004451261281281


8.  Write a Python program to:

● Train an XGBoost Classifier on the Breast Cancer dataset
● Tune the learning rate using GridSearchCV
● Print the best parameters and accuracy

In [3]:
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Define parameter grid for learning rate
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}

# GridSearchCV for tuning
grid = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Predict on test set
y_pred = best_model.predict(X_test)

# Print results
print("Best Parameters:", grid.best_params_)
print("XGBoost Classifier Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'learning_rate': 0.2}
XGBoost Classifier Accuracy: 0.956140350877193


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


9. : Write a Python program to:

● Train a CatBoost Classifier
● Plot the confusion matrix using seaborn

In [13]:
from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 1) Load data
data = load_breast_cancer()
X, y = data.data, data.target  # 0 = malignant, 1 = benign (same order as target_names)

# 2) Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3) Train CatBoost (binary logloss for clear probabilistic outputs)
model = CatBoostClassifier(
    loss_function="Logloss",
    iterations=300,
    learning_rate=0.1,
    depth=6,
    random_state=42,
    verbose=0
)
model.fit(X_train, y_train)

# 4) Predict robustly as class labels
# (avoid shape/type issues by thresholding probabilities explicitly)
y_proba = model.predict_proba(X_test)[:, 1]        # P(class=1)
y_pred = (y_proba >= 0.5).astype(int)              # convert to 0/1 ints

# 5) Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"CatBoost Accuracy: {acc:.4f}")

# 6) Confusion matrix with fixed label order to match target_names
labels = [0, 1]  # 0=malignant, 1=benign
cm = confusion_matrix(y_test, y_pred, labels=labels)

# 7) Plot
plt.figure(figsize=(6, 4))
sns.heatmap(
    cm, annot=True, fmt='d', cbar=False,
    xticklabels=data.target_names,  # ['malignant','benign']
    yticklabels=data.target_names
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - CatBoost Classifier")
plt.tight_layout()
plt.show()


ModuleNotFoundError: No module named 'catboost'

10. You're working for a FinTech company trying to predict loan default using
customer demographics and transaction behavior.
The dataset is imbalanced, contains missing values, and has both numeric and
categorical features.

Describe your step-by-step data science pipeline using boosting techniques:

● Data preprocessing & handling missing/categorical values
● Choice between AdaBoost, XGBoost, or CatBoost
● Hyperparameter tuning strategy
● Evaluation metrics you'd choose and why
● How the business would benefit from your model

Ans- 1) Data preprocessing & handling missing / categorical values

a. Quick data audit

Check class balance, missingness by column, cardinality of categorical features, skewness, outliers.

b. Missing values

If using CatBoost or XGBoost: they natively handle missing values — you can leave NaNs for the model to use.

If you prefer explicit imputation (for analysis or other models):

Numeric: median (robust) or KNN/Iterative imputer for complex patterns.

Categorical: new category "MISSING" or frequency-based imputation.

Log missingness as a separate binary feature for important columns (helps model).

c. Categorical features

CatBoost: pass raw categorical columns (no encoding).

XGBoost / AdaBoost: encode:

Low-cardinality: one-hot / target encoding with careful leakage control (use CV-based/leave-one-out or mean/impact encoding with smoothing).

High-cardinality: target encoding, hashing, or embeddings. Use ordered/regularized target stats to avoid leakage.

Always use encodings computed only on training folds (no target leakage).

d. Numeric features

Trees don’t require scaling; consider transformations (log) for heavy skew; create interaction/ratio features if domain-relevant.

e. Imbalance handling

Preferred (for boosting):

Use class weights / scale_pos_weight (XGBoost) / class_weights where available.

Use stratified CV to preserve class ratios.

If needed, combine with sampling: SMOTE (on training folds only) or under-sampling majority + boosting.

Consider optimizing on a cost-sensitive loss (or custom objective) that reflects business cost of FN vs FP.

f. Feature selection / engineering

Remove leaky features. Create aggregates from transaction history (e.g., avg balance, volatility, recent delinquencies). Use domain rules (e.g., overdue ratios).

Check for multicollinearity (not critical for trees) and drop redundant features if helpful.

2) Choice between AdaBoost, XGBoost, or CatBoost

CatBoost — recommended if you have many categorical features and need fast, robust handling of categorical variables and fewer encoding steps. It also handles missing values and tends to be robust to default hyperparameters.

XGBoost — choose it if you need highly optimized performance, fine-grained control, or distributed training. Use when features are mostly numeric or you’re comfortable encoding categoricals safely.

AdaBoost — less preferred for tabular tasks with complex feature interactions and imbalance; simpler but usually outperformed by the gradient-boosting implementations.

(Overall: CatBoost → first choice for mixed categorical/numeric; XGBoost → second choice when you want max speed/ops control; AdaBoost for simple baselines.)

3) Hyperparameter tuning strategy

a. Validation strategy

Use Stratified K-Fold CV (e.g., 5 folds) or time-based split if data is time-ordered (use rolling splits). Optimize using CV to avoid leakage.

Use nested CV for unbiased performance estimates if model selection is critical.

b. Search approach

Randomized search / Bayesian optimization (Optuna) to explore broad ranges quickly.

Refined grid search around the best region or continue Bayesian for efficiency.

Use early stopping on a validation fold to avoid overfitting (early_stopping_rounds).

c. Parameters to tune (example ranges)

XGBoost / CatBoost common:

learning_rate (eta): 0.01 – 0.3

n_estimators / iterations: 100 – 2000 (use early stopping)

max_depth: 3 – 10

min_child_weight (XGBoost) / min_data_in_leaf (CatBoost): 1 – 50

subsample / rsm (feature fraction): 0.5 – 1.0

colsample_bytree: 0.4 – 1.0

Regularization: reg_alpha (L1) 0 – 1, reg_lambda (L2) 0 – 10

For imbalance: scale_pos_weight = (n_negative / n_positive) as starting point (tune around it)

CatBoost specific: border_count, l2_leaf_reg, bagging_temperature.

d. Scoring during tuning

Optimize for business-relevant metric (see next). Avoid optimizing raw accuracy when data is imbalanced.

e. Reproducibility

Fix random seed; log experiments (MLflow, Weights & Biases); save best model.

4) Evaluation metrics (and why)

Because the dataset is imbalanced, prefer metrics that reflect performance on the minority class and business cost:

Primary metrics

Precision-Recall AUC (PR-AUC) — better than ROC-AUC when positive class is rare; focuses on classifier’s performance for the positive (default) class.

Recall (Sensitivity) — if missing a default (false negative) is costly, maximize recall (possibly at controlled precision).

Precision@k / Top-k Recall / Lift — practical for actioning a top-k list of risky loans (useful for targeted interventions).

Cost-sensitive metric / Expected monetary loss — compute expected loss using business costs of FN and FP; optimize for profit or minimized expected loss.

Secondary metrics

ROC-AUC — general discrimination ability (still informative).

F1-score — balance between precision and recall (useful when both matter).

Confusion matrix — interpret absolute counts.

Calibration curve / Brier score — for reliability of probabilities (important if you’ll use probabilities for scoring/pricing).

Model explainability & stability

SHAP values — to explain individual predictions and global feature importance (useful for compliance and stakeholder trust).

Stability over time / Backtesting — monitor drift and degradation.

5) How the business benefits from the model

Reduced financial loss: Identify high-risk borrowers early, reduce defaults via interventions (restructuring, collections), and avoid issuing loans to unsustainably risky applicants.

Better capital allocation & pricing: Use risk scores for risk-based pricing and capital provisioning—higher precision reduces unnecessary rejections and optimizes revenue.

Targeted collections: Prioritize collection efforts toward high-likelihood defaulters (higher ROI on collections).

Operational efficiency: Automate decisioning for low-risk cases, freeing manual underwriting for borderline/high-complexity cases.

Regulatory & auditability: Use explainability tools (SHAP, feature importance) to provide rationale for decisions and meet audit/regulatory requirements.

Actionable segments: Create segments (e.g., “high risk, high recovery probability”) to customize retention or recovery strategies.

6) Deployment & monitoring (short)

Threshold selection: Choose probability threshold based on business cost matrix, not default 0.5.

A/B test / Shadow deployment: Validate model impact on decision metrics and business KPIs before full rollout.

Monitoring: Track model performance (PR-AUC, recall), data drift, features distribution, and business KPIs (default rate, loss). Retrain on schedule or when drift detected.

Governance: Keep model registry, versioned datasets, and retraining policy.

Quick practical checklist (action items)

EDA & missingness report → create missing flags.

Feature engineering (transaction aggregates).

Choose CatBoost if many categoricals; else XGBoost.

Use stratified CV; optimize PR-AUC / cost metric via Bayesian search; enable early stopping.

Evaluate with PR-AUC, recall, precision@k and compute expected monetary impact.

Explain predictions with SHAP and test in a shadow environment.

Deploy with monitoring, threshold tuned to business costs.

In [26]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix

# Load data
df = pd.read_csv('loan_data.csv')

# Identify features
cat_features = df.select_dtypes(include='object').columns.tolist()
num_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
target = 'loan_default'

# Handle missing values
for col in num_features:
    df[col].fillna(df[col].median(), inplace=True)
for col in cat_features:
    df[col].fillna('Unknown', inplace=True)

# Train-test split
X = df.drop(columns=[target])
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# CatBoost Pool
train_pool = Pool(X_train, y_train, cat_features=cat_features)
test_pool = Pool(X_test, y_test, cat_features=cat_features)

# Model initialization
model = CatBoostClassifier(verbose=0, class_weights=[1, 5])  # Assuming class 1 is minority

# Hyperparameter tuning
params = {
    'depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'iterations': [100, 300, 500],
    'l2_leaf_reg': [1, 3, 5]
}

grid = RandomizedSearchCV(model, param_distributions=params, cv=3, scoring='roc_auc', n_iter=10, random_state=42)
grid.fit(X_train, y_train, cat_features=cat_features)

# Best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

# Evaluation
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


ModuleNotFoundError: No module named 'catboost'