#Boosting Techniques Assignment

#1: What is Boosting in Machine Learning? Explain how it improves weak learners.

:- Boosting is an ensemble learning technique that combines multiple weak learners (models that perform just slightly better than random guessing) to create a strong learner.

How it improves weak learners (in short):

Models are trained sequentially.

Each new model focuses more on the errors made by previous models.

Misclassified data points are given higher importance.

Final prediction is made by combining all models (often using weighted voting).

# 2: What is the difference between AdaBoost and Gradient Boosting in terms of how models are trained?

:- AdaBoost:
Trains models sequentially by re-weighting misclassified data points. Each new model focuses more on samples that previous models got wrong.

Gradient Boosting:
Trains models sequentially by fitting each new model to the residual errors (gradients of the loss function) of the previous models.

#3: How does regularization help in XGBoost?

:- Adds penalties for model complexity (too many or deep trees).

Controls tree depth, number of leaves, and leaf weights.

Encourages simpler models that perform well on unseen data.

#4: Why is CatBoost considered efficient for handling categorical data?

:- CatBoost is efficient for handling categorical data because:

It natively handles categorical features without manual encoding.

Uses target-based (ordered) encoding, reducing target leakage.

Applies ordered boosting, which improves learning stability.

Avoids large one-hot encoded feature spaces.

#5: What are some real-world applications where boosting techniques are preferred over bagging methods?

:- Real-world applications where boosting is preferred over bagging:

Fraud detection ‚Äì focuses on rare, hard-to-classify cases

Credit scoring ‚Äì improves prediction on borderline customers

Medical diagnosis ‚Äì emphasizes misclassified patient cases

Ad click-through rate prediction ‚Äì captures complex patterns

Customer churn prediction ‚Äì improves accuracy on difficult users

# 6: Write a Python program to: ‚óè Train an AdaBoost Classifier on the Breast Cancer dataset ‚óè Print the model accuracy (Include your Python code and output in the code box below.)


In [2]:
# Train an AdaBoost Classifier on the Breast Cancer dataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into training


#Write a Python program to: ‚óè Train a Gradient Boosting Regressor on the California Housing dataset ‚óè Evaluate performance using R-squared score (Include your Python code and output in the code box below.)

In [3]:
# Train a Gradient Boosting Regressor on the California Housing dataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Gradient Boosting Regressor
model = GradientBoostingRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model using R-squared score
r2 = r2_score(y_test, y_pred)
print("R-squared Score:", r2)


R-squared Score: 0.7756446042829697


# 8: Write a Python program to: ‚óè Train an XGBoost Classifier on the Breast Cancer dataset ‚óè Tune the learning rate using GridSearchCV ‚óè Print the best parameters and accuracy (Include your Python code and output in the code box below.)

In [5]:
# Train an XGBoost Classifier with GridSearchCV on the Breast Cancer dataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define XGBoost model
xgb_model = XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

# Parameter grid for learning rate
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Fit model
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# Predictions
y_pred = best_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Model Accuracy:", accuracy)


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.


Best Parameters: {'learning_rate': 0.2}
Model Accuracy: 0.956140350877193


# 9: Write a Python program to: ‚óè Train a CatBoost Classifier ‚óè Plot the confusion matrix using seaborn (Include your Python code and output in the code box below.)

In [7]:
# Train a CatBoost Classifier and plot confusion matrix
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from catboost import CatBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train CatBoost Classifier
model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    verbose=0,
    random_state=42
)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix using seaborn
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix - CatBoost Classifier")
plt.show()


ModuleNotFoundError: No module named 'catboost'

#10: You're working for a FinTech company trying to predict loan default using customer demographics and transaction behavior. The dataset is imbalanced, contains missing values, and has both numeric and categorical features. Describe your step-by-step data science pipeline using boosting techniques: ‚óè Data preprocessing & handling missing/categorical values ‚óè Choice between AdaBoost, XGBoost, or CatBoost ‚óè Hyperparameter tuning strategy ‚óè Evaluation metrics you'd choose and why ‚óè How the business would benefit from your model (Include your Python code and output in the code box below.)

:- 1Ô∏è‚É£ Data Preprocessing

Handling missing values

Numeric features: impute using median (robust to outliers).

Categorical features: fill missing values with "Unknown" or let CatBoost handle them natively.

Handling categorical variables

Prefer CatBoost because it:

Natively handles categorical features

Uses ordered target encoding (avoids leakage)

No need for one-hot encoding ‚Üí faster & cleaner pipeline.

Handling imbalance

Use:

class_weights

or scale_pos_weight (for XGBoost)

Focus on recall of defaulters.

2Ô∏è‚É£ Choice of Boosting Algorithm

Best choice: CatBoost

Why not AdaBoost?

Sensitive to noisy data

Weak with missing values

Why CatBoost over XGBoost?

Handles categorical features automatically

Performs well on imbalanced financial datasets

Less preprocessing, lower leakage risk

3Ô∏è‚É£ Hyperparameter Tuning Strategy

Use GridSearchCV / RandomizedSearchCV

Tune:

learning_rate

depth

iterations

class_weights

Use Stratified K-Fold to preserve class imbalance

4Ô∏è‚É£ Evaluation Metrics (Very Important for FinTech)

Accuracy ‚ùå (misleading for imbalance)

Use instead:

Recall (Default class) ‚Üí catch risky customers

Precision ‚Üí avoid rejecting good customers

F1-score ‚Üí balance precision & recall

ROC-AUC ‚Üí ranking quality

üëâ Primary metric: Recall + ROC-AUC

5Ô∏è‚É£ Business Benefits

Early detection of high-risk borrowers

Reduced loan defaults & financial losses

Better credit decision automation

Improved regulatory compliance

Higher profitability & customer trust

In [8]:
# Loan Default Prediction using CatBoost

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score
from catboost import CatBoostClassifier

# Sample data loading (placeholder)
data = pd.read_csv("loan_data.csv")

X = data.drop("default", axis=1)
y = data["default"]

# Identify categorical features
cat_features = X.select_dtypes(include=["object"]).columns

# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# CatBoost model
model = CatBoostClassifier(
    loss_function="Logloss",
    eval_metric="AUC",
    auto_class_weights="Balanced",
    verbose=0,
    random_state=42
)

# Hyperparameter tuning
param_grid = {
    "depth": [4, 6, 8],
    "learning_rate": [0.03, 0.1],
    "iterations": [200, 300]
}

grid = GridSearchCV(
    model,
    param_grid,
    scoring="roc_auc",
    cv=5
)

grid.fit(X_train, y_train, cat_features=cat_features)

# Best model
best_model = grid.best_estimator_

# Predictions
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

# Evaluation
print("Best Parameters:", grid.best_params_)
print(classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


ModuleNotFoundError: No module named 'catboost'