# Boosting Techniques

1. What is Boosting in Machine Learning? Explain how it improves weak
learners.
  - Boosting is an ensemble learning method that combines several "weak learners" to create a single "strong learner". A weak learner is a model that is only slightly better than random guessing (like a shallow decision tree).


  - Sequential Learning: Unlike bagging, boosting trains models sequentially.

  - Error Correction: Each new model attempts to correct the errors made by the previous models in the sequence.

  - Weight Adjustment: It improves weak learners by assigning higher weights to observations that were misclassified or had high residuals in earlier rounds, forcing the next model to focus more on those "difficult" cases.
2. What is the difference between AdaBoost and Gradient Boosting in terms
of how models are trained?
  - Focus : Focuses on misclassified points by increasing their weights.  Focuses on residuals (the difference between predicted and actual values).
  - Optimization : Minimizes loss by changing weights of data points. Minimizes loss using Gradient Descent on a loss function.
  - Model Building : Usually uses "Stumps" (trees with only one split). Uses larger trees, though still relatively shallow.  
3. How does regularization help in XGBoost?
  - XGBoost (Extreme Gradient Boosting) includes built-in L1 (Lasso) and L2 (Ridge) regularization.

  - Prevents Overfitting: By penalizing complex models, regularization ensures the model doesn't "memorize" noise in the training data.

  - Complexity Control: It limits the influence of individual features and the depth/number of leaves in the trees, leading to better generalization on unseen data.  
4. Why is CatBoost considered efficient for handling categorical data?
  - CatBoost is designed specifically to handle categorical features without requiring manual preprocessing like One-Hot Encoding.

  - Ordered Boosting: It uses a proprietary algorithm to handle categorical features by calculating a "target statistic" while avoiding data leakage.

  - Automatic Encoding: It transforms categorical values into numerical features during training, which reduces memory usage and speeds up the process compared to traditional encoding methods.
5. What are some real-world applications where boosting techniques are
preferred over bagging methods?
  - Boosting is often preferred when the goal is high precision and the dataset has a clear but complex structure.

  - Fraud Detection: Identifying rare fraudulent transactions among millions of legitimate ones.

  - Search Engine Ranking: Ranking pages based on relevance where small errors significantly impact user experience.

  - Click-Through Rate (CTR) Prediction: In digital advertising, where predicting the subtle probability of a user clicking is crucial.    
  


In [None]:
'''#6. Write a Python program to:
  - 1. Train an AdaBoost Classifier on the Breast Cancer dataset
  - 2. Print the model accuracy'''
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = AdaBoostClassifier(n_estimators=50, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, y_pred):.4f}")

In [None]:
'''7.Write a Python program to:
● Train a Gradient Boosting Regressor on the California Housing dataset
● Evaluate performance using R-squared score'''
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
print(f"Gradient Boosting R-squared: {r2_score(y_test, y_pred):.4f}")


In [None]:
'''8.Write a Python program to:
● Train an XGBoost Classifier on the Breast Cancer dataset
● Tune the learning rate using GridSearchCV
● Print the best parameters and accuracy'''
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
param_grid = {'learning_rate': [0.01, 0.1, 0.2, 0.3]}
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
grid_search = GridSearchCV(xgb_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Accuracy: {grid_search.best_score_:.4f}")

In [None]:
'''9.Write a Python program to:
● Train a CatBoost Classifier
● Plot the confusion matrix using seaborn'''
from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cat_model = CatBoostClassifier(iterations=100, verbose=0)
cat_model.fit(X_train, y_train)
y_pred = cat_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("CatBoost Confusion Matrix")
plt.show()

10. You're working for a FinTech company trying to predict loan default using
customer demographics and transaction behavior.
The dataset is imbalanced, contains missing values, and has both numeric and
categorical features.
Describe your step-by-step data science pipeline using boosting techniques:
  - 1. Data preprocessing & handling missing/categorical values
  - 2. Choice between AdaBoost, XGBoost, or CatBoost
  - 3. Hyperparameter tuning strategy
  - 4. Evaluation metrics you'd choose and why
  - 5. How the business would benefit from your model
  - Answer:
  - 1. Data Preprocessing & Handling Missing/Categorical Values

  - Missing Values: Use Mean/Median imputation for numerical features and Mode imputation or a "Missing" label for categorical features.


  - Categorical Encoding: Utilize Target Encoding or let the model handle it natively (if using CatBoost) to preserve the relationship between demographics and default risk.


  - Imbalance Handling: Apply SMOTE (Synthetic Minority Over-sampling Technique) or adjust the scale_pos_weight parameter within the boosting algorithm to ensure the model doesn't ignore the minority "default" class.

  - 2. Choice of Model: CatBoost

  - Reasoning: While XGBoost and AdaBoost are powerful, CatBoost is preferred here because it handles categorical variables natively without needing manual One-Hot Encoding.


  - Efficiency: It is robust against overfitting and deals effectively with the missing values often found in transaction behavior datasets.

  - 3. Hyperparameter Tuning Strategy

  - Method: Use GridSearchCV or RandomizedSearchCV to find the optimal balance between model complexity and performance.


  - Parameters to Tune: Focus on learning_rate (to control step size), depth (to control tree complexity), and l2_leaf_reg (to apply regularization and prevent overfitting).

  - 4. Evaluation Metrics

  - Primary Metric: F1-Score or Precision-Recall AUC.

  - Justification: In loan defaults, Accuracy is misleading because most people don't default. We care about the trade-off between Precision (not rejecting good customers) and Recall (catching as many potential defaulters as possible).

  - 5. Business Benefits

  - Risk Mitigation: By identifying high-risk individuals early, the company reduces the "Non-Performing Assets" (NPA) ratio.


  - Automated Decisions: The model enables faster loan approvals for low-risk customers, improving user experience and operational efficiency.


In [None]:
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
cat_features_indices = [0, 1, 4]
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',
    cat_features=cat_features_indices,
    verbose=100
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred)) [cite: 66, 67]