Question 1: What is Boosting in Machine Learning? Explain how it improves weak
learners.
- Boosting is an ensemble learning method in machine learning that combines multiple weak learners—models that perform only slightly better than random guessing—into a single, highly accurate strong learner

Question 2: What is the difference between AdaBoost and Gradient Boosting in terms
of how models are trained?
- AdaBoost trains each new weak learner by focusing on the data points that were misclassified by the previous model.
- Gradient Boosting identifies the shortcomings of the current model by calculating the residuals (the differences between the actual and predicted values) for each training instance.

Question 3: How does regularization help in XGBoost?
- Regularization in XGBoost helps prevent overfitting by adding penalties to the loss function, thereby discouraging the model from fitting noise in the training data and promoting better generalization to unseen data.

Question 4: Why is CatBoost considered efficient for handling categorical data?
- CatBoost is considered efficient for handling categorical data because it can process these features natively without requiring extensive preprocessing like one-hot encoding or label encoding,


Question 5: What are some real-world applications where boosting techniques are
preferred over bagging methods?
 - Boosting techniques are preferred over bagging methods in real-world applications where high accuracy and the ability to handle complex data patterns are paramount, especially when the underlying model exhibits high bias.

Question 6: Write a Python program to:
- Train an AdaBoost Classifier on the Breast Cancer dataset
- Print the model accuracy


In [1]:
import pandas as pd
import numpy as np

In [22]:
from sklearn.datasets import load_breast_cancer
bcancer = load_breast_cancer()
X1 = bcancer.data
y1 = bcancer.target

In [23]:
from sklearn.model_selection import train_test_split
X1_train,X1_test,y1_train,y1_test = train_test_split(X1,y1,train_size=0.3,random_state=1)

from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier()

abc.fit(X1_train,y1_train)

0,1,2
,estimator,
,n_estimators,50
,learning_rate,1.0
,algorithm,'deprecated'
,random_state,


In [24]:
y1_pred = abc.predict(X1_test)

from sklearn.metrics import accuracy_score
accuracy_score(y1_test,y1_pred)

0.9298245614035088

Question 7: Write a Python program to:
- Train a Gradient Boosting Regressor on the California Housing dataset
- Evaluate performance using R-squared score

In [18]:
housing = pd.read_csv('california_housing_train.csv')
housing

X2 = housing.iloc[:,:-1]
y2 = housing.iloc[:,-1]


In [19]:
X2_train,X2_test,y2_train,y2_test = train_test_split(X2,y2,test_size=0.3, random_state=1)

from sklearn.ensemble import GradientBoostingRegressor

reg = GradientBoostingRegressor()

reg.fit(X2_train,y2_train)

0,1,2
,loss,'squared_error'
,learning_rate,0.1
,n_estimators,100
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,3
,min_impurity_decrease,0.0


In [20]:
y2_pred = reg.predict(X2_test)

from sklearn.metrics import r2_score

r2_score(y2_test,y2_pred)

0.7845662593642795

Question 8: Write a Python program to:
-  Train an XGBoost Classifier on the Breast Cancer dataset
- Tune the learning rate using GridSearchCV
-  Print the best parameters and accuracy


In [28]:
from xgboost import XGBClassifier

xg = XGBClassifier()

xg.fit(X1_train,y1_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [29]:
y3_pred = xg.predict(X1_test)
accuracy_score(y1_test,y3_pred)

0.9298245614035088

In [30]:
params = {
    'n_estimators':[100,200,300,400,500],
    'learning_rate':[0.1,0.2,0.3,1,0.5],
    'booster':['gbtree','gblinear','dart']
}

from sklearn.model_selection import GridSearchCV

grid_mode = GridSearchCV(xg, param_grid=params,cv = 5, verbose=3)

grid_mode.fit(X1_train,y1_train)



Fitting 5 folds for each of 75 candidates, totalling 375 fits
[CV 1/5] END booster=gbtree, learning_rate=0.1, n_estimators=100;, score=0.941 total time=   0.0s
[CV 2/5] END booster=gbtree, learning_rate=0.1, n_estimators=100;, score=0.941 total time=   0.0s
[CV 3/5] END booster=gbtree, learning_rate=0.1, n_estimators=100;, score=1.000 total time=   0.0s
[CV 4/5] END booster=gbtree, learning_rate=0.1, n_estimators=100;, score=0.971 total time=   0.0s
[CV 5/5] END booster=gbtree, learning_rate=0.1, n_estimators=100;, score=1.000 total time=   0.0s
[CV 1/5] END booster=gbtree, learning_rate=0.1, n_estimators=200;, score=0.941 total time=   0.0s
[CV 2/5] END booster=gbtree, learning_rate=0.1, n_estimators=200;, score=0.941 total time=   0.0s
[CV 3/5] END booster=gbtree, learning_rate=0.1, n_estimators=200;, score=1.000 total time=   0.0s
[CV 4/5] END booster=gbtree, learning_rate=0.1, n_estimators=200;, score=0.971 total time=   0.0s
[CV 5/5] END booster=gbtree, learning_rate=0.1, n_estima

0,1,2
,estimator,"XGBClassifier...ree=None, ...)"
,param_grid,"{'booster': ['gbtree', 'gblinear', ...], 'learning_rate': [0.1, 0.2, ...], 'n_estimators': [100, 200, ...]}"
,scoring,
,n_jobs,
,refit,True
,cv,5
,verbose,3
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,'gbtree'
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [34]:
best_params = grid_mode.best_params_

best_score = grid_mode.best_score_

print(f"the best parameters are {best_params}, and best accuracy will be {best_score:.4}")

the best parameters are {'booster': 'gbtree', 'learning_rate': 0.1, 'n_estimators': 100}, and best accuracy will be 0.9706


Question 9: Write a Python program to:
- Train a CatBoost Classifier
-  Plot the confusion matrix using seaborn


In [36]:
from catboost import CatBoostClassifier

cat_clf = CatBoostClassifier()

cat_clf.fit(X1_train,y1_train)

Learning rate set to 0.004834
0:	learn: 0.6858891	total: 147ms	remaining: 2m 26s
1:	learn: 0.6781648	total: 151ms	remaining: 1m 15s
2:	learn: 0.6705497	total: 156ms	remaining: 51.9s
3:	learn: 0.6641175	total: 161ms	remaining: 40.2s
4:	learn: 0.6567857	total: 167ms	remaining: 33.2s
5:	learn: 0.6509733	total: 172ms	remaining: 28.5s
6:	learn: 0.6436193	total: 177ms	remaining: 25s
7:	learn: 0.6368326	total: 181ms	remaining: 22.5s
8:	learn: 0.6291703	total: 186ms	remaining: 20.5s
9:	learn: 0.6228904	total: 190ms	remaining: 18.8s
10:	learn: 0.6173334	total: 195ms	remaining: 17.5s
11:	learn: 0.6111394	total: 200ms	remaining: 16.4s
12:	learn: 0.6047801	total: 204ms	remaining: 15.5s
13:	learn: 0.5985748	total: 209ms	remaining: 14.7s
14:	learn: 0.5920908	total: 215ms	remaining: 14.1s
15:	learn: 0.5853732	total: 220ms	remaining: 13.5s
16:	learn: 0.5793934	total: 227ms	remaining: 13.1s
17:	learn: 0.5751843	total: 236ms	remaining: 12.9s
18:	learn: 0.5700470	total: 240ms	remaining: 12.4s
19:	learn: 

<catboost.core.CatBoostClassifier at 0x2c201fd0050>

In [38]:
y4_pred = cat_clf.predict(X1_test)

accuracy_score(y1_test,y4_pred)

0.9373433583959899

Question 10: You're working for a FinTech company trying to predict loan default using
customer demographics and transaction behavior.
The dataset is imbalanced, contains missing values, and has both numeric and
categorical features.
Describe your step-by-step data science pipeline using boosting techniques:
- Data preprocessing & handling missing/categorical values
- Choice between AdaBoost, XGBoost, or CatBoost
- Hyperparameter tuning strategy
- Evaluation metrics you'd choose and why
- How the business would benefit from your model


Data Cleaning → Handle Missing + Categorical → Class Imbalance → Train CatBoost (compare with XGBoost) → Hyperparameter Tuning (Optuna) → Evaluate (ROC-AUC, PR-AUC, Recall, F1) → Deploy model → Track real-world performance.