Question 1: What is Boosting in Machine Learning? Explain how it improves weak
learners.

=> Boosting is an ensemble learning technique that combines multiple weak learners (usually decision trees) to create a strong learner that performs better than any single model alone.

A weak learner is a model that performs slightly better than random guessing (for example, 55–60% accuracy). Boosting helps these weak learners “boost” their accuracy by training them sequentially, each one focusing on the errors made by the previous ones.

Question 2: What is the difference between AdaBoost and Gradient Boosting in terms
of how models are trained?

=> Both AdaBoost and Gradient Boosting are Boosting algorithms — they combine weak learners (usually decision trees) to form a strong learner.

Question 3: How does regularization help in XGBoost?


=> XGBoost (Extreme Gradient Boosting) is an advanced version of gradient boosting that includes regularization to make the model more robust, generalized, and less prone to overfitting.

Question 4: Why is CatBoost considered efficient for handling categorical data?


=> CatBoost (short for Categorical Boosting) is a gradient boosting algorithm developed by Yandex, and it is specifically designed to handle categorical features efficiently and automatically — unlike most other boosting algorithms that require manual encoding.

Question 5: What are some real-world applications where boosting techniques are
preferred over bagging methods?


=> Both bagging (like Random Forest) and boosting (like AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost) are ensemble learning techniques, but they serve different purposes.

Bagging reduces variance (good for high-variance models),
Boosting reduces bias (good for weak models that underfit).

Question 6: Write a Python program to:
● Train an AdaBoost Classifier on the Breast Cancer dataset
● Print the model accuracy

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = AdaBoostClassifier(n_estimators=100, learning_rate=0.8, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")


Model Accuracy: 96.49%


Question 7: Write a Python program to:
● Train a Gradient Boosting Regressor on the California Housing dataset
● Evaluate performance using R-squared score

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
print(f"R-squared Score: {r2:.4f}")


R-squared Score: 0.8004


Question 8: Write a Python program to:
● Train an XGBoost Classifier on the Breast Cancer dataset
● Tune the learning rate using GridSearchCV
● Print the best parameters and accuracy


In [4]:
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}

grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_ * 100:.2f}%")

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {test_accuracy * 100:.2f}%")


Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Parameters: {'learning_rate': 0.2}
Best Cross-Validation Accuracy: 96.70%
Test Set Accuracy: 95.61%


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
