In [None]:
##Boosting Techniques assignment##

In [None]:
#1.What is Boosting in Machine Learning
Boosting is an ensemble learning technique that combines multiple weak learners (typically decision trees) to create a strong predictive model. 
It works sequentially, where each new model focuses on correcting the errors of the previous one, gradually improving accuracy.

In [None]:
#2.How does Boosting differ from Bagging
Boosting builds models sequentially, with each new model focusing on the mistakes of the previous ones.
Bagging (e.g., Random Forest) trains models independently in parallel and combines them by averaging (for regression) or voting (for classification).
Boosting reduces bias, while bagging primarily reduces variance.

In [None]:
#3. What is the key idea behind AdaBoost
AdaBoost (Adaptive Boosting) assigns higher weights to misclassified samples, 
forcing subsequent models to focus on difficult cases. It iteratively adjusts these weights to minimize errors.


In [None]:
#4.Explain the working of AdaBoost with an example4
1.Start with equal weights for all training samples.
2.Train a weak model (e.g., a shallow decision tree).
3.Increase weights of misclassified samples so the next model focuses on them.
4.Repeat this process for multiple iterations.
5.Final prediction is a weighted combination of all models.
Example: Suppose we classify emails as spam/non-spam. If the first weak model misclassifies certain spam emails, 
the next model will give those emails more importance, improving accuracy over iterations.

In [None]:
#5.What is Gradient Boosting, and how is it different from AdaBoost
Gradient Boosting minimizes the loss function by fitting new models to the residual errors of previous models, optimizing through gradient descent.
AdaBoost adjusts weights for misclassified samples, while Gradient Boosting directly optimizes errors using gradients.

In [None]:
#6.What is the loss function in Gradient Boosting
The loss function depends on the problem type:

Regression: Mean Squared Error (MSE) or Mean Absolute Error (MAE).
Classification: Log loss (for binary classification) or Cross-Entropy Loss.

In [None]:
#7.How does XGBoost improve over traditional Gradient Boosting
XGBoost (Extreme Gradient Boosting) enhances traditional Gradient Boosting by:

Regularization (L1 & L2) to prevent overfitting.
Tree Pruning for efficient learning.
Parallel Processing for faster training.
Handling missing values automatically.

In [None]:
#8.What is the difference between XGBoost and CatBoost
XGBoost: Optimized for speed and efficiency, supports handling missing values but requires encoding categorical data.
CatBoost: Specifically designed for categorical data, uses ordered boosting to prevent target leakage and handles categorical features natively.

In [None]:
#9.What are some real-world applications of Boosting techniques
Fraud Detection (e.g., credit card fraud).
Medical Diagnosis (e.g., disease prediction).
Search Engine Ranking (e.g., Google’s ranking algorithms).
Recommendation Systems (e.g., Netflix, Amazon).
Financial Market Prediction (e.g., stock price forecasting).

In [None]:
#10.How does regularization help in XGBoost

Regularization prevents overfitting by adding penalties:

L1 Regularization (Lasso): Shrinks feature weights, promoting sparsity.
L2 Regularization (Ridge): Prevents extreme weight values, making the model more stable.

In [None]:
#11.What are some hyperparameters to tune in Gradient Boosting models
Learning rate (eta): Controls step size in optimization.
Number of trees: More trees improve accuracy but increase computation.
Max depth: Higher values increase complexity but may cause overfitting.
Subsample ratio: Controls how much data is used per tree.
Regularization parameters (lambda, alpha): Help prevent overfitting.

In [None]:
#12.What is the concept of Feature Importance in Boosting
Feature Importance indicates how much a feature contributes to predictions. It can be measured using:

Gain: Contribution of a feature to model improvement.
Cover: Number of samples affected by a feature.
Frequency: How often a feature is used in trees.

In [None]:
#13.Why is CatBoost efficient for categorical data?
CatBoost is efficient because:

It encodes categorical features internally, avoiding manual one-hot encoding.
It uses Ordered Boosting, preventing target leakage.
It applies efficient GPU acceleration, speeding up training.

In [None]:
##practical_example

In [None]:
#14. Train an AdaBoost Classifier on a sample dataset and print model accuracy.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=50, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("AdaBoost Classifier Accuracy:", accuracy)


In [None]:
#15.Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE).
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.datasets import make_regression

# Generate a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=10, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Regressor
model = AdaBoostRegressor(n_estimators=50, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("AdaBoost Regressor MAE:", mae)

In [None]:
#16.Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train model
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Print feature importance
importances = model.feature_importances_
for feature, importance in zip(data.feature_names, importances):
    print(f"{feature}: {importance:.4f}")


In [None]:
#17. Train a Gradient Boosting Regressor and evaluate using R-Squared Score.
from sklearn.metrics import r2_score

# Train model
regressor = GradientBoostingRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

# Predict and evaluate
y_pred = regressor.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("Gradient Boosting Regressor R-Squared Score:", r2)


In [None]:
#18.Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting.
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Train XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric="logloss", random_state=42)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)
xgb_accuracy = accuracy_score(y_test, xgb_pred)

# Compare with Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_pred)

print("XGBoost Classifier Accuracy:", xgb_accuracy)
print("Gradient Boosting Classifier Accuracy:", gb_accuracy)


In [None]:
#19. Train a CatBoost Classifier and evaluate using F1-Score.
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score

# Train CatBoost Classifier
cat_model = CatBoostClassifier(iterations=100, depth=6, learning_rate=0.1, verbose=False, random_state=42)
cat_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = cat_model.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("CatBoost Classifier F1-Score:", f1)


In [None]:
#20.Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE).
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Train model
xgb_regressor = XGBRegressor(n_estimators=100, random_state=42)
xgb_regressor.fit(X_train, y_train)

# Predict and evaluate
y_pred = xgb_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("XGBoost Regressor MSE:", mse)


In [None]:
#21. Train an AdaBoost Classifier and visualize feature importance.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix, classification_report, log_loss
from xgboost import XGBClassifier, XGBRegressor, plot_importance
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification, make_regression

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and visualize AdaBoost feature importance
adaboost = AdaBoostClassifier(n_estimators=50, random_state=42)
adaboost.fit(X_train, y_train)
plt.bar(range(X.shape[1]), adaboost.feature_importances_)
plt.title("AdaBoost Feature Importance")
plt.show()




In [None]:
#22. Train a Gradient Boosting Regressor and plot learning curves.
# Train Gradient Boosting Regressor and plot learning curve
X_reg, y_reg = make_regression(n_samples=1000, n_features=20, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
gb_reg = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_reg.fit(X_train_reg, y_train_reg)
plt.plot(gb_reg.train_score_, label="Training loss")
plt.title("Gradient Boosting Learning Curve")
plt.show()














In [None]:
#23. Train an XGBoost Classifier and visualize feature importance.
# Train XGBoost Classifier and visualize feature importance
xgb = XGBClassifier(n_estimators=50, use_label_encoder=False, eval_metric="logloss", random_state=42)
xgb.fit(X_train, y_train)
plot_importance(xgb)
plt.show()

In [None]:
#24. Train a CatBoost Classifier and plot the confusion matrix.
# Train CatBoost Classifier and plot confusion matrix
catboost = CatBoostClassifier(n_estimators=50, verbose=0, random_state=42)
catboost.fit(X_train, y_train)
y_pred = catboost.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d")
plt.title("CatBoost Confusion Matrix")
plt.show()

In [None]:
#25.Train an AdaBoost Classifier with different numbers of estimators and compare accuracy.
# Train AdaBoost Classifier with different estimators and compare accuracy
n_estimators = [10, 50, 100, 200]
for n in n_estimators:
    clf = AdaBoostClassifier(n_estimators=n, random_state=42)
    clf.fit(X_train, y_train)
    acc = accuracy_score(y_test, clf.predict(X_test))
    print(f"AdaBoost with {n} estimators: Accuracy = {acc:.4f}")

In [None]:
#26. Train a Gradient Boosting Classifier and visualize the ROC curve.
# Train Gradient Boosting Classifier and visualize the ROC curve
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train, y_train)
y_scores = gb_clf.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_scores)
plt.plot(fpr, tpr, label=f"AUC = {auc(fpr, tpr):.4f}")
plt.legend()
plt.title("Gradient Boosting ROC Curve")
plt.show()

In [None]:
#27. Train an XGBoost Regressor and tune the learning rate using GridSearchCV.
# Train XGBoost Regressor and tune learning rate using GridSearchCV
param_grid = {"learning_rate": [0.01, 0.1, 0.2, 0.3]}
grid = GridSearchCV(XGBRegressor(n_estimators=100, random_state=42), param_grid, cv=3)
grid.fit(X_train_reg, y_train_reg)
print(f"Best learning rate for XGBoost: {grid.best_params_}")


In [None]:
#28. Train a CatBoost Classifier on an imbalanced dataset and compare performance with class weighting.
# Train CatBoost Classifier on an imbalanced dataset and compare performance with class weighting
X_imb, y_imb = make_classification(n_samples=1000, weights=[0.9, 0.1], n_features=20, random_state=42)
X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(X_imb, y_imb, test_size=0.2, random_state=42)
catboost_weighted = CatBoostClassifier(class_weights=[1, 10], verbose=0, random_state=42)
catboost_weighted.fit(X_train_imb, y_train_imb)
print("CatBoost Classification Report (Weighted):")
print(classification_report(y_test_imb, catboost_weighted.predict(X_test_imb)))

In [None]:
#29. Train an AdaBoost Classifier and analyze the effect of different learning rates.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define different learning rates to test
learning_rates = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0]

# Store results
accuracies = []

# Train AdaBoost models with different learning rates
for lr in learning_rates:
    model = AdaBoostClassifier(n_estimators=50, learning_rate=lr, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Learning Rate: {lr:.3f} - Accuracy: {acc:.4f}")

# Plot learning rate vs. accuracy
plt.figure(figsize=(8, 5))
plt.plot(learning_rates, accuracies, marker='o', linestyle='-')
plt.xlabel("Learning Rate")
plt.ylabel("Accuracy")
plt.title("Effect of Learning Rate on AdaBoost Accuracy")
plt.xscale("log")  # Log scale for better visualization
plt.grid(True)
plt.show()

SyntaxError: invalid non-printable character U+001B (3929306178.py, line 2)

In [None]:
#30. Train an XGBoost Classifier for multi-class classification and evaluate using log-loss.
# Train an XGBoost Classifier for multi-class classification and evaluate using log-loss
X_multi, y_multi = make_classification(n_samples=1000, n_classes=3, n_features=20, n_informative=15, random_state=42)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)
xgb_multi = XGBClassifier(n_estimators=50, objective="multi:softprob", eval_metric="mlogloss", random_state=42)
xgb_multi.fit(X_train_multi, y_train_multi)
y_pred_proba = xgb_multi.predict_proba(X_test_multi)
print(f"XGBoost Multi-class Log Loss: {log_loss(y_test_multi, y_pred_proba):.4f}")