<a href="https://colab.research.google.com/github/kanika0216/python-Basics/blob/main/Boosting_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Theoretical**

Que 1: What is Boosting in Machine Learning

Ans: Boosting is a machine learning ensemble technique that combines multiple weak learners sequentially to form a strong learner by focusing on the errors of previous models.

Que 2: How does Boosting differ from Bagging

Ans: Boosting trains models sequentially with a focus on correcting errors made by previous models, while Bagging trains models in parallel and combines their outputs to reduce variance.

Que 3: What is the key idea behind AdaBoost

Ans: The key idea of AdaBoost is to combine several weak classifiers into one strong classifier by adjusting weights on training data based on previous errors.

Que 4: Explain the working of AdaBoost with an example

Ans: AdaBoost starts by assigning equal weights to all data points. After each model, weights increase for misclassified points and decrease for correctly classified ones. For example, if a sample is misclassified, it gets more focus in the next iteration to improve accuracy.

Que 5: What is Gradient Boosting, and how is it different from AdaBoost

Ans: Gradient Boosting builds models sequentially to minimize a loss function using gradient descent, whereas AdaBoost focuses on adjusting weights based on classification errors.

Que 6: What is the loss function in Gradient Boosting

Ans: The loss function in Gradient Boosting is a differentiable function that measures the difference between actual and predicted values, commonly mean squared error or log loss.

Que 7: How does XGBoost improve over traditional Gradient Boosting

Ans: XGBoost improves by using regularization, parallel processing, tree pruning, and optimized handling of missing values, making it faster and more accurate.

Que 8: What is the difference between XGBoost and CatBoost

Ans: XGBoost is efficient but needs preprocessing for categorical data, while CatBoost handles categorical features natively and reduces overfitting through ordered boosting.

Que 9: What are some real-world applications of Boosting techniques

Ans: Boosting is used in fraud detection, spam filtering, customer churn prediction, medical diagnosis, and ranking algorithms in search engines.

Que 10: How does regularization help in XGBoost

Ans: Regularization in XGBoost controls model complexity and prevents overfitting by penalizing large coefficients and complex trees.

Que 11: What are some hyperparameters to tune in Gradient Boosting models

Ans: Key hyperparameters include learning rate, number of estimators, max depth, subsample, and loss function.

Que 12: What is the concept of Feature Importance in Boosting

Ans: Feature Importance indicates how much a feature contributes to the predictive power of the model, helping in feature selection and interpretation.

Que 13: Why is CatBoost efficient for categorical data?

Ans: CatBoost efficiently handles categorical variables using ordered boosting and permutation-driven techniques, avoiding the need for one-hot encoding.

**Practical**

Que 14: Train an AdaBoost Classifier on a sample dataset and print model accuracy

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = AdaBoostClassifier(n_estimators=50)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


Que 15: Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE)

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = AdaBoostRegressor(n_estimators=50)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("MAE:", mean_absolute_error(y_test, y_pred))


Que 16: Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

model = GradientBoostingClassifier(n_estimators=100)
model.fit(X_train, y_train)

importances = model.feature_importances_
plt.barh(data.feature_names, importances)
plt.xlabel("Feature Importance")
plt.show()


Que 17: Train a Gradient Boosting Regressor and evaluate using R-Squared Score

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = GradientBoostingRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("R2 Score:", r2_score(y_test, y_pred))


Que 18: Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting

In [None]:
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)

gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)

print("XGBoost Accuracy:", accuracy_score(y_test, xgb_pred))
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_pred))


Que 19: Train a CatBoost Classifier and evaluate using F1-Score

In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = CatBoostClassifier(verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("F1 Score:", f1_score(y_test, y_pred))


Que 20: Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE)

In [None]:
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))


Que 21: Train an AdaBoost Classifier and visualize feature importance

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=1000, n_features=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = AdaBoostClassifier(n_estimators=100)
model.fit(X_train, y_train)

plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.xlabel("Feature Index")
plt.ylabel("Importance")
plt.title("AdaBoost Feature Importance")
plt.show()


Que 22: Train a Gradient Boosting Regressor and plot learning curves

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import learning_curve
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)
model = GradientBoostingRegressor()

train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5)

train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)

plt.plot(train_sizes, train_scores_mean, label="Training score")
plt.plot(train_sizes, test_scores_mean, label="Cross-validation score")
plt.xlabel("Training Size")
plt.ylabel("Score")
plt.title("Learning Curve")
plt.legend()
plt.show()


Que 23: Train an XGBoost Classifier and visualize feature importance

In [None]:
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)

plot_importance(model)
plt.title("XGBoost Feature Importance")
plt.show()


Que 24: Train a CatBoost Classifier and plot the confusion matrix

In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = CatBoostClassifier(verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()


Que 25: Train an AdaBoost Classifier with different numbers of estimators and compare accuracy

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

for n in [10, 50, 100]:
    model = AdaBoostClassifier(n_estimators=n)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Accuracy with {n} estimators:", accuracy_score(y_test, y_pred))


Que 26: Train a Gradient Boosting Classifier and visualize the ROC curve

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = GradientBoostingClassifier()
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)[:,1]

fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='ROC Curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


Que 27: Train an XGBoost Regressor and tune the learning rate using GridSearchCV

In [None]:
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV, train_test_split

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2]}
model = XGBRegressor()
grid = GridSearchCV(model, param_grid, cv=3)
grid.fit(X_train, y_train)

print("Best learning rate:", grid.best_params_['learning_rate'])


Que 28: Train a CatBoost Classifier on an imbalanced dataset and compare performance with class weighting

In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Without class weights
model1 = CatBoostClassifier(verbose=0)
model1.fit(X_train, y_train)
print("Without class weights:")
print(classification_report(y_test, model1.predict(X_test)))

# With class weights
model2 = CatBoostClassifier(class_weights=[1, 10], verbose=0)
model2.fit(X_train, y_train)
print("With class weights:")
print(classification_report(y_test, model2.predict(X_test)))


Que 29: Train an AdaBoost Classifier and analyze the effect of different learning rates

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

for lr in [0.01, 0.1, 0.5, 1]:
    model = AdaBoostClassifier(learning_rate=lr)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Accuracy with learning rate {lr}:", accuracy_score(y_test, y_pred))


Que 30: Train an XGBoost Classifier for multi-class classification and evaluate using log-loss

In [None]:
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

X, y = make_classification(n_samples=1000, n_features=20, n_classes=3, n_informative=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = XGBClassifier(objective='multi:softprob', num_class=3, eval_metric='mlogloss')
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)

print("Log Loss:", log_loss(y_test, y_pred_proba))
