## Boosting_Assignment1


Q1. What is boosting in machine learning?   
Q2. What are the advantages and limitations of using boosting techniques?   
Q3. Explain how boosting works.   
Q4. What are the different types of boosting algorithms?   
Q5. What are some common parameters in boosting algorithms?   
Q6. How do boosting algorithms combine weak learners to create a strong learner?   
Q7. Explain the concept of AdaBoost algorithm and its working.   
Q8. What is the loss function used in AdaBoost algorithm?     
Q9. How does the AdaBoost algorithm update the weights of misclassified samples?   
Q10. What is the effect of increasing the number of estimators in AdaBoost algorithm?   

### Q1. What is boosting in machine learning?

Boosting is a machine learning ensemble technique that combines multiple weak learners to create a strong learner. It works by sequentially training a series of weak learners, with each subsequent learner focusing more on the instances that were misclassified by the previous learners. This sequential learning process allows boosting to correct errors made by earlier models, ultimately leading to improved predictive performance. Popular boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.

### Q2. What are the advantages and limitations of using boosting techniques?

#### Advantages and limitations of using boosting techniques:

Advantages:
1. Improved Accuracy: Boosting algorithms often yield higher accuracy compared to individual weak learners by leveraging the strengths of multiple models.
2. Handling Complex Relationships: Boosting can effectively capture complex relationships between features and target variables, making it suitable for various types of datasets.
3. Robustness to Overfitting: Boosting algorithms, such as Gradient Boosting and XGBoost, incorporate regularization techniques to prevent overfitting, ensuring better generalization to unseen data.
4. Feature Importance: Boosting algorithms provide feature importance scores, allowing users to identify the most influential features in the predictive model.
5. Versatility: Boosting algorithms can be applied to a wide range of machine learning tasks, including classification, regression, and ranking.

Limitations:
1. Sensitivity to Noisy Data: Boosting algorithms can be sensitive to noisy data and outliers, leading to degraded performance if the dataset contains significant noise.
2. Computationally Intensive: Boosting algorithms often require more computational resources and training time compared to simpler models due to the sequential nature of training multiple weak learners.
3. Parameter Tuning: Boosting algorithms involve tuning various hyperparameters, such as learning rate, tree depth, and the number of estimators, which can be challenging and time-consuming.
4. Vulnerability to Overfitting: While boosting algorithms aim to reduce overfitting, they can still overfit the training data if not properly tuned or if the dataset is too small.
5. Interpretability: Boosting models are generally less interpretable compared to simpler models like decision trees, making it challenging to explain the reasoning behind predictions.

Overall, while boosting techniques offer significant advantages in terms of predictive performance and robustness, they also come with certain limitations that need to be considered when applying them to real-world problems.

### Q3. Explain how boosting works.

Boosting is an ensemble learning technique that combines multiple weak learners sequentially to create a strong predictive model. The primary idea behind boosting is to sequentially train weak learners, where each subsequent model corrects the errors made by the previous ones. Here's how boosting works step by step:

1. **Initialize Weights**: In the beginning, each data point in the training set is assigned an equal weight.

2. **Train Weak Learner**: A weak learner, often a simple decision tree with limited depth (a stump), is trained on the dataset. The weak learner's goal is to minimize the errors in predictions.

3. **Compute Error**: After training the weak learner, the algorithm computes the errors made by the model. Data points that were misclassified receive higher weights, while correctly classified points receive lower weights.

4. **Adjust Weights**: The weights of the misclassified data points are increased, making them more influential in the next iteration. This emphasizes the importance of correcting the errors made by the previous model.

5. **Train Next Weak Learner**: Another weak learner is trained on the same dataset, giving more weight to the previously misclassified data points. The new weak learner aims to correct the errors made by the previous model.

6. **Repeat**: Steps 3 to 5 are repeated iteratively for a predefined number of iterations or until the model achieves a desired level of performance.

7. **Combine Weak Learners**: Finally, all weak learners are combined to form a strong predictive model. The final prediction is typically obtained by taking a weighted average of the predictions from each weak learner, where the weights are determined based on the performance of each model.

By iteratively training weak learners and focusing on correcting the errors made by the previous models, boosting effectively improves the overall predictive performance of the model. Popular boosting algorithms include AdaBoost, Gradient Boosting, XGBoost, and LightGBM.

### Q4. What are the different types of boosting algorithms?

There are several types of boosting algorithms, each with its own characteristics and approaches to improving model performance. Some of the most common types of boosting algorithms include:

1. **AdaBoost (Adaptive Boosting)**: AdaBoost is one of the earliest and most popular boosting algorithms. It works by iteratively training a series of weak learners, where each subsequent learner focuses more on the instances that were misclassified by the previous ones. AdaBoost assigns higher weights to misclassified instances and lower weights to correctly classified instances, allowing subsequent weak learners to prioritize the difficult-to-classify examples.

2. **Gradient Boosting**: Gradient Boosting builds a sequence of decision trees, with each tree attempting to correct the errors made by the previous ones. Unlike AdaBoost, Gradient Boosting focuses on minimizing a loss function (e.g., mean squared error for regression or log loss for classification) rather than adjusting instance weights. It uses gradient descent optimization to minimize the loss function, with each new tree added to the ensemble reducing the residual errors.

3. **XGBoost (Extreme Gradient Boosting)**: XGBoost is an optimized implementation of Gradient Boosting, known for its efficiency and performance. It introduces additional regularization techniques to prevent overfitting and improve generalization. XGBoost uses a more advanced optimization algorithm, incorporates tree pruning, and supports parallel processing, making it one of the most widely used boosting algorithms for both classification and regression tasks.

4. **LightGBM (Light Gradient Boosting Machine)**: LightGBM is another efficient and scalable implementation of Gradient Boosting, developed by Microsoft. It uses a novel gradient-based approach to handle categorical features and employs histogram-based algorithms for faster training speed and lower memory usage. LightGBM is particularly well-suited for large-scale datasets and has gained popularity in various machine learning competitions and real-world applications.

5. **CatBoost (Categorical Boosting)**: CatBoost is a boosting algorithm developed by Yandex, designed to handle categorical features seamlessly without the need for extensive preprocessing. It employs an innovative algorithm for feature combinations and provides robust handling of categorical variables, making it suitable for a wide range of applications.

These are just a few examples of boosting algorithms, and there are many other variations and implementations available. Each algorithm has its own strengths and weaknesses, and the choice of algorithm often depends on the specific requirements of the problem at hand, the size and nature of the dataset, and computational resources available.

### Q5. What are some common parameters in boosting algorithms?

Boosting algorithms typically have a set of common parameters that influence the behavior and performance of the model. Some of the common parameters include:

1. **Number of Trees (n_estimators)**: This parameter specifies the number of weak learners (trees) to be used in the ensemble. Increasing the number of trees may lead to better performance, but it also increases computational cost and the risk of overfitting.

2. **Learning Rate (or Shrinkage)**: The learning rate controls the contribution of each tree to the final prediction. A lower learning rate requires more trees to achieve the same level of accuracy but may improve generalization. It helps to prevent overfitting by shrinking the contribution of each tree.

3. **Tree Depth (max_depth)**: This parameter specifies the maximum depth of each tree in the ensemble. Deeper trees can capture more complex patterns in the data but are more prone to overfitting. Limiting the tree depth helps control model complexity and improves generalization.

4. **Minimum Samples Split (min_samples_split)**: This parameter specifies the minimum number of samples required to split an internal node. It helps prevent the algorithm from creating nodes with too few samples, which can lead to overfitting.

5. **Minimum Samples Leaf (min_samples_leaf)**: This parameter specifies the minimum number of samples required to be at a leaf node. It helps control the size of the trees and prevents overfitting by requiring a certain number of samples in each leaf.

6. **Subsample (or Bagging Fraction)**: Subsample controls the fraction of samples used for training each tree. Setting a value less than 1.0 enables stochastic gradient boosting, where each tree is trained on a random subset of the training data. This can help improve generalization and reduce overfitting.

7. **Feature Subsampling (colsample_bytree)**: This parameter controls the fraction of features (columns) used for training each tree. Similar to subsampling, it helps introduce randomness into the model and can prevent overfitting by limiting the number of features considered at each split.

8. **Regularization Parameters (reg_alpha, reg_lambda)**: These parameters control L1 and L2 regularization, respectively, to prevent overfitting by penalizing large coefficients. They help to reduce model complexity and improve generalization.

These are just a few examples of common parameters in boosting algorithms. The optimal values for these parameters depend on the specific dataset and problem at hand and are often determined through experimentation and cross-validation.

### Q6. How do boosting algorithms combine weak learners to create a strong learner?

Boosting algorithms combine multiple weak learners to create a strong learner through an iterative process. Here's how it typically works:

1. **Sequential Training**: Boosting algorithms train a sequence of weak learners (usually decision trees) sequentially. Each tree is trained using the information from the previous trees in the ensemble.

2. **Focus on Errors**: During each iteration, the boosting algorithm assigns higher weights to the instances that were misclassified by the previous weak learners. This allows subsequent weak learners to focus more on the difficult-to-classify instances.

3. **Weighted Voting**: After each weak learner is trained, it contributes to the final prediction by assigning it a weight based on its performance. Weak learners that perform well on the training data are given higher weights, indicating their importance in the ensemble.

4. **Aggregate Predictions**: The predictions from all weak learners are combined using a weighted sum or a voting mechanism to produce the final prediction of the ensemble. Typically, each weak learner's prediction is weighted by its performance during training.

5. **Adaptive Learning**: Boosting algorithms adaptively adjust the weights of training instances at each iteration, focusing more on the instances that are difficult to classify correctly. This iterative process gradually improves the overall performance of the ensemble.

By iteratively combining weak learners and focusing on the errors made by previous learners, boosting algorithms can create a strong learner that achieves high accuracy on the training data and generalizes well to unseen data. The final model tends to have lower bias and variance compared to individual weak learners, making it robust and effective for a wide range of classification and regression tasks.

### Q7. Explain the concept of AdaBoost algorithm and its working.

AdaBoost, short for Adaptive Boosting, is a popular boosting algorithm used for classification tasks. It works by combining multiple weak learners (typically decision trees) to create a strong learner. Here's how AdaBoost works:

1. **Initialization**: At the beginning, each training instance is assigned an equal weight. These weights are used to give more importance to instances that were misclassified by the previous weak learners.

2. **Sequential Training**: AdaBoost trains a sequence of weak learners (decision trees) sequentially. In each iteration, it focuses more on the instances that were misclassified by the previous weak learners.

3. **Weighted Voting**: After each weak learner is trained, it predicts the class labels for all instances in the training data. The algorithm then calculates the weighted error rate, which is the sum of the weights of the misclassified instances.

4. **Classifier Weight**: Based on the weighted error rate, AdaBoost assigns a weight to each weak learner. Weak learners with lower error rates are given higher weights, indicating their importance in the final ensemble.

5. **Update Instance Weights**: AdaBoost updates the weights of the training instances. It increases the weights of misclassified instances, forcing the next weak learner to focus more on them.

6. **Final Prediction**: Once all weak learners are trained, AdaBoost combines their predictions using a weighted sum or a voting mechanism. The weights of the weak learners determine their influence on the final prediction. Typically, stronger weak learners have higher weights.

7. **AdaBoost Algorithm**: The AdaBoost algorithm iteratively repeats steps 2 to 6 until a predefined number of weak learners is reached or until a desired level of accuracy is achieved. Each weak learner is trained on a modified version of the training data, where the weights of the misclassified instances are adjusted.

By combining multiple weak learners in a sequential manner and focusing on the instances that are difficult to classify, AdaBoost creates a strong ensemble classifier that often outperforms individual weak learners. The final model tends to have high accuracy and robustness against overfitting, making AdaBoost a powerful algorithm for classification tasks.

### Q8. What is the loss function used in AdaBoost algorithm?

In AdaBoost (Adaptive Boosting) algorithm, the loss function used is the exponential loss function, also known as the exponential loss or exponential loss criterion. 

The exponential loss function is defined as:

$ L(y, f(x)) = e^{-y \cdot f(x)} $

Where:
- $( y )$ is the true label of the instance $( y = +1 )$ for positive class, $( y = -1 )$ for negative class).
- $( f(x) )$ is the prediction made by the weak learner for the instance $( x )$.

The exponential loss function assigns higher penalties to misclassifications, especially for instances that are misclassified with high confidence. It amplifies the impact of misclassifications, causing the boosting algorithm to focus more on correcting these errors in subsequent iterations.

The exponential loss function is particularly effective in AdaBoost because it allows the algorithm to assign higher weights to misclassified instances in each iteration, ensuring that subsequent weak learners focus more on correcting these mistakes. As a result, AdaBoost can progressively improve the model's performance and achieve higher accuracy.

### Q9. How does the AdaBoost algorithm update the weights of misclassified samples?

In the AdaBoost algorithm, the weights of misclassified samples are updated in each iteration to prioritize the correct classification of these instances in subsequent iterations. The process of updating weights involves the following steps:

1. **Initialize Weights**: Initially, all training samples are assigned equal weights. These weights determine the importance of each sample in the training process.

2. **Train Weak Learner**: In each iteration, a weak learner (e.g., decision stump) is trained on the training data. The weak learner aims to minimize the weighted error rate, where the weights are initially assigned to each sample.

3. **Calculate Weighted Error Rate**: After training the weak learner, its performance is evaluated on the training data. The weighted error rate is calculated based on the misclassification of each sample. Samples that are misclassified are assigned higher weights.

4. **Update Sample Weights**: The weights of the misclassified samples are updated to increase their importance in the subsequent iteration. The update formula for the weight of each sample depends on its misclassification status:
   - For correctly classified samples: Their weights are decreased to de-emphasize them in the next iteration.
   - For misclassified samples: Their weights are increased to prioritize them in the next iteration.

5. **Normalize Weights**: After updating the weights, they are normalized to ensure that they sum up to 1. This normalization step maintains the relative importance of the samples while preventing the weights from growing too large or too small.

6. **Repeat Iterations**: Steps 2-5 are repeated for a specified number of iterations or until a predefined stopping criterion is met (e.g., achieving a desired level of accuracy).

By updating the weights of misclassified samples in each iteration, AdaBoost focuses on difficult-to-classify instances, making subsequent weak learners pay more attention to these samples. This adaptive boosting of misclassified samples helps improve the overall performance of the ensemble model.

### Q10. What is the effect of increasing the number of estimators in AdaBoost algorithm?

Increasing the number of estimators (weak learners) in the AdaBoost algorithm typically has the following effects:

1. **Improved Performance**: Generally, increasing the number of estimators leads to improved performance of the AdaBoost ensemble. With more weak learners, the ensemble can capture more complex patterns in the data, resulting in better generalization and lower bias.

2. **Reduced Bias**: As the number of estimators increases, the AdaBoost ensemble becomes more flexible and capable of fitting the training data more closely. This reduction in bias allows the model to capture intricate relationships between features and the target variable, leading to a better approximation of the true underlying function.

3. **Potential Overfitting**: While increasing the number of estimators can enhance model performance on the training data, there is a risk of overfitting, especially if the dataset is small or noisy. As the model becomes more complex, it may start to memorize noise or outliers in the training data, resulting in decreased generalization performance on unseen data.

4. **Slower Training**: Training a larger number of weak learners can significantly increase the computational time required to fit the AdaBoost ensemble. Each additional estimator adds computational overhead, as the algorithm iteratively updates sample weights and trains new weak learners in each iteration.

5. **Diminishing Returns**: There may be diminishing returns in terms of performance improvement with each additional estimator beyond a certain point. After reaching a sufficient number of estimators, further increasing their count may lead to marginal improvements or even degradation in performance due to overfitting or increased computational complexity.

In summary, while increasing the number of estimators in the AdaBoost algorithm can enhance performance and reduce bias, it is essential to monitor for signs of overfitting and consider the trade-offs between model complexity, computational resources, and generalization performance. Cross-validation and model evaluation on a separate validation set can help determine the optimal number of estimators for a given dataset.

-------------------------------------------------------------------

## Boosting_Assignmnent_2

Q1. What is Gradient Boosting Regression?  
Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a
simple regression problem as an example and train the model on a small dataset. Evaluate the model's
performance using metrics such as mean squared error and R-squared.   
Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to
optimise the performance of the model. Use grid search or random search to find the best
hyperparameters    
Q4. What is a weak learner in Gradient Boosting?   
Q5. What is the intuition behind the Gradient Boosting algorithm?   
Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?  
Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting
algorithm?  

### Q1. What is Gradient Boosting Regression?

Gradient Boosting Regression is a machine learning technique used for regression tasks, where the goal is to predict continuous numerical values. It is an ensemble learning method that builds a strong predictive model by combining multiple weak predictive models, typically decision trees, in a sequential manner. 

In Gradient Boosting Regression, the model is built in an iterative fashion. Initially, a simple model is created to make predictions on the training data. In subsequent iterations, additional models, called "weak learners," are trained to correct the errors made by the previous models. Each new model focuses on learning from the residuals (the differences between the actual target values and the predictions made by the previous models) in order to reduce the overall error of the ensemble.

The "gradient" in Gradient Boosting refers to the technique of minimizing the loss function by using gradient descent optimization. In each iteration, the algorithm calculates the gradients of the loss function with respect to the predictions of the previous model, and then fits a new model to the negative gradients (i.e., the residuals) to update the predictions. This process is repeated for a specified number of iterations or until a stopping criterion is met.

Overall, Gradient Boosting Regression is known for its high predictive accuracy and robustness against overfitting. It is widely used in various regression tasks, such as predicting house prices, stock prices, and customer churn rates.

### Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a simple regression problem as an example and train the model on a small dataset. Evaluate the model's performance using metrics such as mean squared error and R-squared.

In [11]:
import numpy as np

class DecisionTreeRegressor:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth

    def fit(self, X, y):
        self.n_features_ = X.shape[1]
        self.tree_ = self._build_tree(X, y)

    def _build_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        if depth >= self.max_depth or n_samples <= 1:
            return np.mean(y)

        feature_idxs = np.random.choice(n_features, min(n_features, int(np.sqrt(n_features))), replace=False)
        best_feature, best_threshold, best_gain = None, None, -np.inf

        for feature_idx in feature_idxs:
            thresholds = np.unique(X[:, feature_idx])
            for threshold in thresholds:
                left_indices = np.where(X[:, feature_idx] <= threshold)[0]
                right_indices = np.where(X[:, feature_idx] > threshold)[0]
                if len(left_indices) == 0 or len(right_indices) == 0:
                    continue
                gain = self._compute_gain(y, y[left_indices], y[right_indices])
                if gain > best_gain:
                    best_feature = feature_idx
                    best_threshold = threshold
                    best_gain = gain

        if best_gain == -np.inf:
            return np.mean(y)

        left_indices = np.where(X[:, best_feature] <= best_threshold)[0]
        right_indices = np.where(X[:, best_feature] > best_threshold)[0]
        left_subtree = self._build_tree(X[left_indices], y[left_indices], depth + 1)
        right_subtree = self._build_tree(X[right_indices], y[right_indices], depth + 1)

        return {'feature': best_feature, 'threshold': best_threshold,
                'left': left_subtree, 'right': right_subtree}

    def _compute_gain(self, parent, left_child, right_child):
        total_samples = len(parent)
        p_left = len(left_child) / total_samples
        p_right = len(right_child) / total_samples
        return self._mse(parent) - (p_left * self._mse(left_child) + p_right * self._mse(right_child))

    def _mse(self, y):
        return np.mean((y - np.mean(y))**2)

    def predict(self, X):
        return np.array([self._predict_tree(x, self.tree_) for x in X])

    def _predict_tree(self, x, tree):
        if isinstance(tree, dict):
            if x[tree['feature']] <= tree['threshold']:
                return self._predict_tree(x, tree['left'])
            else:
                return self._predict_tree(x, tree['right'])
        else:
            return tree

class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.models = []

    def fit(self, X, y):
        self.base_prediction = np.mean(y)
        previous_prediction = np.full_like(y, self.base_prediction, dtype=float)
        self.models.append(previous_prediction)

        for _ in range(self.n_estimators):
            residuals = y - previous_prediction
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            prediction = tree.predict(X)
            previous_prediction += self.learning_rate * prediction
            self.models.append(tree)

    def predict(self, X):
        predictions = np.array([self.base_prediction] * len(X))
        for tree in self.models[1:]:
            predictions += self.learning_rate * tree.predict(X)
        return predictions

def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def r2_score(y_true, y_pred):
    numerator = np.sum((y_true - y_pred)**2)
    denominator = np.sum((y_true - np.mean(y_true))**2)
    return 1 - (numerator / denominator)

# Example usage:
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 6, 8, 10])

X_test = np.array([[6], [7], [8]])
y_test = np.array([12, 14, 16])

gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
gb_regressor.fit(X_train, y_train)

y_pred_train = gb_regressor.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)

print("Training MSE:", mse_train)
print("Training R2 score:", r2_train)

y_pred_test = gb_regressor.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

print("Test MSE:", mse_test)
print("Test R2 score:", r2_test)


Training MSE: 5.644063286794225e-09
Training R2 score: 0.9999999992944921
Test MSE: 18.66751664271918
Test R2 score: -6.000318741019692


In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the California housing dataset
california_housing = fetch_california_housing()
data = california_housing.data
target = california_housing.target
feature_names = california_housing.feature_names

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# Define numerical and categorical features
numeric_features = []
categorical_features = []
for i, dtype in enumerate(california_housing.feature_names):
    if np.issubdtype(data[:, i].dtype, np.number):
        numeric_features.append(i)
    else:
        categorical_features.append(i)

# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Define parameter grids for grid search
param_grid_adaboost = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__learning_rate': [0.01, 0.1, 1.0],
    'regressor__loss': ['linear', 'square', 'exponential']
}

param_grid_gradientboost = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__learning_rate': [0.01, 0.1, 1.0],
    'regressor__max_depth': [3, 4, 5],
    'regressor__min_samples_split': [2, 5, 10]
}

param_grid_xgboost = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__learning_rate': [0.01, 0.1, 1.0],
    'regressor__max_depth': [3, 4, 5],
    'regressor__min_child_weight': [1, 3, 5]
}

# Define pipelines for each algorithm
ada_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', AdaBoostRegressor())
])

gb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor())
])

xgb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor())
])

# Perform grid search for AdaBoost
ada_grid_search = GridSearchCV(ada_pipeline, param_grid_adaboost, cv=5, scoring='neg_mean_squared_error')
ada_grid_search.fit(X_train, y_train)
print("AdaBoost Best Parameters:", ada_grid_search.best_params_)

# Perform grid search for Gradient Boosting
gb_grid_search = GridSearchCV(gb_pipeline, param_grid_gradientboost, cv=5, scoring='neg_mean_squared_error')
gb_grid_search.fit(X_train, y_train)
print("Gradient Boosting Best Parameters:", gb_grid_search.best_params_)

# Perform grid search for XGBoost
xgb_grid_search = GridSearchCV(xgb_pipeline, param_grid_xgboost, cv=5, scoring='neg_mean_squared_error')
xgb_grid_search.fit(X_train, y_train)
print("XGBoost Best Parameters:", xgb_grid_search.best_params_)

# Evaluate models on test data
ada_best = ada_grid_search.best_estimator_
gb_best = gb_grid_search.best_estimator_
xgb_best = xgb_grid_search.best_estimator_

y_pred_ada = ada_best.predict(X_test)
y_pred_gb = gb_best.predict(X_test)
y_pred_xgb = xgb_best.predict(X_test)

mse_ada = mean_squared_error(y_test, y_pred_ada)
r2_ada = r2_score(y_test, y_pred_ada)
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("AdaBoost MSE:", mse_ada)
print("AdaBoost R-squared:", r2_ada)
print("Gradient Boosting MSE:", mse_gb)
print("Gradient Boosting R-squared:", r2_gb)
print("XGBoost MSE:", mse_xgb)
print("XGBoost R-squared:", r2_xgb)


### Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to optimise the performance of the model. Use grid search or random search to find the best hyperparameters

In [13]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the California housing dataset
california_housing = fetch_california_housing()
data = california_housing.data
target = california_housing.target
feature_names = california_housing.feature_names

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# Define preprocessing steps
scaler = StandardScaler()

# Define parameter grids for grid search
param_grid_adaboost = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__learning_rate': [0.01, 0.1, 1.0]
}

param_grid_gradientboost = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__learning_rate': [0.01, 0.1, 1.0],
    'regressor__max_depth': [3, 4, 5]
}

param_grid_xgboost = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__learning_rate': [0.01, 0.1, 1.0],
    'regressor__max_depth': [3, 4, 5]
}

# Define pipelines for each algorithm
ada_pipeline = Pipeline([
    ('scaler', scaler),
    ('regressor', AdaBoostRegressor())
])

gb_pipeline = Pipeline([
    ('scaler', scaler),
    ('regressor', GradientBoostingRegressor())
])

xgb_pipeline = Pipeline([
    ('scaler', scaler),
    ('regressor', XGBRegressor())
])

# Perform grid search for AdaBoost
ada_grid_search = GridSearchCV(ada_pipeline, param_grid_adaboost, cv=5, scoring='neg_mean_squared_error')
ada_grid_search.fit(X_train, y_train)
print("AdaBoost Best Parameters:", ada_grid_search.best_params_)

# Perform grid search for Gradient Boosting
gb_grid_search = GridSearchCV(gb_pipeline, param_grid_gradientboost, cv=5, scoring='neg_mean_squared_error')
gb_grid_search.fit(X_train, y_train)
print("Gradient Boosting Best Parameters:", gb_grid_search.best_params_)

# Perform grid search for XGBoost
xgb_grid_search = GridSearchCV(xgb_pipeline, param_grid_xgboost, cv=5, scoring='neg_mean_squared_error')
xgb_grid_search.fit(X_train, y_train)
print("XGBoost Best Parameters:", xgb_grid_search.best_params_)

# Evaluate models on test data
ada_best = ada_grid_search.best_estimator_
gb_best = gb_grid_search.best_estimator_
xgb_best = xgb_grid_search.best_estimator_

y_pred_ada = ada_best.predict(X_test)
y_pred_gb = gb_best.predict(X_test)
y_pred_xgb = xgb_best.predict(X_test)

mse_ada = mean_squared_error(y_test, y_pred_ada)
r2_ada = r2_score(y_test, y_pred_ada)
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("AdaBoost MSE:", mse_ada)
print("AdaBoost R-squared:", r2_ada)
print("Gradient Boosting MSE:", mse_gb)
print("Gradient Boosting R-squared:", r2_gb)
print("XGBoost MSE:", mse_xgb)
print("XGBoost R-squared:", r2_xgb)


AdaBoost Best Parameters: {'regressor__learning_rate': 0.1, 'regressor__n_estimators': 50}


KeyboardInterrupt: 

### Q4. What is a weak learner in Gradient Boosting?

In Gradient Boosting, a weak learner refers to a base model that is simple and performs slightly better than random guessing on a given problem. Weak learners are typically models with low predictive power, such as shallow decision trees or linear models.

The key characteristic of a weak learner is that it can only provide marginal improvements to the model's predictive performance. However, when combined in an ensemble with other weak learners using boosting, they can collectively create a strong predictive model.

Gradient Boosting works by sequentially adding weak learners to the ensemble, with each new learner focusing on the mistakes made by the previous ones. By iteratively fitting new weak learners to the residuals (the differences between the actual and predicted values) of the previous predictions, Gradient Boosting gradually reduces the error of the overall ensemble, ultimately resulting in a powerful predictive model.

Therefore, in Gradient Boosting, the weak learners serve as building blocks that, when combined intelligently, contribute to the creation of a strong predictive model capable of capturing complex relationships within the data.

### Q5. What is the intuition behind the Gradient Boosting algorithm?

The intuition behind the Gradient Boosting algorithm lies in the concept of improving the performance of a weak learner by iteratively adding new weak learners that focus on the mistakes made by the previous ones.

Here's a step-by-step intuition behind how Gradient Boosting works:

1. **Initialization**: Initially, the model starts with an initial weak learner, which could be any simple model that performs slightly better than random guessing, such as a decision tree with very few levels.

2. **Sequential Training**: Gradient Boosting adds weak learners sequentially to the ensemble. Each new weak learner is trained to correct the errors (residuals) made by the combination of all the previous weak learners.

3. **Gradient Descent**: At each iteration, the algorithm calculates the gradient of the loss function with respect to the current ensemble's predictions. This gradient provides information on how to update the predictions to reduce the overall loss.

4. **Model Fitting**: The new weak learner is trained to predict the negative gradient (or pseudo-residuals) of the loss function with respect to the current predictions. Essentially, the new learner focuses on capturing the remaining patterns or errors in the data that the ensemble has not yet captured.

5. **Combining Predictions**: After training the new weak learner, its predictions are added to the ensemble by weighting them according to a learning rate parameter. This learning rate controls the contribution of each weak learner to the final prediction.

6. **Iterative Process**: Steps 3 to 5 are repeated iteratively for a predefined number of iterations or until a stopping criterion is met. With each iteration, the ensemble becomes increasingly adept at capturing the complex relationships in the data and reducing the overall error.

7. **Final Prediction**: The final prediction is obtained by aggregating the predictions of all weak learners in the ensemble, typically through a weighted sum.

By iteratively adding new weak learners that focus on the residuals of the previous predictions, Gradient Boosting gradually reduces the error of the ensemble, resulting in a strong predictive model capable of capturing complex relationships in the data.

### Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?

The Gradient Boosting algorithm builds an ensemble of weak learners sequentially through an iterative process. Here's a step-by-step explanation of how it constructs the ensemble:

1. **Initialization**: Gradient Boosting starts with an initial weak learner, which is typically a simple model that performs slightly better than random guessing, such as a decision tree with very few levels.

2. **Initial Prediction**: The initial weak learner makes predictions on the training data.

3. **Residual Calculation**: The algorithm calculates the residuals, which represent the difference between the actual target values and the predictions made by the current ensemble.

4. **Learning from Residuals**: A new weak learner is trained to predict the residuals (or negative gradients) of the loss function with respect to the current predictions. This weak learner focuses on capturing the remaining patterns or errors in the data that the ensemble has not yet captured.

5. **Updating Predictions**: The predictions of the new weak learner are added to the predictions of the current ensemble, typically by combining them with a learning rate parameter. This learning rate controls the contribution of each weak learner to the final prediction.

6. **Iterative Process**: Steps 3 to 5 are repeated iteratively for a predefined number of iterations or until a stopping criterion is met. With each iteration, a new weak learner is trained to predict the residuals of the current ensemble, and its predictions are added to the ensemble.

7. **Final Prediction**: The final prediction is obtained by aggregating the predictions of all weak learners in the ensemble, typically through a weighted sum.

By iteratively adding new weak learners that focus on the residuals of the previous predictions, Gradient Boosting gradually reduces the error of the ensemble, resulting in a strong predictive model capable of capturing complex relationships in the data.

### Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting algorithm?

Constructing the mathematical intuition behind the Gradient Boosting algorithm involves several key steps:

1. **Understanding Gradient Descent**: Gradient Boosting is based on the concept of gradient descent, a widely used optimization technique. Begin by understanding how gradient descent works, particularly in the context of minimizing a loss function.

2. **Loss Functions**: Learn about different loss functions commonly used in regression and classification problems, such as mean squared error (MSE) for regression and log loss (cross-entropy) for classification.

3. **Gradient Descent for Regression**: Understand how gradient descent can be applied to minimize the loss function in regression problems. This involves computing the gradient (derivative) of the loss function with respect to the model predictions and updating the model parameters (coefficients) in the direction that minimizes the loss.

4. **Gradient Boosting for Regression**: Build intuition for how Gradient Boosting extends gradient descent for regression problems by sequentially adding weak learners (decision trees) to the ensemble. Each weak learner is trained to predict the residuals (negative gradients) of the loss function with respect to the current ensemble predictions.

5. **Boosting Algorithm**: Study the boosting algorithm step-by-step, focusing on how each weak learner is trained and added to the ensemble to reduce the error of the overall model.

6. **Learning Rate**: Understand the role of the learning rate parameter in controlling the contribution of each weak learner to the final prediction. Lower learning rates result in more conservative updates and smoother convergence.

7. **Regularization**: Learn about regularization techniques used in Gradient Boosting, such as shrinkage (learning rate) and tree-specific parameters (tree depth, number of leaves, etc.), to prevent overfitting and improve generalization performance.

8. **Gradient Boosting for Classification**: Extend the intuition to classification problems by considering how Gradient Boosting can be adapted to minimize classification loss functions (e.g., log loss) and make predictions for multiple classes.

By following these steps and gaining a solid understanding of the underlying principles, you can develop a comprehensive mathematical intuition for the Gradient Boosting algorithm.