In [1]:
#q7

In [None]:
Gradient Boosting is a machine learning algorithm that uses an ensemble of decision trees to make predictions. The algorithm works by iteratively adding decision trees to the ensemble, each one correcting the errors of the previous trees. The core concept behind Gradient Boosting is to minimize a loss function by gradient descent. Here are the steps involved in constructing the mathematical intuition of Gradient Boosting algorithm:

Define a loss function: The first step is to define a loss function that measures how well the model fits the data. In Gradient Boosting, the most commonly used loss function is the mean squared error (MSE).

Initialize the model: The second step is to initialize the model with a constant value, such as the mean of the target variable. This constant value represents the average prediction of the model for all the data points.

Calculate the residuals: The next step is to calculate the residuals, which are the differences between the predicted values and the actual values of the target variable. These residuals represent the errors of the current model.

Train a decision tree on the residuals: The fourth step is to train a decision tree on the residuals. The decision tree is trained to predict the residuals of the current model, rather than the target variable itself.

Update the model: The fifth step is to update the model by adding the output of the decision tree to the current predictions. This updated model is the sum of the previous model and the output of the new decision tree.

Repeat steps 3 to 5: The sixth step is to repeat steps 3 to 5 until a stopping criterion is met. This stopping criterion can be a maximum number of iterations, a minimum improvement in the loss function, or any other criterion that makes sense for the problem.

Make predictions: The final step is to use the trained model to make predictions on new data points.

In [None]:
#q2

In [2]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

class GradientBoostingRegressor:
    
    def __init__(self, n_estimators=100, max_depth=3, learning_rate=0.1):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.learning_rate = learning_rate
        self.trees = []
        self.mean = None
    
    def fit(self, X, y):
        self.mean = np.mean(y)
        F = np.full(y.shape, self.mean)
        for i in range(self.n_estimators):
            residuals = y - F
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            F += self.learning_rate * tree.predict(X)
            self.trees.append(tree)
    
    def predict(self, X):
        F = np.full(X.shape[0], self.mean)
        for tree in self.trees:
            F += self.learning_rate * tree.predict(X)
        return F



In [None]:
This implementation includes the following steps:

Initialization of the hyperparameters, such as the number of estimators, the maximum depth of the decision trees, and the learning rate.
Definition of a fit method that iteratively trains decision trees on the residuals and updates the predictions using the learning rate. The mean of the target variable is used as the initial prediction.
Definition of a predict method that uses the trained trees to make predictions on new data points.

In [3]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)

model = GradientBoostingRegressor(n_estimators=100, max_depth=3, learning_rate=0.1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.2f}")
print(f"R-squared: {r2:.2f}")


MSE: 6.10
R-squared: 0.91



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

In [None]:
#q3

In [4]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from scipy.stats import randint as sp_randint

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [2, 3, 4],
    'learning_rate': [0.01, 0.1, 1],
}

gb_model = GradientBoostingRegressor(random_state=42)
grid_search = GridSearchCV(gb_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print(f"Best hyperparameters: {grid_search.best_params_}")

y_pred = grid_search.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.2f}")
print(f"R-squared: {r2:.2f}")



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

Best hyperparameters: {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100}
MSE: 9.40
R-squared: 0.86


In [None]:
In this example, we use GridSearchCV from sklearn.model_selection library to search for the best hyperparameters of our gradient boosting model.
We define a dictionary of hyperparameters and their values, and GridSearchCV performs an exhaustive search over all possible combinations of hyperparameters.

We also specify the number of cross-validation folds to use (cv=5), the scoring metric to use (neg_mean_squared_error), and the model to use (GradientBoostingRegressor). 
After the grid search is completed, we print out the best hyperparameters found by the algorithm. 

Finally, we use the best hyperparameters to train the model on the training data, and evaluate its performance on the test data using MSE and R-squared metrics. 

You can also use RandomizedSearchCV from sklearn.model_selection library to perform a randomized search over a range of hyperparameters instead of an exhaustive grid search. 

In [None]:
#q4

In [None]:
In Gradient Boosting, a weak learner is a simple model that performs only slightly better than random guessing on a classification or regression problem.
The weak learner can be any algorithm that produces an output based on the input features, such as decision trees or linear models.

In the context of Gradient Boosting, weak learners are iteratively combined to form a strong learner that can accurately predict the target variable. 
At each iteration, the weak learner is trained on the residual errors of the previous model, and its predictions are added to the previous model's predictions to update the overall model.
This process continues until the model reaches a predefined stopping criterion or until a maximum number of iterations is reached.

The reason for using weak learners in Gradient Boosting is that they allow for a more stable and efficient training process. 
By using simple models, the algorithm can focus on learning the complex relationships between the input features and the target variable, without overfitting to the training data. 
The weak learners also prevent the algorithm from getting stuck in local optima, which can occur when using a single complex model.

In [None]:
#q5

In [None]:
The intuition behind Gradient Boosting is to iteratively add weak learners to a model in order to improve its predictions.

The algorithm works by starting with an initial prediction for each data point, which can be a simple average of the target variable for regression problems or the most frequent class for classification problems.
Then, it trains a weak learner, typically a decision tree with a small number of nodes, to predict the difference between the true target values and the initial predictions. This difference is called the "residual error".


The weak learner is trained on the residual errors rather than the original target values, which allows it to focus on the patterns in the data that are not captured by the initial predictions.
The weak learner's predictions are then added to the initial predictions to create an updated set of predictions.

This process is repeated iteratively, with each new weak learner trained on the residual errors of the previous model. 
The idea is that each new weak learner focuses on the errors that the previous models were unable to capture, leading to a more accurate prediction overall.

The term "gradient" in Gradient Boosting refers to the use of the gradient of the loss function, such as mean squared error or cross-entropy, to guide the training process. The gradient is used to determine the direction and magnitude of the update to the predictions at each step, which allows the algorithm to converge to the optimal solution in a more efficient manner.

In [None]:
#q6

In [None]:
Gradient Boosting builds an ensemble of weak learners by iteratively adding them to the model. Each weak learner is trained on the residual errors of the previous model, and its predictions are added to the previous model's predictions to update the overall model.

The process of building the ensemble can be broken down into the following steps:

Initialize the model with an initial prediction for each data point. For regression problems, this can be a simple average of the target variable, and for classification problems, this can be the most frequent class.

Calculate the residual errors between the initial predictions and the true target values.

Train a weak learner, typically a decision tree with a small number of nodes, to predict the residual errors.

Add the weak learner's predictions to the previous model's predictions to update the overall model.

Repeat steps 2-4 until the model reaches a predefined stopping criterion or a maximum number of iterations is reached.

The weak learners in Gradient Boosting are called "weak" because they are typically simple models that perform only slightly better than random guessing. However, by combining many weak learners together, Gradient Boosting can create a strong and flexible ensemble that can accurately predict the target variable.

The use of an ensemble in Gradient Boosting also helps to reduce overfitting, as the individual weak learners are trained on different subsets of the data and focus on different aspects of the problem. This helps to ensure that the model is able to generalize well to new data.