Q1. What is Gradient Boosting Regression?


Gradient boosting is another popular boosting algorithm in machine learning used for classification and regression tasks. It follows the method of boosting technique in each models and they are combined sequentially.

Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a
simple regression problem as an example and train the model on a small dataset. Evaluate the model's
performance using metrics such as mean squared error and R-squared.


In [2]:
import numpy as np

# Generate a sample dataset
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X.squeeze() + np.random.rand(100)

# Define the number of boosting rounds and the learning rate
n_estimators = 100
learning_rate = 0.1

# Initialize the predictions with the mean of the target values
initial_prediction = np.mean(y)
predictions = np.full(len(y), initial_prediction)

# Gradient boosting
from sklearn.tree import DecisionTreeRegressor
for _ in range(n_estimators):
    # Calculate residuals
    residuals = y - predictions

    # Fit a decision tree to the residuals
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X, residuals)

    # Make predictions with the current tree
    tree_predictions = tree.predict(X)

    # Update the predictions with a fraction (learning rate) of the new predictions
    predictions += learning_rate * tree_predictions

# Calculate the mean squared error
mse = np.mean((y - predictions) ** 2)

# Calculate R-squared
ss_total = np.sum((y - np.mean(y)) ** 2)
ss_residual = np.sum((y - predictions) ** 2)
r_squared = 1 - (ss_residual / ss_total)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared: {r_squared:.4f}")


Mean Squared Error: 0.0144
R-squared: 0.9628


Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to
optimise the performance of the model. Use grid search or random search to find the best
hyperparameters.


In [3]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Generate a sample dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a range of hyperparameters to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}

# Create a GradientBoostingRegressor
gb_regressor = GradientBoostingRegressor(random_state=42)

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=gb_regressor, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Create a GradientBoostingRegressor with the best hyperparameters
best_gb_regressor = GradientBoostingRegressor(**best_params, random_state=42)

# Train the model on the full training set
best_gb_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_gb_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Best Hyperparameters:", best_params)
print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared: {r2:.4f}")


Best Hyperparameters: {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 50}
Mean Squared Error: 1.4696
R-squared: 0.9989


Q4. What is a weak learner in Gradient Boosting?


In the context of Gradient Boosting, a "weak learner" refers to a base model or a simple machine learning algorithm that performs slightly better than random guessing on a classification or regression problem. Weak learners are typically less complex models and are characterized by their limited capacity to capture the underlying patterns in the data.

Q5. What is the intuition behind the Gradient Boosting algorithm?



The intuition behind the Gradient Boosting algorithm is to build a strong predictive model by sequentially combining a series of weak models, each of which focuses on correcting the errors made by the previous models. Gradient Boosting is a machine learning technique that emphasizes a "boosting" approach, where each weak learner is trained to address the shortcomings of the ensemble up to that point.

Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?


The key idea behind Gradient Boosting is that the ensemble focuses on the examples that are challenging to classify or predict correctly. Each weak learner is specialized in handling specific errors or patterns in the data. This leads to a strong model that can effectively capture complex relationships and provide accurate predictions. The contributions of weak learners are combined using weighted voting, with better-performing models having more influence.


Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting
algorithm?

We will be using gradient boosting to predict the output by following the following steps:
	First we will create a base model which will be a dumb model. It will take a specific input and will give its output. The output of base model will be average of the given output feature y and it is denoted by y ̂.
	After that we will calculate the residuals or errors by subtracting output of base model from each data point of output feature.
Exp (x_1)	Degree (x_2)	Salary(y)	R_1   (y-y ̂)	R_2
2	B.E	50K	-25K	-23K
3	Masters	70K	-5K	-3K
5	Masters	80K	5K	3K
6	PhD	100K	25K	20K
		y ̂=75K		
	Now we will construct a decision tree considering the x_1  & x_2 as input features and R_1 as output feature. This will give  a residual output R_2. After training the decision tree model with our input features (x_1& x_2) and output feature R_1, the predictions the decision tree will make will be near to the previous residual output as similar input features are provided to both R_1  & R_2. The outputs of R_2 shown in the table are assumed.
Exp (x_1)	Degree (x_2)	Salary(y)	R_1   (y-y ̂)	R_2	y ̂
2	B.E	50K	-25K	-23K	74.77
3	Masters	70K	-5K	-3K	74.97
5	Masters	80K	5K	3K	75.03
6	PhD	100K	25K	20K	75.2
		y ̂=75K			
	Now we will calculate final prediction by adding the base model and predictions made by all the decision tree models (there is only one decision tree here). So here the output of base model for first record is 75 and output of first decision tree for the same record is -23. On summing both we get output of 52K which is near to the original output. We can see we get almost accurate prediction by using just a single decision tree but practically this is not an ideal situation. Rather it is overfitting of the model and during testing there is high possibility we get errors when new test data is passed. 
So to avoid this we use α which is the learning rate and is generally taken as 0.01 but it can range between 0 and 1. This α is multiplied with the decision trees.
Exp (x_1)	Degree (x_2)	Salary(y)	R_1   (y-y ̂)	R_2	y ̂	R_3
(y-y ̂)
2	B.E	50K	-25K	-23K	74.77	-24.77
3	Masters	70K	-5K	-3K	74.97	-4.97
5	Masters	80K	5K	3K	75.03	4.97
6	PhD	100K	25K	20K	75.2	24.8
		y ̂=75K				
Now we will add the output of base model for first record and output of first decision tree for the same record multiplied with the learning rate. This will give the output y ̂=74.77. We will do this for all records so that we get output of each record as shown in table.
	Now again we will calculate the residuals or errors by subtracting the obtained output feature y ̂ from each data point of output feature y for each of the input record. This residual will be considered as R_3  as shown in the table below.  
	Now we will construct another decision tree using this residual output R_3.
	This whole sequential process will be repeated again and again and the mathematical function will be:
F(x)=〖α_0 h〗_0 (x)+〖α_1 h〗_1 (x)+〖α_2 h〗_2 (x)+ ……+〖α_n h〗_n (x)
It can be simplified as:
F(x)=∑_(i=0)^n▒〖〖α_i h〗_i (x) 〗
Where,
α_i is the learning rate
h_0 (x) is the output of base learner
h_i (x) is the output of decision trees
Also note that α_0 is considered as 1 and rest of the values of α_i ranges between 0 and 1. By default in python the value of α is 0.01
