<img src='https://yastatic.net/s3/ml-handbook/admin/5_1_848ce5004c.png' width='1200px'>

<a href='https://academy.yandex.ru/handbook/ml/article/gradientnyj-busting'>image source</a>

# INTRO

Gradient descent and boosting are optimization methods used to improve the accuracy of machine learning models. These methods allow you to adjust the parameters of the model to minimize prediction error.

Below, I'll describe the step-by-step process of how these methods work using a simple example. For simplicity, I specifically did not use mathematical formulas and greatly simplified the description of the algorithms.

If you're interested in learning more about these methods, I recommend studying the topic of gradient boosting in more depth, for example, by reading <a href='https://www.kaggle.com/code/kashnitsky/topic-10-gradient-boosting'>this Kaggle topic</a>.

In [None]:
# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tqdm import tqdm

# GRADIENT DESCENT

We are going to  use a sample of 10 observations, each with 3 parameters (features or x) and a target variable y. The target value is linearly dependent on the predictors x. The value of the target variable y is determined by the function:
y = w0 + w1x1 + w2x2 + w3x3

The value of w(i) is unknown to us, our task is to find them.

In [None]:
# Let's create a random dataset and represent it in a table for easier visualization.
X, y = make_regression(n_samples=10, n_features=3, noise=.01, random_state=42)
df = pd.DataFrame(X, columns=['x1', 'x2', 'x3'])
df['y_true'] = y
df.head()
# The value of target 'y_true' is linearly dependent on x1, x2 and x3.

### STEP 1
Ok, let's randomly initialize the values of w0, w1, w2, and w3. This is usually done to start the optimization process for finding the best possible values for these coefficients.

In [None]:
w0, w1, w2, w3 = 1, 1, 1, 1

### STEP 2
To make the prediction of the target variable y, we can use our linear dependency function with the given random values of w0, w1, w2 and w3, and the values of x1, x2, and x3 for each observation in the dataset. The predicted value of y can then be calculated by plugging in the values into the linear dependency equation:

y = w0 + w1x1 + w2x2 + w3x3.

In [None]:
df['y_pred'] = w0+w1*df['x1']+w2*df['x2']+w3*df['x3']
df.head()

### STEP 3
Let's understand how well we did with the prediction task. To do this, we will use the mean squared  error loss function. It shows the difference between the actual and predicted target values. We chose it for the example and any other loss function can be used.

In [None]:
mse = mean_squared_error(df['y_true'], df['y_pred'] )
mse

### STEP 4
It looks like we are far from perfect. Let's improve the accuracy of our prediction. To do this, we will use the gradient.

The gradient is a vector that shows the direction of the function's increase. The antigradient is used to find the minimum.
We will move along the loss function's graph line until we reach the local  minimum point. Jumping ahead, I will say that the gradient descent method only allows to get closer to the local minimum point.

<img src='https://raw.githubusercontent.com/betelgeus/study/master/images/descent.png' width='400px'>


Gradient descent requires finding the partial derivatives of the loss function with respect to each of the coefficients (or parameters) in the model. If you are not familiar with derivatives, they can be calculated using an <a href='https://www.derivative-calculator.net'>online calculator</a>. 

The gradient computation is essentially the process of finding the slope of the loss function at the current parameter values.

### FIND PARTIAL DERIVATIVES

**Linear regression function**
$$f(w, x_i) = w_0 + w_i x_{ki}$$

$$f(w, x_i) = w_0 + w_1 x_1  + w_2 x_2 + w_3 x_3$$

**Loss function**
$$\mathcal{L}(y, f(w, x_i)) = \frac{1}{N} \displaystyle\sum_{i=1}^{N}(y_i - f(w, x_i))^2 \rightarrow min$$

$$\mathcal{L}(y, f(w, x_i)) = \frac{1}{N} \displaystyle\sum_{i=1}^{N}(y_i - ( w_0 + w_1 x_1  + w_2 x_2 + w_3 x_3))^2 \rightarrow min$$

**Find partial derivatives for each weight: w0, w1, w2, w3**

$$\frac{\partial \mathcal{L}(y, f(w, x_i))}{\partial w_0} = \frac{2(w_0 - y + w_1 x_1  + w_2 x_2 + w_3 x_3)}{N}$$

$$\frac{\partial \mathcal{L}(y, f(w, x_i))}{\partial w_1} = \frac{2x_1(w_1x_1 - y + w_0  + w_2 x_2 + w_3 x_3)}{N}$$

$$\frac{\partial \mathcal{L}(y, f(w, x_i))}{\partial w_2} = \frac{2x_2(w_2x_2 - y + w_0  + w_1 x_1 + w_3 x_3)}{N}$$

$$\frac{\partial \mathcal{L}(y, f(w, x_i))}{\partial w_3} = \frac{2x_3(w_3x_3 - y + w_0  + w_1 x_1 + w_2 x_2)}{N}$$


In [None]:
w0_derivative = np.mean(2*(w0-df['y_true']+w1*df['x1']+w2*df['x2']+w3*df['x3']))
w1_derivative = np.mean(2*df['x1']*(w1*df['x1']-df['y_true']+w0+w2*df['x2']+w3*df['x3']))
w2_derivative = np.mean(2*df['x2']*(w2*df['x2']-df['y_true']+w0+w1*df['x1']+w3*df['x3']))
w3_derivative = np.mean(2*df['x3']*(w3*df['x3']-df['y_true']+w0+w1*df['x1']+w2*df['x2']))
print('w0_derivative:', w0_derivative)
print('w1_derivative:', w1_derivative)
print('w2_derivative:', w2_derivative)
print('w3_derivative:', w3_derivative)

### STEP 5
Look's good! Let's use gradient to update the parameters. 

This is done by subtracting a scaled version of the gradient from the current parameter values. The scale factor is called the learning rate, and it determines the size of the step we take towards the minimum. A small learning rate ensures that the parameters are updated slowly and converge to the minimum, while a large learning rate can lead to overshooting and oscillations.

In [None]:
learning_rate = 0.1

w0 = w0 - learning_rate * w0_derivative
w1 = w1 - learning_rate * w1_derivative
w2 = w2 - learning_rate * w2_derivative
w3 = w3 - learning_rate * w3_derivative

df['y_pred'] = w0+w1*df['x1']+w2*df['x2']+w3*df['x3']
mse = mean_squared_error(df['y_true'], df['y_pred'] )
mse

### STEP 6
Wonderful! We have minimized the error. Let's repeat previos steps until a stopping criteria is met, such as a maximum number of iterations.

In [None]:
n = 30
loss = []

for _ in range(n):
    df['y_pred'] = w0+w1*df['x1']+w2*df['x2']+w3*df['x3']
    loss.append(mean_squared_error(df['y_true'], df['y_pred']))
    
    w0_derivative = np.mean(2*(w0-df['y_true']+w1*df['x1']+w2*df['x2']+w3*df['x3']))
    w1_derivative = np.mean(2*df['x1']*(w1*df['x1']-df['y_true']+w0+w2*df['x2']+w3*df['x3']))
    w2_derivative = np.mean(2*df['x2']*(w2*df['x2']-df['y_true']+w0+w1*df['x1']+w3*df['x3']))
    w3_derivative = np.mean(2*df['x3']*(w3*df['x3']-df['y_true']+w0+w1*df['x1']+w2*df['x2']))

    w0 = w0 - learning_rate * w0_derivative
    w1 = w1 - learning_rate * w1_derivative
    w2 = w2 - learning_rate * w2_derivative
    w3 = w3 - learning_rate * w3_derivative

In [None]:
# We will plot the loss function values for each training iteration. 
# As we can see, the difference between the actual and predicted target value decreases gradually.
plt.plot(loss)
plt.title(f'MSE Loss: {loss[-1]}')
plt.show()

In [None]:
df.head()

In [None]:
print(w0, w1, w2, w3)

### Finaly
The gradient descent algorithm is a crutical component of many machine learning algorithms, including linear regression, logistic regression, and neural networks. It works by iteratively adjusting the parameters until the cost is minimized, effectively finding the optimal values for the parameters.

# GRADIENT BOOSTING
Now we are using 1000 observations and dividing the samples into two: training and testing. Some of the steps will be new from what was before. I will describe them in as much detail as possible.

In [None]:
X, y = make_regression(n_samples=1000, n_features=3, noise=.01, random_state=42)

# Divide sample on train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Represent data in a table for easier visualization.
train = pd.DataFrame(X_train, columns=['x1', 'x2', 'x3'])
train['y_true'] = y_train

### Step 1
Let's make a simple constant prediction of the target variable by taking the mean of the true target values. Then we will calculate the error.

In [None]:
train['y_pred'] = np.mean(train['y_true'])
train.head()

In [None]:
# Let's define the loss function
mse = mean_squared_error(train['y_true'], train['y_pred'])
mse

### Step 2
And so, our task is to minimize the value of the loss function. To solve this problem, we will find the value of the derivative for each observation, that is, the residual. Let's find the residuals, the difference between the actual and predicted target value.

In [None]:
train['residuals'] = train['y_true'] - train['y_pred']
train.head()

### Step 3
Let's create a simple decision tree, for clarity we use the minimum depth.

In [None]:
tree = DecisionTreeRegressor(random_state=42, max_depth=1)

### Step 4
For each leaf, the DecisionTreeRegressor minimizes the loss function.  Algorithm searches for such a predict value that the sum of derivatives (residuals) is minimal. Train the tree to predict residuals.

This value will be mean value of residuals in the leaf. DecisionTreeRegressor will ised them as a predictor for each observation. 

Due to the fact that we use a tree with max_depth = 1, we will have only two leaves, respectively, only two mean values.

In [None]:
tree.fit(train[['x1', 'x2', 'x3']], train['residuals'])
train['output (res_pred)'] = tree.predict(train[['x1', 'x2', 'x3']])
train.head()

### Step 5
It's time for boosting. Let's improve our model's prediction. Just like last time, we'll gradually train the model, adjusting the prediction value with the learning rate.

But now for prediction we use desicion tree instead of linear regression function and we predict residuals instead of target value.

We will improve the accuracy by adding to the target variable the value of the residual multiplied by the learning rate.

In [None]:
train['y_pred'] +=  learning_rate * train['output (res_pred)']
train.head()

In [None]:
mse = mean_squared_error(train['y_true'], train['y_pred'])
mse

### Step 6
Not bad. We were able to improve the prediction quality. Let's repeat.

In [None]:
n = 2000

# Arrays for storing the values of the loss function and trees. 
# The trees will be needed later for predicting target values in the test sample.
loss = []
trees = []

In [None]:
for _ in range(n):
    train['residuals'] = train['y_true'] - train['y_pred']
    tree = DecisionTreeRegressor(random_state=42, max_depth=1)
    tree.fit(train[['x1', 'x2', 'x3']], train['residuals'])
    train['output (res_pred)'] = tree.predict(train[['x1', 'x2', 'x3']])
    train['y_pred'] +=  learning_rate * train['output (res_pred)']
    loss.append(mean_squared_error(train['y_true'], train['y_pred']))
    trees.append(tree)

In [None]:
plt.plot(loss)
plt.title(f'MSE Loss {loss[-1]}')
plt.show()

In [None]:
train.head()

### Step 7
Let's repeat the actions on the data that the model did not see.

In [None]:
test = pd.DataFrame(X_test, columns=['x1', 'x2', 'x3'])
test['y_true'] = y_test
test['y_pred'] = y_train.mean()

In [None]:
loss = []

for tree in trees:
    test['output (res_pred)'] = tree.predict(test[['x1', 'x2', 'x3']])
    test['y_pred'] +=  learning_rate * test['output (res_pred)']
    loss.append(mean_squared_error(test['y_true'], test['y_pred']))

In [None]:
plt.plot(loss)
plt.title(f'MSE Loss {loss[-1]}')
plt.show()

In [None]:
test.head()

# LET'S PRACTICE
We will practice on the popular dataset House Prices. First of all, it is important for us to understand the work of the GBDT, so we will don't focus on the steps with EDA, data preprocessing and feature engenering. 

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle//input/house-prices-advanced-regression-techniques/test.csv')
test_id = test['Id'].copy()
sample_sub = pd.read_csv('/kaggle//input/house-prices-advanced-regression-techniques/sample_submission.csv')
print(train.shape)
print(test.shape)

In [None]:
train_test = pd.concat([train, test])
train_test = train_test.reset_index(drop=True)
train_test.shape

In [None]:
for col_name in ['PoolQC', 'Alley', 'MiscFeature', 'FireplaceQu', 'Fence']:
    train_test[col_name] = train_test[col_name].fillna('None')

for col_name in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'GarageYrBlt',\
                 'BsmtQual', 'BsmtCond', 'BsmtFinType1', 'BsmtFinType2', 'BsmtExposure']:
    train_test[col_name] = train_test[col_name].fillna('No')

for col_name in ['MasVnrArea', 'BsmtFinSF2', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath',\
                 'BsmtUnfSF', 'BsmtFinSF1', 'GarageArea', 'LotFrontage']:
    train_test[col_name] = train_test[col_name].fillna(0)

for col_name, value in zip(['MSZoning', 'Utilities', 'Functional', 'Exterior1st',\
                            'KitchenQual', 'Electrical', 'SaleType', 'Exterior2nd', 'MasVnrType', 'GarageCars'],\
                           ['RL', 'AllPub', 'Typ', 'VinylSd', 'TA', 'SBrkr', 'WD', 'VinylSd', 'None', 2]):
	train_test[col_name] = train_test[col_name].fillna(value)

train_test.isna().sum()

In [None]:
numerical_features = train_test.select_dtypes(include = ['float64', 'int64']).columns.drop('SalePrice')
categorical_features = train_test.select_dtypes(include = 'object').columns
predictors = list(numerical_features) + list(categorical_features)
std = StandardScaler()
target = 'SalePrice'



def encoder(df, categorical_cols, numerical_cols, target):
    y = df[target]
    df[categorical_cols] = df[categorical_cols].astype('str')
    df_lbl = df[categorical_cols].apply(LabelEncoder().fit_transform)
    num_cols_std_scaled = std.fit_transform(df[numerical_cols])
    df_num = pd.DataFrame(num_cols_std_scaled, columns=numerical_cols)
    df = pd.concat([df_lbl, df_num], axis=1)
    df = pd.concat([df, y], axis=1)
    return df


train_test = encoder(train_test, categorical_features, numerical_features, target)

train = train_test[train_test[target].notnull()].copy()
test = train_test[train_test[target].isna()].copy()
test = test.drop([target], axis=1)

print(train.shape)
print(test.shape)

### LET'S GO

In [None]:
train['Predictions'] = train[target].mean()
train[[target, 'Predictions']].head()

In [None]:
learning_rate = 0.01

In [None]:
train['Residuals'] = train[target] - train['Predictions']
train[[target, 'Predictions', 'Residuals']].head()

In [None]:
rmse = mean_squared_error(np.log(train['Predictions']), np.log(train[target]), squared=False) 
rmse

In [None]:
tree = DecisionTreeRegressor(random_state=42, max_depth=4, min_samples_split=13, max_features=0.74)

In [None]:
tree.fit(train[predictors], train['Residuals'])
train['Output'] = tree.predict(train[predictors])
train[[target, 'Predictions', 'Residuals', 'Output']].head()

In [None]:
train['Predictions'] +=  learning_rate * train['Output']
train[[target, 'Predictions']].head()

In [None]:
rmse = mean_squared_error(np.log(train['Predictions']), np.log(train[target]), squared=False) 
rmse

In [None]:
n = 1000

loss_train = []
loss_val = []
trees = []

In [None]:
for _ in tqdm(range(n)):
    train['Residuals'] = train[target] - train['Predictions']
    train_sample, val_sample = train_test_split(train, train_size=0.8, random_state=42)
    tree = DecisionTreeRegressor(random_state=42, max_depth=4, min_samples_split=13, max_features=0.74)
    tree.fit(train_sample[predictors], train_sample['Residuals'])
    train_sample['Output'] = tree.predict(train_sample[predictors])
    train_sample['Predictions'] +=  learning_rate * train_sample['Output']
    val_sample['Output'] = tree.predict(val_sample[predictors])
    val_sample['Predictions'] +=  learning_rate * val_sample['Output']
    train = pd.concat([train_sample, val_sample])
    loss_train.append(mean_squared_error(np.log(train['Predictions']), np.log(train[target]), squared=False))
    loss_val.append(mean_squared_error(np.log(val_sample['Predictions']), np.log(val_sample[target]), squared=False))
    trees.append(tree)

loss = pd.DataFrame({'Train': loss_train, 'Val': loss_val})

In [None]:
loss.plot()
plt.show()

In [None]:
loss

In [None]:
rmse = mean_squared_error(np.log(train['Predictions']), np.log(train[target]), squared=False) 
rmse

In [None]:
train[[target, 'Predictions']].head(10)

In [None]:
test['Predictions'] = train[target].mean()
for tree in trees:
    test['Output'] = tree.predict(test[predictors])
    test['Predictions'] +=  learning_rate * test['Output']

In [None]:
prediction = pd.DataFrame(columns=sample_sub.columns)
prediction.iloc[:, 0] = test_id
# prediction.iloc[:, 1] = np.expm1(test['Predictions'].reset_index(drop=True))
prediction.iloc[:, 1] = test['Predictions'].reset_index(drop=True)
prediction.to_csv('submission.csv', index=False)

In [None]:
prediction.head()

In [None]:
nan