<a href="https://colab.research.google.com/github/muyeblog/implementAlgorithmFromScratch/blob/main/GradientBoostingInPythonFromScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Gradient Boosting in Python from Scratch

https://towardsdatascience.com/gradient-boosting-in-python-from-scratch-788d1cf1ca7

The aim of this article is to explain every bit of the popular and oftentimes **mysterious gradient boosting algorithm** using Python code and visualizations. Gradient boosting is the key part of such competition-winning algorithms as CAT boost, ADA boost or XGBOOST thus knowing what is boosting, what is the gradient and how the two are linked in creating an algorithm is a must for any modern machine learning practitioner.


The implementation and animations(实现和动画) of gradient boosting for regression in Python can be accessed in my repo: https://github.com/Eligijus112/gradient-boosting

![](https://miro.medium.com/max/1400/1*cQ69MSfNojO9i_asf432_A.jpeg)

The main  picture  of the article depicts(描绘) the process of evolution(进化的过程) and how, over a long period of time, the beak size of a bird species adapts to its surroundings(鸟类的喙的大小如何在很长一段时间内适应其周围环境).https://en.wikipedia.org/wiki/Darwin%27s_finches

Just like animals adapt in various ways given new facts in their habitats(栖息地),(就像动物以各种方式适应栖息地中的新事实一样),so do machine learning algorithms adapt to the data environment we put them in.The main idea behind the gradient boosting algorithm is that the main engine of it is a low accuracy and simple algorithm which learns from its own previous mistake(梯度提升算法背后的主要思想是，它的主要引擎是一个低准确率且简单的算法，它从自己之前的错误中学习).

At every iteration, not just the errors are used to adjust the model, but previous iteration's models get invoked as well. Thus, with every pass over the data, the gradient boosting model gets more complex and complex because in it adds more and more simple models together.

In a nutshell(简而言之)，the simplified equation of many gradients boosting algorithms is a recursion(递归):

$F_m(x) = F_{m-1}(x) + \alpha G_m(x)$

The current value *m*(think about it as the present) uses the past information (m-1) and gets adjusted by new present evidence(G) with a certain weight.

In the article below, we will dive deeper into the nitty-gritty details of gradient boosting(将深入探讨梯度提升的细枝末节) and I hope that after going through all the code and explanations, the reader will see the gradient boosting while sounding intimidating(听起来吓人), is not that complex.

Let us imagine we want to have a model to predict how many miles can a car drive based on the car's weight. The data can be accessed from here:

https://archive.ics.uci.edu/ml/datasets/auto+mpg


The data with all features:

![](https://miro.medium.com/max/1400/1*uvc-O_fnX5pXCq0Mzpb-9A.png)

The relationship in question:

![](https://miro.medium.com/max/1400/1*oPekoKlRkgLdkT2b3aGGAw.png)
mpg ~ weight; Graph by author



There is a clear relationship --- the heavier the car, the fewer miles per gallon it can go. Let us try to fit a base learner to the data and see how it performs.

Let us start building a gradient boosting machine learning algorithm to model this relationship.

Regression gradient boosting algorithm has three very broad machine learning terms in its title:

- Regression
- Gradient
- Boosting

It is important to have at least an intuitive(直觉) understanding of the standalone definitions(独立的定义)  before trying to merge them together.


### Regression:
  Regression in machine learning means finding a relationship $f$ between a continuous variable's $Y$ average value and features $X$.

  $E[Y] = f(X)$

In regression, the variable we are trying to predict is continuous, that is ,it can have an infinite(无限的) number of values. For example, human weight, the speed at which a person will finish a race in the Olympics, a person's wage(工资), etc.

The way in which we are trying to explain the $Y$ variable is using features $X$. For example, we can say that a person's wage (the $Y$ variable) is determined by his or her experience in a workplace, academic achievements, certificate amounts, etc (the $X$ variable).



### The gradients
The gradients of a function in mathematics is a vector whose each coordinate is a partial derivative of a given function's argument(数学中函数的梯度是一个向量，其每个坐标都是给定函数参数的偏导数).

$\nabla f(x_1,...,x_p) = [\frac{\partial f}{\partial x_1},...,\frac{\partial f}{\partial x_p}]$

Gradient definition

The gradient is very widely in popular algorithms used when finding the function argument that minimizes or maximizes the function. It is because of the fact that at any given point $x$ if the function's gradient is negative then the original function is decreasing at the point $x$。 If at point $x$ the gradient is positive, then the function is increasing. This is why a lot of loss functions in machine learning try to be as simple as possible (the fewer arguments the better) and be differentiable(可微的)(enabling the finding of the gradient)(可以找到梯度).



The term boosting in machine learning means a process of training a set of weak learners to the training data where each weak learner iteratively learns from previous learners' mistakes(机器学习中的提升一词是指在训练数据上训练一组弱学习器的过程，其中每个弱学习器迭代地从先前学习器的错误中学习).

A weak learner is a machine learning algorithm that is quick to fit data but has relatively poor performance in terms of accuracy, mean squared error or other metrics(弱学习器是一种机器学习算法，可以快速拟合数据，但在准确性、均方误差或其他指标方面的性能相对较差).

![](https://miro.medium.com/max/1400/1*Gl4S98dlRJqMh4jR4eDOJg.png)

In the above schema, each ML model is a weak learner(在上述的方案中，每一个 ML 模型都是一个弱学习器). Each error is calculated with past predictions(每个错误都是根据过去的预测计算的).Each subsequent model tries to fit the previous errors with the original features(每个后续模型都尝试用原始特征拟合先前的错误误差)(just take a leap of faith here and follow along with the article, I will explain this process in detail later)().

The final prediction is a weighted sum of all the outputs from the gotten ML models(最终预测获得的 ML 模型是所有输出的加权和，adaboost是弱分类器，通过加权多数表决):

![](https://miro.medium.com/max/1400/1*UNNud7AEYk63-wCSk95mMA.png)

**Regression gradient boosting** is an algorithm that combines all there above ideas into one machine learning method.

Before diving deeper into how all the three ideas are interlinked(在深入探讨这三个想法是如何关联之前),we need to select a weak base learner for the boosting part. A very popular choice is a *regression decision tree*(回归决策树). To learn more about the regression decision trees check out my article：

[Regression Tree in Python From Scratch](https://towardsdatascience.com/regression-tree-in-python-from-scratch-9b7b64c815e3)

A quick recap(快速回顾):
![](https://miro.medium.com/max/1278/1*e_e1-w8AofrHUt2WuJIzig.png)

Regression tree schema


Each node in a tree has Y and X features saved to it. Additionally, each node has:
- Special residuals of the Y mean were substracted from each of the Y values(从每个Y值中减去 Y 平均值的特殊残差).
- The mean squared error for the residuals(残差的均方误差).
- The best spliting feature and the best split value for further creation of nodes(进一步创建节点的最佳分类特征和最佳分裂值).
- All the initial hyperparameter(所有初始超参数).




## The Python implementaion of a regression tree:

In [18]:
# Data wrangling

import pandas as pd
# Infinity constant
from math import inf


class Tree():
  """
  Class to fit a regression tree to the given data
  """
  def __init__(
      self,
      d:pd.DataFrame,
      y_var:str, # label 列名
      x_vars:list, # 特征列名 list
      max_depth:int = 4,
      min_sample_leaf:int = 2
  ):
    """
    Class to create the regression tree object.

    Arguments
    ----------
    d: pd.DataFrame
      The dataframe to create the tree form
    y_var: str
      The target values
    x_vars: dict
      The features to use in the tree
    max_depth: int
      The maximum depth of the tree
    min_sample_leaf: int
      The minimum number of observations in each of the subtrees after
      spliting
    """
    # Saving the names of y variable and x features
    self.y_var = y_var # 列名
    self.features = x_vars  # 特征列名

    # Saving the node data to memory
    # the type of d is pandas.core.frame.DataFrame
    self.d = d[[y_var] + x_vars]  # convert str type y_var with '[]',then concatenate with list x_vars

    # Saving the data to the node
    self.Y = d[y_var].values.tolist()

    # Saving the number of observations in the node.
    self.n = len(d)

    # Initialing the depth counter(初始化深度计数器)
    self.depth = 0

    #Saving the maximum depth of the tree
    self.max_depth = max_depth

    # Saving the minimum samples in the dataframe after spliting(拆分后的最小样本)
    self.min_sample_leaf = min_sample_leaf

    # Calculating the mse
    self.get_y_mse()

    # Infering the best split
    self.get_best_split()

    # Saving to memory the y mean (prediction of the node)
    self.get_y_mean()
  
  # x:list 表示 x 的建议类型是 list, -> float 表示函数的返回值建议类型为 float
  # 说白了就是用来标识参数类型和返回值类型
  @staticmethod # 静态方法，可以无需实例化，通过类目直接调用，也可以实例化对象后使用对象进行调用
  def get_mean(x:list) -> float: 
    """
    Calculates the mean over a list of float elements
    """
    # Initialing the sum counter
    _sum = 0

    # Infering(推断、测) the length of list
    _n = len(x)

    if _n == 0:
      return inf
    
    # Iterating through(遍历) the y values
    for _x in x:
      _sum += _x
    
    return _sum / _n
  
  def get_y_mean(self) -> None:
    """
    Saves the current node's mean
    """
    self.y_mean = self.get_mean(self.Y)
  
  def get_mse(self,x:list) -> float:
    """
    Calculates the mse of a given list by subtracting the mean,
    summing and dividing by n
    """
    # Infering the length of  list
    _n = len(x)

    if _n == 0:
      return inf
    
    # Calculating the mean
    _mean = self.get_mean(x)

    # Getting the residuals
    residuals = [_x - _mean for _x in x]

    # Squaring the residuals
    residuals = [r**2 for r in residuals]

    # Summing the residuals
    _r_sum = 0
    for r in residuals:
      _r_sum += r
    
    # Returning the mean squared error
    return _r_sum / _n
  
  def get_y_mse(self) -> None:
    """
    Method to calculate the MSE of the current node
    """
    self.mse = self.get_mse(self.Y)
  
  def get_mse_weighted(self,y_left:list,y_right:list):
    """
    Calculates the weighted mse given two lists
    """
    # Calculates the length of both values
    _n_left = len(y_left)
    _n_right = len(y_right)
    _n_total = _n_left + _n_right

    # Calculating  the mse of each sides
    _mse_left = self.get_mse(y_left)
    _mse_rigth = self.get_mse(y_right)

    # Calculating the weighted mse
    return (_mse_left * _n_left / _n_total) + (_mse_right * _n_right / _n_total)
  
  def get_best_spilt(self):
    """
    Method to find the best split among thee features

    The logic is to find the feature and the feature value which reduces
    the object mse the most
    """
    #  Setting initial values
    _best_mse = self.mse
    _best_feature = None
    _best_feature_value = None

    # Creating lists of categorical and numberical features
    _cat_features = [ft for ft in self.d.columns if self.d[ft].dtype == 'category']
    _num_features = list(set(self.features) - set(_cat_features))

    # Going through the categorical features
    for _cat_feature in _cat_features:
      # Infering the levels of the categorical feature
      _levels = self.d[_cat_feature].unique()

      for _level in _levels:
        # Spliting the data into two parts: one that is equal to the categrical level
        # and one that is not
        _y_left = self.d.loc[self.d[_cat_feature] == _level, self.y_var].values
        _y_right = self.d.loc[self.d[_cat_feature] != _level, self.y_var].values

        if len(_y_left) >= self.min_sample_leaf and len(_y_right) >= self.min_sample_leaf:
          # Calculating the weighted mse
          _mse_w = self.get_mse_weighted(_y_left,_y_right)

          # Checking the values
          if _mse_w < _best_mse:
            _best_mse = _mse_w
            _best_feature = _cat_feature
            _best_feature_value = str(_level) # Specificaly adding the type for later spliting
          
    # Going through the numerical features
    for _num_feature in _num_features:
      # Getting the values
      _values = self.d[_num_feature].values

      # Getting the unique entries
      _values = list(set(_values))

      # Sorting the values
      _values.sort()


      # Getting the rolling average values of the feature
      # and spliting the dataset by that value
      for i in range(len(_values) - 1):
        # Rolling  average
        _left = _values[i]
        _right = _values[i] + 1
        _mean = (_left + _right)/2

        # Iterating over the values and calculating the mse
        _y_left = self.d.loc[self.d[_num_feature] <= _mean,self.y_var].values
        _y_right = self.d.loc[self.d[_num_feature] > _mean,self.y_var].values

        if len(_y_left) >= self.min_sample.leaf and len(_y_right) >= self.min_sample_leaf:
          # Getting the weighted mse
          _mse_w = self.get_mse_weighted(_y_left,_y_right)

          # Checking the 
          if _mse_w < _best_mse:
            _best_mse = _mse_w
            _best_feature = _num_feature
            _best_feature_value = _mean

    # Saving the best splits to object memory
    self.best_feature = _best_feature
    self.best_feature_value = _best_feature_value
  
  def fit(self):
    """
    The recurisive method to fit a regression tree and on the data provided
    """
    if self.depth < self.max_depth:
      # Spliting the data depending on the found best splits
      _best_feature = self.best_feature
      _best_feature_value = self.best_feature_value

      # Spliting the data for the creation of additional sub tree
      _d_left = pd.DataFrame()
      _d_right = pd.DataFrame()
      if isinstance(_best_feature_value,str): # isinstance函数是判断 对象 _best_feature_value 是不是 str类型, str类型的数据为categorical特征
        _d_left = self.d[self.d[_best_feature] == _best_feature_value].copy()
        _d_right = self.d[self.d[_best_feature] != _best_feature_value].copy()
      else:
        _d_left = self.d[self.d[_best_feature] <= _best_feature_value].copy()
        _d_right = self.d[self.d[_best_feature] > _best_feature_value].copy()

      # Creating the tree instances
      _left_tree = Tree(
          d = _d_left,
          y_var = self.y_var,
          x_vars = self.features,
          min_sample_leaf = self.min_sample_leaf,
          max_depth = self.max_depth
      )
      _right_tree = Tree(
          d = _d_right,
          y_var = self.y_var,
          x_vars = self.features,
          min_sample_leaf = self.min_sample_leaf,
          max_depth = self.max_depth
      )

      # Setting the depths
      _left_tree.depth = self.depth + 1
      _right_tree.depth = self.depth + 1

      # Defining the rules for the left and right subtrees
      _left_symbol = '<='
      _right_symbol = '>'

      if isinstance(_best_feature_value,str):
        _left_symbol = '=='
        _right_symbol = '!='
      

      _rule_left = f"{_best_feature} {_left_symbol} {_best_feature_value}"
      _rule_right = f"{_best_feature} {_right_symbol} {_best_feature_value}"

      _left_tree.rule = _rule_left
      _right_tree.rule = _rule_right

      # Saving the pointers in memory
      self.left = _left_tree
      self.right = _right_tree

      # Continuing the recursive process
      self.left.fit()
      self.right.fit()
  
  def print_info(self,width=4):
    """
    Method to print the information about the tree
    """
    # Defining the number of spaces
    const = int(self.depth * width ** 1.5)
    spaces = "-" * const

    if self.depth == 0:
      print(f"Root (level {self.depth}")
    else:
      print(f"|{spaces} Split rule:{self.rule} (level {self.depth}|")
    print(f"{' ' * const} | MSE of the node : {round(self.mse,2)}")
    print(f"{' ' * const} | Count of obervations in node: {self.n}")
    print(f"{' ' * const} | Precition of node: {round(self.y_mean,3)}")
  
  def print_tree(self):
    """
    Prints the whole tree from the current node to the bottom
    """
    self.print_info()

    if self.depth < self.max_depth:
      self.left.print_tree()
      self.right.print_tree()
  
  def predict(self,x:dict) -> float:
    """
    Returns the predict Y value based on the X values
    
    Arguments
    ---------
    x: dict
      Dictionary of the structure:
      {
        "feature_name": value,
      }
    Returns
    -------
    The mean Y based on the x and fitted
    """
    # Infering the node
    _node = self
    while _node.depth < self.max_depth:

      # Extracting the best split feature and values
      _best_feature = _node.best_feature
      _best_feature_value = _node.best_feature_value

      # Checking if the feature is categorical or numerical
      if isinstance(_best_feature_value, str):
        if x[_best_feature] == _best_feature_value: # categorical feature
          _node = _node.left
        else:
          _node = _node.right
      else: # numerical feature
        if x[_best_feature] <= _best_feature_value:
          _node = _node.left
        else:
          _node = _node.right
    
    # Returning the prediction
    return _node.y_mean


In [2]:
import pandas as pd
d = pd.DataFrame
x_vars = ['name','num']
y_var = 'price'
type(x_vars)

list

In [3]:
data = {'name':['苹果','梨','草莓'],
       'num':[3,2,5],
       'price':[10,9,8]}
df = pd.DataFrame(data)
print(df)

  name  num  price
0   苹果    3     10
1    梨    2      9
2   草莓    5      8


In [37]:
[y_var]+x_vars

['price', 'name', 'num']

In [49]:
# x_var:list,y_var:str
# z=x_vars + y_var # can only concatenate list (not "str") to list
# z = y_var + x_vars # can only concatenate str (not "list") to str
z  = [y_var] + x_vars
print(z)

['price', 'name', 'num']


In [44]:
df[['price', 'name', 'num']]

Unnamed: 0,price,name,num
0,10,苹果,3
1,9,梨,2
2,8,草莓,5


In [51]:
df[y_var]

0    10
1     9
2     8
Name: price, dtype: int64

In [45]:
df[ [y_var] + x_vars ]

Unnamed: 0,price,name,num
0,10,苹果,3
1,9,梨,2
2,8,草莓,5


In [50]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [32]:
list(e.columns)

[0]

In [54]:
x = [1,2,3,4]
print(x)
x = [r**2 for r in x]
print(x)

[1, 2, 3, 4]
[1, 4, 9, 16]


In [4]:
df

Unnamed: 0,name,num,price
0,苹果,3,10
1,梨,2,9
2,草莓,5,8


In [8]:
df[df['num']==3]

Unnamed: 0,name,num,price
0,苹果,3,10


In [9]:
2**1.5

2.8284271247461903

In [13]:
import numpy as np
np.sqrt(2**3)

2.8284271247461903

In [15]:
len(' '* 3)

3

In [17]:
{' '}

{' '}