# Gradient Boosting Classifiers in Python with Scikit-Learn

Decision trees are usually used when doing gradient boosting. Gradient boosting models are becoming popular because of their effectiveness at classifying complex datasets

# How Gradient Boosting Came to Be 

The idea behind "gradient boosting" is to take a weak hypothesis or weak learning algorithm and make a series of tweaks to it that will improve the strength of the hypothesis/learner. This type of Hypothesis Boosting is based on the idea of Probability Approximately Correct Learning. 

this PAC learning method investigates machine learning problems to interpret how complex they are, and a similiar method is applied to Hypothesis Boosting.

in hypothesis boosting, you look at all observations that machine learning aglorithm is trained on, and you leave only the observations that the machine learning method successfully classified behind, stripping out the other observations. A new weak learner is created and tested on the set of data that was poorly classified, and then just the examples that were successfully classified and kept.

for AdaBoost, many weak learners are created by initializing many decision tree algorithms that only have a single split, such as the "stump".

the instances/observations in the training set are weighted by the algorithm, and more weight is assigned to instances which are difficult to classify. more weak learners are added into the system sequentially, and they assigned to the most difficult training instances.

in Adaboost, the predictions are made through majority vote, with the instances being classified according to which class receives the most votes from the weak learners.

Gradient boosting classifiers are the Ada Boosting method combined with weighted minimization, after which the classifiers and weighted inputs are recalculated. The objective of Gradient Boosting classifiers is to minimize the loss, or the difference between the actual class value of the training example and the predicted class value. It isn't required to understand the process for reducing the classifier's loss, but it operates similarly to gradient descent. 

In the case of Gradient Boosting Machines, every time a new weak learner is added to the model, the weights of the previous learners are frozen or cemented in place, left unchanged as the new layers are introduced. This is distinct from the approaches used in Ada Boosting where the values are adjusted when new learners are added. 

The power of gradient boosting machines comes from the fact that they can be used on more than binary classification problems, they can be used on multiclass classification problems and even regression problems. 

# Theory Behind Gradient Boost

The Gradient Boosting Classifier depends on a loss function. A custom loss function can be used, and many standardized loss functions are supported by gradient boosting classifiers, but the loss function has to be differentiable. 

Classification algorithms frequently use logarithmic loss, while regression algorithms can use squared errors. Gradient boosting systems don't have to derive a new loss function every time the boosting algorithm is added, rather any differentiable loss function can be applied to the system. 

Gradient boosting systems have two other necessary parts: a weak learner and an additive component. Gradient boosting systems use decision trees as their weak learners. Regression trees are used for the weak learners, and thes regression trees output real values. because the outputs are real values, as new learners are added into the model the output of the regression trees can be added together to correct for errors in the predictions.

the additive component of the gradient boosting model comes from the fact that trees are added to the model over time, and when this occurs the existing trees arent manipulated, their values remain fixed.

a procedure similiar to gradient descent is used to minimize the error between given parameters. this is done by taking the calculated loss and preforming gradient descent to reduce that loss. Afterward, the parameters of the tree are modified ot reduce the residual loss.

the new tree's output is then append to the output of the previous trees used in the model. this process is repeated until a previously specified number of trees is reached or the loss is reduced below a centain threshold.

There's a trade-off between the learning rate and the number of trees needed, so you'll have to experiment to find the best values for each of the parameters, but small values less than 0.1 or values between 0.1 and 0.3 often work well. 

# Library

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier

# read and edit

In [2]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [3]:
y_train = train_data["Survived"]
train_data.drop(labels="Survived", axis=1, inplace=True)

In [4]:
full_data = train_data.append(test_data)

In [5]:
drop_columns = ["Name", "Age", "SibSp", "Ticket", "Cabin", "Parch", "Embarked"]
full_data.drop(labels=drop_columns, axis=1, inplace=True)

In [6]:
full_data = pd.get_dummies(full_data, columns=["Sex"])
full_data.fillna(value=0.0, inplace=True)

In [7]:
X_train = full_data.values[0:891]
X_test = full_data.values[891:]

In [8]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [9]:
state = 12 
test_size = 0.30 
 
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size=test_size, random_state=state)

# List Learning Rate

In [11]:
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in lr_list:
    gb_clf = GradientBoostingClassifier(n_estimators=20, 
                                        learning_rate=learning_rate, 
                                        max_features=2, max_depth=2, 
                                        random_state=0)
    
    gb_clf.fit(X_train, y_train)
    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))
    print("Accuracy score (validation): {0:.3f}\n".format(gb_clf.score(X_val, y_val)))

Learning rate:  0.05
Accuracy score (training): 0.801
Accuracy score (validation): 0.731

Learning rate:  0.075
Accuracy score (training): 0.814
Accuracy score (validation): 0.731

Learning rate:  0.1
Accuracy score (training): 0.812
Accuracy score (validation): 0.724

Learning rate:  0.25
Accuracy score (training): 0.835
Accuracy score (validation): 0.750

Learning rate:  0.5
Accuracy score (training): 0.864
Accuracy score (validation): 0.772

Learning rate:  0.75
Accuracy score (training): 0.875
Accuracy score (validation): 0.754

Learning rate:  1
Accuracy score (training): 0.875
Accuracy score (validation): 0.739



In [13]:
gb_clf2 = GradientBoostingClassifier(n_estimators=20, learning_rate=0.5, 
                                     max_features=2, max_depth=2, random_state=0)
gb_clf2.fit(X_train, y_train)
predictions = gb_clf2.predict(X_val)
print("Confusion Matrix:")
print(confusion_matrix(y_val, predictions))
print("Classification Report")
print(classification_report(y_val, predictions))

Confusion Matrix:
[[142  19]
 [ 42  65]]
Classification Report
              precision    recall  f1-score   support

           0       0.77      0.88      0.82       161
           1       0.77      0.61      0.68       107

    accuracy                           0.77       268
   macro avg       0.77      0.74      0.75       268
weighted avg       0.77      0.77      0.77       268

