### Boosted Tree Classification Model
In this tutorial, we show you the steps to train a boosted tree classification model, and also how to evaluate this model afterwards.

### 1. Import libraries
First of all, we import xgboost, the library which provides boosted tree functionality. The purpose other libraries will be explained as we advance in the tutorial.

In [13]:
import xgboost
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

### 2. Load data
We use Pandas to load the preprocessed version of our dataset (london_clean.csv) and select only the columns we will use for training the tree. We are aware the unecessary columns could have been removed in the preprocessing phase, but we chose to keep them in the file so we can experiment with different combination of inputs/outputs.

In [6]:
ds = pd.read_csv('preprocessing/data/london_clean.csv')
selected_cols = ["DateOfCall", "PropertyType", "NumPumpsAttending", 
                    "PumpHoursRoundUp", "CostCat", "mean_temp"]
ds = ds.loc[:, selected_cols]

### 3. Split data
After loading the data, we start by splitting the dataset between columns that will be used as input to train the model (X) and the actual output (y). In other words, for a certain combination of data about the file (e.g. data of call, property type, etc.) we want to predict its cost category.

Besides splitting columns, we also want to split the rows to create two datasets: one for training (70% of the original dataset rows) and testing (30% of the original dataset rows). We use train_test_split from scikit-learn to do this task for us.

In [7]:
X = ds.drop(["CostCat"], axis=1).values 
y = ds["CostCat"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, test_size=0.3, random_state=42
)

### 4. Train model
We call xgboost.XGBClassifier and use the training dataset (x_train, y_train) to train the model. We use the hyperparameter objective="multi:softmax" because we want to do multiclass classification. We keep the other hyperparameters in their default values. Later, in step 6, we will try different hyperparameter combinations to find the best one.

In [11]:
model = xgboost.XGBClassifier(objective="multi:softmax", random_state=42)
model.fit(X_train, y_train)

### 5. Make predictions
We use the trained model to make predictions. We then compare these predictions to the actual labels to check how accurate is the model. According to sci-kit learn's classification_report, the accuracy (average f1-score) is 0.77.

In [9]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.90      0.82    210199
           1       0.76      0.48      0.59    127674
           2       1.00      1.00      1.00     31404
           3       0.70      0.91      0.79      4963
           4       0.72      0.42      0.53      4546
           5       0.91      0.96      0.93      7200

    accuracy                           0.77    385986
   macro avg       0.80      0.78      0.78    385986
weighted avg       0.77      0.77      0.75    385986



### 6. Hyperparameter tunning
Now we use GridSearchCV to search for the best combination of hyperparameters for training a boosted tree. We check the following hyperparameters: number of estimators (trees), max depth, booster and tree method.

After the search is done, we find out the best hypeparameters are: (...). We then proceed to train the model using these hypeparameters.

In [10]:
tree_params = {'n_estimators': [5, 7, 10],
               'max_depth': [5, 7, 10], 
               'booster': ['gbtree', 'gblinear', 'dart'],
               'tree_method': ['exact', 'approx', 'hist']}
model_best = GridSearchCV(xgboost.XGBClassifier(), tree_params, n_jobs=4, cv=5)


model_best = model_best.fit(X_train, y_train)

print("Best hyperparameters for boosted tree classification: ", model_best.best_params_)


KeyboardInterrupt: 

### 7. Comparison
Finally, we compute the accuracy of the best classifier according to GridSearchCV. The accuracy is (...). Compared to the previous model, which was trained using the standard hypeparameters, the improvement was (...).

In [12]:
y_pred = model_best.predict(X_test)
print(classification_report(y_test, y_pred))

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

### 8. Cross validation
Description

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std

folds = [5, 10]
for i in folds:
    cross_val = KFold(n_splits=i, random_state=42, shuffle=True)
    scores = cross_val_score(model_best, X, y, scoring='accuracy', cv=cross_val, n_jobs=4)
    print("Testing with {} fold:".format(i))
    print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))