# Gradient Boosting Model
In diesem Notebook beschreiben wir das Gradient Boosting Verfahren.

## Load Packages

In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In [2]:
X_train = pd.read_csv('Xtrain_feature_sel.csv')
X_test = pd.read_csv('Xtest_feature_sel.csv')
y_train = pd.read_csv('ytrain_mod.csv')
y_test = pd.read_csv('ytest_mod.csv')
print("Shape of X Train: {}".format(X_train.shape))
print("Shape of X Test: {}".format(X_test.shape))
print("Shape of y Train: {}".format(y_train.shape))
print("Shape of y Test: {}".format(y_test.shape))

Shape of X Train: (8672, 19)
Shape of X Test: (2168, 19)
Shape of y Train: (8672, 1)
Shape of y Test: (2169, 1)


## Model Building

In [3]:
model = GradientBoostingClassifier()

Die Grundeinstellungsparameter sind:
- learning rate: 0.1
- n estimators: 100
- subsample: 1
- min samples split: 2
- min samples leaf: 1
- max depth: 3
- min impurity decrease: 0
- random state: None
- max features: None
- max learf nodes: None


## Cross Validation

In [4]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Accuracy: 0.558 (0.012)


### Plot accuracy

In [5]:
fig = px.scatter(x = range(1,len(n_scores)+1), y = n_scores)
fig.show()

## Fit the Model

In [6]:
# fit the model on the whole dataset
model = GradientBoostingClassifier()
model.fit(X_train, y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



## Prediction

In [7]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

## Evaluation

### Confusion Matrix

In [8]:
print("Confusion matrix of the training set: {}".format(metrics.confusion_matrix(y_train,y_pred_train)))
print("Confusion matrix of the test set: {}".format(metrics.confusion_matrix(y_test,y_pred_test)))

Confusion matrix of the training set: [[   2    1    3    8    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0]
 [   0   32    0   21    0    8    0    0    0    0    0    0    0    0
     0    0    0    0    0    0]
 [   0    1   17   34    0   15    0    0    0    0    0    0    0    0
     0    0    0    0    0    0]
 [   2    7    0  208    1   77    1    7    0    0    0    0    0    0
     0    0    0    0    0    0]
 [   0    1    2   43   41   57    1   12    0    0    0    0    0    0
     0    0    0    0    0    0]
 [   0    0    2   81    3  403    5   71    3    0    0    0    0    0
     0    0    0    0    0    0]
 [   0    1    1   14    0   84   55   99    0    2    0    0    0    0
     0    0    0    0    0    0]
 [   0    0    0    5    0   91    3  566    9   55    0    1    0    0
     0    0    0    0    0    0]
 [   0    0    0    0    0    5    2  135  110  129    1    1    0    0
     0    0    0    0    0    0]
 [   0    0    0 

ValueError: Found input variables with inconsistent numbers of samples: [2169, 2168]

### Accuracy Score

In [None]:
print("Accuracy Score for the training set: {}".format(metrics.accuracy_score(y_train, y_pred_train)))
print("Accuracy Score for the test set: {}".format(metrics.accuracy_score(y_test, y_pred_test)))

Accuracy Score for the training set: 0.8406303471216781
Accuracy Score for the test set: 0.8406343356506687


### Recall and Precision

In [None]:
print("Precision score for the training set: {}".format(metrics.precision_score(y_train,y_pred_train)))
print("Precision score for the test set: {}".format(metrics.precision_score(y_test, y_pred_test)))

Precision score for the training set: 0.8173325093605729
Precision score for the test set: 0.8210255559243279


In [None]:
print("Recall score for the training set: {}".format(metrics.recall_score(y_train,y_pred_train)))
print("Recall score for the test set: {}".format(metrics.recall_score(y_test,y_pred_test)))

Recall score for the training set: 0.7305218012866334
Recall score for the test set: 0.7359065893202439


### F1-score

In [None]:
print("F1 score from training set: {}".format(metrics.f1_score(y_train, y_pred_train)))
print("F1 score from test set: {}".format(metrics.f1_score(y_test,y_pred_test)))

F1 score from training set: 0.7714927856983548
F1 score from test set: 0.7761393050435329


### AUC Score

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train, y_pred_train)
print("AUC score for the training set: {}".format(metrics.auc(fpr, tpr)))

AUC score for the training set: 0.8176711667670027


In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred_test)
print("AUC score for the test set: {}".format(metrics.auc(fpr,tpr)))

AUC score for the test set: 0.8197435588414281


## Variable Importance

In [None]:
feat_imp = pd.Series(model.feature_importances_).sort_values(ascending=False)
sorted_idx = np.argsort(feat_imp)
feat_imp_df = pd.DataFrame({'vars': X_train.columns[sorted_idx], 'feat_imp': model.feature_importances_})
feat_imp_df.head()

Unnamed: 0,vars,feat_imp
0,customer_type_Transient,0.110239
1,distribution_channel_TA/TO,0.000418
2,distribution_channel_Direct,0.003303
3,distribution_channel_Corporate,0.002491
4,reserved_room_type_A,0.001046


In [None]:
fig = px.bar(feat_imp_df.iloc[:10,], x= 'feat_imp', y='vars')
fig.update_yaxes(title_text='Variables')
fig.update_xaxes(title_text='Feature Importance')
fig.update_layout(yaxis = {'categoryorder':'total ascending'})
fig.show()