# Gradient Boosting Model
In diesem Notebook beschreiben wir das Gradient Boosting Verfahren.

## Load Packages

In [2]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In [3]:
X_train = pd.read_csv('Xtrain_feature_sel.csv')
X_test = pd.read_csv('Xtest_feature_sel.csv')
y_train = pd.read_csv('ytrain.csv')
y_test = pd.read_csv('ytest.csv')
print("Shape of X Train: {}".format(X_train.shape))
print("Shape of X Test: {}".format(X_test.shape))
print("Shape of y Train: {}".format(y_train.shape))
print("Shape of y Test: {}".format(y_test.shape))

Shape of X Train: (8672, 19)
Shape of X Test: (2168, 19)
Shape of y Train: (8672, 1)
Shape of y Test: (2168, 1)


## Model Building

In [4]:
model = GradientBoostingClassifier()

Die Grundeinstellungsparameter sind:
- learning rate: 0.1
- n estimators: 100
- subsample: 1
- min samples split: 2
- min samples leaf: 1
- max depth: 3
- min impurity decrease: 0
- random state: None
- max features: None
- max learf nodes: None


## Cross Validation

In [5]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))



Accuracy: 0.558 (0.013)


### Plot accuracy

In [6]:
fig = px.scatter(x = range(1,len(n_scores)+1), y = n_scores)
fig.show()

## Fit the Model

In [7]:
# fit the model on the whole dataset
model = GradientBoostingClassifier()
model.fit(X_train, y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



## Prediction

In [8]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

## Evaluation

### Confusion Matrix

In [9]:
print("Confusion matrix of the training set: {}".format(metrics.confusion_matrix(y_train,y_pred_train)))
print("Confusion matrix of the test set: {}".format(metrics.confusion_matrix(y_test,y_pred_test)))

Confusion matrix of the training set: [[   1    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0]
 [   0   10    1    0    0    0    2    0    0    0    0    0    0    0
     0    0    0    0    0    0    0]
 [   0    0   28    0    0    0   26    0    0    7    0    0    0    0
     0    0    0    0    0    0    0]
 [   0    0    0  565    1    0    3   53    0   90    1    0    0   13
     0    0    0    0    4    0    0]
 [   0    0    0    0 1050    0    0    4   48    1   68    0    0    0
    56    0    1    0    0   26    0]
 [   0    0    0    0    0   47    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0]
 [   0    0    6    8    0    0  217    0    0   70    0    0    0    0
     0    2    0    0    0    0    0]
 [   0    0    0   69    0    0    2  649    0    3   68    0    0   17
     0    0   25    0    0    0    0]
 [   0    0    0    0   76    0    0    1  852    0    0    3    0    0
    46    

### Accuracy Score

In [10]:
print("Accuracy Score for the training set: {}".format(metrics.accuracy_score(y_train, y_pred_train)))
print("Accuracy Score for the test set: {}".format(metrics.accuracy_score(y_test, y_pred_test)))

Accuracy Score for the training set: 0.6934963099630996
Accuracy Score for the test set: 0.5714944649446494


### Recall and Precision

Facing ValueError: Target is multiclass but average='binary'
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

In [11]:
print("Precision score for the training set: {}".format(metrics.precision_score(y_train, y_pred_train, average='weighted')))
print("Precision score for the test set: {}".format(metrics.precision_score(y_test, y_pred_test, average='weighted')))

Precision score for the training set: 0.7137954608664614
Precision score for the test set: 0.5497699375096385


In [12]:
print("Recall score for the training set: {}".format(metrics.recall_score(y_train,y_pred_train, average='weighted')))
print("Recall score for the test set: {}".format(metrics.recall_score(y_test,y_pred_test, average='weighted')))

Recall score for the training set: 0.6934963099630996
Recall score for the test set: 0.5714944649446494



Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.



### F1-score

In [18]:
print("F1 score from training set: {}".format(metrics.f1_score(y_train, y_pred_train, average='weighted')))
print("F1 score from test set: {}".format(metrics.f1_score(y_test,y_pred_test, average='weighted')))

F1 score from training set: 0.677132670764058
F1 score from test set: 0.5418338501744372


### AUC Score

In [20]:
fpr, tpr, thresholds = metrics.roc_curve(y_train, y_pred_train)
print("AUC score for the training set: {}".format(metrics.auc(fpr, tpr)))

ValueError: multiclass format is not supported

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred_test)
print("AUC score for the test set: {}".format(metrics.auc(fpr,tpr)))

ValueError: multiclass format is not supported

## Variable Importance

In [None]:
feat_imp = pd.Series(model.feature_importances_).sort_values(ascending=False)
sorted_idx = np.argsort(feat_imp)
feat_imp_df = pd.DataFrame({'vars': X_train.columns[sorted_idx], 'feat_imp': model.feature_importances_})
feat_imp_df.head()

Unnamed: 0,vars,feat_imp
0,Genres_Education,0.832916
1,Genres_Rare,0.035576
2,Genres_Entertainment,0.059845
3,Content Rating_Teen,0.027285
4,Content Rating_Everyone,0.001948


In [None]:
fig = px.bar(feat_imp_df.iloc[:10,], x= 'feat_imp', y='vars')
fig.update_yaxes(title_text='Variables')
fig.update_xaxes(title_text='Feature Importance')
fig.update_layout(yaxis = {'categoryorder':'total ascending'})
fig.show()