# Gradient Boosting Decision Trees

Neste notebook iremos implementar o algoritmo GDBT para a classificação de nódulos.


Importação das bibliotecas e dos dados necessários para a implementação do modelo:

In [None]:
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.model_selection import GridSearchCV
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('../data_cleaning/final_features.csv')

# definir a variável target (y) e dividir os dados em conjunto de treino e teste
y = data["malignancy"]
X = data.drop(columns=['malignancy', 'case_id'])

## Hiperparâmetros

Utilização da GridSearchCV para descobrir quais os melhores hiperparâmetros para o treino do modelo:

In [4]:
gbdt_model = GradientBoostingClassifier()
# hiperparâmetros possíveis
param_grid_gbdt = {
    'n_estimators': [100, 200],
    'max_depth': [3, 6, 10],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0]
}

# grid search com cross-validation
grid_search_gbdt = GridSearchCV(estimator=gbdt_model, param_grid=param_grid_gbdt, cv=10, scoring='accuracy', verbose=2, n_jobs=-1)
grid_search_gbdt.fit(X, y) 
print(f"Best Hyperparameters: {grid_search_gbdt.best_params_}")

Fitting 10 folds for each of 24 candidates, totalling 240 fits
Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}


## Avaliação do Modelo

10-fold cross-validation para testar o nosso modelo no que toca a precisão, f1, roc_auc e accuracy:

In [6]:
from sklearn.model_selection import cross_val_score
import numpy as np

# modelo
model = GradientBoostingClassifier(learning_rate=0.1, max_depth=3, n_estimators=100, subsample=1) # best_gbdt_model = grid_search_gbdt.best_estimator_

scores = [0,0,0,0] 
# [0] -> precision
# [1] -> f1
# [2] -> roc_auc
# [3] -> accuracy

# avaliação usando 10-fold cross validation
scores[0] = (cross_val_score(model, X, y, cv=10, scoring = "precision"))
scores[1] = (cross_val_score(model, X, y, cv=10, scoring = "f1"))
scores[2] = (cross_val_score(model, X, y, cv=10, scoring = "roc_auc"))
scores[3] = (cross_val_score(model, X, y, cv=10, scoring = "accuracy"))


precision_scores_formatted = [f"{score:.2f}" for score in scores[0]]
print(f'Precision Scores: {precision_scores_formatted}')
F1_scores_formatted = [f"{score:.2f}" for score in scores[1]]
print(f'F1 Scores: {F1_scores_formatted}')
ROC_AUC_scores_formatted = [f"{score:.2f}" for score in scores[2]]
print(f'ROC_AUC Scores: {ROC_AUC_scores_formatted}')
Accuracy_scores_formatted = [f"{score:.2f}" for score in scores[3]]
print(f'Accuracy Scores: {Accuracy_scores_formatted}')

Precision Scores: ['0.89', '0.83', '0.84', '0.82', '0.87', '0.84', '0.82', '0.88', '0.89', '0.80']
F1 Scores: ['0.88', '0.83', '0.87', '0.86', '0.83', '0.84', '0.79', '0.79', '0.86', '0.83']
ROC_AUC Scores: ['0.96', '0.93', '0.95', '0.95', '0.94', '0.95', '0.93', '0.93', '0.94', '0.94']
Accuracy Scores: ['0.92', '0.86', '0.89', '0.88', '0.87', '0.88', '0.84', '0.85', '0.90', '0.86']


Os valores dos arrays dos scores acima serão utilizados para um teste estatístico (Wilcoxon Test) para averiguar se os modelos são estatisticamente diferentes.

Valor médio de cada métrica de avaliação do modelo:

In [7]:
print(f'Average Precision: {np.mean(scores[0]):.2f}')
print(f'F1 Score: {np.mean(scores[1]):.2f}')
print(f'ROC_AUC: {np.mean(scores[2]):.2f}')
print(f'Accuracy Score: {np.mean(scores[3]):.2f}')

Average Precision: 0.85
F1 Score: 0.84
ROC_AUC: 0.94
Accuracy Score: 0.88
