# Random Forest
Neste notebook vamos testar a qualidade do algoritmo Random Forest para a classificação de nódulos.

In [1]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

data = pd.read_csv('../data_cleaning/final_features.csv')

y = data["malignancy"]
X = data.drop(columns=['malignancy','case_id'])

## Hiperparâmetros

Utilização da GridSearchCV para descobrir quais os melhores hiperparâmetros para o treino do modelo:

In [4]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

model = RandomForestClassifier()

# cross-validation para testar os hiperparametros
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=405)


print("Teste com varios ranges, objetivo: saber qual o melhor")
# hiperparametros a testar
param_grid = {
    'n_estimators': [i for i in range (25,201,25)],
    'max_depth': [i for i in range(5,106,10)],
    'min_samples_leaf': [i for i in range(5,26,5)],
    'criterion': ["gini", "entropy", "log_loss"]
}

# grid search com cv
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='accuracy', verbose=2, n_jobs=-1)
grid_search.fit(X, y)

# resultado
print(f"Best Hyperparameters: {grid_search.best_params_}")

Teste com varios ranges, objetivo: saber qual o melhor
Fitting 10 folds for each of 1320 candidates, totalling 13200 fits
Best Hyperparameters: {'criterion': 'log_loss', 'max_depth': 95, 'min_samples_leaf': 5, 'n_estimators': 25}


## Avaliação do Modelo

10-fold cross-validation para testar o nosso modelo no que toca a precisão, f1, roc_auc e accuracy:

In [5]:
# métricas a utilizar para a avaliação do modelo
from sklearn.model_selection import cross_val_score
import numpy as np

# modelo
model = RandomForestClassifier(criterion = 'entropy', max_depth = 105, min_samples_leaf = 5, n_estimators = 25)

# avaliação usando 10-fold cross validation
# skf = StratifiedKFold(n_splits = 10)
scores = [0,0,0,0] 
# [0] -> precision
# [1] -> f1
# [2] -> roc_auc
# [3] -> accuracy

scores[0] = (cross_val_score(model, X, y, cv=10, scoring = "precision"))
scores[1] = (cross_val_score(model, X, y, cv=10, scoring = "f1"))
scores[2] = (cross_val_score(model, X, y, cv=10, scoring = "roc_auc"))
scores[3] = (cross_val_score(model, X, y, cv=10, scoring = "accuracy"))


precision_scores_formatted = [f"{score:.2f}" for score in scores[0]]
print(f'Precision Scores: {precision_scores_formatted}')
F1_scores_formatted = [f"{score:.2f}" for score in scores[1]]
print(f'F1 Scores: {F1_scores_formatted}')
ROC_AUC_scores_formatted = [f"{score:.2f}" for score in scores[2]]
print(f'ROC_AUC Scores: {ROC_AUC_scores_formatted}')
Accuracy_scores_formatted = [f"{score:.2f}" for score in scores[3]]
print(f'Accuracy Scores: {Accuracy_scores_formatted}')

Precision Scores: ['0.87', '0.78', '0.84', '0.83', '0.83', '0.83', '0.82', '0.88', '0.84', '0.75']
F1 Scores: ['0.84', '0.80', '0.84', '0.85', '0.79', '0.76', '0.75', '0.76', '0.79', '0.77']
ROC_AUC Scores: ['0.94', '0.92', '0.95', '0.95', '0.91', '0.92', '0.91', '0.92', '0.93', '0.92']
Accuracy Scores: ['0.87', '0.85', '0.88', '0.91', '0.85', '0.83', '0.83', '0.84', '0.87', '0.82']


Os valores dos arrays dos scores acima serão utilizados para um teste estatístico (Wilcoxon Test) para averiguar se os modelos são estatisticamente diferentes.

Valor médio de cada métrica de avaliação do modelo:

In [6]:
print(f'Average Precision: {np.mean(scores[0]):.2f}')
print(f'F1 Score: {np.mean(scores[1]):.2f}')
print(f'ROC_AUC: {np.mean(scores[2]):.2f}')
print(f'Accuracy Score: {np.mean(scores[3]):.2f}')

Average Precision: 0.83
F1 Score: 0.80
ROC_AUC: 0.93
Accuracy Score: 0.85
