# Task 08 - Simulação de um fluxo (pipeline) de treino/retreino automático

Sub Tarefas
- Utilizar todas as classes criadas para executar a esteira (pipeline) de treino.
- Salvar os modelos otimizados.
- Escolher um modelo final baseado no conjunto de testes.
- Salvar o modelo final.

Definição de Pronto:
- Ter o código da esteira estruturado e rodando de ponta a ponta.
- Ter um modelo final treinado e salvo.

In [2]:
from src.data_access_handler import DataAccessHandler
from src.feature_selector import FeatureSelector
from src.model import Model,ModelSelector,ModelOptimizer
from src.utils import f1_score_micro

CV_SPLITS = 5
OPTIMIZATION_TRIALS = 10
RANDOM_STATE = 42
DATA_PATH = "./data/"
MODEL_PATH = "./models/"

In [9]:
print("Loading training data into memory...")
access_handler = DataAccessHandler(main_path=DATA_PATH)
df_train = access_handler.load(dataset_type="train")
print("Data loaded!\n")

target = 'fetal_health'
X,y = df_train.drop(columns=target),df_train[target].values.ravel()

print("Selecting best features for training...")
feature_selector = FeatureSelector()
feature_selector.select_best_features(X=X,y=y)
feature_selector.save_best_features(path = DATA_PATH)
features = feature_selector.get_selected_features
print("Best features selected!\n")

X,y = df_train[features],df_train[target].values.ravel()

print("Optimizing available ML models...")
model_optimizer = ModelOptimizer(random_state = RANDOM_STATE, optimization_trials = OPTIMIZATION_TRIALS, cv_splits = CV_SPLITS)
model_optimizer.optimize_all_models(X,y)
lr_best,rf_best,lgbm_best = model_optimizer.get_optimized_models
print("All models optimized!\n")

lr_model = Model(model = lr_best)
lr_model.save(path=MODEL_PATH,model_name="logistic_regression")

rf_model = Model(model = rf_best)
rf_model.save(path=MODEL_PATH,model_name="random_forest")

lgbm_model = Model(model = lgbm_best)
lgbm_model.save(path=MODEL_PATH,model_name="light_gbm")

del lr_best,rf_best,lgbm_best,lr_model,rf_model,lgbm_model
print("All optimized models saved!\n")

print("Loading saved models into memory...")
lr_model = Model()
lr_model.load(path=MODEL_PATH,model_name="logistic_regression")

rf_model = Model()
rf_model.load(path=MODEL_PATH,model_name="random_forest")

lgbm_model = Model()
lgbm_model.load(path=MODEL_PATH,model_name="light_gbm")
print("All optimized models loaded into memory!\n")


print("Loading test set for model selection...")
df_test = access_handler.load(dataset_type="test")
X,y = df_test[features],df_test[target].values.ravel()
print("Data loaded into memory!\n")

print("Choosing final model...")
model_selector = ModelSelector(models=[lr_model,rf_model,lgbm_model],
                            model_names=["logistic_regression","random_forest","light_gbm"])
model_selector.select_best_model(X,y)

print("Saving final model...")
model_selector.get_winner_model.save(path = MODEL_PATH,model_name = "winner_model")
print("Final model saved!")

del model_selector,df_test,access_handler

Loading training data into memory...
Data loaded!

Selecting best features for training...


  optimized_model = OptunaSearchCV(model_pipeline,
[32m[I 2022-07-19 20:12:49,747][0m A new study created in memory with name: no-name-b8be0f1b-34f7-4bf3-aef6-9ef5b25ba883[0m
[32m[I 2022-07-19 20:12:49,768][0m Trial 0 finished with value: 0.7486666666666666 and parameters: {'lr__C': 0.00046025999465485514}. Best is trial 0 with value: 0.7486666666666666.[0m
[32m[I 2022-07-19 20:12:49,789][0m Trial 1 finished with value: 0.7506666666666667 and parameters: {'lr__C': 0.0006321850156886694}. Best is trial 1 with value: 0.7506666666666667.[0m
[32m[I 2022-07-19 20:12:49,829][0m Trial 2 finished with value: 0.834 and parameters: {'lr__C': 0.06796419128240948}. Best is trial 2 with value: 0.834.[0m
[32m[I 2022-07-19 20:12:49,855][0m Trial 3 finished with value: 0.8026666666666668 and parameters: {'lr__C': 0.013792757559148971}. Best is trial 2 with value: 0.834.[0m
[32m[I 2022-07-19 20:12:49,876][0m Trial 4 finished with value: 0.7513333333333333 and parameters: {'lr__C': 0.00

Atributos Selecionados: ['baseline value', 'accelerations', 'uterine_contractions', 'prolongued_decelerations', 'abnormal_short_term_variability', 'mean_value_of_short_term_variability', 'percentage_of_time_with_abnormal_long_term_variability', 'mean_value_of_long_term_variability', 'histogram_width', 'histogram_min', 'histogram_mode', 'histogram_mean', 'histogram_median']
Best features selected!

Optimizing available ML models...
Finding best hyperparams for Logistic Regression...


[32m[I 2022-07-19 20:12:49,946][0m Trial 6 finished with value: 0.7813333333333333 and parameters: {'lr__C': 0.00614369123919829}. Best is trial 5 with value: 0.836.[0m
[32m[I 2022-07-19 20:12:49,975][0m Trial 7 finished with value: 0.8233333333333333 and parameters: {'lr__C': 0.027150716171105076}. Best is trial 5 with value: 0.836.[0m
[32m[I 2022-07-19 20:12:49,995][0m Trial 8 finished with value: 0.7466666666666667 and parameters: {'lr__C': 0.00016209190420362666}. Best is trial 5 with value: 0.836.[0m
[32m[I 2022-07-19 20:12:50,016][0m Trial 9 finished with value: 0.7513333333333334 and parameters: {'lr__C': 0.0005982409991499023}. Best is trial 5 with value: 0.836.[0m
  optimized_model = OptunaSearchCV(model_pipeline,
[32m[I 2022-07-19 20:12:50,028][0m A new study created in memory with name: no-name-48db8728-a763-4451-902a-e7b9f1e0dcf2[0m


Logistic Regression optimized!

Finding best hyperparams for Random Forest...


[32m[I 2022-07-19 20:12:50,336][0m Trial 0 finished with value: 0.9306666666666666 and parameters: {'rf__n_estimators': 58, 'rf__max_depth': 95, 'rf__min_samples_split': 6, 'rf__min_samples_leaf': 3}. Best is trial 0 with value: 0.9306666666666666.[0m
[32m[I 2022-07-19 20:12:51,545][0m Trial 1 finished with value: 0.8546666666666667 and parameters: {'rf__n_estimators': 296, 'rf__max_depth': 14, 'rf__min_samples_split': 17, 'rf__min_samples_leaf': 34}. Best is trial 0 with value: 0.9306666666666666.[0m
[32m[I 2022-07-19 20:12:52,547][0m Trial 2 finished with value: 0.8720000000000001 and parameters: {'rf__n_estimators': 238, 'rf__max_depth': 88, 'rf__min_samples_split': 18, 'rf__min_samples_leaf': 26}. Best is trial 0 with value: 0.9306666666666666.[0m
[32m[I 2022-07-19 20:12:52,689][0m Trial 3 finished with value: 0.8539999999999999 and parameters: {'rf__n_estimators': 31, 'rf__max_depth': 8, 'rf__min_samples_split': 20, 'rf__min_samples_leaf': 35}. Best is trial 0 with valu

Random Forest optimized!

Finding best hyperparams for Light GBM...


[32m[I 2022-07-19 20:13:00,778][0m Trial 0 finished with value: 0.908 and parameters: {'lgbm__n_estimators': 24, 'lgbm__max_depth': 53, 'lgbm__learning_rate': 0.06259960211800354, 'lgbm__num_leaves': 29, 'lgbm__subsample_for_bin': 27402}. Best is trial 0 with value: 0.908.[0m
[32m[I 2022-07-19 20:13:14,548][0m Trial 1 finished with value: 0.884 and parameters: {'lgbm__n_estimators': 236, 'lgbm__max_depth': 78, 'lgbm__learning_rate': 0.05416708338246849, 'lgbm__num_leaves': 11, 'lgbm__subsample_for_bin': 12}. Best is trial 0 with value: 0.908.[0m
[32m[I 2022-07-19 20:14:04,899][0m Trial 2 finished with value: 0.9006666666666666 and parameters: {'lgbm__n_estimators': 373, 'lgbm__max_depth': 32, 'lgbm__learning_rate': 0.0577009012507725, 'lgbm__num_leaves': 26, 'lgbm__subsample_for_bin': 14}. Best is trial 0 with value: 0.908.[0m
[32m[I 2022-07-19 20:14:29,984][0m Trial 3 finished with value: 0.9366666666666668 and parameters: {'lgbm__n_estimators': 101, 'lgbm__max_depth': 37, 

Light GBM optimized!
All models optimized!

All optimized models saved!

Loading saved models into memory...
All optimized models loaded into memory!

Loading test set for model selection...
Data loaded into memory!

Choosing final model...
logistic_regression final score: 0.7939
random_forest final score: 0.9361
light_gbm final score: 0.9537

Best model is light_gbm with f1-score-micro =  0.9537 for the test set.
Saving final model...
Final model saved!


# Conclusões:

- Devido ao tempo mais elevado de treino do LightGBM, os modelos foram otimizados por apenas 10 rodadas cada um.
- A esteira de treinamento e escolha do melhor modelo construída com as classes pode ser executada de ponta a ponta, com bastante facilidade e clareza de todos os passos.
- O melhor modelo escolhido ainda segue o código experimental das task 06 e 07. Isso porque a seed usada é a mesma. Caso a seed seja modificada, o resultado final poderá ser diferente.

# Task 09 - Simulação de um Fluxo de Inferência

Sub Tarefas:
- Utilizar todas as classes criadas para executar a esteira (pipeline) de treino.
- Fazer uma inferência para uma amostra dos dados de testes

Definição de Pronto:
- Ter a inferência execudada de ponta a ponta para uma amostra dos dados.

In [3]:
from src.data_access_handler import DataAccessHandler
from src.feature_selector import FeatureSelector
from src.model import Model

DATA_PATH = "./data/"
MODEL_PATH = "./models/"

In [4]:
print("Loading training data into memory...")
access_handler = DataAccessHandler(main_path=DATA_PATH)
df = access_handler.load(dataset_type="test").iloc[:10] # simulando 10 amostras para inferência
print("Data loaded!\n")

feature_selector = FeatureSelector()
feature_selector.load_best_features(path = DATA_PATH)
features = feature_selector.get_selected_features
print("Best features selected!\n")

target = 'fetal_health'
X = df[features]

print("Model loaded into memory!\n")
model = Model()
model.load(path=MODEL_PATH,model_name="winner_model")

print("Prediction for provided sample:")
model.predict(X)

Loading training data into memory...
Data loaded!

Best features selected!

Model loaded into memory!

Prediction for provided sample:


array([1., 1., 1., 1., 1., 3., 1., 3., 2., 2.])

# Conclusões:

- A esteira de inferência construída com as classes pode ser executada de ponta a ponta com bastante facilidade e clareza de todos os passos.