Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of the end-to-end autotrain module #1219

Merged
merged 47 commits into from
Jul 14, 2021
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
26a9663
first pass @ e2e autotrain
ANarayan Jun 25, 2021
ff115c8
first pass @ auto batch scaling
ANarayan Jun 29, 2021
d8543f4
add additional parameter for pbt scheduler and supports passing time …
ANarayan Jun 29, 2021
7419fc8
add default hyperparameter search space + tune_batch_size parameter
ANarayan Jun 29, 2021
aeba080
add comments and delete tune_config.py
ANarayan Jun 29, 2021
fa28c82
fix bug in assignment of pbt scheduler paramter
ANarayan Jun 29, 2021
ee130f7
fix bug to support pbt scheduler
ANarayan Jun 29, 2021
cc81e27
fix bug and cpu/gpu resource specification in config
ANarayan Jun 29, 2021
03ba456
fix pbt scheduelr params and validation metric bug in config files
ANarayan Jun 29, 2021
1da9aae
add max_trials to auto tune function
ANarayan Jun 29, 2021
79e66e2
change search space encoding to only json encode lists which do not c…
ANarayan Jun 30, 2021
aa5e174
add function to support training for tune_batch_size and tune_learnin…
ANarayan Jun 30, 2021
efcca6e
change default scheduler to async_hyperband
ANarayan Jun 30, 2021
bfa0794
sort imports
ANarayan Jun 30, 2021
725f688
makes train an internal func. & adds output_dir param to auto_train
ANarayan Jun 30, 2021
32fa44b
minor naming changes
ANarayan Jul 1, 2021
a887f04
add a first pass @ an auto learning rate tuner
ANarayan Jul 1, 2021
df96d21
minor naming change
ANarayan Jul 1, 2021
cfda49f
replace GPUtil/psutil with ray cluster resources
ANarayan Jul 1, 2021
45a9af3
fix bugs in tune_learning_rate
ANarayan Jul 1, 2021
dc995b9
fix bugs in function imports
ANarayan Jul 1, 2021
eb116e0
add missing type to concat config
ANarayan Jul 1, 2021
336b17e
add support for dask df inputs and add return dict from auto_train api
ANarayan Jul 2, 2021
e783e60
only exclude text features if there are no available GPUs
ANarayan Jul 2, 2021
f93289a
add float to TrialResults dataclass to handle nans produced when auto…
ANarayan Jul 2, 2021
85e7b56
add support for auto keyword for batch_size and learning_rate
ANarayan Jul 2, 2021
5bb9312
add limit on tune batch size halving capacity
ANarayan Jul 2, 2021
01aa523
fix bug in tune batch size
ANarayan Jul 2, 2021
cb2b171
fixed bug in halving logic and added limit on batch_size bound
ANarayan Jul 6, 2021
1767247
add eager mode execution to tune_batch_size
ANarayan Jul 9, 2021
72cb96a
catch failed trials
ANarayan Jul 10, 2021
ba1d952
handles edge case where a trial never starts
ANarayan Jul 10, 2021
aa79c17
fix variable passing bug
ANarayan Jul 12, 2021
49a334e
format value error message
ANarayan Jul 12, 2021
59190b5
add constants BATCH_SIZE, LEARNING_RATE, AUTO
ANarayan Jul 13, 2021
5da3865
add more constants
ANarayan Jul 13, 2021
74715c9
add ray import exception
ANarayan Jul 13, 2021
d23d1de
add try/finally catch to ensure eager execution mode is properly reset
ANarayan Jul 13, 2021
1eeddd5
add ray import exception to utils.py
ANarayan Jul 13, 2021
9d663e8
remove accidental batch_size import
ANarayan Jul 13, 2021
cf922e6
add CONFIG to constants
ANarayan Jul 13, 2021
56170bb
minor change
ANarayan Jul 13, 2021
e08e758
add COMBINER to constants
ANarayan Jul 14, 2021
4926ff1
fix ray import exception and function signatures
ANarayan Jul 14, 2021
e16b3a7
remove unused import
ANarayan Jul 14, 2021
415acd2
change nan excpetion catch to warning
ANarayan Jul 14, 2021
faae740
Merge branch 'master' into automl
tgaddair Jul 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 49 additions & 24 deletions ludwig/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@

from ludwig.backend import Backend, initialize_backend
from ludwig.callbacks import Callback
from ludwig.constants import FULL, PREPROCESSING, TEST, TRAINING, VALIDATION
from ludwig.constants import FULL, PREPROCESSING, TEST, TRAINING, VALIDATION, LEARNING_RATE, BATCH_SIZE, AUTO
from ludwig.data.dataset.base import Dataset
from ludwig.data.postprocessing import convert_predictions, postprocess
from ludwig.data.preprocessing import (load_metadata,
Expand Down Expand Up @@ -347,12 +347,12 @@ def train(
# if we are skipping all saving,
# there is no need to create a directory that will remain empty
should_create_output_directory = not (
skip_save_training_description and
skip_save_training_statistics and
skip_save_model and
skip_save_progress and
skip_save_log and
skip_save_processed_input
skip_save_training_description and
skip_save_training_statistics and
skip_save_model and
skip_save_progress and
skip_save_log and
skip_save_processed_input
)

output_url = output_directory
Expand All @@ -365,7 +365,8 @@ def train(
output_directory)

if isinstance(training_set, Dataset) and training_set_metadata is not None:
preprocessed_data = (training_set, validation_set, test_set, training_set_metadata)
preprocessed_data = (
training_set, validation_set, test_set, training_set_metadata)
else:
# save description
if self.backend.is_coordinator():
Expand All @@ -384,10 +385,12 @@ def train(
# print description
logger.info('Experiment name: {}'.format(experiment_name))
logger.info('Model name: {}'.format(model_name))
logger.info('Output directory: {}'.format(output_directory))
logger.info(
'Output directory: {}'.format(output_directory))
logger.info('\n')
for key, value in description.items():
logger.info('{}: {}'.format(key, pformat(value, indent=4)))
logger.info('{}: {}'.format(
key, pformat(value, indent=4)))
logger.info('\n')

preprocessed_data = self.preprocess(
Expand Down Expand Up @@ -421,7 +424,8 @@ def train(
if self.backend.is_coordinator():
logger.info('Training set: {0}'.format(len(training_set)))
if validation_set is not None:
logger.info('Validation set: {0}'.format(len(validation_set)))
logger.info('Validation set: {0}'.format(
len(validation_set)))
if test_set is not None:
logger.info('Test set: {0}'.format(len(test_set)))
if not skip_save_model:
Expand Down Expand Up @@ -476,6 +480,26 @@ def train(
config_fp=self.config_fp,
)

# auto tune batch size
if self.config[TRAINING][BATCH_SIZE] == AUTO:
# TODO (ASN): add support for substitute_with_max parameter
tuned_batch_size = trainer.tune_batch_size(
self.config,
training_set,
random_seed=random_seed
)
self.config[TRAINING][BATCH_SIZE] = tuned_batch_size

# auto tune learning rate
if self.config[TRAINING][LEARNING_RATE] == AUTO:
new_learning_rate = trainer.tune_learning_rate(
self.config,
LudwigModel.create_model(self.config, random_seed),
training_set,
random_seed=random_seed
)
self.config[TRAINING][LEARNING_RATE] = new_learning_rate

# train model
if self.backend.is_coordinator():
print_boxed('TRAINING')
Expand Down Expand Up @@ -512,7 +536,8 @@ def train(
# results of the model with highest validation test performance
if self.backend.is_coordinator() and validation_set is not None:
epoch_best_vali_metric, best_vali_metric = best_function(
enumerate(validation_field_result[validation_metric]),
enumerate(
validation_field_result[validation_metric]),
key=lambda pair: pair[1]
)
logger.info(
Expand Down Expand Up @@ -707,7 +732,7 @@ def predict(
# if we are skipping all saving,
# there is no need to create a directory that will remain empty
should_create_exp_dir = not (
skip_save_unprocessed_output and skip_save_predictions
skip_save_unprocessed_output and skip_save_predictions
)
if should_create_exp_dir:
makedirs(output_directory, exist_ok=True)
Expand All @@ -720,7 +745,7 @@ def predict(
output_directory=output_directory,
backend=self.backend,
skip_save_unprocessed_output=skip_save_unprocessed_output
or not self.backend.is_coordinator(),
or not self.backend.is_coordinator(),
)
converted_postproc_predictions = convert_predictions(
postproc_predictions,
Expand Down Expand Up @@ -859,9 +884,9 @@ def evaluate(
# if we are skipping all saving,
# there is no need to create a directory that will remain empty
should_create_exp_dir = not (
skip_save_unprocessed_output and
skip_save_predictions and
skip_save_eval_stats
skip_save_unprocessed_output and
skip_save_predictions and
skip_save_eval_stats
)
if should_create_exp_dir:
makedirs(output_directory, exist_ok=True)
Expand All @@ -875,16 +900,16 @@ def evaluate(
output_directory=output_directory,
backend=self.backend,
skip_save_unprocessed_output=skip_save_unprocessed_output
or not self.backend.is_coordinator(),
or not self.backend.is_coordinator(),
)
else:
postproc_predictions = predictions # = {}

if self.backend.is_coordinator():
should_save_predictions = (
collect_predictions
and postproc_predictions is not None
and not skip_save_predictions
collect_predictions
and postproc_predictions is not None
and not skip_save_predictions
)
if should_save_predictions:
save_prediction_outputs(
Expand Down Expand Up @@ -1285,7 +1310,7 @@ def preprocess(
training_set_metadata) = preprocessed_data

return proc_training_set, proc_validation_set, proc_test_set, \
training_set_metadata
training_set_metadata

@staticmethod
def load(
Expand Down Expand Up @@ -1347,7 +1372,7 @@ def load(
model_dir,
MODEL_HYPERPARAMETERS_FILE_NAME
)
))
))

if backend_param is None and 'backend' in config:
# Reset backend from config
Expand Down Expand Up @@ -1697,7 +1722,7 @@ def kfold_cross_validate(
else:
ValueError(
"{} format is not supported for k_fold_cross_validate()"
.format(data_format)
.format(data_format)
)

kfold_cv_stats = {}
Expand Down
Empty file added ludwig/automl/__init__.py
Empty file.
111 changes: 111 additions & 0 deletions ludwig/automl/automl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""
automl.py

Driver script which:

(1) Builds a base config by performing type inference and populating config
w/default combiner parameters, training paramers, and hyperopt search space
(2) Tunes config based on resource constraints
(3) Runs hyperparameter optimization experiment
"""
import logging
import sys
from typing import Dict, Union

import numpy as np
import pandas as pd
from ludwig.automl.base_config import create_default_config
from ludwig.hyperopt.run import hyperopt

logger = logging.getLogger(__name__)


try:
import dask.dataframe as dd
import ray
except ImportError:
logger.error(
' ray is not installed. '
'In order to use auto_train please run '
'pip install ludwig[ray]'
)
sys.exit(-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we do this in a few other places in Ludwig, but for programatic usage, we should probably avoid calling sys.exit in case the user doesn't want their notebook to crash. Maybe raise an exception?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point


OUTPUT_DIR = "."


def model_select(default_configs):
"""
Performs model selection based on dataset.
Note: Current implementation returns tabnet by default. This will be
improved in subsequent iterations
"""
return default_configs['tabnet']


def auto_train(
tgaddair marked this conversation as resolved.
Show resolved Hide resolved
dataset: Union[str, pd.DataFrame, dd.core.DataFrame],
target: str,
time_limit_s: Union[int, float],
output_dir: str = OUTPUT_DIR,
config=None,
):
"""
Main auto train API that first builds configs for each model type
(e.g. concat, tabnet, transformer). Then selects model based on dataset
attributes. And finally runs a hyperparameter optimization experiment.

All batch and learning rate tuning is done @ training time.

# Inputs
:param dataset: (str) filepath to dataset.
:param target_name: (str) name of target feature
:param time_limit_s: (int, float) total time allocated to auto_train. acts
as the stopping parameter

# Returns
:return: (str) path to best trained model
"""
if config is None:
config = _create_auto_config(dataset, target, time_limit_s)
model_name = config['combiner']['type']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMBINER and TYPE should be constants

hyperopt_results = _train(config, dataset,
output_dir, model_name=model_name)
experiment_analysis = hyperopt_results.experiment_analysis
# catch edge case where metric_score is nan
# TODO (ASN): Decide how we want to proceed if at least one trial has
# completed
for trial in hyperopt_results.ordered_trials:
tgaddair marked this conversation as resolved.
Show resolved Hide resolved
if np.isnan(trial.metric_score):
raise ValueError(
"There was an error running the experiment. "
"A trial failed to start. "
"Consider increasing the time budget for experiment. "
Comment on lines +78 to +80
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure failing to start is the only possible reason for a NaN?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w4nderlust Not sure - let me investigate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way around it is to just check if the training_stats and eval_stats are empty dicts.

)

autotrain_results = {
'path_to_best_model': experiment_analysis.best_checkpoint,
'trial_id': "_".join(experiment_analysis.best_logdir.split("/")[-1].split("_")[1:])
}
return autotrain_results


def _create_auto_config(dataset, target, time_limit_s) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this public by removing the underscore. But create_default_config and model_select it may make sense to be private.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair Totally agree with making create_default_config and model_select private. Whats the reasoning for making create_auto_config public?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be if the user wants to inspect the auto config and modify it before training, e.g.:

config = create_auto_config()
config['training']['learning_rate'] = 1
auto_train(..., config=config)

Does that seem reasonable to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right right. this make total sense!

default_configs = create_default_config(dataset, target, time_limit_s)
model_config = model_select(default_configs)
return model_config


def _train(
config: Dict,
dataset: Union[str, pd.DataFrame, dd.core.DataFrame],
output_dir: str,
model_name: str
):
hyperopt_results = hyperopt(
config,
dataset=dataset,
output_directory=output_dir,
model_name=model_name
)
return hyperopt_results
Loading