-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of the end-to-end autotrain module #1219
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Left a few comments.
ludwig/automl/automl.py
Outdated
return default_configs['tabnet'], 'tabnet' | ||
|
||
|
||
def auto_train(dataset: str, target: str, time_limit_s: Union[int, float]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have output_dir
be an optional param here:
..., output_dir: str = OUTPUT_DIR):
ludwig/automl/automl.py
Outdated
train(model_config, dataset, OUTPUT_DIR, model_name=model_name) | ||
|
||
|
||
def train( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an internal function, I would prefix it with an underscore so it isn't exported in the public API: _train
.
ludwig/automl/base_config.py
Outdated
import pandas as pd | ||
from ludwig.automl.utils import FieldInfo, get_avg_words, get_available_resources | ||
from ludwig.utils.data_utils import load_yaml | ||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit for import ordering:
# Python standard library imports
import os
# third party imports
import pandas as pd
# ludwig imports
import ludwig
ludwig/automl/base_config.py
Outdated
return experiment_resources | ||
|
||
|
||
def create_default_config(dataset: str, target_name: str = None, time_limit_s: Union[int, float] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sticking with Ludwig convention, we'll probably want to allow a union of string (path) and DataFrame (Pandas or Dask) for the dataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may also want to have more than one target. it's fine if we make it work with only one for now thogh
ludwig/automl/base_config.py
Outdated
# Second pass to exclude fields that are too expensive given the constraints | ||
for meta in metadata: | ||
if input_count > 2 and meta["config"]["type"] == "text": | ||
# By default, exclude text inputs when there are other candidate inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we could do here is include text only if we have a GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed -- this makes sense.
ludwig/models/trainer.py
Outdated
if high - low <= 1: | ||
break | ||
|
||
# Restore original parameters to defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put this in a try-finally
block to ensure we restore the params even on exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need a finally clause here since we restore the original parameters outside of the while loop.
ludwig/models/trainer.py
Outdated
skip_save_progress = self.skip_save_progress | ||
skip_save_log = self.skip_save_log | ||
# Set temporary values | ||
self.epochs = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to train for 3 full epochs or just 3 batches?
…ontain floats or ints
ludwig/automl/base_config.py
Outdated
return experiment_resources | ||
|
||
|
||
def create_default_config(dataset: str, target_name: str = None, time_limit_s: Union[int, float] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may also want to have more than one target. it's fine if we make it work with only one for now thogh
ludwig/automl/utils.py
Outdated
avg_words: int = None | ||
|
||
|
||
def get_avg_words(field: Series) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this computes a statistic about the length of the strings that make up a column in the data, is that correct? in thath case, instead of calling it avg_word
would call it avg_str_len
for clarity.
Or this could be calculating the number of words in a text assuming the text can be split by whitespace, in that cace a better name would avg_num_tokens
ludwig/automl/base_config.py
Outdated
return fields, row_count | ||
|
||
|
||
def get_input_and_output_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest renaming get_features_config
and input and output are redundant and what you are returning is a config
ludwig/automl/base_config.py
Outdated
}, | ||
"excluded": should_exclude(field, row_count, target_name), | ||
"mode": get_predicted_mode(field, target_name), | ||
"missingValues": missing_value_percent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to keep up with python conventions, would call it missing_values
ludwig/automl/base_config.py
Outdated
return metadata | ||
|
||
|
||
def get_predicted_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe infer_type
?
ludwig/automl/base_config.py
Outdated
return False | ||
|
||
|
||
def get_predicted_mode(field: FieldInfo, target_name: str = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe infer_mode
?
… tune time > auto_train budget
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff! Only a handful of minor comments.
target_name: str = None, | ||
) -> Dict: | ||
""" | ||
Constructs FeildInfo objects for each feature in dataset. These objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo in "FieldInfo"
hyperopt: | ||
# goal: maximize | ||
parameters: | ||
training.learning_rate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed since we set batch_size and learning_rate to auto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair Reasoning behind the auto keyword is as follows. We want to tune the learning_rate and scale the batch_size only when those parameters are not part of the hyperparameter search space. The challenge with tuning the parameters during a hyperparameter optimization experiment is that the tuning algorithm has no knowledge that the value of the parameter it sample has changed and thus learns an incorrect parameter search space. This being said, I think the default behavior should be to do a hyperparameter search over learning_rate and batch_size (because we learn a relationship between batch_size and lr) vs tuning them independently. We can of course run experiments comparing the efficacy of tuning vs. search space optimization and change the default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so in the current implementation the auto
keyword will be overridden by the values selected from the search space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup!
ludwig/models/trainer.py
Outdated
from ludwig.globals import TRAINING_CHECKPOINTS_DIR_PATH | ||
from ludwig.globals import TRAINING_PROGRESS_TRACKER_FILE_NAME | ||
from ludwig.globals import is_progressbar_disabled | ||
#from ludwig.api import LudwigModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove commented line
ludwig/models/trainer.py
Outdated
self.skip_save_log = True | ||
|
||
# Turn eager mode on | ||
tf.config.experimental_run_functions_eagerly(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably wrap this in a try-finally just in case an exception is raised and handled by the user. So something like:
tf.config.experimental_run_functions_eagerly(True)
try:
...
finally:
tf.config.experimental_run_functions_eagerly(False)
ludwig/automl/automl.py
Outdated
:return: (str) path to best trained model | ||
""" | ||
|
||
default_configs = create_default_config(dataset, target, time_limit_s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be useful to separate out the create_default_config
in case the user wants to do something like: create default config, modify, then auto-train.
So perhaps we could do this:
def auto_train(..., config=None):
if config is None:
config = create_auto_config(dataset, target, time_limit_s)
This create_auto_config
would then combine the default config creation with model selection. Maybe instead of returning the model name we can obtain this from the combiner name? That would also simplify the API for end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair this make sense. Just to clarify, we would expect that if the user wants to modify the config themselves, they would call create_default_config
first before calling auto_train
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
ludwig/api.py
Outdated
) | ||
self.config[TRAINING]['batch_size'] = tuned_batch_size | ||
|
||
# auto tune learning rate | ||
if self.config[TRAINING]["learning_rate"] == "auto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AUTO, LEARNING_RATE and BATCH_SIZE can become constants
ludwig/automl/automl.py
Outdated
from typing import Dict, Union | ||
|
||
import dask.dataframe as dd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask may not be installed, it is installed only in the fistributed package
"There was an error running the experiment. " | ||
"A trial failed to start. " | ||
"Consider increasing the time budget for experiment. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure failing to start is the only possible reason for a NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@w4nderlust Not sure - let me investigate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way around it is to just check if the training_stats
and eval_stats
are empty dicts.
ludwig/automl/base_config.py
Outdated
target_name: str = None, | ||
) -> dict: | ||
) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in all other places we are using typing.Dict
we should use dict
if we don't specify the types of keys and values, and Dict[keytype, valuetype]
otherwise. Just for consistency.
More info: https://stackoverflow.com/questions/37087457/difference-between-defining-typing-dict-and-dict
ludwig/automl/base_config.py
Outdated
# Exclude text fields if no GPUs are available | ||
if resources['gpu'] == 0: | ||
for meta in metadata: | ||
if input_count > 2 and meta["config"]["type"] == "text": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONFIG, TYPE and TEXT can all be constants
random_seed: int = default_random_seed, | ||
min_lr: float = 1e-8, | ||
max_lr: float = 1.0, | ||
total_training_steps: int = 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a big number of steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@w4nderlust totally agree. I pulled this default value from the pytorch lightning LR tuner. We can modify our implementation is you think something smaller is more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are spanning through 8 orders of magnitude with the default parameters, probably less than 10 for each order of magnitude may be sufficient. Although let's keep it like this for now and if we figure out that it's too slow, 50 would probably be sufficient too.
ludwig/automl/automl.py
Outdated
return autotrain_results | ||
|
||
|
||
def _create_auto_config(dataset, target, time_limit_s) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this public by removing the underscore. But create_default_config
and model_select
it may make sense to be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair Totally agree with making create_default_config
and model_select
private. Whats the reasoning for making create_auto_config
public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea would be if the user wants to inspect the auto config and modify it before training, e.g.:
config = create_auto_config()
config['training']['learning_rate'] = 1
auto_train(..., config=config)
Does that seem reasonable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right right. this make total sense!
ludwig/automl/base_config.py
Outdated
'gpu_resources_per_trial': 1 | ||
}) | ||
if cpu_count > 1: | ||
cpus_per_trial = int(cpu_count/gpu_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe max(int(cpu_count / gpu_count), 1)
ludwig/automl/automl.py
Outdated
'In order to use auto_train please run ' | ||
'pip install ludwig[ray]' | ||
) | ||
sys.exit(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we do this in a few other places in Ludwig, but for programatic usage, we should probably avoid calling sys.exit
in case the user doesn't want their notebook to crash. Maybe raise an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
ludwig/automl/automl.py
Outdated
hyperopt_results = _train(model_config, dataset, | ||
if config is None: | ||
config = _create_auto_config(dataset, target, time_limit_s) | ||
model_name = config['combiner']['type'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COMBINER and TYPE should be constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Can't wait to try it out.
This PR: