Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of the end-to-end autotrain module #1219

Merged
merged 47 commits into from
Jul 14, 2021

Conversation

ANarayan
Copy link
Collaborator

@ANarayan ANarayan commented Jun 29, 2021

This PR:

  • adds automl module
  • adds support for specifying ray tune time budget
  • adds support for auto tuning batch size
  • add support auto selecting learning rate

@ANarayan ANarayan marked this pull request as draft June 29, 2021 04:06
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Left a few comments.

return default_configs['tabnet'], 'tabnet'


def auto_train(dataset: str, target: str, time_limit_s: Union[int, float]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have output_dir be an optional param here:

..., output_dir: str = OUTPUT_DIR):

train(model_config, dataset, OUTPUT_DIR, model_name=model_name)


def train(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an internal function, I would prefix it with an underscore so it isn't exported in the public API: _train.

import pandas as pd
from ludwig.automl.utils import FieldInfo, get_avg_words, get_available_resources
from ludwig.utils.data_utils import load_yaml
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit for import ordering:

# Python standard library imports
import os

# third party imports
import pandas as pd

# ludwig imports
import ludwig

return experiment_resources


def create_default_config(dataset: str, target_name: str = None, time_limit_s: Union[int, float] = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sticking with Ludwig convention, we'll probably want to allow a union of string (path) and DataFrame (Pandas or Dask) for the dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may also want to have more than one target. it's fine if we make it work with only one for now thogh

# Second pass to exclude fields that are too expensive given the constraints
for meta in metadata:
if input_count > 2 and meta["config"]["type"] == "text":
# By default, exclude text inputs when there are other candidate inputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do here is include text only if we have a GPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed -- this makes sense.

ludwig/automl/utils.py Show resolved Hide resolved
if high - low <= 1:
break

# Restore original parameters to defaults
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put this in a try-finally block to ensure we restore the params even on exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need a finally clause here since we restore the original parameters outside of the while loop.

skip_save_progress = self.skip_save_progress
skip_save_log = self.skip_save_log
# Set temporary values
self.epochs = 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to train for 3 full epochs or just 3 batches?

ludwig/automl/defaults/concat_config.yaml Show resolved Hide resolved
return experiment_resources


def create_default_config(dataset: str, target_name: str = None, time_limit_s: Union[int, float] = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may also want to have more than one target. it's fine if we make it work with only one for now thogh

avg_words: int = None


def get_avg_words(field: Series) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this computes a statistic about the length of the strings that make up a column in the data, is that correct? in thath case, instead of calling it avg_word would call it avg_str_len for clarity.
Or this could be calculating the number of words in a text assuming the text can be split by whitespace, in that cace a better name would avg_num_tokens

return fields, row_count


def get_input_and_output_features(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest renaming get_features_config and input and output are redundant and what you are returning is a config

},
"excluded": should_exclude(field, row_count, target_name),
"mode": get_predicted_mode(field, target_name),
"missingValues": missing_value_percent,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to keep up with python conventions, would call it missing_values

return metadata


def get_predicted_type(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe infer_type?

return False


def get_predicted_mode(field: FieldInfo, target_name: str = None) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe infer_mode?

@ANarayan ANarayan marked this pull request as ready for review July 12, 2021 16:53
@ANarayan ANarayan changed the base branch from automl to master July 12, 2021 16:53
@ANarayan ANarayan changed the title first pass at end to end auto train module [WIP] End-to-end autotrain module Jul 12, 2021
@tgaddair tgaddair changed the title [WIP] End-to-end autotrain module Initial implementation of the end-to-end autotrain module Jul 12, 2021
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff! Only a handful of minor comments.

ludwig/automl/base_config.py Show resolved Hide resolved
ludwig/automl/automl.py Show resolved Hide resolved
target_name: str = None,
) -> Dict:
"""
Constructs FeildInfo objects for each feature in dataset. These objects
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo in "FieldInfo"

hyperopt:
# goal: maximize
parameters:
training.learning_rate:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed since we set batch_size and learning_rate to auto?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair Reasoning behind the auto keyword is as follows. We want to tune the learning_rate and scale the batch_size only when those parameters are not part of the hyperparameter search space. The challenge with tuning the parameters during a hyperparameter optimization experiment is that the tuning algorithm has no knowledge that the value of the parameter it sample has changed and thus learns an incorrect parameter search space. This being said, I think the default behavior should be to do a hyperparameter search over learning_rate and batch_size (because we learn a relationship between batch_size and lr) vs tuning them independently. We can of course run experiments comparing the efficacy of tuning vs. search space optimization and change the default behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so in the current implementation the auto keyword will be overridden by the values selected from the search space.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup!

from ludwig.globals import TRAINING_CHECKPOINTS_DIR_PATH
from ludwig.globals import TRAINING_PROGRESS_TRACKER_FILE_NAME
from ludwig.globals import is_progressbar_disabled
#from ludwig.api import LudwigModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove commented line

self.skip_save_log = True

# Turn eager mode on
tf.config.experimental_run_functions_eagerly(True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably wrap this in a try-finally just in case an exception is raised and handled by the user. So something like:

tf.config.experimental_run_functions_eagerly(True)
try:
    ...
finally:
    tf.config.experimental_run_functions_eagerly(False)

ludwig/automl/automl.py Show resolved Hide resolved
:return: (str) path to best trained model
"""

default_configs = create_default_config(dataset, target, time_limit_s)
Copy link
Collaborator

@tgaddair tgaddair Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be useful to separate out the create_default_config in case the user wants to do something like: create default config, modify, then auto-train.

So perhaps we could do this:

def auto_train(..., config=None):
    if config is None:
        config = create_auto_config(dataset, target, time_limit_s)

This create_auto_config would then combine the default config creation with model selection. Maybe instead of returning the model name we can obtain this from the combiner name? That would also simplify the API for end users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair this make sense. Just to clarify, we would expect that if the user wants to modify the config themselves, they would call create_default_config first before calling auto_train right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

ludwig/api.py Outdated
)
self.config[TRAINING]['batch_size'] = tuned_batch_size

# auto tune learning rate
if self.config[TRAINING]["learning_rate"] == "auto":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AUTO, LEARNING_RATE and BATCH_SIZE can become constants

from typing import Dict, Union

import dask.dataframe as dd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask may not be installed, it is installed only in the fistributed package

Comment on lines +67 to +69
"There was an error running the experiment. "
"A trial failed to start. "
"Consider increasing the time budget for experiment. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure failing to start is the only possible reason for a NaN?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w4nderlust Not sure - let me investigate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way around it is to just check if the training_stats and eval_stats are empty dicts.

target_name: str = None,
) -> dict:
) -> Dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in all other places we are using typing.Dict we should use dict if we don't specify the types of keys and values, and Dict[keytype, valuetype] otherwise. Just for consistency.

More info: https://stackoverflow.com/questions/37087457/difference-between-defining-typing-dict-and-dict

# Exclude text fields if no GPUs are available
if resources['gpu'] == 0:
for meta in metadata:
if input_count > 2 and meta["config"]["type"] == "text":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIG, TYPE and TEXT can all be constants

random_seed: int = default_random_seed,
min_lr: float = 1e-8,
max_lr: float = 1.0,
total_training_steps: int = 100,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a big number of steps

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w4nderlust totally agree. I pulled this default value from the pytorch lightning LR tuner. We can modify our implementation is you think something smaller is more appropriate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are spanning through 8 orders of magnitude with the default parameters, probably less than 10 for each order of magnitude may be sufficient. Although let's keep it like this for now and if we figure out that it's too slow, 50 would probably be sufficient too.

return autotrain_results


def _create_auto_config(dataset, target, time_limit_s) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this public by removing the underscore. But create_default_config and model_select it may make sense to be private.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair Totally agree with making create_default_config and model_select private. Whats the reasoning for making create_auto_config public?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be if the user wants to inspect the auto config and modify it before training, e.g.:

config = create_auto_config()
config['training']['learning_rate'] = 1
auto_train(..., config=config)

Does that seem reasonable to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right right. this make total sense!

'gpu_resources_per_trial': 1
})
if cpu_count > 1:
cpus_per_trial = int(cpu_count/gpu_count)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe max(int(cpu_count / gpu_count), 1)

'In order to use auto_train please run '
'pip install ludwig[ray]'
)
sys.exit(-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we do this in a few other places in Ludwig, but for programatic usage, we should probably avoid calling sys.exit in case the user doesn't want their notebook to crash. Maybe raise an exception?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

hyperopt_results = _train(model_config, dataset,
if config is None:
config = _create_auto_config(dataset, target, time_limit_s)
model_name = config['combiner']['type']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMBINER and TYPE should be constants

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can't wait to try it out.

@tgaddair tgaddair merged commit c3fffea into ludwig-ai:master Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants