Initial implementation of the end-to-end autotrain module #1219

ANarayan · 2021-06-29T04:04:03Z

This PR:

adds automl module
adds support for specifying ray tune time budget
adds support for auto tuning batch size
add support auto selecting learning rate

…budget to tune

tgaddair

Awesome work! Left a few comments.

tgaddair · 2021-06-30T02:53:09Z

ludwig/automl/automl.py

+    return default_configs['tabnet'], 'tabnet'
+
+
+def auto_train(dataset: str, target: str, time_limit_s: Union[int, float]):


I would have output_dir be an optional param here:

..., output_dir: str = OUTPUT_DIR):

tgaddair · 2021-06-30T02:54:19Z

ludwig/automl/automl.py

+    train(model_config, dataset, OUTPUT_DIR, model_name=model_name)
+
+
+def train(


If this is an internal function, I would prefix it with an underscore so it isn't exported in the public API: _train.

tgaddair · 2021-06-30T02:55:28Z

ludwig/automl/base_config.py

+import pandas as pd
+from ludwig.automl.utils import FieldInfo, get_avg_words, get_available_resources
+from ludwig.utils.data_utils import load_yaml
+import os


Nit for import ordering:

# Python standard library imports import os # third party imports import pandas as pd # ludwig imports import ludwig

tgaddair · 2021-06-30T03:09:41Z

ludwig/automl/base_config.py

+    return experiment_resources
+
+
+def create_default_config(dataset: str, target_name: str = None, time_limit_s: Union[int, float] = None):


Sticking with Ludwig convention, we'll probably want to allow a union of string (path) and DataFrame (Pandas or Dask) for the dataset.

we may also want to have more than one target. it's fine if we make it work with only one for now thogh

tgaddair · 2021-06-30T03:15:03Z

ludwig/automl/base_config.py

+    # Second pass to exclude fields that are too expensive given the constraints
+    for meta in metadata:
+        if input_count > 2 and meta["config"]["type"] == "text":
+            # By default, exclude text inputs when there are other candidate inputs


One thing we could do here is include text only if we have a GPU.

agreed -- this makes sense.

ludwig/automl/utils.py

tgaddair · 2021-06-30T14:34:08Z

ludwig/models/trainer.py

+                if high - low <= 1:
+                    break
+
+        # Restore original parameters to defaults


We can put this in a try-finally block to ensure we restore the params even on exception.

Not sure if we need a finally clause here since we restore the original parameters outside of the while loop.

tgaddair · 2021-06-30T14:35:47Z

ludwig/models/trainer.py

+        skip_save_progress = self.skip_save_progress
+        skip_save_log = self.skip_save_log
+        # Set temporary values
+        self.epochs = 3


Is it necessary to train for 3 full epochs or just 3 batches?

…ontain floats or ints

…g_rate

ludwig/automl/defaults/concat_config.yaml

w4nderlust · 2021-06-30T20:44:36Z

ludwig/automl/base_config.py

+    return experiment_resources
+
+
+def create_default_config(dataset: str, target_name: str = None, time_limit_s: Union[int, float] = None):


we may also want to have more than one target. it's fine if we make it work with only one for now thogh

w4nderlust · 2021-06-30T20:47:38Z

ludwig/automl/utils.py

+    avg_words: int = None
+
+
+def get_avg_words(field: Series) -> int:


It seems to me that this computes a statistic about the length of the strings that make up a column in the data, is that correct? in thath case, instead of calling it avg_word would call it avg_str_len for clarity.
Or this could be calculating the number of words in a text assuming the text can be split by whitespace, in that cace a better name would avg_num_tokens

w4nderlust · 2021-06-30T20:53:38Z

ludwig/automl/base_config.py

+    return fields, row_count
+
+
+def get_input_and_output_features(


Suggest renaming get_features_config and input and output are redundant and what you are returning is a config

w4nderlust · 2021-06-30T20:54:59Z

ludwig/automl/base_config.py

+                },
+                "excluded": should_exclude(field, row_count, target_name),
+                "mode": get_predicted_mode(field, target_name),
+                "missingValues": missing_value_percent,


just to keep up with python conventions, would call it missing_values

w4nderlust · 2021-06-30T20:55:31Z

ludwig/automl/base_config.py

+    return metadata
+
+
+def get_predicted_type(


maybe infer_type?

w4nderlust · 2021-06-30T20:56:25Z

ludwig/automl/base_config.py

+    return False
+
+
+def get_predicted_mode(field: FieldInfo, target_name: str = None) -> str:


maybe infer_mode?

… tune time > auto_train budget

tgaddair

Awesome stuff! Only a handful of minor comments.

ludwig/automl/base_config.py

ludwig/automl/automl.py

tgaddair · 2021-07-12T19:48:20Z

ludwig/automl/base_config.py

+    target_name: str = None,
+) -> Dict:
+    """
+    Constructs FeildInfo objects for each feature in dataset. These objects


Nit: typo in "FieldInfo"

tgaddair · 2021-07-12T19:57:55Z

ludwig/automl/defaults/tabnet_config.yaml

+hyperopt:
+  # goal: maximize
+  parameters:
+    training.learning_rate:


Is this still needed since we set batch_size and learning_rate to auto?

@tgaddair Reasoning behind the auto keyword is as follows. We want to tune the learning_rate and scale the batch_size only when those parameters are not part of the hyperparameter search space. The challenge with tuning the parameters during a hyperparameter optimization experiment is that the tuning algorithm has no knowledge that the value of the parameter it sample has changed and thus learns an incorrect parameter search space. This being said, I think the default behavior should be to do a hyperparameter search over learning_rate and batch_size (because we learn a relationship between batch_size and lr) vs tuning them independently. We can of course run experiments comparing the efficacy of tuning vs. search space optimization and change the default behavior.

I see, so in the current implementation the auto keyword will be overridden by the values selected from the search space.

tgaddair · 2021-07-12T20:01:02Z

ludwig/models/trainer.py

-from ludwig.globals import TRAINING_CHECKPOINTS_DIR_PATH
-from ludwig.globals import TRAINING_PROGRESS_TRACKER_FILE_NAME
-from ludwig.globals import is_progressbar_disabled
+#from ludwig.api import LudwigModel


Nit: remove commented line

tgaddair · 2021-07-12T20:07:41Z

ludwig/models/trainer.py

+        self.skip_save_log = True
+
+        # Turn eager mode on
+        tf.config.experimental_run_functions_eagerly(True)


We should probably wrap this in a try-finally just in case an exception is raised and handled by the user. So something like:

tf.config.experimental_run_functions_eagerly(True) try: ... finally: tf.config.experimental_run_functions_eagerly(False)

ludwig/automl/automl.py

tgaddair · 2021-07-12T20:14:19Z

ludwig/automl/automl.py

+    :return: (str) path to best trained model
+    """
+
+    default_configs = create_default_config(dataset, target, time_limit_s)


I think it could be useful to separate out the create_default_config in case the user wants to do something like: create default config, modify, then auto-train.

So perhaps we could do this:

def auto_train(..., config=None): if config is None: config = create_auto_config(dataset, target, time_limit_s)

This create_auto_config would then combine the default config creation with model selection. Maybe instead of returning the model name we can obtain this from the combiner name? That would also simplify the API for end users.

@tgaddair this make sense. Just to clarify, we would expect that if the user wants to modify the config themselves, they would call create_default_config first before calling auto_train right?

Yes, exactly.

w4nderlust · 2021-07-12T21:42:25Z

ludwig/api.py

                    )
                    self.config[TRAINING]['batch_size'] = tuned_batch_size

+                # auto tune learning rate
+                if self.config[TRAINING]["learning_rate"] == "auto":


AUTO, LEARNING_RATE and BATCH_SIZE can become constants

w4nderlust · 2021-07-12T21:43:49Z

ludwig/automl/automl.py

 from typing import Dict, Union

+import dask.dataframe as dd


dask may not be installed, it is installed only in the fistributed package

w4nderlust · 2021-07-12T21:44:40Z

ludwig/automl/automl.py

+                "There was an error running the experiment. "
+                "A trial failed to start. "
+                "Consider increasing the time budget for experiment. "


Are we sure failing to start is the only possible reason for a NaN?

@w4nderlust Not sure - let me investigate

Another way around it is to just check if the training_stats and eval_stats are empty dicts.

w4nderlust · 2021-07-12T21:48:23Z

ludwig/automl/base_config.py

    target_name: str = None,
-) -> dict:
+) -> Dict:


Here and in all other places we are using typing.Dict we should use dict if we don't specify the types of keys and values, and Dict[keytype, valuetype] otherwise. Just for consistency.

More info: https://stackoverflow.com/questions/37087457/difference-between-defining-typing-dict-and-dict

w4nderlust · 2021-07-12T21:49:19Z

ludwig/automl/base_config.py

+    # Exclude text fields if no GPUs are available
+    if resources['gpu'] == 0:
+        for meta in metadata:
+            if input_count > 2 and meta["config"]["type"] == "text":


CONFIG, TYPE and TEXT can all be constants

w4nderlust · 2021-07-12T21:53:46Z

ludwig/models/trainer.py

+        random_seed: int = default_random_seed,
+        min_lr: float = 1e-8,
+        max_lr: float = 1.0,
+        total_training_steps: int = 100,


this seems like a big number of steps

@w4nderlust totally agree. I pulled this default value from the pytorch lightning LR tuner. We can modify our implementation is you think something smaller is more appropriate.

We are spanning through 8 orders of magnitude with the default parameters, probably less than 10 for each order of magnitude may be sufficient. Although let's keep it like this for now and if we figure out that it's too slow, 50 would probably be sufficient too.

tgaddair · 2021-07-13T21:07:27Z

ludwig/automl/automl.py

+    return autotrain_results
+
+
+def _create_auto_config(dataset, target, time_limit_s) -> dict:


Let's make this public by removing the underscore. But create_default_config and model_select it may make sense to be private.

@tgaddair Totally agree with making create_default_config and model_select private. Whats the reasoning for making create_auto_config public?

The idea would be if the user wants to inspect the auto config and modify it before training, e.g.:

config = create_auto_config() config['training']['learning_rate'] = 1 auto_train(..., config=config)

Does that seem reasonable to you?

Right right. this make total sense!

tgaddair · 2021-07-13T21:11:16Z

ludwig/automl/base_config.py

+            'gpu_resources_per_trial': 1
+        })
+        if cpu_count > 1:
+            cpus_per_trial = int(cpu_count/gpu_count)


Maybe max(int(cpu_count / gpu_count), 1)

tgaddair · 2021-07-13T21:13:14Z

ludwig/automl/automl.py

+        'In order to use auto_train please run '
+        'pip install ludwig[ray]'
+    )
+    sys.exit(-1)


I know we do this in a few other places in Ludwig, but for programatic usage, we should probably avoid calling sys.exit in case the user doesn't want their notebook to crash. Maybe raise an exception?

w4nderlust · 2021-07-13T23:39:48Z

ludwig/automl/automl.py

-    hyperopt_results = _train(model_config, dataset,
+    if config is None:
+        config = _create_auto_config(dataset, target, time_limit_s)
+    model_name = config['combiner']['type']


COMBINER and TYPE should be constants

tgaddair

LGTM! Can't wait to try it out.

ANarayan added 5 commits June 24, 2021 18:52

first pass @ e2e autotrain

26a9663

first pass @ auto batch scaling

ff115c8

add additional parameter for pbt scheduler and supports passing time …

d8543f4

…budget to tune

add default hyperparameter search space + tune_batch_size parameter

7419fc8

add comments and delete tune_config.py

aeba080

ANarayan requested review from tgaddair and w4nderlust June 29, 2021 04:04

ANarayan marked this pull request as draft June 29, 2021 04:06

ANarayan added 5 commits June 29, 2021 12:18

fix bug in assignment of pbt scheduler paramter

fa28c82

fix bug to support pbt scheduler

ee130f7

fix bug and cpu/gpu resource specification in config

cc81e27

fix pbt scheduelr params and validation metric bug in config files

03ba456

add max_trials to auto tune function

1da9aae

tgaddair reviewed Jun 30, 2021

View reviewed changes

ANarayan added 5 commits June 30, 2021 08:06

change search space encoding to only json encode lists which do not c…

79e66e2

…ontain floats or ints

add function to support training for tune_batch_size and tune_learnin…

aa5e174

…g_rate

change default scheduler to async_hyperband

efcca6e

sort imports

bfa0794

makes train an internal func. & adds output_dir param to auto_train

725f688

w4nderlust reviewed Jun 30, 2021

View reviewed changes

ANarayan added 10 commits June 30, 2021 17:43

minor naming changes

32fa44b

add a first pass @ an auto learning rate tuner

a887f04

minor naming change

df96d21

replace GPUtil/psutil with ray cluster resources

cfda49f

fix bugs in tune_learning_rate

45a9af3

fix bugs in function imports

dc995b9

add missing type to concat config

eb116e0

add support for dask df inputs and add return dict from auto_train api

336b17e

only exclude text features if there are no available GPUs

e783e60

add float to TrialResults dataclass to handle nans produced when auto…

f93289a

… tune time > auto_train budget

ANarayan added 7 commits July 2, 2021 16:00

fix bug in tune batch size

01aa523

fixed bug in halving logic and added limit on batch_size bound

cb2b171

add eager mode execution to tune_batch_size

1767247

catch failed trials

72cb96a

handles edge case where a trial never starts

ba1d952

fix variable passing bug

aa79c17

format value error message

49a334e

ANarayan marked this pull request as ready for review July 12, 2021 16:53

ANarayan changed the base branch from automl to master July 12, 2021 16:53

ANarayan changed the title ~~first pass at end to end auto train module~~ [WIP] End-to-end autotrain module Jul 12, 2021

tgaddair changed the title ~~[WIP] End-to-end autotrain module~~ Initial implementation of the end-to-end autotrain module Jul 12, 2021

tgaddair reviewed Jul 12, 2021

View reviewed changes

w4nderlust reviewed Jul 12, 2021

View reviewed changes

ANarayan added 8 commits July 12, 2021 18:11

add constants BATCH_SIZE, LEARNING_RATE, AUTO

59190b5

add more constants

5da3865

add ray import exception

74715c9

add try/finally catch to ensure eager execution mode is properly reset

d23d1de

add ray import exception to utils.py

1eeddd5

remove accidental batch_size import

9d663e8

add CONFIG to constants

cf922e6

minor change

56170bb

tgaddair reviewed Jul 13, 2021

View reviewed changes

w4nderlust reviewed Jul 13, 2021

View reviewed changes

ANarayan and others added 5 commits July 13, 2021 17:42

add COMBINER to constants

e08e758

fix ray import exception and function signatures

4926ff1

remove unused import

e16b3a7

change nan excpetion catch to warning

415acd2

Merge branch 'master' into automl

faae740

tgaddair approved these changes Jul 14, 2021

View reviewed changes

tgaddair merged commit c3fffea into ludwig-ai:master Jul 14, 2021

		return default_configs['tabnet'], 'tabnet'


		def auto_train(dataset: str, target: str, time_limit_s: Union[int, float]):

		train(model_config, dataset, OUTPUT_DIR, model_name=model_name)


		def train(

		return experiment_resources


		def create_default_config(dataset: str, target_name: str = None, time_limit_s: Union[int, float] = None):

		avg_words: int = None


		def get_avg_words(field: Series) -> int:

		return False


		def get_predicted_mode(field: FieldInfo, target_name: str = None) -> str:

		return autotrain_results


		def _create_auto_config(dataset, target, time_limit_s) -> dict:

Initial implementation of the end-to-end autotrain module #1219

Initial implementation of the end-to-end autotrain module #1219

Conversation

ANarayan commented Jun 29, 2021 • edited Loading

tgaddair left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

ANarayan commented Jun 29, 2021 •

edited

Loading

tgaddair left a comment •

edited

Loading

tgaddair Jul 12, 2021 •

edited

Loading