wandb sweeps integration for hyperparameter optimization #1124

kiristern · 2022-04-21T19:41:26Z

Motivation for the feature

Models take a long time to train, therefore integrating sweeps for model hyperparameter tuning will help us converge towards the best model more quickly.

Description of the feature

Sweeps is initialized with either a yaml file or using a python dictionary and will train the model using different hyperparameters, as specified in the yaml or dictionary.
Can run multiple 'agents' at a time to search more quickly.
Importance metrics and best runs are recorded and can be viewed on the wandb dashboard.

Wondering if sweeps parameters can/should be specified directly in the config.json, for example:

 "wandb": {
        "wandb_api_key": "",
        "project_name": "ivado-wandb-testing",
        "group_name": "temp",
        "run_name": "run-1",
        "log_grads_every": 100
        "sweeps_config" : "random" # sweep method (can also specify 'grid' or 'bayes'); leave "" if don't want to sweep ?
 ...
"training_parameters": {
        'batch_size': {
              # integers between 4 and 32
              # with evenly-distributed logarithms 
              'distribution': 'q_log_uniform_values',
              'q': 4,
              'min': 4,
              'max': 32,
      },
        "training_time": {
            "num_epochs": {
               "values": [15, 25, 50],
             },
            "early_stopping_patience": 50,
            "early_stopping_epsilon": 0.001
        },
        "scheduler": {
            "initial_lr":  {
                  # a uniform distribution between 1e-5 and 0.1
                  'distribution': 'uniform',
                  'min': 1e-5,
                  'max': 0.1
        },
...

See sweep config for more details

Alternatives

Other hyperparameter optimization frameworks: optuna and sigopt (probably there are others but these were some that were suggested in the lab meeting). However, I think wandb sweeps would integrate the best, given it's already set up.

The text was updated successfully, but these errors were encountered:

dyt811 · 2022-04-22T16:48:07Z

An optional key under "wandb" should quite reasonable. I have not played with wandb searching but in other frame work like comet.ml, the hyperparameter searching is a very smallish JSON snippet to define the searching range/values etc.

jcohenadad · 2022-12-22T21:36:41Z

@naga-karthik @kiristern is there any update on this issue? What is your current strategy to wandb sweep with ivadomed?

jcohenadad · 2022-12-22T21:41:45Z

I'm thinking, as a "short term" solution, we could maybe come up with a Python wrapper that would generate a config file on the fly (based on an input template, replacing the hyperparameter to be swept), and launch ivadomed inside the wrapper?

And this wrapper could eventually replace https://github.com/ivadomed/ivadomed/blob/master/ivadomed/scripts/automate_training.py, unless some people are using it?

naga-karthik · 2022-12-23T00:02:38Z

Sorry, there has not been any update on this yet. Will take up now.

I have only used sweeps from the CLI where we have run wandb.sweep with the hyperparameters-to-sweep in a yaml file and then separately run wandb.agent based on the sweep id that the above function returns. However, it seems like both of these commands could be run together based on the documentation here and here. There is no need to give a yaml-formatted file in this case, the standard key: value format could work.

Given this, it appears that launching ivadomed within a python script as you suggested is a decent solution (not a short-term one). I will take a look how this can be done.

naga-karthik · 2022-12-23T17:34:04Z

There is one issue that I realized just now. How wandb sweeps work is that based on the hyperparameters ranges you specify, the wandb agents create various combinations on their own to run different models (the yaml file also contains the path to main.py and the argparse commands defining the hyperparameters that the agents choose by themselves). The results are shown on the Sweeps dashboard, which is different from the standard wandb dashboard to visualize runs.

Now, with our solution, if we define the python wrapper containing the hyperparameter ranges and we're calling ivadomed ourselves, then there is no point of calling wandb.sweep because we're defining the hyperparameters ourselves (by defining the config file for ivadomed). In order words, the parameters we want to sweep over will also be appearing under the project with some group and run name. This could be one solution but will not have the same advantages of a proper "Sweep" on wandb dashboard

jcohenadad · 2022-12-23T17:55:59Z

Right-- this was my understanding as well. But my idea was to let wandb sweep, retrieve the parameters at each 'sweep loop' (is that possible), generate an ivadomed config file with the params at a given iteration, and launch ivadomed.

naga-karthik · 2022-12-23T18:09:31Z

Aha!

retrieve the parameters at each 'sweep loop' (is that possible)

This is exactly where the problem is. Once we run wandb.sweep, the only thing it returns is an alphanumeric code that should subsequently be used to run wandb.agent. As a result, we don't "see" the parameters until after we have initialized the agent (after which it appears on the dashboard). It is precisely for this reason I said that we have to come up with hyperparam combinations ourselves in order to feed it to ivadomed's config file (thereby defeating the purpose of wandb sweeps)

One workaround I could think of is as follows:

Do not think about wandb sweeps for the moment and essentially borrow the training process from ivadomed_automate_training without doing it on multiple GPUs. One could always run multiple hyperparam sweeps on different GPUs too. Now, because ivadomed_automate_training presents a way to combine various hyperparameters in different ways, we can the config files resulting from that to run ivadomed (inside this wrapper). Since, we already have wandb inside ivadomed's training.py, we will see all the runs on the dashboard. A direct comparison with all the various hyperparameters and the specific effects of each of them might be difficult to see (this is precisely what the Sweeps dashboard makes easy), but we will at least be able to run a basic sweep in the first place.

jcohenadad · 2022-12-23T18:53:04Z

Do not think about wandb sweeps for the moment and essentially borrow the training process from ivadomed_automate_training without doing it on multiple GPUs. One could always run multiple hyperparam sweeps on different GPUs too. Now, because ivadomed_automate_training presents a way to combine various hyperparameters in different ways, we can the config files resulting from that to run ivadomed (inside this wrapper). Since, we already have wandb inside ivadomed's training.py, we will see all the runs on the dashboard. A direct comparison with all the various hyperparameters and the specific effects of each of them might be difficult to see (this is precisely what the Sweeps dashboard makes easy), but we will at least be able to run a basic sweep in the first place.

Hum, this is not great because the visualization offered by wandb-sweep is extremely useful. I'm still wondering if there is some modularity in wandb-sweep (which is not shown in the basic example from the website) that would allow us to use it with ivadomed. A bit more digging is necessary, to make sure we are not missing a good opportunity here. Also tagging @kiristern @kanishk16 @dyt811 so they can help digging.

naga-karthik · 2022-12-23T21:48:44Z

@jcohenadad I looked into this a bit more and seems like my understanding of sweeps was incomplete. Here's a picture of how it works, which I pulled from one of their issues on GH here.

It appears that wandb sweep simply returns a sweep_id. Once we run wandb agent using the sweep_id, it internally syncs with sweep controller, which returns different combinations of hyperparameters to the client, thereby making the agent runs these hyperparameter combinations. So, our initial hypotheses on looking more into wandb sweep will not be useful anymore. I am looking at whether we can somehow use the sweep controller, but then again, the user's job ends after running wandb agent (so not sure how exactly, need more digging here). Everything after that is happening inside a loop by communicating between the client and the sweep controller internally.

jcohenadad · 2022-12-23T22:49:35Z

I'm not sure I agree with your analysis @naga-karthik . Looking at an example code for Pytorch, the key wandb elements are already integrated in ivadomed's training API:

ivadomed/ivadomed/training.py

Line 74 in f066b2e

    
           wandb.init(project=project_name, group=group_name, name=run_name, config=cfg, dir=path_output)

ivadomed/ivadomed/training.py

Line 183 in f066b2e

wandb.log({"learning_rate": lr})

So my guess is that we would "just" need to implement the wandb sweep functionality into the training API, unless I am missing something?

naga-karthik · 2022-12-23T23:54:43Z

TL;DR
I definitely need some more time to look into this (and also need to brainstorm with the team). I am running into some errors very similar to this, when I call the sweep and agent functions inside our training API.

What I actually did
After looking at your suggestion, I added the following lines inside our training API:

wandb_sweep_params = True
if wandb_tracking and wandb_sweep_params:
    
    sweep_configuration = {
        'method': 'random',
        'metric': {
            'goal': 'minimize', 
            'name': 'val_loss_total_avg'},
        'parameters': {
            'batch_size': {
                'values': [16, 32, 64]
            },
            'num_epochs': {
                'values': [5, 10, 15]
            },
            'depth': {
                'values': [2, 3, 4]
            }
        }
    }
    sweep_id = wandb.sweep(sweep_configuration, project=project_name)
    wandb.agent(sweep_id, count=5)

where, wandb_sweep_params would be an additional key inside our main config.json file and the values inside the sweep_configuration dictionary will be retrieved from our main config file as well (for testing purposes now, they aren't). Now, whenever we call wandb.sweep, as I mentioned above, it only returns a sweep_id which is useless until we run wandb.agent to use the sweep_id and initialize the run.

Now, wandb.agent does a weird thing. Because these functions are optimized for CLI, it always runs this command /usr/bin/env python --batch_size=32 --depth=3 --num_epochs=5 (note the arguments are the ones I provided in the sweep_configuration dictionary. The whole log from wandb is:

2022-12-23 18:38:40,070 - wandb.wandb_agent - INFO - About to run command: /usr/bin/env python --batch_size=32 --depth=3 --num_epochs=5

This is a problem because we don't want CLI inputs be running from our training API. As suggested in the wandb docs, the best way to use Sweeps is to run these commands separately on CLI or from a Jupyter notebook. Bottomline is that sweeps integration is not trivial because of the rigidity in how wandb itself provided this feature.

jcohenadad · 2022-12-27T20:52:12Z

I see. Have you tried using Sweep's API? Also, from wandb/wandb#2282 (comment), have you tried:

If running from the command line isn't an option, you could try setting the WANDB_START_METHOD=thread.

There might also be possibilities to use wandb's sweep via a local controller.

Don't hesitate to also open an issue on wandb's repository, to explain what we would like to do. If there is a quick solution (or a 'no go'), it would save us a lot of time.

dyt811 added deep learning Any topics touching PyTorch and deep learning concepts, requiring additional ML/DL expertise enhancement category: improves performance/results of an existing feature labels Apr 22, 2022

naga-karthik self-assigned this Apr 24, 2022

jcohenadad removed the deep learning Any topics touching PyTorch and deep learning concepts, requiring additional ML/DL expertise label Dec 22, 2022

jcohenadad linked a pull request Dec 22, 2022 that will close this issue

Initial prototype for wandb sweep #1249

Draft

7 tasks

jcohenadad mentioned this issue Jan 1, 2023

Create individual group/run in WanDB when using ivadomed_automate_training #1251

Open

jcohenadad added the wandb label Jan 13, 2023

jcohenadad mentioned this issue Jan 16, 2023

For final model, do hyperparam optimization sct-pipeline/contrast-agnostic-softseg-spinalcord#40

Closed

jcohenadad added the priority:high label Jan 24, 2023

kanishk16 linked a pull request Feb 7, 2023 that will close this issue

feat: integrate wandb sweeps for hyperparameter optimization #1288

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wandb sweeps integration for hyperparameter optimization #1124

wandb sweeps integration for hyperparameter optimization #1124

kiristern commented Apr 21, 2022

dyt811 commented Apr 22, 2022

jcohenadad commented Dec 22, 2022

jcohenadad commented Dec 22, 2022 •

edited

naga-karthik commented Dec 23, 2022

naga-karthik commented Dec 23, 2022

jcohenadad commented Dec 23, 2022

naga-karthik commented Dec 23, 2022 •

edited

jcohenadad commented Dec 23, 2022

naga-karthik commented Dec 23, 2022 •

edited

jcohenadad commented Dec 23, 2022

naga-karthik commented Dec 23, 2022 •

edited

jcohenadad commented Dec 27, 2022 •

edited

wandb sweeps integration for hyperparameter optimization #1124

wandb sweeps integration for hyperparameter optimization #1124

Comments

kiristern commented Apr 21, 2022

Motivation for the feature

Description of the feature

Alternatives

dyt811 commented Apr 22, 2022

jcohenadad commented Dec 22, 2022

jcohenadad commented Dec 22, 2022 • edited

naga-karthik commented Dec 23, 2022

naga-karthik commented Dec 23, 2022

jcohenadad commented Dec 23, 2022

naga-karthik commented Dec 23, 2022 • edited

jcohenadad commented Dec 23, 2022

naga-karthik commented Dec 23, 2022 • edited

jcohenadad commented Dec 23, 2022

naga-karthik commented Dec 23, 2022 • edited

jcohenadad commented Dec 27, 2022 • edited

jcohenadad commented Dec 22, 2022 •

edited

naga-karthik commented Dec 23, 2022 •

edited

naga-karthik commented Dec 23, 2022 •

edited

naga-karthik commented Dec 23, 2022 •

edited

jcohenadad commented Dec 27, 2022 •

edited