Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wandb sweeps integration for hyperparameter optimization #1124

Open
kiristern opened this issue Apr 21, 2022 · 12 comments · May be fixed by #1249 or #1288
Open

wandb sweeps integration for hyperparameter optimization #1124

kiristern opened this issue Apr 21, 2022 · 12 comments · May be fixed by #1249 or #1288
Assignees
Labels
enhancement category: improves performance/results of an existing feature priority:high wandb

Comments

@kiristern
Copy link

Motivation for the feature

Models take a long time to train, therefore integrating sweeps for model hyperparameter tuning will help us converge towards the best model more quickly.

Description of the feature

  • Sweeps is initialized with either a yaml file or using a python dictionary and will train the model using different hyperparameters, as specified in the yaml or dictionary.
  • Can run multiple 'agents' at a time to search more quickly.
  • Importance metrics and best runs are recorded and can be viewed on the wandb dashboard.

Wondering if sweeps parameters can/should be specified directly in the config.json, for example:

 "wandb": {
        "wandb_api_key": "",
        "project_name": "ivado-wandb-testing",
        "group_name": "temp",
        "run_name": "run-1",
        "log_grads_every": 100
        "sweeps_config" : "random" # sweep method (can also specify 'grid' or 'bayes'); leave "" if don't want to sweep ?
 ...
"training_parameters": {
        'batch_size': {
              # integers between 4 and 32
              # with evenly-distributed logarithms 
              'distribution': 'q_log_uniform_values',
              'q': 4,
              'min': 4,
              'max': 32,
      },
        "training_time": {
            "num_epochs": {
               "values": [15, 25, 50],
             },
            "early_stopping_patience": 50,
            "early_stopping_epsilon": 0.001
        },
        "scheduler": {
            "initial_lr":  {
                  # a uniform distribution between 1e-5 and 0.1
                  'distribution': 'uniform',
                  'min': 1e-5,
                  'max': 0.1
        },
...

See sweep config for more details

Alternatives

Other hyperparameter optimization frameworks: optuna and sigopt (probably there are others but these were some that were suggested in the lab meeting). However, I think wandb sweeps would integrate the best, given it's already set up.

@dyt811
Copy link
Member

dyt811 commented Apr 22, 2022

An optional key under "wandb" should quite reasonable. I have not played with wandb searching but in other frame work like comet.ml, the hyperparameter searching is a very smallish JSON snippet to define the searching range/values etc.

@dyt811 dyt811 added deep learning Any topics touching PyTorch and deep learning concepts, requiring additional ML/DL expertise enhancement category: improves performance/results of an existing feature labels Apr 22, 2022
@naga-karthik naga-karthik self-assigned this Apr 24, 2022
@jcohenadad jcohenadad removed the deep learning Any topics touching PyTorch and deep learning concepts, requiring additional ML/DL expertise label Dec 22, 2022
@jcohenadad
Copy link
Member

@naga-karthik @kiristern is there any update on this issue? What is your current strategy to wandb sweep with ivadomed?

@jcohenadad
Copy link
Member

jcohenadad commented Dec 22, 2022

I'm thinking, as a "short term" solution, we could maybe come up with a Python wrapper that would generate a config file on the fly (based on an input template, replacing the hyperparameter to be swept), and launch ivadomed inside the wrapper?

And this wrapper could eventually replace https://github.com/ivadomed/ivadomed/blob/master/ivadomed/scripts/automate_training.py, unless some people are using it?

@jcohenadad jcohenadad linked a pull request Dec 22, 2022 that will close this issue
7 tasks
@naga-karthik
Copy link
Member

Sorry, there has not been any update on this yet. Will take up now.

I have only used sweeps from the CLI where we have run wandb.sweep with the hyperparameters-to-sweep in a yaml file and then separately run wandb.agent based on the sweep id that the above function returns. However, it seems like both of these commands could be run together based on the documentation here and here. There is no need to give a yaml-formatted file in this case, the standard key: value format could work.

Given this, it appears that launching ivadomed within a python script as you suggested is a decent solution (not a short-term one). I will take a look how this can be done.

@naga-karthik
Copy link
Member

There is one issue that I realized just now. How wandb sweeps work is that based on the hyperparameters ranges you specify, the wandb agents create various combinations on their own to run different models (the yaml file also contains the path to main.py and the argparse commands defining the hyperparameters that the agents choose by themselves). The results are shown on the Sweeps dashboard, which is different from the standard wandb dashboard to visualize runs.

Now, with our solution, if we define the python wrapper containing the hyperparameter ranges and we're calling ivadomed ourselves, then there is no point of calling wandb.sweep because we're defining the hyperparameters ourselves (by defining the config file for ivadomed). In order words, the parameters we want to sweep over will also be appearing under the project with some group and run name. This could be one solution but will not have the same advantages of a proper "Sweep" on wandb dashboard

@jcohenadad
Copy link
Member

Right-- this was my understanding as well. But my idea was to let wandb sweep, retrieve the parameters at each 'sweep loop' (is that possible), generate an ivadomed config file with the params at a given iteration, and launch ivadomed.

@naga-karthik
Copy link
Member

naga-karthik commented Dec 23, 2022

Aha!

retrieve the parameters at each 'sweep loop' (is that possible)

This is exactly where the problem is. Once we run wandb.sweep, the only thing it returns is an alphanumeric code that should subsequently be used to run wandb.agent. As a result, we don't "see" the parameters until after we have initialized the agent (after which it appears on the dashboard). It is precisely for this reason I said that we have to come up with hyperparam combinations ourselves in order to feed it to ivadomed's config file (thereby defeating the purpose of wandb sweeps)

One workaround I could think of is as follows:

Do not think about wandb sweeps for the moment and essentially borrow the training process from ivadomed_automate_training without doing it on multiple GPUs. One could always run multiple hyperparam sweeps on different GPUs too. Now, because ivadomed_automate_training presents a way to combine various hyperparameters in different ways, we can the config files resulting from that to run ivadomed (inside this wrapper). Since, we already have wandb inside ivadomed's training.py, we will see all the runs on the dashboard. A direct comparison with all the various hyperparameters and the specific effects of each of them might be difficult to see (this is precisely what the Sweeps dashboard makes easy), but we will at least be able to run a basic sweep in the first place.

@jcohenadad
Copy link
Member

Do not think about wandb sweeps for the moment and essentially borrow the training process from ivadomed_automate_training without doing it on multiple GPUs. One could always run multiple hyperparam sweeps on different GPUs too. Now, because ivadomed_automate_training presents a way to combine various hyperparameters in different ways, we can the config files resulting from that to run ivadomed (inside this wrapper). Since, we already have wandb inside ivadomed's training.py, we will see all the runs on the dashboard. A direct comparison with all the various hyperparameters and the specific effects of each of them might be difficult to see (this is precisely what the Sweeps dashboard makes easy), but we will at least be able to run a basic sweep in the first place.

Hum, this is not great because the visualization offered by wandb-sweep is extremely useful. I'm still wondering if there is some modularity in wandb-sweep (which is not shown in the basic example from the website) that would allow us to use it with ivadomed. A bit more digging is necessary, to make sure we are not missing a good opportunity here. Also tagging @kiristern @kanishk16 @dyt811 so they can help digging.

@naga-karthik
Copy link
Member

naga-karthik commented Dec 23, 2022

@jcohenadad I looked into this a bit more and seems like my understanding of sweeps was incomplete. Here's a picture of how it works, which I pulled from one of their issues on GH here.
206537835-0124e96e-dee0-4748-8a71-d69b1cd7e059

It appears that wandb sweep simply returns a sweep_id. Once we run wandb agent using the sweep_id, it internally syncs with sweep controller, which returns different combinations of hyperparameters to the client, thereby making the agent runs these hyperparameter combinations. So, our initial hypotheses on looking more into wandb sweep will not be useful anymore. I am looking at whether we can somehow use the sweep controller, but then again, the user's job ends after running wandb agent (so not sure how exactly, need more digging here). Everything after that is happening inside a loop by communicating between the client and the sweep controller internally.

@jcohenadad
Copy link
Member

I'm not sure I agree with your analysis @naga-karthik . Looking at an example code for Pytorch, the key wandb elements are already integrated in ivadomed's training API:

wandb.init(project=project_name, group=group_name, name=run_name, config=cfg, dir=path_output)

wandb.log({"learning_rate": lr})

So my guess is that we would "just" need to implement the wandb sweep functionality into the training API, unless I am missing something?

@naga-karthik
Copy link
Member

naga-karthik commented Dec 23, 2022

TL;DR
I definitely need some more time to look into this (and also need to brainstorm with the team). I am running into some errors very similar to this, when I call the sweep and agent functions inside our training API.

What I actually did
After looking at your suggestion, I added the following lines inside our training API:

wandb_sweep_params = True
if wandb_tracking and wandb_sweep_params:
    
    sweep_configuration = {
        'method': 'random',
        'metric': {
            'goal': 'minimize', 
            'name': 'val_loss_total_avg'},
        'parameters': {
            'batch_size': {
                'values': [16, 32, 64]
            },
            'num_epochs': {
                'values': [5, 10, 15]
            },
            'depth': {
                'values': [2, 3, 4]
            }
        }
    }
    sweep_id = wandb.sweep(sweep_configuration, project=project_name)
    wandb.agent(sweep_id, count=5)

where, wandb_sweep_params would be an additional key inside our main config.json file and the values inside the sweep_configuration dictionary will be retrieved from our main config file as well (for testing purposes now, they aren't). Now, whenever we call wandb.sweep, as I mentioned above, it only returns a sweep_id which is useless until we run wandb.agent to use the sweep_id and initialize the run.

Now, wandb.agent does a weird thing. Because these functions are optimized for CLI, it always runs this command /usr/bin/env python --batch_size=32 --depth=3 --num_epochs=5 (note the arguments are the ones I provided in the sweep_configuration dictionary. The whole log from wandb is:

2022-12-23 18:38:40,070 - wandb.wandb_agent - INFO - About to run command: /usr/bin/env python --batch_size=32 --depth=3 --num_epochs=5

This is a problem because we don't want CLI inputs be running from our training API. As suggested in the wandb docs, the best way to use Sweeps is to run these commands separately on CLI or from a Jupyter notebook. Bottomline is that sweeps integration is not trivial because of the rigidity in how wandb itself provided this feature.

@jcohenadad
Copy link
Member

jcohenadad commented Dec 27, 2022

I see. Have you tried using Sweep's API? Also, from wandb/wandb#2282 (comment), have you tried:

If running from the command line isn't an option, you could try setting the WANDB_START_METHOD=thread.

There might also be possibilities to use wandb's sweep via a local controller.

Don't hesitate to also open an issue on wandb's repository, to explain what we would like to do. If there is a quick solution (or a 'no go'), it would save us a lot of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement category: improves performance/results of an existing feature priority:high wandb
Projects
None yet
4 participants