Job specific environment variables can't be set in Hydra multi-run #9449

domenVres · 2024-06-12T08:35:38Z

Is your feature request related to a problem? Please describe.

Hydra offers a way to set the environment variables, that are specific for each job - section hydra.job.env_set here. However, when I added this to the configuration for my multi-run hyperparameter search, the environment variable was not changed. I suspect that the reason for that is that NeMo has a custom launcher, through which hydra executes the job. I did not find that these environment variables would be handled anywhere in there.

Describe the solution you'd like

I would like to be able to set job-specific environment variables in the same way that Hydra documentation describes it. This means I should be able to add field job.env_set in Hydra config for multi-runs (screenshot below).

Describe alternatives you've considered

The only alternative that I found so far and that works is to manually set the environment variables inside the training script (each job is a training script). However, inside these scripts, I don't have access to the job ID, and hence, I had to infer it from the values of hyperparameters. This is far from ideal, and it would be much nicer to have a config-level solution.

Additional context

The main reason that I went into trouble with the environment variables is a problem with ports when training multiple models at once. The process is initialized on the master port, and consequently, the following error happens:

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53394 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:53394 (errno: 98 - Address already in use).

The solution to avoid this error is to set the environment variable MASTER_PORT to a different value for each job. Without this option, running multiple training processes in parallel is impossible due to port collisions.

The text was updated successfully, but these errors were encountered:

titu1994 · 2024-06-12T08:59:20Z

We don't use process launcher unless you use hydra sweep config. Can you try removing that ? If not, we'll have to see how to implement your request. Ofc, if you have a solution you're encouraged to send a pr

domenVres · 2024-06-12T09:36:56Z

Is there a way to perform a hyperparameter search without the hydra sweep?

github-actions · 2024-08-03T01:49:25Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-08-10T01:51:36Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

domenVres assigned okuchaiev Jun 12, 2024

elliottnv assigned titu1994 and unassigned okuchaiev Jul 3, 2024

github-actions bot added the stale label Aug 3, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job specific environment variables can't be set in Hydra multi-run #9449

Job specific environment variables can't be set in Hydra multi-run #9449

domenVres commented Jun 12, 2024

titu1994 commented Jun 12, 2024

domenVres commented Jun 12, 2024

github-actions bot commented Aug 3, 2024

github-actions bot commented Aug 10, 2024

Job specific environment variables can't be set in Hydra multi-run #9449

Job specific environment variables can't be set in Hydra multi-run #9449

Comments

domenVres commented Jun 12, 2024

titu1994 commented Jun 12, 2024

domenVres commented Jun 12, 2024

github-actions bot commented Aug 3, 2024

github-actions bot commented Aug 10, 2024