You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Hydra offers a way to set the environment variables, that are specific for each job - section hydra.job.env_set here. However, when I added this to the configuration for my multi-run hyperparameter search, the environment variable was not changed. I suspect that the reason for that is that NeMo has a custom launcher, through which hydra executes the job. I did not find that these environment variables would be handled anywhere in there.
Describe the solution you'd like
I would like to be able to set job-specific environment variables in the same way that Hydra documentation describes it. This means I should be able to add field job.env_set in Hydra config for multi-runs (screenshot below).
Describe alternatives you've considered
The only alternative that I found so far and that works is to manually set the environment variables inside the training script (each job is a training script). However, inside these scripts, I don't have access to the job ID, and hence, I had to infer it from the values of hyperparameters. This is far from ideal, and it would be much nicer to have a config-level solution.
Additional context
The main reason that I went into trouble with the environment variables is a problem with ports when training multiple models at once. The process is initialized on the master port, and consequently, the following error happens:
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53394 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:53394 (errno: 98 - Address already in use).
The solution to avoid this error is to set the environment variable MASTER_PORT to a different value for each job. Without this option, running multiple training processes in parallel is impossible due to port collisions.
The text was updated successfully, but these errors were encountered:
We don't use process launcher unless you use hydra sweep config. Can you try removing that ? If not, we'll have to see how to implement your request. Ofc, if you have a solution you're encouraged to send a pr
Is your feature request related to a problem? Please describe.
Hydra offers a way to set the environment variables, that are specific for each job - section hydra.job.env_set here. However, when I added this to the configuration for my multi-run hyperparameter search, the environment variable was not changed. I suspect that the reason for that is that NeMo has a custom launcher, through which hydra executes the job. I did not find that these environment variables would be handled anywhere in there.
Describe the solution you'd like
I would like to be able to set job-specific environment variables in the same way that Hydra documentation describes it. This means I should be able to add field job.env_set in Hydra config for multi-runs (screenshot below).
Describe alternatives you've considered
The only alternative that I found so far and that works is to manually set the environment variables inside the training script (each job is a training script). However, inside these scripts, I don't have access to the job ID, and hence, I had to infer it from the values of hyperparameters. This is far from ideal, and it would be much nicer to have a config-level solution.
Additional context
The main reason that I went into trouble with the environment variables is a problem with ports when training multiple models at once. The process is initialized on the master port, and consequently, the following error happens:
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53394 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:53394 (errno: 98 - Address already in use).
The solution to avoid this error is to set the environment variable MASTER_PORT to a different value for each job. Without this option, running multiple training processes in parallel is impossible due to port collisions.
The text was updated successfully, but these errors were encountered: