Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples/classifytask: Horovod has not been initialized; use hvd.init() #84

Closed
justfouw opened this issue Dec 16, 2020 · 2 comments
Closed

Comments

@justfouw
Copy link

I run example code, like that
python run_pipeline.py classification/classify.yml with one new config itemtrainer:distributed=True

Some errors occur about horovod. ValueError: Horovod has not been initialized; use hvd.init()..

Does horovod is not supported in this example?

@zhangjiajin
Copy link
Member

zhangjiajin commented Dec 17, 2020

@justfouw

Hi, Vega does not support horovd in the NAS phase.

In the NAS phase, dask.distributed is used to distribute different networks to different nodes for parallel search.

The fullytrain phase supports horovd. In this case, you need to set distributed to True, delete the models_folder parameter, and set the model_desc_file parameter to perform fulltrain for a specific network. See the following:

pipeline: [fullytrain]

fullytrain:
    pipe_step:
        type: FullyTrainPipeStep
        # models_folder: "{local_base_path}/output/nas/"

    trainer:
        type: Trainer
        epochs: 160
        distributed: True
        optimizer:
            type: SGD
            params:
                lr: 0.1
                momentum: 0.9
                weight_decay: 0.0001
        lr_scheduler:
            type: MultiStepLR
            params:
                milestones: [60, 120]
                gamma: 0.5
        hps_folder: "<hpo file folder path>"

    model:
        model_desc_file: "<model_n.json file path>"

    dataset:
        type: Cifar10
        common:
            data_path: /cache/datasets/cifar10/

@menglifenglin
Copy link

When I try to slove this problem for esr_ea algorithm by this way, The error shows that : "/root/.local/lib/python3.6/site-packages/vega/core/pipeline/horovod/run_cluster_horovod_train.sh: No such file or directory," . Could you tell me how to deal with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants