Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add multi machine dist_train #232

Merged

Conversation

YuanLiuuuuuu
Copy link
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects, like MMDet or MMSeg.
  • CLA has been signed and all committers have signed the CLA in this PR.

@YuanLiuuuuuu YuanLiuuuuuu merged commit 7dac5bf into open-mmlab:dev_v0.8.0 Mar 11, 2022
@YuanLiuuuuuu YuanLiuuuuuu deleted the feature/multi_machine_dist branch March 15, 2022 01:56
@YuanLiuuuuuu YuanLiuuuuuu mentioned this pull request Mar 22, 2022
6 tasks
YuanLiuuuuuu added a commit that referenced this pull request Mar 31, 2022
* [Feature]: Add multi machine dist_train

* [Fix]: Change bash to sh

* [Fix]: Fix missing sh suffix

* [Refactor]: Change bash to sh
Jiahao000 added a commit that referenced this pull request Mar 31, 2022
* [Fix]: Fix mmcls upgrade bug (#235)

* [Feature]: Add multi machine dist_train (#232)

* [Feature]: Add multi machine dist_train

* [Fix]: Change bash to sh

* [Fix]: Fix missing sh suffix

* [Refactor]: Change bash to sh

* [Refactor] Add unit test (#234)

* [Refactor] add unit test

* update workflow

* update

* [Fix] fix lint

* update test

* refactor moco and densecl unit test

* fix lint

* add unit test

* update unit test

* remove modification

* [Feature]: Add MAE metafile (#238)

* [Feature]: Add MAE metafile

* [Fix]: Fix lint

* [Fix]: Change LARS to AdamW in the metafile of MAE

* [Fix] fix codecov bug (#241)

* [Fix] fix codecov bug

* update comment

* [Refactor] Using MMCls backbones (#233)

* [Refactor] using backbones from MMCls

* [Refactor] modify the unit test

* [Fix] modify default setting of out_indices

* [Docs] fix lint

* [Refactor] modify super init

* [Refactore] remove res_layer.py

* using mmcv PatchEmbed

* [Fix]: Fix outdated problem (#249)

* [Fix]: Fix outdated problem

* [Fix]: Update MoCov3 bibtex

* [Fix]: Use abs path in README

* [Fix]: Reformat MAE bibtex

* [Fix]: Reformat MoCov3 bibtex

* [Feature] Resume from the latest checkpoint automatically. (#245)

* [Feature] Resume from the latest checkpoint automatically.

* fix windows path problem

* fix lint

* add code reference

* [Docs] add docstring for ResNet and ResNeXt (#252)

* [Feature] support KNN benchmark (#243)

* [Feature] support KNN benchmark

* [Fix] add docstring and multi-machine testing

* [Fix] fix lint

* [Fix] change args format and check init_cfg

* [Docs] add benchmark tutorial

* [Docs] add benchmark results

* [Feature]: SimMIM supported (#239)

* [Feature]: SimMIM Pretrain

* [Feature]: Add mix precision and 16x128 config

* [Fix]: Fix config import bug

* [Fix]: Fix config bug

* [Feature]: Simim Finetune

* [Fix]: Log every 100

* [Fix]: Fix eval problem

* [Feature]: Add docstring for simmim

* [Refactor]: Merge layer wise lr decay to Default constructor

* [Fix]:Fix simmim evaluation bug

* [Fix]: Change model to be compatible to latest version of mmcls

* [Fix]: Fix lint

* [Fix]: Rewrite forward_train for classification cls

* [Feature]: Add UT

* [Fix]: Fix lint

* [Feature]: Add 32 gpus training for simmim ft

* [Fix]: Rename mmcls classifier wrapper

* [Fix]: Add docstring to SimMIMNeck

* [Feature]: Generate docstring for the forward function of simmim encoder

* [Fix]: Rewrite the class docstring for constructor

* [Fix]: Fix lint

* [Fix]: Fix UT

* [Fix]: Reformat config

* [Fix]: Add img resolution

* [Feature]: Add readme and metafile

* [Fix]: Fix typo in README.md

* [Fix]: Change BlackMaskGen to BlockwiseMaskGenerator

* [Fix]: Change the name of SwinForSimMIM

* [Fix]: Delete irrelevant files

* [Feature]: Create extra transformerfinetuneconstructor

* [Fix]: Fix lint

* [Fix]: Update SimMIM README

* [Fix]: Change SimMIMPretrainHead to SimMIMHead

* [Fix]: Fix the docstring of ft constructor

* [Fix]: Fix UT

* [Fix]: Recover deletion

Co-authored-by: Your <you@example.com>

* [Fix] add seed to distributed sampler (#250)

* [Fix] add seed to distributed sampler

* fix lint

* [Feature] Add ImageNet21k (#225)

* solve memory leak by limited implementation

* fix lint problem

Co-authored-by: liming <liming.ai@bytedance.com>

* [Refactor] change args format to '--a-b' (#253)

* [Refactor] change args format to `--a-b`

* modify tsne script

* modify 'sh' files

* modify getting_started.md

* modify getting_started.md

* [Fix] fix 'mkdir' error in prepare_voc07_cls.sh (#261)

* [Fix] fix positional parameter error (#260)

* [Fix] fix command errors in benchmarks tutorial (#263)

* [Docs] add brief installation steps in README.md (#265)

* [Docs] add colab tutorial (#247)

* [Docs] add colab tutorial

* fix lint

* modify the colab tutorial, using API to train the model

* modify the description

* remove #

* modify the command

* [Docs] translate 6_benchmarks.md into Chinese (#262)

* [Docs] translate 6_benchmarks.md into Chinese

* Update 6_benchmarks.md

change 基准 to 基准评测

* Update 6_benchmarks.md

(1)  Add Chinese translation of  ‘1 folder for ImageNet nearest-neighbor classification task’
(2) 数据预准备 -> 数据准备

* [Docs] remove install scripts in README (#267)

* [Docs] Update version information in dev branch (#268)

* update version to v0.8.0

* fix lint

* [Fix]: Install the latest mmcls

* [Fix]: Add SimMIM in RAEDME

Co-authored-by: Yuan Liu <30762564+YuanLiuuuuuu@users.noreply.github.com>
Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com>
Co-authored-by: Your <you@example.com>
Co-authored-by: Ming Li <73068772+mitming@users.noreply.github.com>
Co-authored-by: liming <liming.ai@bytedance.com>
Co-authored-by: RenQin <45731309+soonera@users.noreply.github.com>
Co-authored-by: YuanLiuuuuuu <3463423099@qq.com>
biubiubiiu added a commit to biubiubiiu/derain-toolbox that referenced this pull request Apr 11, 2022
Jiahao000 pushed a commit that referenced this pull request Apr 27, 2022
* [Fix]: Fix mmcls upgrade bug (#235)

* [Feature]: Add multi machine dist_train (#232)

* [Feature]: Add multi machine dist_train

* [Fix]: Change bash to sh

* [Fix]: Fix missing sh suffix

* [Refactor]: Change bash to sh

* [Refactor] Add unit test (#234)

* [Refactor] add unit test

* update workflow

* update

* [Fix] fix lint

* update test

* refactor moco and densecl unit test

* fix lint

* add unit test

* update unit test

* remove modification

* [Feature]: Add MAE metafile (#238)

* [Feature]: Add MAE metafile

* [Fix]: Fix lint

* [Fix]: Change LARS to AdamW in the metafile of MAE

* Add barlowtwins

* Add unit test for barlowtwins

* Adjust training params

* add decorator to pass CI

* adjust params

* Add barlowtwins

* Add unit test for barlowtwins

* Adjust training params

* add decorator to pass CI

* adjust params

* add barlowtwins configs

* revise LatentCrossCorrelationHead

* modify ut to save memory

* add metafile

* add barlowtwins results to model zoo

* add barlow twins to homepage

* fix batch size bug

* add algorithm readme

* add type hints

* reorganize the model zoo

* remove one config

* recover the config

* add missing docstring

* revise barlowtwins

* reorganize coco and voc benchmark

* add barlowtwins to index.rst

* revise docstring

Co-authored-by: Yuan Liu <30762564+YuanLiuuuuuu@users.noreply.github.com>
Co-authored-by: Yixiao Fang <36138628+fangyixiao18@users.noreply.github.com>
Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
fangyixiao18 added a commit that referenced this pull request Apr 29, 2022
* [Feature]: MAE pre-training with fp16 (#271)

* [Feature]: MAE pre-training with fp16

* [Fix]: Fix lint

* [Fix]: Fix SimMIM config link, and add SimMIM to model_zoo (#272)

* [Fix]: Fix link error

* [Fix]: Add SimMIM to model zoo

* [Fix]: Fix lint

* [Fix] fix 'no init_cfg' error for pre-trained model backbones (#256)

* [UT] add unit test for apis (#276)

* [UT] add unit test for apis

* ignore pytest log

* [Feature] Add extra dataloader settings in configs. (#264)

* [Feature] support to set validation samples per gpu independently

* set default to be cfg.data.samples_per_gpu

* modify the tools/test.py

* using 'train_dataloader', 'val_dataloader', 'test_dataloader' for specific settings

* test 'evaluation' branch

* [Fix]: Change imgs_per_gpu to samples_per_gpu MAE (#278)

* [Feature]: Add SimMIM 192 pt 224 ft (#280)

* [Feature]: Add SimMIM 192 pt 224 ft

* [Feature]: Add simmim 192 pt 224 ft to readme

* [Fix] fix key error bug when registering custom hooks (#273)

* [UT] remove pytorch1.5 test (#288)

* [Benchmark] rename linear probing config file names (#281)

* [Benchmark] rename linear probing config file names

* update config links

* Avoid GPU memory leak with prefetch dataloader (#277)

* [Feature] barlowtwins (#207)

* [Fix]: Fix mmcls upgrade bug (#235)

* [Feature]: Add multi machine dist_train (#232)

* [Feature]: Add multi machine dist_train

* [Fix]: Change bash to sh

* [Fix]: Fix missing sh suffix

* [Refactor]: Change bash to sh

* [Refactor] Add unit test (#234)

* [Refactor] add unit test

* update workflow

* update

* [Fix] fix lint

* update test

* refactor moco and densecl unit test

* fix lint

* add unit test

* update unit test

* remove modification

* [Feature]: Add MAE metafile (#238)

* [Feature]: Add MAE metafile

* [Fix]: Fix lint

* [Fix]: Change LARS to AdamW in the metafile of MAE

* Add barlowtwins

* Add unit test for barlowtwins

* Adjust training params

* add decorator to pass CI

* adjust params

* Add barlowtwins

* Add unit test for barlowtwins

* Adjust training params

* add decorator to pass CI

* adjust params

* add barlowtwins configs

* revise LatentCrossCorrelationHead

* modify ut to save memory

* add metafile

* add barlowtwins results to model zoo

* add barlow twins to homepage

* fix batch size bug

* add algorithm readme

* add type hints

* reorganize the model zoo

* remove one config

* recover the config

* add missing docstring

* revise barlowtwins

* reorganize coco and voc benchmark

* add barlowtwins to index.rst

* revise docstring

Co-authored-by: Yuan Liu <30762564+YuanLiuuuuuu@users.noreply.github.com>
Co-authored-by: Yixiao Fang <36138628+fangyixiao18@users.noreply.github.com>
Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>

* [Fix] fix --local-rank (#290)

* [UT] reduce memory usage while runing unit test (#291)

* [Feature]: CAE Supported (#284)

* [Feature]: Add mc

* [Feature]: Add dataset of CAE

* [Feature]: Init version of CAE

* [Feature]: Add mc

* [Fix]: Change beta to (0.9, 0.999)

* [Fix]: New feature

* [Fix]: Decouple the qkv bias

* [Feature]: Decouple qkv bias in MultiheadAttention

* [Feature]: New mask generator

* [Fix]: Fix TransformEncoderLayer bug

* [Feature]: Add MAE CAE linear prob

* [Fix]: Fix config

* [Fix]: Delete redundant mc

* [Fix]: Add init value in mim cls vit

* [Fix]: Fix cae ft config

* [Fix]: Delete repeated init_values

* [Fix]: Change bs from 64 to 128 in CAE ft

* [Fix]: Add mc in cae pt

* [Fix]: Fix momemtum update bug

* [Fix]: Add no weight_decay for gamma

* [Feature]: Add mc for cae pt

* [Fix]: Delete mc

* [Fix]: Delete redundant files

* [Fix]: Fix lint

* [Feature]: Add docstring to algo, backbone, neck and head

* [Fix]: Fix lint

* [Fix]: network

* [Feature]: Add docstrings for network blocks

* [Feature]: Add docstring to ToTensor

* [Feature]: Add docstring to transoform

* [Fix]: Add type hint to BEiTMaskGenerator

* [Fix]: Fix lint

* [Fix]: Add copyright to dalle_e

* [Fix]: Fix BlockwiseMaskGenerator

* [Feature]: Add UT for CAE

* [Fix]: Fix dalle state_dict path not existed bug

* [Fix]: Delete file_client_args related code

* [Fix]: Remove redundant code

* [Refactor]: Add fp16 to the name of cae pre-train config

* [Refactor]: Use FFN from mmcv

* [Refactor]: Change network_blocks to trasformer_blocks

* [Fix]: Fix mask generator name bug

* [Fix]: cae pre-train config bug

* [Fix]: Fix docstring grammar

* [Fix]: Fix mc related code

* [Fix]: Add object parent to transform

* [Fix]: Delete unnecessary modification

* [Fix]: Change blockwisemask generator to simmim mask generator

* [Refactor]: Change cae mae pretrain vit to cae mae vit

* [Refactor]: Change lamb to lambd

* [Fix]: Remove blank line

* [Fix]: Fix lint

* [Fix]: Fix UT

* [Fix]: Delete modification to swin

* [Fix]: Fix lint

* [Feature]: Add README and metafile

* [Feature]: Update index.rst

* [Fix]: Update model_zoo

* [Fix]: Change MAE to CAE in algorithm

* [Fix]: Change SimMIMMaskGenerator to CAEMaskGenerator

* [Fix]: Fix model zoo

* [Fix]: Change to dalle_encoder

* [Feature]: Add download link for dalle

* [Fix]: Fix lint

* [Fix]: Fix UT

* [Fix]: Update metafile

* [Fix]: Change b to base

* [Feature]: Add dalle download link in warning

* [Fix] add arxiv link in readme

Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com>

* [Enhance] update SimCLR models and results (#295)

* [Enhance] update simclr models and results

* [Fix] revise comments to indicate settings

* Update version (#296)

* [Feature]: Update to 0.9.0

* [Feature]: Add version constrain for mmcls

* [Fix]: Fix bug

* [Fix]: Fix version bug

* [Feature]: Update version in install.md

* update changelog

* update readme

* [Fix] fix uppercase

* [Fix] fix uppercase

* [Fix] fix uppercase

* update version dependency

* add cae to readme

Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com>

Co-authored-by: Yixiao Fang <36138628+fangyixiao18@users.noreply.github.com>
Co-authored-by: Ming Li <73068772+mitming@users.noreply.github.com>
Co-authored-by: xcnick <xcnick0412@gmail.com>
Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants