Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Upload checkpoints and logs to ceph #1375

Merged
merged 103 commits into from
Oct 24, 2021

Conversation

zhouzaida
Copy link
Member

@zhouzaida zhouzaida commented Sep 26, 2021

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Note: The PR is based on #1330

The motivation of the PR is to upload checkpoints and logs to PetrelOSS and load checkpoints from PetrelOSS.

It includes four parts:

  • Refactor save_checkpoint (mmcv/runner/checkpoint.py) and CheckpointHook (mmcv/runner/hooks/checkpoint.py). They support uploading checkpoints to PetrelOSS.
  • Refactor load_checkpoint_ceph. It supports loading checkpoints from PetrelOSS.
  • Refactor TextLoggerHook. It supports uploading logs to PetrelOSS.
  • Refactor EvalHook. It supports uploading the best checkpoint to PetrelOSS.

Modification

  • save_checkpoint (mmcv/runner/checkpoint.py)

    • Add an arugment file_client_args which is used to instantiate a FileClient
  • CheckPointHook (mmcv/runner/hooks/checkpoint.py)

    • Add an arugment file_client_args which is used to instantiate a FileClient
    • Refactor the method _save_checkpoint which saves the new checkpoint and removes those deprecated checkpoints
  • load_checkpoint_ceph
    Original load_checkpoint_ceph only supports loading checkpoints from Ceph. The prefix of it is s3:// and it is the same as Petrel. Therefore, we need to refactor the function and make it support Petrel too.

  • TextLoggerHook

    • Add three arguments out_dir, keep_local and file_client_args
    • Add after_run which will upload the logs to Petrel after the end of the training
  • EvalHook (mmcv/runner/hooks/evalhook.py)

    • Add two arguments out_dir and file_client_args
    • Update before_run which will infer the out_dir and file_client.

BC-breaking (Optional)

No

Use cases (Optional)

  1. Upload checkpoints to PetrelOSS

    checkpoint_config = dict(interval=1, out_dir='s3://path/of/your/directory')  # defined in default_runtime.py
    evaluation = dict(interval=1, metric='accuracy', out_dir='s3://path/of/your/directory')
  2. Save the best checkpoint to PetrelOSS

    evaluation = dict(interval=1, save_best='auto', out_dir='s3://bucket_name')
  3. Resume the checkpoint from PetrelOSS

     resume_from = 's3://path/of/your/resumed_checkpoint.pth'
  4. Upload logs to PetrelOss
    Only the log_config needs to be modified, which is defined in default_runtime.py

    log_config = dict(
        interval=50,
        hooks=[
            dict(type='TextLoggerHook', out_dir='s3://path/of/your/directory/to/save/log'),
        ])

Tested by downstream codebases

MMClassification

Training a model with resnet101_b16x8_cifar10

Save to PetrelOSS Save to Lustre
Training time (20 epochs, 1 GPU) 15min 14min
  • [Save to PetrelOSS] config for uploading checkpoints, best checkpoint, and logs to PetrelOSS

    # checkpoint saving
    checkpoint_config = dict(interval=1, max_keep_ckpts=5, out_dir='s3://bucket_name')
    # yapf:disable
    log_config = dict(
        interval=100,
        hooks=[
            dict(type='TextLoggerHook', out_dir='s3://bucket_name'),
            # dict(type='TensorboardLoggerHook')
        ])
    # yapf:enable
    evaluation = dict(interval=1, save_best='auto', out_dir='s3://bucket_name')
    • CheckpointHook works as expected
      Include uploading checkpoints and deleting outdated checkpoints (max_keep_ckpts=5)
    • Resuming from checkpoints works as expected
    • TextLoggerHook works as expected
      Include uploading logs or deleting the local logs after uploading logs (keep_local=False)
    • EvaluationHook works as expected
      Include evaluating the validation dataset, saving and updating the best checkpoint (save_best='auto')
  • [Save to Lustre] config for saving checkpoints, best checkpoint, and logs to Lustre

    # checkpoint saving
    checkpoint_config = dict(interval=1, max_keep_ckpts=5)
    # yapf:disable
    log_config = dict(
        interval=100,
        hooks=[
            dict(type='TextLoggerHook'),
            # dict(type='TensorboardLoggerHook')
        ])
    # yapf:enable
    evaluation = dict(interval=1, save_best='auto')

MMDetection

Training a model with retinanet_r50_fpn_1x_coco

Save to PetrelOSS Save to Lustre
Training time (8 epochs, 1 GPU) 2h20m
  • [Save to PetrelOSS] config for uploading checkpoints, best checkpoint, and logs to PetrelOSS

    # checkpoint saving
    checkpoint_config = dict(interval=1, max_keep_ckpts=2, out_dir='s3://bucket_name')
    # yapf:disable
    log_config = dict(
        interval=50,
        hooks=[
            dict(type='TextLoggerHook', out_dir='s3://bucket_name'),
            # dict(type='TensorboardLoggerHook')
        ])
    # yapf:enable
    evaluation = dict(interval=1, save_best='bbox', out_dir='s3://bucket_name')
    data = dict(
        samples_per_gpu=2,
        workers_per_gpu=2,
        train=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',  # reduce the training time
            img_prefix=data_root + 'train2017/',
            pipeline=train_pipeline),
        val=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline),
        test=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline))
  • [Save to Lustre] config for saving checkpoints, best checkpoint, and logs to Lustre

    # checkpoint saving
    checkpoint_config = dict(interval=1, max_keep_ckpts=5)
    # yapf:disable
    log_config = dict(
        interval=100,
        hooks=[
            dict(type='TextLoggerHook'),
            # dict(type='TensorboardLoggerHook')
        ])
    # yapf:enable
    evaluation = dict(interval=1, save_best='bbox')
    data = dict(
        samples_per_gpu=2,
        workers_per_gpu=2,
        train=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',  # reduce the training time
            img_prefix=data_root + 'train2017/',
            pipeline=train_pipeline),
        val=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline),
        test=dict(
            type=dataset_type,
            ann_file=data_root + 'annotations/instances_val2017.json',
            img_prefix=data_root + 'val2017/',
            pipeline=test_pipeline))

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
  • CLA has been signed and all committers have signed the CLA in this PR.

@codecov
Copy link

codecov bot commented Sep 26, 2021

Codecov Report

Merging #1375 (81520af) into master (8cac7c2) will increase coverage by 0.14%.
The diff coverage is 77.22%.

❗ Current head 81520af differs from pull request most recent head 39a58a3. Consider uploading reports for the commit 39a58a3 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1375      +/-   ##
==========================================
+ Coverage   68.59%   68.74%   +0.14%     
==========================================
  Files         164      164              
  Lines       10891    11029     +138     
  Branches     1991     2018      +27     
==========================================
+ Hits         7471     7582     +111     
- Misses       3030     3051      +21     
- Partials      390      396       +6     
Flag Coverage Δ
unittests 68.74% <77.22%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcv/runner/hooks/logger/text.py 71.07% <31.81%> (-8.73%) ⬇️
mmcv/runner/checkpoint.py 73.77% <41.17%> (-0.65%) ⬇️
mmcv/fileio/file_client.py 86.66% <85.29%> (-3.41%) ⬇️
mmcv/runner/hooks/checkpoint.py 84.12% <90.90%> (+24.86%) ⬆️
mmcv/fileio/handlers/base.py 53.33% <100.00%> (-25.24%) ⬇️
mmcv/fileio/handlers/pickle_handler.py 87.50% <100.00%> (-12.50%) ⬇️
mmcv/fileio/io.py 76.56% <100.00%> (+6.56%) ⬆️
mmcv/fileio/parse.py 100.00% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b5b47f...39a58a3. Read the comment docs.

mmcv/runner/checkpoint.py Outdated Show resolved Hide resolved
@zhouzaida zhouzaida added the WIP label Oct 23, 2021
@ZwwWayne
Copy link
Collaborator

Can be merged after resolving the final comments.

@ZwwWayne ZwwWayne removed the WIP label Oct 24, 2021
@ZwwWayne ZwwWayne merged commit 32e09f4 into open-mmlab:master Oct 24, 2021
zhouzaida added a commit that referenced this pull request Nov 3, 2021
* [Feature] Choose storage backend by the prefix of filepath

* refactor FileClient and add unittest

* support loading from different backends

* polish docstring

* fix unittet

* rename attribute str_like_obj to is_str_like_obj

* [Docs] Upload checkpoint to petrel oss

* add infer_client method

* Support uploading checkpoint to petrel oss

* add check_exist method

* refactor CheckpointHook

* support uploading logs to ceph

* rename var client to file_client

* polish docstring

* enhance load_from_ceph

* refactor load_from_ceph

* refactor TextLoggerHook

* change the meaning of out_dir argument

* fix test_checkpoint_hook.py

* add join_paths method

* remove join_paths and add _format_path

* enhance unittest

* refactor unittest

* add a unittest for EvalHook when file backend is petrel

* singleton pattern

* fix test_clientio.py

* deprecate CephBackend

* add warning in load_from_ceph

* fix type of out_suffix

* enhance docstring

* refactor unittest for petrel

* refactor unittest for disk backend

* update io.md

* add concat_paths method

* fix CI

* mock check_exist

* improve docstring

* improve docstring

* improve docstring

* improve docstring

* add isdir and copyfile for file backend

* delete copyfile and add get_local_path

* remove isdir method of petrel

* fix typo

* rename check_exists to exists

* refactor code and polish docstring

* fix windows ci

* add comment and polish docstring

* polish docstring

* polish docstring

* rename _path_mapping to _map_path

* polish docstring and fix typo

* refactor get_local_path

* add list_dir_or_file for FileClient

* add list_dir_or_file for PetrelBackend

* fix windows ci

* Add return docstring

* polish docstring

* fix typo

* fix typo

* fix typo

* fix error when mocking PetrelBackend

* deprecate the conversion from Path to str

* add docs for loading checkpoints with FileClient

* rename keep_log to keep_local

* refactor map_path

* add _ensure_methods to ensure methods have been implemented

* fix list_dir_or_file

* rename _ensure_method_implemented to has_method

* refactor

* polish information

* format information
@zhouzaida zhouzaida deleted the upload-ckpt-to-ceph branch November 4, 2021 07:17
zhouzaida added a commit that referenced this pull request Apr 16, 2022
* [Feature] Add roiaware pool3d ops from mmdet3d (#1382)

* add ops (roiaware pool3d) in mmdet3d

* refactor code

* fix typo

Co-authored-by: zhouzaida <zhouzaida@163.com>

* [Feature] Add iou3d op from mmdet3d (#1356)

* add ops (iou3d) in mmdet3d

* add unit test

* refactor code

* refactor code

* refactor code

* refactor code

* refactor code

Co-authored-by: zhouzaida <zhouzaida@163.com>

* [Fix] Update test data for test_iou3d (#1427)

* Update test data for test_iou3d

* delete blank lines

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* [Feature] Add group points ops from mmdet3d (#1415)

* add op (group points) and its related ops (ball query and knn) in mmdet3d

* refactor code

* fix typo

* refactor code

* fix typo

* refactor code

* make input contiguous

Co-authored-by: zhouzaida <zhouzaida@163.com>

* add mmdet3d op (#1425)

Co-authored-by: zhouzaida <zhouzaida@163.com>

* [Feature] Loading objects from different backends and dumping objects to different backends (#1330)

* [Feature] Choose storage backend by the prefix of filepath

* refactor FileClient and add unittest

* support loading from different backends

* polish docstring

* fix unittet

* rename attribute str_like_obj to is_str_like_obj

* add infer_client method

* add check_exist method

* rename var client to file_client

* polish docstring

* add join_paths method

* remove join_paths and add _format_path

* enhance unittest

* refactor unittest

* singleton pattern

* fix test_clientio.py

* deprecate CephBackend

* enhance docstring

* refactor unittest for petrel

* refactor unittest for disk backend

* update io.md

* add concat_paths method

* improve docstring

* improve docstring

* add isdir and copyfile for file backend

* delete copyfile and add get_local_path

* remove isdir method of petrel

* fix typo

* add comment and polish docstring

* polish docstring

* rename _path_mapping to _map_path

* polish docstring and fix typo

* refactor get_local_path

* add list_dir_or_file for FileClient

* add list_dir_or_file for PetrelBackend

* fix windows ci

* Add return docstring

* polish docstring

* fix typo

* fix typo

* deprecate the conversion from Path to str

* add docs for loading checkpoints with FileClient

* refactor map_path

* add _ensure_methods to ensure methods have been implemented

* fix list_dir_or_file

* rename _ensure_method_implemented to has_method

* Add CI for pytorch 1.10 (#1431)

* [Feature] Upload checkpoints and logs to ceph (#1375)

* [Feature] Choose storage backend by the prefix of filepath

* refactor FileClient and add unittest

* support loading from different backends

* polish docstring

* fix unittet

* rename attribute str_like_obj to is_str_like_obj

* [Docs] Upload checkpoint to petrel oss

* add infer_client method

* Support uploading checkpoint to petrel oss

* add check_exist method

* refactor CheckpointHook

* support uploading logs to ceph

* rename var client to file_client

* polish docstring

* enhance load_from_ceph

* refactor load_from_ceph

* refactor TextLoggerHook

* change the meaning of out_dir argument

* fix test_checkpoint_hook.py

* add join_paths method

* remove join_paths and add _format_path

* enhance unittest

* refactor unittest

* add a unittest for EvalHook when file backend is petrel

* singleton pattern

* fix test_clientio.py

* deprecate CephBackend

* add warning in load_from_ceph

* fix type of out_suffix

* enhance docstring

* refactor unittest for petrel

* refactor unittest for disk backend

* update io.md

* add concat_paths method

* fix CI

* mock check_exist

* improve docstring

* improve docstring

* improve docstring

* improve docstring

* add isdir and copyfile for file backend

* delete copyfile and add get_local_path

* remove isdir method of petrel

* fix typo

* rename check_exists to exists

* refactor code and polish docstring

* fix windows ci

* add comment and polish docstring

* polish docstring

* polish docstring

* rename _path_mapping to _map_path

* polish docstring and fix typo

* refactor get_local_path

* add list_dir_or_file for FileClient

* add list_dir_or_file for PetrelBackend

* fix windows ci

* Add return docstring

* polish docstring

* fix typo

* fix typo

* fix typo

* fix error when mocking PetrelBackend

* deprecate the conversion from Path to str

* add docs for loading checkpoints with FileClient

* rename keep_log to keep_local

* refactor map_path

* add _ensure_methods to ensure methods have been implemented

* fix list_dir_or_file

* rename _ensure_method_implemented to has_method

* refactor

* polish information

* format information

* bump version to v1.3.16 (#1430)

* [Fix]: Update test data of test_tin_shift (#1426)

* Update test data of test_tin_shift

* Delete tmp.engine

* add pytest raises asserterror test

* raise valueerror, update test log

* add more comment

* Apply suggestions from code review

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* fix the wrong function reference bug in BaseTransformerLayer when batch_first is True (#1418)

* [Docs] Add mmcv itself in the docs list (#1441)

* Add mmcv itself in the docs list

* modify link of docs

* [Improve] improve checkpoint loading log (#1446)

* [Feature] Support SigmoidFocalLoss with Cambricon MLU backend (#1346)

* [Feature] Support SigmoidFocalLoss with Cambricon MLU backend

* refactor MMCV_WITH_MLU macro define

* refactor NFU_ALIGN_SIZE, PAD_DOWN and split_pipeline_num

* delete extra fool proofing in cpp

* [Feature] Support SigmoidFocalLossBackward with Cambricon MLU backend

* fix macro definition in SigmoidFocalLoss

* refactor mlu files into clang-format

* refactor sigmoid focal loss test

* refactor Sigmoid Focal Loss file structure.

* fix python lint error

* fix import torch_mlu error type

* fix lint

* refactor clang format style to google

Co-authored-by: zhouzaida <zhouzaida@163.com>

* [Feature] Support RoiAlign With Cambricon MLU Backend (#1429)

* [Feature] Support NMS with cambricon MLU backend (#1467)

* [Feature] Support BBoxOverlaps with cambricon MLU backend (#1507)

* [Refactor] Format C++ code

* [Refactor] include common_mlu_helper in pytorch_mlu_helper and refactor build condition

* [Improve] Improve the performance of roialign, nms and focalloss with MLU backend (#1572)

* [Improve] Improve the performance of roialign with MLU backend

* replace CHECK_MLU with CHECK_MLU_INPUT

* [Improve] Improve the perf of nms and focallosssigmoid with MLU backend

* [Improve] Improve the performance of roialign with MLU backend (#1741)

* [Feature] Support tin_shift with cambricon MLU backend (#1696)

* [Feature] Support tin_shift with cambricon MLU backend

* [fix] Add the assertion of batch_size in tin_shift.py

* [fix] fix the param check of tin_shift in cambricon code

* [fix] Fix lint failure.

* [fix] Fix source file lint failure.

* Update mmcv/ops/tin_shift.py

[Refactor] Modify the code in mmcv/ops/tin_shift.py.

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>

Co-authored-by: budefei <budefei@cambricon.com>
Co-authored-by: budefei <budefei@cambricom.com>
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* resolve conflicts and fix lint

* fix mmcv.utils.__init__

* fix mmcv.utils.__init__

* Fix lints and change FLAG

* fix setup and refine

* remove a redundant line

* remove an unnecessary 'f'

* fix compilation error

Co-authored-by: dingchang <hudingchang.vendor@sensetime.com>
Co-authored-by: zhouzaida <zhouzaida@163.com>
Co-authored-by: q.yao <yaoqian@sensetime.com>
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: pc <luopeichao@sensetime.com>
Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com>
Co-authored-by: q.yao <streetyao@live.com>
Co-authored-by: Tong Gao <gaotongxiao@gmail.com>
Co-authored-by: Yuxin Liu <liuyuxin@cambricon.com>
Co-authored-by: zihanchang11 <92860914+zihanchang11@users.noreply.github.com>
Co-authored-by: shlrao <shenglong.rao@gmail.com>
Co-authored-by: zhouchenyang <zcy19950525@gmail.com>
Co-authored-by: Mrxiaofei <36697723+Mrxiaofei@users.noreply.github.com>
Co-authored-by: budefei <budefei@cambricon.com>
Co-authored-by: budefei <budefei@cambricom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants