Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Ckpt Removing Raises Exception and stalls training #244

Closed
3 tasks done
serser opened this issue Nov 5, 2022 · 2 comments 路 Fixed by open-mmlab/mmengine#682
Closed
3 tasks done

Best Ckpt Removing Raises Exception and stalls training #244

serser opened this issue Nov 5, 2022 · 2 comments 路 Fixed by open-mmlab/mmengine#682

Comments

@serser
Copy link

serser commented Nov 5, 2022

Prerequisite

馃悶 Describe the bug

As said in the title, training stops at exception.

In mmengine/fileio/backends/local_backend.py, line 416, I've changed into following to temp fix it.

        try:
            if not self.exists(filepath):
                raise FileNotFoundError(f'filepath {filepath} does not exist')

            if self.isdir(filepath):
                raise IsADirectoryError('filepath should be a file')
        except:
            print("not exist!!!", filepath)

        try:
            os.remove(filepath)
        except:
            print("already removed!!!", filepath)

Environment

sys.platform: linux
Python: 3.8.13 (default, Oct 21 2022, 23:50:54) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.0, V11.0.221
GCC: gcc (GCC) 5.4.0
PyTorch: 1.7.1+cu110
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.8.2+cu110
OpenCV: 4.6.0
MMEngine: 0.3.0
MMCV: 2.0.0rc2
MMDetection: 3.0.0rc2
MMYOLO: 0.1.2+dc3377b

Additional information

I think it is due to removing best ckpt via multiprocessing so that it throws exception. Needs reviewing and fixing from dev side.

@RangeKing
Copy link
Collaborator

Hi @serser, it seems that it's an upstream issue. You could open an issue in MMEngine.

@RangeKing
Copy link
Collaborator

Hi @serser, you can run the command, mim install mmengine -U or pip install mmengine -U, to install the latest mmengine (0.3.1) which will solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants