Re-producing issue #5

youngwanLEE · 2021-05-17T02:56:16Z

Hi,

For checking re-producibility, I tried to train the coat_lite_mini model(reported 79.1/94.5) and got 78.85/94.42 by using this command :

bash scripts/train.sh coat_lite_mini coat_lite_mini

with the default settings such as the batch size of 256 and using 8 GPUs (TITAN RTX).

Is such a small difference (79.1 vs. 78.9) negligible?

My environment :

sys.platform linux
Python 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
numpy 1.19.2
Compiler GCC 7.5
CUDA compiler CUDA 10.1
detectron2 arch flags 7.5
DETECTRON2_ENV_MODULE
PyTorch 1.7.0
PyTorch debug build True
GPU available True
GPU 0,1,2,3,4,5,6,7 TITAN RTX (arch=7.5)
CUDA_HOME /usr/local/cuda-10.1
Pillow 8.0.1
torchvision 0.8.0
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5
fvcore 0.1.2.post20201218
cv2 Not found

PyTorch built with:

GCC 7.3

C++ Version: 201402

Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications

Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)

OpenMP 201511 (a.k.a. OpenMP 4.5)

NNPACK is enabled

CPU capability usage: AVX2

CUDA Runtime 10.2

NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37

CuDNN 7.6.5

Magma 2.5.2

Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

xwjabc · 2021-05-17T03:16:05Z

Hi @youngwanLEE, thank you for your experiment for reproducibility checking! I think 0.1~0.2% error is acceptable (actually our reported results are rounded from 79.090%), since in some of our trials we also experience similar errors. Thus, I think your results are reasonable.

youngwanLEE · 2021-05-17T03:17:43Z

@xwjabc Thanks for your quick reply :)

xwjabc · 2021-05-17T03:18:50Z

@xwjabc Thanks for your quick reply :)

@youngwanLEE Besides, sometimes we found that if you use 16 GPUs instead of 8 GPUs (other settings are the same) it can slightly improve the performance (around 0.1% improvement), but we have not validated it too much. You may give it a try :)

youngwanLEE · 2021-05-17T11:17:04Z

@xwjabc,
On the other hand, Did you decrease the batch size from 256 to 128?
When I tried to train the coat_mini model with default settings (batch size of 256) on an 8 GPU machine with V100(32GB), the out-of-memory problem occurred.

So I had no choice but to reduce the batch size from 256 to 128.

xwjabc · 2021-05-17T17:42:44Z

@youngwanLEE Yes, we also reduce the batch size per GPU and use more GPUs for coat_mini model (since we use 24GB GPUs such as TITAN RTX or RTX 3090 for training).

youngwanLEE · 2021-05-20T06:48:07Z

@xwjabc Hi,
I have another question.

How to compute the computational cost (a.k.a. FLOPs)?

xwjabc · 2021-05-20T18:07:19Z

@xwjabc Hi,
I have another question.

How to compute the computational cost (a.k.a. FLOPs)?

Hi @youngwanLEE, for the arXiv paper, we used a modified version of the FLOPs calculation script from mmcv, following PVT (the calculation for attention part is modified accordingly). Will update the script to repo soon!

youngwanLEE · 2021-05-21T01:12:13Z

@xwjabc oh, good news !!
Thanks :)

youngwanLEE · 2021-05-24T04:23:39Z

@xwjabc Hi,

I want to share the re-produced result of Coat-Mini: 81.494 / 95.568 which is higher than your report:).

Cool !!

My environment :

pytorch: 1.7
torchvision: 0.8.1
GPUs : RTX8000 x 8
batch-size-per-gpu : 256

Training time 6 days, 15:51:55

xwjabc · 2021-05-24T16:25:40Z

@youngwanLEE Cool! Currently we are still exploring ways to improve the efficiency of CoaT models. We hope that we can obtain faster and better models in the end.

youngwanLEE · 2021-06-07T01:55:27Z

@xwjabc Hi,
I have another question.
How to compute the computational cost (a.k.a. FLOPs)?

Hi @youngwanLEE, for the arXiv paper, we used a modified version of the FLOPs calculation script from mmcv, following PVT (the calculation for attention part is modified accordingly). Will update the script to repo soon!

Hi, @xwjabc

I want to know when the FLOPs calculation script is open.

Thanks in advance :)

xwjabc · 2021-06-18T17:07:37Z

@youngwanLEE Sorry for the late reply! Previously we were busy preparing for the paper rebuttal. We will release the FLOPs calculation script soon as well as some larger models (CoaT Small (~20M) and CoaT-Lite Medium (~40M)). Thanks!

youngwanLEE closed this as completed May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-producing issue #5

Re-producing issue #5

youngwanLEE commented May 17, 2021

xwjabc commented May 17, 2021 •

edited

youngwanLEE commented May 17, 2021

xwjabc commented May 17, 2021 •

edited

youngwanLEE commented May 17, 2021

xwjabc commented May 17, 2021

youngwanLEE commented May 20, 2021

xwjabc commented May 20, 2021

youngwanLEE commented May 21, 2021

youngwanLEE commented May 24, 2021 •

edited

xwjabc commented May 24, 2021

youngwanLEE commented Jun 7, 2021

xwjabc commented Jun 18, 2021 •

edited

Re-producing issue #5

Re-producing issue #5

Comments

youngwanLEE commented May 17, 2021

xwjabc commented May 17, 2021 • edited

youngwanLEE commented May 17, 2021

xwjabc commented May 17, 2021 • edited

youngwanLEE commented May 17, 2021

xwjabc commented May 17, 2021

youngwanLEE commented May 20, 2021

xwjabc commented May 20, 2021

youngwanLEE commented May 21, 2021

youngwanLEE commented May 24, 2021 • edited

xwjabc commented May 24, 2021

youngwanLEE commented Jun 7, 2021

xwjabc commented Jun 18, 2021 • edited

xwjabc commented May 17, 2021 •

edited

xwjabc commented May 17, 2021 •

edited

youngwanLEE commented May 24, 2021 •

edited

xwjabc commented Jun 18, 2021 •

edited