Fix torch.randperm #5014

ypwhs · 2021-04-19T08:57:12Z

Environment

2021-04-19 16:53:29,111 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.5 (default, Sep  4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 
TorchVision: 0.9.1+cu111
OpenCV: 4.5.1
MMCV: 1.3.1
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.11.0+
------------------------------------------------------------

Error Log

2021-04-19 16:09:47,605 - mmdet - INFO - workflow: [('train', 1)], max: 32 epochs
2021-04-19 16:09:50,356 - mmdet - INFO - Epoch [1][1/1235]	lr: 1.000e-05, eta: 1 day, 6:07:50, time: 2.745, data_time: 2.107, memory: 571, loss_rpn_cls: 0.6955, loss_rpn_bbox: 1.6445, loss_cls: 1.7080, acc: 0.3906, loss_bbox: 0.0003, loss: 4.0482, grad_norm: 21.0380
2021-04-19 16:09:50,431 - mmdet - INFO - Epoch [1][2/1235]	lr: 8.805e-05, eta: 15:28:34, time: 0.075, data_time: 0.016, memory: 571, loss_rpn_cls: 0.6899, loss_rpn_bbox: 0.0366, loss_cls: 1.7156, acc: 0.0000, loss_bbox: 0.0324, loss: 2.4745, grad_norm: 18.7435
2021-04-19 16:09:50,552 - mmdet - INFO - Epoch [1][3/1235]	lr: 1.661e-04, eta: 10:45:42, time: 0.122, data_time: 0.054, memory: 648, loss_rpn_cls: 0.6897, loss_rpn_bbox: 0.0044, loss_cls: 1.6527, acc: 0.9766, loss_bbox: 0.0001, loss: 2.3469, grad_norm: 20.5114
2021-04-19 16:09:50,665 - mmdet - INFO - Epoch [1][4/1235]	lr: 2.441e-04, eta: 8:22:51, time: 0.113, data_time: 0.040, memory: 648, loss_rpn_cls: 0.6889, loss_rpn_bbox: 0.0045, loss_cls: 1.4771, acc: 25.5859, loss_bbox: 0.0000, loss: 2.1705, grad_norm: 18.6314
2021-04-19 16:09:50,740 - mmdet - INFO - Epoch [1][5/1235]	lr: 3.222e-04, eta: 6:52:04, time: 0.074, data_time: 0.016, memory: 648, loss_rpn_cls: 0.6883, loss_rpn_bbox: 0.4192, loss_cls: 1.2716, acc: 96.6797, loss_bbox: 0.0085, loss: 2.3877, grad_norm: 17.7080
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [42,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Reason

torch.randperm(gallery.numel(), device=gallery.device)

will get huge number like:

torch.randperm(gallery.numel(), device=gallery.device)
Out[4]: 
tensor([4473784420498503614, 3067222985952748201, 1730247622805280939,
         ..., 8905490153029995233, 2309086847014742982,
        7042600051962955657], device='cuda:0')

Solution

Generate random numbers on CPU, and copy them to GPU.

torch.randperm(gallery.numel()).to(device=gallery.device)
Out[5]: 
tensor([ 486686, 1097497,  599874,  ...,   77371,  606891,  348124],
       device='cuda:0')

Or, wait PyTorch fix it.

For User: You could downgrade PyTorch to 1.7.1.

Related Issue

#4734

#4824

pytorch/pytorch#30569

CLAassistant · 2021-04-19T08:57:16Z

All committers have signed the CLA.

codecov · 2021-04-19T11:15:46Z

Codecov Report

Merging #5014 (88dc758) into master (e02d559) will decrease coverage by 0.15%.
The diff coverage is 100.00%.

❗ Current head 88dc758 differs from pull request most recent head 0ddfd9d. Consider uploading reports for the commit 0ddfd9d to get more accurate results

@@            Coverage Diff             @@
##           master    #5014      +/-   ##
==========================================
- Coverage   65.65%   65.49%   -0.16%     
==========================================
  Files         257      257              
  Lines       20080    20080              
  Branches     3419     3419              
==========================================
- Hits        13183    13152      -31     
- Misses       6185     6211      +26     
- Partials      712      717       +5

Flag	Coverage Δ
unittests	`65.49% <100.00%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
mmdet/core/bbox/samplers/random_sampler.py	`75.00% <100.00%> (ø)`
mmdet/models/roi_heads/mask_scoring_roi_head.py	`55.35% <0.00%> (-32.15%)`	⬇️
mmdet/models/dense_heads/rpn_test_mixin.py	`77.41% <0.00%> (-6.46%)`	⬇️
mmdet/models/roi_heads/mask_heads/maskiou_head.py	`94.62% <0.00%> (-5.38%)`	⬇️
mmdet/models/roi_heads/test_mixins.py	`54.08% <0.00%> (-3.78%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e02d559...0ddfd9d. Read the comment docs.

ZwwWayne · 2021-04-19T14:38:31Z

Hi @ypwhs ,
Thanks for the PR. Would you like to sign the CLA so that we can merge this PR?

hhaAndroid · 2021-04-20T01:15:53Z

mmdet/core/bbox/samplers/random_sampler.py

@@ -51,7 +51,7 @@ def random_choice(self, gallery, num):
            else:
                device = 'cpu'
            gallery = torch.tensor(gallery, dtype=torch.long, device=device)
-        perm = torch.randperm(gallery.numel(), device=gallery.device)[:num]
+        perm = torch.randperm(gallery.numel()).to(device=gallery.device)[:num]


Thank you for your contribution. Can you add some comments? such as the reason for the modification and the corresponding link.

This is just a simple problem fix, the specific problem may be in PyTorch, and it has not been repaired so far. I just tried to fix it, and it works, so there is no link to provide. Thanks for your review.

I understand the meaning, but it is best to write the reason for the modification in the code to facilitate reading the code in the future. Thank you!

We can simply add a line of comments to refer to this PR and the issue link of PyTorch and indicate that this is a workaround. Thus, users could easily understand the code. When PyTorch finally fixes this issue, we may clean the code in the future.

Added comments and optimized the speed of copying.

mmdet/core/bbox/samplers/random_sampler.py

Temporary fix for torch.randperm, from: open-mmlab#5014

Update random_sampler.py

a0df79f

ZwwWayne requested a review from hhaAndroid April 19, 2021 14:38

hhaAndroid reviewed Apr 23, 2021

View reviewed changes

Update random_sampler.py

0ddfd9d

ZwwWayne reviewed Apr 26, 2021

View reviewed changes

mmdet/core/bbox/samplers/random_sampler.py Show resolved Hide resolved

ZwwWayne approved these changes Apr 26, 2021

View reviewed changes

ZwwWayne merged commit 50a8cee into open-mmlab:master Apr 26, 2021

ypwhs deleted the patch-1 branch April 26, 2021 09:33

K-H-Ismail added a commit to K-H-Ismail/Swin-Transformer-Object-Detection that referenced this pull request Aug 17, 2022

Update random_sampler.py

8851ca9

Temporary fix for torch.randperm, from: open-mmlab#5014

K-H-Ismail mentioned this pull request Aug 17, 2022

Update random_sampler.py SwinTransformer/Swin-Transformer-Object-Detection#191

Merged

weiyx16 pushed a commit to SwinTransformer/Swin-Transformer-Object-Detection that referenced this pull request Aug 22, 2022

Update random_sampler.py (#191)

c7b2011

Temporary fix for torch.randperm, from: open-mmlab#5014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix torch.randperm #5014

Fix torch.randperm #5014

ypwhs commented Apr 19, 2021 •

edited

CLAassistant commented Apr 19, 2021 •

edited

codecov bot commented Apr 19, 2021 •

edited

ZwwWayne commented Apr 19, 2021

hhaAndroid Apr 20, 2021

ypwhs Apr 24, 2021

hhaAndroid Apr 25, 2021

ZwwWayne Apr 25, 2021 •

edited

ypwhs Apr 26, 2021

Fix torch.randperm #5014

Fix torch.randperm #5014

Conversation

ypwhs commented Apr 19, 2021 • edited

Environment

Error Log

Reason

Solution

Related Issue

CLAassistant commented Apr 19, 2021 • edited

codecov bot commented Apr 19, 2021 • edited

Codecov Report

ZwwWayne commented Apr 19, 2021

hhaAndroid Apr 20, 2021

Choose a reason for hiding this comment

ypwhs Apr 24, 2021

Choose a reason for hiding this comment

hhaAndroid Apr 25, 2021

Choose a reason for hiding this comment

ZwwWayne Apr 25, 2021 • edited

Choose a reason for hiding this comment

ypwhs Apr 26, 2021

Choose a reason for hiding this comment

ypwhs commented Apr 19, 2021 •

edited

CLAassistant commented Apr 19, 2021 •

edited

codecov bot commented Apr 19, 2021 •

edited

ZwwWayne Apr 25, 2021 •

edited