Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix torch.randperm #5014

Merged
merged 2 commits into from
Apr 26, 2021
Merged

Fix torch.randperm #5014

merged 2 commits into from
Apr 26, 2021

Conversation

ypwhs
Copy link
Contributor

@ypwhs ypwhs commented Apr 19, 2021

Environment

2021-04-19 16:53:29,111 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.5 (default, Sep  4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.1+cu111
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 
TorchVision: 0.9.1+cu111
OpenCV: 4.5.1
MMCV: 1.3.1
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.11.0+
------------------------------------------------------------

Error Log

2021-04-19 16:09:47,605 - mmdet - INFO - workflow: [('train', 1)], max: 32 epochs
2021-04-19 16:09:50,356 - mmdet - INFO - Epoch [1][1/1235]	lr: 1.000e-05, eta: 1 day, 6:07:50, time: 2.745, data_time: 2.107, memory: 571, loss_rpn_cls: 0.6955, loss_rpn_bbox: 1.6445, loss_cls: 1.7080, acc: 0.3906, loss_bbox: 0.0003, loss: 4.0482, grad_norm: 21.0380
2021-04-19 16:09:50,431 - mmdet - INFO - Epoch [1][2/1235]	lr: 8.805e-05, eta: 15:28:34, time: 0.075, data_time: 0.016, memory: 571, loss_rpn_cls: 0.6899, loss_rpn_bbox: 0.0366, loss_cls: 1.7156, acc: 0.0000, loss_bbox: 0.0324, loss: 2.4745, grad_norm: 18.7435
2021-04-19 16:09:50,552 - mmdet - INFO - Epoch [1][3/1235]	lr: 1.661e-04, eta: 10:45:42, time: 0.122, data_time: 0.054, memory: 648, loss_rpn_cls: 0.6897, loss_rpn_bbox: 0.0044, loss_cls: 1.6527, acc: 0.9766, loss_bbox: 0.0001, loss: 2.3469, grad_norm: 20.5114
2021-04-19 16:09:50,665 - mmdet - INFO - Epoch [1][4/1235]	lr: 2.441e-04, eta: 8:22:51, time: 0.113, data_time: 0.040, memory: 648, loss_rpn_cls: 0.6889, loss_rpn_bbox: 0.0045, loss_cls: 1.4771, acc: 25.5859, loss_bbox: 0.0000, loss: 2.1705, grad_norm: 18.6314
2021-04-19 16:09:50,740 - mmdet - INFO - Epoch [1][5/1235]	lr: 3.222e-04, eta: 6:52:04, time: 0.074, data_time: 0.016, memory: 648, loss_rpn_cls: 0.6883, loss_rpn_bbox: 0.4192, loss_cls: 1.2716, acc: 96.6797, loss_bbox: 0.0085, loss: 2.3877, grad_norm: 17.7080
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [42,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Reason

torch.randperm(gallery.numel(), device=gallery.device)

will get huge number like:

torch.randperm(gallery.numel(), device=gallery.device)
Out[4]: 
tensor([4473784420498503614, 3067222985952748201, 1730247622805280939,
         ..., 8905490153029995233, 2309086847014742982,
        7042600051962955657], device='cuda:0')

Solution

Generate random numbers on CPU, and copy them to GPU.

torch.randperm(gallery.numel()).to(device=gallery.device)
Out[5]: 
tensor([ 486686, 1097497,  599874,  ...,   77371,  606891,  348124],
       device='cuda:0')

Or, wait PyTorch fix it.

For User: You could downgrade PyTorch to 1.7.1.

Related Issue

#4734

#4824

pytorch/pytorch#30569

@CLAassistant
Copy link

CLAassistant commented Apr 19, 2021

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Apr 19, 2021

Codecov Report

Merging #5014 (88dc758) into master (e02d559) will decrease coverage by 0.15%.
The diff coverage is 100.00%.

❗ Current head 88dc758 differs from pull request most recent head 0ddfd9d. Consider uploading reports for the commit 0ddfd9d to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5014      +/-   ##
==========================================
- Coverage   65.65%   65.49%   -0.16%     
==========================================
  Files         257      257              
  Lines       20080    20080              
  Branches     3419     3419              
==========================================
- Hits        13183    13152      -31     
- Misses       6185     6211      +26     
- Partials      712      717       +5     
Flag Coverage Δ
unittests 65.49% <100.00%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmdet/core/bbox/samplers/random_sampler.py 75.00% <100.00%> (ø)
mmdet/models/roi_heads/mask_scoring_roi_head.py 55.35% <0.00%> (-32.15%) ⬇️
mmdet/models/dense_heads/rpn_test_mixin.py 77.41% <0.00%> (-6.46%) ⬇️
mmdet/models/roi_heads/mask_heads/maskiou_head.py 94.62% <0.00%> (-5.38%) ⬇️
mmdet/models/roi_heads/test_mixins.py 54.08% <0.00%> (-3.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e02d559...0ddfd9d. Read the comment docs.

@ZwwWayne
Copy link
Collaborator

Hi @ypwhs ,
Thanks for the PR. Would you like to sign the CLA so that we can merge this PR?

@@ -51,7 +51,7 @@ def random_choice(self, gallery, num):
else:
device = 'cpu'
gallery = torch.tensor(gallery, dtype=torch.long, device=device)
perm = torch.randperm(gallery.numel(), device=gallery.device)[:num]
perm = torch.randperm(gallery.numel()).to(device=gallery.device)[:num]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. Can you add some comments? such as the reason for the modification and the corresponding link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a simple problem fix, the specific problem may be in PyTorch, and it has not been repaired so far. I just tried to fix it, and it works, so there is no link to provide. Thanks for your review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the meaning, but it is best to write the reason for the modification in the code to facilitate reading the code in the future. Thank you!

Copy link
Collaborator

@ZwwWayne ZwwWayne Apr 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply add a line of comments to refer to this PR and the issue link of PyTorch and indicate that this is a workaround. Thus, users could easily understand the code. When PyTorch finally fixes this issue, we may clean the code in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments and optimized the speed of copying.

@ZwwWayne ZwwWayne merged commit 50a8cee into open-mmlab:master Apr 26, 2021
@ypwhs ypwhs deleted the patch-1 branch April 26, 2021 09:33
K-H-Ismail added a commit to K-H-Ismail/Swin-Transformer-Object-Detection that referenced this pull request Aug 17, 2022
Temporary fix for torch.randperm, from: open-mmlab#5014
weiyx16 pushed a commit to SwinTransformer/Swin-Transformer-Object-Detection that referenced this pull request Aug 22, 2022
Temporary fix for torch.randperm, from: open-mmlab#5014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants