DeviceRadixSort::SortPairs fails to sort array #64

daktfi · 2016-11-15T09:05:56Z

When I try to sort array of 40m (roughly) pairs or longer it simply does not sort them without reporting any errors.
Device is: Device 0: GeForce GTX 950 (PTX version 520, SM520, 6 SMs, 904 free / 1995 total MB physmem, 105.760 GB/s @ 3305000 kHz mem clock, ECC off)
cub version 1.5.5 (latest at the moment).

Sample project to reproduce the problem is attached
check_dev_radix.zip

When run with increasingly larger size of array to sort it eventually fails to sort it.
As I understand, the critical size depends on amount of free RAM. The problem is - no error reported.

The text was updated successfully, but these errors were encountered:

dumerrill · 2016-11-15T15:37:43Z

Hrm. Seems to work just fine for me. What OS, host, and CUDA compilers are
you using? (You might consider checking the status result from the
allocator.)

Compiling:

[dumerrill@dt06 removeme]$ nvcc -arch=sm_52 -std=c++11 -O3 main.cpp
sort_cub.cu -I../.. -I.

For 100M items:

[dumerrill@dt06 removeme]$ ./a.out 100000000
Found 4 CUDA devices:
Device 0: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 1: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 2: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 3: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Largest mem available 22.2GiB @0
Smallest mem available 22.2GiB @0
Device 0 selected
set length 100000000

Array length 100001408 (381MiB)
Data:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort0:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort1:
7 91538659
37 4880726
51 72684560
95 9505224
95 83691578
181 85858716
224 77198143
227 52701079
315 30544587
367 77156907
Sort ok

With 40M:

[dumerrill@dt06 removeme]$ ./a.out 40000000
Found 4 CUDA devices:
Device 0: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 1: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 2: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 3: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Largest mem available 22.2GiB @0
Smallest mem available 22.2GiB @0
Device 0 selected
set length 40000000

Array length 40000896 (152MiB)
Data:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort0:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort1:
37 4880726
95 9505224
315 30544587
448 12936177
452 28788976
490 27337079
614 3012490
657 32356371
657 34183315
677 28274133
Sort ok

On Tue, Nov 15, 2016 at 4:05 AM, daktfi notifications@github.com wrote:

When I try to sort array of 40m (roughly) pairs or longer it simply does
not sort them without reporting any errors.
Device is: Device 0: GeForce GTX 950 (PTX version 520, SM520, 6 SMs, 904
free / 1995 total MB physmem, 105.760 GB/s @ 3305000 kHz mem clock, ECC off)
cub version 1.5.5 (latest at the moment).

Sample project to reproduce the problem is attached
check_dev_radix.zip
https://github.com/NVlabs/cub/files/591469/check_dev_radix.zip

When run with increasingly larger size of array to sort it eventually
fails to sort it.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/NVlabs/cub/issues/64, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABaFwFudz5Uuz49R762PgwhP_I6PuKd1ks5q-XX0gaJpZM4KySCh
.

daktfi · 2016-11-15T17:32:01Z

I found the problem: it is necessary to check cudaPeekAtLastError()/cudaGetLastError() after sort. It seems sorting requires additional amount of videomemory beside allocated buffers and temp_storage (roughly again as much as keys size doubled). Mind the row with device specs in the original post: there were only 904 Mb of free memory.
I don't think it's a major bug, but still quite inconvenient. I think those extra allocations must be covered under temp_storage.

The setup is:
Kubuntu 16.04 fully updated, gcc 5.4, CUDA 8.0;
Core i7 (don't think that matters, though), 32 Gb RAM, GTX 950 with 2 Gb of memory;

dumerrill · 2016-11-16T17:56:04Z

The implementation does peek errors after each kernel launch, e.g.,
https://github.com/NVlabs/cub/blob/1.5.5/cub/device/dispatch/dispatch_radix_sort.cuh#L900.

However, as you mention, this doesn't capture all runtime errors: others
only show up when the stream is synchronized with the host (e.g., during a
malloc or memcpy). If you want improved CUB debugging, you can set the
last, optional debug_synchronous parameter to true, the implementation will
synchronize the stream after each kernel invocation to catch CUDART errors
that won't otherwise be reported. (Of course, this incurs added runtime
overhead of synchronizing the device with the host.)

On Tue, Nov 15, 2016 at 12:32 PM, daktfi notifications@github.com wrote:

I found the problem: it is necessary to check cudaPeekAtLastError()/cudaGetLastError()
after sort. It seems sorting requires additional amount of videomemory
beside allocated buffers and temp_storage (roughly again as much as keys
size doubled).
I don't think it's a major bug, but still quite inconvenient. I think
those extra allocations must be covered under temp_storage.

The setup is:
Kubuntu 16.04 fully updated, gcc 5.4, CUDA 8.0;
Core i7 (don't think that matters, though), 32 Gb RAM, GTX 950 with 2 Gb
of memory;

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/NVlabs/cub/issues/64#issuecomment-260709450, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABaFwJ9p7hzkw7LHPZkDOZcH1zfsHIwWks5q-eyTgaJpZM4KySCh
.

daktfi · 2016-11-16T18:38:05Z

Thanks for advice on debug, this'll be quite useful.
However, my point here and now is not about errors reporting (that's my fault here - I'm quite new to CUDA and missed few points in manual), but about memory consumption. If I already allocated TWO buffers for keys, TWO buffers for values and even some extra temporary storage, why does the sorting fails with error like "Can't allocate device memory"?
I don't care how much memory it needs to sort data, I just want to be able to allocate this amount. And to do that I have to know how big it is. Not a big deal, I repeat, but still a little annoying...
To be specific, I'm sorting rather large arrays (over 1B key-value pairs, hopefully both 64-bit). To do that, I split 'em into device-sortable blocks (and then merge 'em later, but this is completely irrelevant)... To calculate proper length for such smaller blocks I need to know exact memory consumption, and - oops! - I can't. :-)
After some runs of attached test I figured out the proper ratio to be about three (to sort 1M of 32-bit key + 32-bit value pairs I need about 24 Mb of memory, while total combined size of allocated buffers and temporary storage is noticeably smaller - I don't have PC with working CUDA setup right now to check exactly). Hope this number will help someone. :-)

dumerrill · 2016-11-23T17:17:24Z

why does the sorting fails with error like "Can't allocate device memory"?

The sorting won't fail with a memory allocation error. If that's the error you're getting from CUB, then program was already failed and simply returning a latent error from an earlier failed attempt to allocate memory that wasn't cleared.

I think those extra allocations must be covered under temp_storage.

CUB does no allocation whatsoever. Everything its sorting needs is bundled up in the temp storage, which you can allocate (conservatively, even, using an upper bound of problem size, if that's available) way in advance. In general, CUDA device memory allocation is a stream-blocking, host-synchronizing event, and CUB doesn't want to impose that upon an application right in the middle of what the application is presuming to be an asynchornous stream computation.

dumerrill closed this as completed Nov 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeviceRadixSort::SortPairs fails to sort array #64

DeviceRadixSort::SortPairs fails to sort array #64

daktfi commented Nov 15, 2016 •

edited

dumerrill commented Nov 15, 2016

daktfi commented Nov 15, 2016 •

edited

dumerrill commented Nov 16, 2016

daktfi commented Nov 16, 2016 •

edited

dumerrill commented Nov 23, 2016

DeviceRadixSort::SortPairs fails to sort array #64

DeviceRadixSort::SortPairs fails to sort array #64

Comments

daktfi commented Nov 15, 2016 • edited

dumerrill commented Nov 15, 2016

daktfi commented Nov 15, 2016 • edited

dumerrill commented Nov 16, 2016

daktfi commented Nov 16, 2016 • edited

dumerrill commented Nov 23, 2016

daktfi commented Nov 15, 2016 •

edited

daktfi commented Nov 15, 2016 •

edited

daktfi commented Nov 16, 2016 •

edited