Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support AVX-512 instructions #24

Open
drossetti opened this issue Dec 6, 2017 · 4 comments
Open

support AVX-512 instructions #24

drossetti opened this issue Dec 6, 2017 · 4 comments

Comments

@drossetti
Copy link
Member

No description provided.

@imaginary-person
Copy link

Using AVX-512 based memcpy is a bad idea, in general.

This is how gdr_copy_from_mapping does with AVX512 (In fact, its SSE4.1 version is faster than its AVX version, and the source code prefers it over the AVX version).

gdr_copy_from_mapping num iters for each size: 100
Test                     Size(B)         Avg.Time(us)
DBG:  using AVX512 implementation of gdr_copy_from_bar
gdr_copy_from_mapping           1             0.9811
gdr_copy_from_mapping           2             1.2646
gdr_copy_from_mapping           4             1.2648
gdr_copy_from_mapping           8             1.2640
gdr_copy_from_mapping          16             1.8958
gdr_copy_from_mapping          32             3.1540
gdr_copy_from_mapping          64             0.6476
gdr_copy_from_mapping         128             1.2858
gdr_copy_from_mapping         256             2.5581
gdr_copy_from_mapping         512             5.0851
gdr_copy_from_mapping        1024            10.2162
gdr_copy_from_mapping        2048            24.0402
gdr_copy_from_mapping        4096            44.5810
gdr_copy_from_mapping        8192            81.9428
gdr_copy_from_mapping       16384           170.7200
gdr_copy_from_mapping       32768           341.2040
gdr_copy_from_mapping       65536           675.1082
gdr_copy_from_mapping      131072          1357.5815
gdr_copy_from_mapping      262144          2706.2129
gdr_copy_from_mapping      524288          5425.6831
gdr_copy_from_mapping     1048576         10837.6549
gdr_copy_from_mapping     2097152         21672.5916
gdr_copy_from_mapping     4194304         55437.2406
gdr_copy_from_mapping     8388608        110991.1427
gdr_copy_from_mapping    16777216        222043.6687

@drossetti
Copy link
Member Author

Thank you for taking a look.
Which CPU, GPU and PCIe topology did you test?
Can you report copy_to_mapping perf ?

@imaginary-person
Copy link

Thanks for your response!

CPU - Intel Xeon Silver 4114 (Skylake)
GPU - Tesla P100-PCIE-12GB
CUDA version - 11.4

Here are the gdr_copy_to_mapping numbers for AVX512 -

gdr_copy_to_mapping num iters for each size: 10000

Test Size(B) Avg.Time(us)
gdr_copy_to_mapping 1 0.1250
gdr_copy_to_mapping 2 0.1245
gdr_copy_to_mapping 4 0.1245
gdr_copy_to_mapping 8 0.1222
gdr_copy_to_mapping 16 0.1263
gdr_copy_to_mapping 32 0.1252
gdr_copy_to_mapping 64 0.1280
gdr_copy_to_mapping 128 0.1376
gdr_copy_to_mapping 256 0.1439
gdr_copy_to_mapping 512 0.1550
gdr_copy_to_mapping 1024 0.1927
gdr_copy_to_mapping 2048 0.2631
gdr_copy_to_mapping 4096 0.4262
gdr_copy_to_mapping 8192 0.8239
gdr_copy_to_mapping 16384 1.6179
gdr_copy_to_mapping 32768 3.2132
gdr_copy_to_mapping 65536 6.4094
gdr_copy_to_mapping 131072 12.7935
gdr_copy_to_mapping 262144 25.5790
gdr_copy_to_mapping 524288 51.1738
gdr_copy_to_mapping 1048576 102.2248
gdr_copy_to_mapping 2097152 204.4293
gdr_copy_to_mapping 4194304 409.7942
gdr_copy_to_mapping 8388608 822.7885
gdr_copy_to_mapping 16777216 1683.7191

As for the PCIe topology, I'm not sure, but I did a lspci -tv:

-+-[0000:d7]-+-05.0  Intel Corporation Device 2034
 |           +-05.2  Intel Corporation Sky Lake-E RAS Configuration Registers
 |           +-05.4  Intel Corporation Device 2036
 |           +-0e.0  Intel Corporation Device 2058
 |           +-0e.1  Intel Corporation Device 2059
 |           +-0f.0  Intel Corporation Device 2058
 |           +-0f.1  Intel Corporation Device 2059
 |           +-12.0  Intel Corporation Sky Lake-E M3KTI Registers
 |           +-12.1  Intel Corporation Sky Lake-E M3KTI Registers
 |           +-12.2  Intel Corporation Sky Lake-E M3KTI Registers
 |           +-15.0  Intel Corporation Sky Lake-E M2PCI Registers
 |           +-16.0  Intel Corporation Sky Lake-E M2PCI Registers
 |           \-16.4  Intel Corporation Sky Lake-E M2PCI Registers
 +-[0000:ae]-+-05.0  Intel Corporation Device 2034
 |           +-05.2  Intel Corporation Sky Lake-E RAS Configuration Registers
 |           +-05.4  Intel Corporation Device 2036
 |           +-08.0  Intel Corporation Device 2066
 |           +-09.0  Intel Corporation Device 2066
 |           +-0a.0  Intel Corporation Device 2040
 |           +-0a.1  Intel Corporation Device 2041
 |           +-0a.2  Intel Corporation Device 2042
 |           +-0a.3  Intel Corporation Device 2043
 |           +-0a.4  Intel Corporation Device 2044
 |           +-0a.5  Intel Corporation Device 2045
 |           +-0a.6  Intel Corporation Device 2046
 |           +-0a.7  Intel Corporation Device 2047
 |           +-0b.0  Intel Corporation Device 2048
 |           +-0b.1  Intel Corporation Device 2049
 |           +-0b.2  Intel Corporation Device 204a
 |           +-0b.3  Intel Corporation Device 204b
 |           +-0c.0  Intel Corporation Device 2040
 |           +-0c.1  Intel Corporation Device 2041
 |           +-0c.2  Intel Corporation Device 2042
 |           +-0c.3  Intel Corporation Device 2043
 |           +-0c.4  Intel Corporation Device 2044
 |           +-0c.5  Intel Corporation Device 2045
 |           +-0c.6  Intel Corporation Device 2046
 |           +-0c.7  Intel Corporation Device 2047
 |           +-0d.0  Intel Corporation Device 2048
 |           +-0d.1  Intel Corporation Device 2049
 |           +-0d.2  Intel Corporation Device 204a
 |           \-0d.3  Intel Corporation Device 204b
 +-[0000:85]-+-00.0-[86]----00.0  NVIDIA Corporation GP100GL [Tesla P100 PCIe 12GB]

@imaginary-person
Copy link

imaginary-person commented Jul 16, 2021

One caveat is that I probably could've used the -mavx512vl compilation flag to use up to 32 ymm registers for both AVX & AVX2, but I didn't. I wonder if loop-unrolling in the source-code should be tweaked if 32 registers are to be leveraged, instead of the default 16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants