New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Random Initialization on AMD CPU. #19057
Comments
You are likely having threading problems. |
As for random, use The old version in |
Thanks for your suggestions. But more threads do not work. Probably, the issue is incurred by the system or other configurations, such as lapack or openblas. As an alternative solution, I'd like to re-implement the numpy code with pytorch, which performs normally. |
This is really strange. Ran your code on my machine and saw timings of 2s and 5s for the two main tasks. Have you tried using conda-provided Python and NumPy as a check that it isn't something to do with the particulars of a specific compiler? From your logs it looks like you are compiling your own version. |
Yes, the initialization operation is completed instantly on other cpus (i.e., CPU is By the way, 'cupy' runs very fast (same initialization takes about 0.1s) since it works on GPU(RTX 3090 here). |
The idea is to use fewer threads, maybe try 1. The timings you report are so bad I'd almost suspect swap memory is in play. |
The |
Here's the stripped-down version (using the preferred modern PRNG stuff that's significantly faster): import numpy as np
import time
def print_time(string=''):
res = time.strftime("%Y-%m-%d-%H:%M:%S", time.localtime())
res = res + ' | ' + string
print(res)
np.__config__.show()
rng = np.random.default_rng(0xd6992220051247b7be44a45e1580c507)
# --- Test 5
n = 10000
t = time.time()
A = rng.standard_normal([n, n])
td = time.time() - t
print_time("Random Initialize (%d,%d) matrix in %0.3f s" % (n, n, td))
# --- Test 6
t = time.time()
B = np.argsort(A, axis=-1)
td = time.time() - t
print_time("ArgSort (%d,%d) matrix in %0.3f s" % (n, n, td)) |
The random and argsort operation in AMD cpu is so slow that a random initialization of 10000*10000 matrix needs 30m !!!
For comparison, the same initialization in Intel CPU needs about 5s.
When using openblas for acceleration, the time cost is much lower, but still 3x than Intel CPU (about 17s). And the code often stucks in somewhere when performing random initialization or argsort.
CPU: 128 AMD EPYC 7502 32-Core Processor
Numpy: 1.20.3
Reproducing code example:
Error message:
OpenBLAS
NumPy/Python version information:
Numpy: 1.20.3
Python: 3.8
The text was updated successfully, but these errors were encountered: