Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Random Initialization on AMD CPU. #19057

Open
baist opened this issue May 21, 2021 · 8 comments
Open

Slow Random Initialization on AMD CPU. #19057

baist opened this issue May 21, 2021 · 8 comments

Comments

@baist
Copy link

baist commented May 21, 2021

The random and argsort operation in AMD cpu is so slow that a random initialization of 10000*10000 matrix needs 30m !!!
For comparison, the same initialization in Intel CPU needs about 5s.

When using openblas for acceleration, the time cost is much lower, but still 3x than Intel CPU (about 17s). And the code often stucks in somewhere when performing random initialization or argsort.

CPU: 128 AMD EPYC 7502 32-Core Processor
Numpy: 1.20.3

Reproducing code example:

import numpy.random as npr
import numpy as np
import time


def print_time(string=''): 
    res = time.strftime("%Y-%m-%d-%H:%M:%S", time.localtime())
    res = res + ' | ' + string
    print(res)

np.random.seed(1)
np.__config__.show()

# --- Test 1
N = 1
n = 1000

A = npr.randn(n,n)
B = npr.randn(n,n)

t = time.time()
for i in range(N):
    C = np.dot(A, B)
td = time.time() - t
print_time("dotted two (%d,%d) matrices in %0.1f ms" % (n, n, 1e3*td/N))

# --- Test 2
N = 100
n = 4000

A = npr.randn(n)
B = npr.randn(n)

t = time.time()
for i in range(N):
    C = np.dot(A, B)
td = time.time() - t
print_time("dotted two (%d) vectors in %0.2f us" % (n, 1e6*td/N))

# --- Test 3
m,n = (2000,1000)

A = npr.randn(m,n)

t = time.time()
[U,s,V] = np.linalg.svd(A, full_matrices=False)
td = time.time() - t
print_time("SVD of (%d,%d) matrix in %0.3f s" % (m, n, td))

# --- Test 4
n = 1500
A = npr.randn(n,n)

t = time.time()
w, v = np.linalg.eig(A)
td = time.time() - t
print_time("Eigendecomp of (%d,%d) matrix in %0.3f s" % (n, n, td))




for i in range(10):
    print_time(str(i))
    # --- Test 5
    n = 10000
    t = time.time()
    A = npr.randn(n, n)
    td = time.time() - t
    print_time("Random Initialize (%d,%d) matrix in %0.3f s" % (n, n, td))


    # --- Test 6
    t = time.time()
    B = np.argsort(A, axis=-1)
    td = time.time() - t
    print_time("ArgSort (%d,%d) matrix in %0.3f s" % (n, n, td))

    print(B)

Error message:

blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/username/miniconda3/envs/py38/lib']
    include_dirs = ['/home/username/miniconda3/envs/py38/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/username/miniconda3/envs/py38/lib']
    include_dirs = ['/home/username/miniconda3/envs/py38/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/home/username/miniconda3/envs/py38/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/username/miniconda3/envs/py38/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/username/miniconda3/envs/py38/include']
dotted two (1000,1000) matrices in 246.3 ms
dotted two (4000) vectors in 22.61 us
SVD of (2000,1000) matrix in 20.998 s
Eigendecomp of (1500,1500) matrix in 128.044 s
Random Initialize matrix (10000,10000) matrix in 1951.663 s
ArgSort matrix (10000,10000) matrix in 1148.212 s

OpenBLAS

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
accelerate_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
    libraries = ['openblas', 'lapack']
    library_dirs = ['/usr/local/lib']
    language = f77
lapack_opt_info:
    libraries = ['openblas', 'lapack', 'openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('NO_ATLAS_INFO', 1)]
2021-05-21-15:22:23 | dotted two (1000,1000) matrices in 141.2 ms
2021-05-21-15:22:23 | dotted two (4000) vectors in 6.92 us
2021-05-21-15:22:24 | SVD of (2000,1000) matrix in 0.742 s
2021-05-21-15:22:27 | Eigendecomp of (1500,1500) matrix in 2.571 s
2021-05-21-15:22:27 | 0
2021-05-21-15:22:44 | Random Initialize (10000,10000) matrix in 17.062 s
2021-05-21-15:23:14 | ArgSort (10000,10000) matrix in 29.787 s
2021-05-21-15:23:14 | 1
2021-05-21-15:23:29 | Random Initialize (10000,10000) matrix in 15.164 s
2021-05-21-15:27:31 | ArgSort (10000,10000) matrix in 242.202 s

NumPy/Python version information:

Numpy: 1.20.3
Python: 3.8

@bashtage
Copy link
Contributor

bashtage commented May 21, 2021

You are likely having threading problems. export OPENBLAS_NUM_THREADS=8 or perhaps 16.

@bashtage
Copy link
Contributor

bashtage commented May 21, 2021

As for random, use np.random.default_rng(SEED) to get a generator, then gen.standard_normal((n,n)). Takes 1 second to generate a 10k by 10k matrix on a Ryzen 39?0.

The old version in np.random.randn uses a Box-Muller transform which in trun relised on a polar transform. This is much slower than using the modern version that is in Generator.

@baist
Copy link
Author

baist commented May 21, 2021

You are likely having threading problems. export OPENBLAS_NUM_THREADS=8 or perhaps 16.

Thanks for your suggestions. But more threads do not work. Probably, the issue is incurred by the system or other configurations, such as lapack or openblas. As an alternative solution, I'd like to re-implement the numpy code with pytorch, which performs normally.

@bashtage
Copy link
Contributor

This is really strange. Ran your code on my machine and saw timings of 2s and 5s for the two main tasks.

Have you tried using conda-provided Python and NumPy as a check that it isn't something to do with the particulars of a specific compiler? From your logs it looks like you are compiling your own version.

@baist
Copy link
Author

baist commented May 21, 2021

This is really strange. Ran your code on my machine and saw timings of 2s and 5s for the two main tasks.

Have you tried using conda-provided Python and NumPy as a check that it isn't something to do with the particulars of a specific compiler? From your logs it looks like you are compiling your own version.

Yes, the initialization operation is completed instantly on other cpus (i.e., AMD EYPC 7551P or Intel(R) Xeon(R) Silver 4114 ). In the beginning, I use the Python and Numpy installed by Conda, however, the time cost is nearly 30m for 10k*10k matrix. To solve this, I try to install OpenBLAS (also lapack) for acceleration and also compile Numpy, the matrix operations have indeed faster, but the initialization has not yet (17s~200+).

CPU is AMD EPYC 7502 and system version is Ubuntu 18.04 LTS. Maybe some files lost in installation, in future, I plan to reinstall the system and test again.

By the way, 'cupy' runs very fast (same initialization takes about 0.1s) since it works on GPU(RTX 3090 here).

@charris
Copy link
Member

charris commented May 21, 2021

But more threads do not work.

The idea is to use fewer threads, maybe try 1. The timings you report are so bad I'd almost suspect swap memory is in play.

@rkern
Copy link
Member

rkern commented May 21, 2021

The OPENBLAS_NUM_THREADS stuff shouldn't be in play for the final two tests that are in question, as they only use randn() and argsort() and no linear algebra.

@rkern
Copy link
Member

rkern commented May 21, 2021

Here's the stripped-down version (using the preferred modern PRNG stuff that's significantly faster):

import numpy as np
import time


def print_time(string=''): 
    res = time.strftime("%Y-%m-%d-%H:%M:%S", time.localtime())
    res = res + ' | ' + string
    print(res)

np.__config__.show()
rng = np.random.default_rng(0xd6992220051247b7be44a45e1580c507)

# --- Test 5
n = 10000
t = time.time()
A = rng.standard_normal([n, n])
td = time.time() - t
print_time("Random Initialize (%d,%d) matrix in %0.3f s" % (n, n, td))


# --- Test 6
t = time.time()
B = np.argsort(A, axis=-1)
td = time.time() - t
print_time("ArgSort (%d,%d) matrix in %0.3f s" % (n, n, td))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants