New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A need for contiguity-based axis-order optimization in tensordot #11940
Comments
I've run into this problem in my projects, and found that the origin of the difference comes from the
If we have to generate a new matrix here, then it's very time consuming. I think a general way to solve this problem is not easy, because if we rely on I'm happy to prepare a PR for this, but I think I should hear more advice first. |
Here are results of some experiment. Firstly repeat previous results: >>> import numpy as np
>>> x = np.random.rand(100, 100, 100)
>>> %timeit np.tensordot(x, x, axes=((0, 1, 2), (0, 1, 2)))
... 1000 loops, best of 3: 547 µs per loop
>>> %timeit np.tensordot(x, x, axes=((1, 2, 0), (1, 2, 0)))
... 100 loops, best of 3: 13.8 ms per loop
>>> xt = np.moveaxis(x, -1, 0)
>>> %timeit np.tensordot(xt, xt, axes=((0, 1, 2), (0, 1, 2)))
... 100 loops, best of 3: 17 ms per loop
>>> %timeit np.tensordot(xt, xt, axes=((1, 2, 0), (1, 2, 0)))
... 1000 loops, best of 3: 498 µs per loop Everything goes "as expected". >>> %timeit x.transpose((0, 1, 2)).reshape(100 ** 3)
... The slowest run took 7.41 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.23 µs per loop
>>> x_ = x.transpose((0, 1, 2)).reshape(100 ** 3)
>>> %timeit x_.dot(x_)
... 1000 loops, best of 3: 340 µs per loop
>>> %timeit x.transpose((1, 2, 0)).reshape(100 ** 3)
... 100 loops, best of 3: 5.55 ms per loop
>>> x_ = x.transpose((1, 2, 0)).reshape(100 ** 3)
>>> %timeit x_.dot(x_)
... 1000 loops, best of 3: 344 µs per loop I'm not sure what the "cache" thing means, but the conclusion that >>> %timeit xt.transpose((0, 1, 2)).reshape(100 ** 3)
... 100 loops, best of 3: 7.85 ms per loop
>>> xt_ = xt.transpose((0, 1, 2)).reshape(100 ** 3)
>>> %timeit xt_.dot(xt_)
... 1000 loops, best of 3: 345 µs per loop
>>> %timeit xt.transpose((1, 2, 0)).reshape(100 ** 3)
... The slowest run took 8.00 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.28 µs per loop
>>> xt_ = xt.transpose((1, 2, 0)).reshape(100 ** 3)
>>> %timeit xt_.dot(xt_)
... 1000 loops, best of 3: 345 µs per loop |
The ordering of axes fed to
tensordot
can have a massive (order of magnitude) impact on its efficiency, based on the memory layout of the array(s) being summed:Moving
x
's axis leads to a swap in timing:As suggested by @eric-wieser,
tensordot
would benefit from axis-ordering based on memory contiguity to help guard against these massive slow downs.The text was updated successfully, but these errors were encountered: