Special-case mm, mv and vv contractions in eager path #1122

suranap · 2024-02-06T20:20:55Z

Software versions

Python : 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
Platform : Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Legion : v23.11.00.dev-54-g40f6061
Legate : 23.11.00.dev+54.g40f6061
Cunumeric : 23.11.00.dev+36.gb2912ed7
Numpy : 1.26.4
Scipy : 1.12.0
Numba : 0.59.0
CTK package : cuda-version-11.7-h67201e3_2 (conda-forge)
GPU driver : 535.54.03
GPU devices :
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

Expected Eager arrays to be closer to NumPy performance.

Observed behavior

--------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                      Min                   Max                  Mean             StdDev                Median                IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_cunumeric_linear_defer         1.8493 (1.0)          2.2592 (1.0)          1.9831 (1.0)       0.1007 (1.0)          1.9785 (1.0)       0.1476 (1.0)          15;0  504.2546 (1.0)          46           1
test_cunumeric_linear_numpy       179.1552 (96.88)      180.6841 (79.98)      179.6776 (90.60)     0.6576 (6.53)       179.3563 (90.65)     1.1339 (7.68)          1;0    5.5655 (0.01)          6           1
test_linear_numpy                 198.9235 (107.57)     225.0926 (99.63)      211.5238 (106.66)   11.8600 (117.79)     216.5387 (109.45)   20.6418 (139.85)        3;0    4.7276 (0.01)          5           1
test_cunumeric_linear_eager     1,593.4713 (861.67)   1,598.5898 (707.58)   1,594.9393 (804.26)    2.0885 (20.74)    1,594.0862 (805.71)    1.9122 (12.95)         1;1    0.6270 (0.00)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

Example code or instructions

import cunumeric as np
import numpy as onp

# pip install pytest-benchmark 

def linear(x, weight, bias):
    out = x @ weight.T + bias
    return out


def test_linear_numpy(benchmark):
    """Use regular numpy, no cunumeric at all
    """
    x = onp.random.normal(size=(6144, 768))
    weight = onp.random.normal(size=(768,768))
    bias = onp.random.normal(size=(6144, 768))

    out = benchmark(linear, x, weight, bias)


def test_cunumeric_linear_numpy(benchmark):
    """Extract numpy arrays inside Eager arrays to get baseline numpy performance. 
    """
    orig = np.runtime.max_eager_volume 
    np.runtime.max_eager_volume = 1_000_000_000_000
    x_a = np.random.normal(size=(6144, 768))
    weight_a = np.random.normal(size=(768,768))
    bias_a = np.random.normal(size=(6144, 768))

    x = x_a._thunk.array
    weight = weight_a._thunk.array
    bias = bias_a._thunk.array

    out = benchmark(linear, x, weight, bias)
    np.runtime.max_eager_volume = orig
    
def test_cunumeric_linear_eager(benchmark):
    """Run large eager array to show overhead compared to numpy
    """
    orig = np.runtime.max_eager_volume 
    np.runtime.max_eager_volume = 1_000_000_000_000
    x = np.random.normal(size=(6144, 768))
    weight = np.random.normal(size=(768,768))
    bias = np.random.normal(size=(6144, 768))

    out = benchmark(linear, x, weight, bias)
    np.runtime.max_eager_volume = orig


def test_cunumeric_linear_defer(benchmark):
    """Performance of deferred array on GPU(s)
    """
    x = np.random.normal(size=(6144, 768))
    weight = np.random.normal(size=(768,768))
    bias = np.random.normal(size=(6144, 768))

    out = benchmark(linear, x, weight, bias)

Stack traceback or browser console output

No response

The text was updated successfully, but these errors were encountered:

manopapad · 2024-02-14T02:00:59Z

Thank you for reporting this. The issue here is that all tensor-contraction-like operations are going through the same interface (module.py:_contract), then inside of deferred.py:contract we look for cases that allow us to use the "fast" paths, for vector-vector, matrix-vector and matrix-matrix multiplication. The eager implementation of contract doesn't do this optimization, and always just falls back to np.einsum, which apparently runs much slower than np.matmul, even if the einsum expression corresponds to a matrix-matrix multiplication.

We should handle "fast path" operations like dot, matmul etc. directly in module.py and array.py. We still want to keep the "check for fast paths" in deferred.py:contract, so we still recognize cases where an einsum contraction description can be executed efficiently, but when we know in module.py or array.py that we're in a fast path, we should just emit the corresponding task directly.

manopapad added the category:improvement PR introduces an improvement and will be classified as such in release notes label Feb 14, 2024

manopapad changed the title ~~Eager arrays 10x slower than Numpy~~ Special-case mm, mv and vv contractions in eager path Feb 14, 2024

manopapad assigned bryevdv Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special-case mm, mv and vv contractions in eager path #1122

Special-case mm, mv and vv contractions in eager path #1122

suranap commented Feb 6, 2024

manopapad commented Feb 14, 2024

Special-case mm, mv and vv contractions in eager path #1122

Special-case mm, mv and vv contractions in eager path #1122

Comments

suranap commented Feb 6, 2024

Software versions

Jupyter notebook / Jupyter Lab version

Expected behavior

Observed behavior

Example code or instructions

Stack traceback or browser console output

manopapad commented Feb 14, 2024