Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special-case mm, mv and vv contractions in eager path #1122

Open
suranap opened this issue Feb 6, 2024 · 1 comment
Open

Special-case mm, mv and vv contractions in eager path #1122

suranap opened this issue Feb 6, 2024 · 1 comment
Assignees
Labels
category:improvement PR introduces an improvement and will be classified as such in release notes

Comments

@suranap
Copy link

suranap commented Feb 6, 2024

Software versions

Python : 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
Platform : Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Legion : v23.11.00.dev-54-g40f6061
Legate : 23.11.00.dev+54.g40f6061
Cunumeric : 23.11.00.dev+36.gb2912ed7
Numpy : 1.26.4
Scipy : 1.12.0
Numba : 0.59.0
CTK package : cuda-version-11.7-h67201e3_2 (conda-forge)
GPU driver : 535.54.03
GPU devices :
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

Expected Eager arrays to be closer to NumPy performance.

Observed behavior

--------------------------------------------------------------------------------------------- benchmark: 4 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                      Min                   Max                  Mean             StdDev                Median                IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_cunumeric_linear_defer         1.8493 (1.0)          2.2592 (1.0)          1.9831 (1.0)       0.1007 (1.0)          1.9785 (1.0)       0.1476 (1.0)          15;0  504.2546 (1.0)          46           1
test_cunumeric_linear_numpy       179.1552 (96.88)      180.6841 (79.98)      179.6776 (90.60)     0.6576 (6.53)       179.3563 (90.65)     1.1339 (7.68)          1;0    5.5655 (0.01)          6           1
test_linear_numpy                 198.9235 (107.57)     225.0926 (99.63)      211.5238 (106.66)   11.8600 (117.79)     216.5387 (109.45)   20.6418 (139.85)        3;0    4.7276 (0.01)          5           1
test_cunumeric_linear_eager     1,593.4713 (861.67)   1,598.5898 (707.58)   1,594.9393 (804.26)    2.0885 (20.74)    1,594.0862 (805.71)    1.9122 (12.95)         1;1    0.6270 (0.00)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

Example code or instructions

import cunumeric as np
import numpy as onp

# pip install pytest-benchmark 

def linear(x, weight, bias):
    out = x @ weight.T + bias
    return out


def test_linear_numpy(benchmark):
    """Use regular numpy, no cunumeric at all
    """
    x = onp.random.normal(size=(6144, 768))
    weight = onp.random.normal(size=(768,768))
    bias = onp.random.normal(size=(6144, 768))

    out = benchmark(linear, x, weight, bias)


def test_cunumeric_linear_numpy(benchmark):
    """Extract numpy arrays inside Eager arrays to get baseline numpy performance. 
    """
    orig = np.runtime.max_eager_volume 
    np.runtime.max_eager_volume = 1_000_000_000_000
    x_a = np.random.normal(size=(6144, 768))
    weight_a = np.random.normal(size=(768,768))
    bias_a = np.random.normal(size=(6144, 768))

    x = x_a._thunk.array
    weight = weight_a._thunk.array
    bias = bias_a._thunk.array

    out = benchmark(linear, x, weight, bias)
    np.runtime.max_eager_volume = orig
    
def test_cunumeric_linear_eager(benchmark):
    """Run large eager array to show overhead compared to numpy
    """
    orig = np.runtime.max_eager_volume 
    np.runtime.max_eager_volume = 1_000_000_000_000
    x = np.random.normal(size=(6144, 768))
    weight = np.random.normal(size=(768,768))
    bias = np.random.normal(size=(6144, 768))

    out = benchmark(linear, x, weight, bias)
    np.runtime.max_eager_volume = orig


def test_cunumeric_linear_defer(benchmark):
    """Performance of deferred array on GPU(s)
    """
    x = np.random.normal(size=(6144, 768))
    weight = np.random.normal(size=(768,768))
    bias = np.random.normal(size=(6144, 768))

    out = benchmark(linear, x, weight, bias)

Stack traceback or browser console output

No response

@manopapad manopapad added the category:improvement PR introduces an improvement and will be classified as such in release notes label Feb 14, 2024
@manopapad
Copy link
Contributor

Thank you for reporting this. The issue here is that all tensor-contraction-like operations are going through the same interface (module.py:_contract), then inside of deferred.py:contract we look for cases that allow us to use the "fast" paths, for vector-vector, matrix-vector and matrix-matrix multiplication. The eager implementation of contract doesn't do this optimization, and always just falls back to np.einsum, which apparently runs much slower than np.matmul, even if the einsum expression corresponds to a matrix-matrix multiplication.

We should handle "fast path" operations like dot, matmul etc. directly in module.py and array.py. We still want to keep the "check for fast paths" in deferred.py:contract, so we still recognize cases where an einsum contraction description can be executed efficiently, but when we know in module.py or array.py that we're in a fast path, we should just emit the corresponding task directly.

@manopapad manopapad changed the title Eager arrays 10x slower than Numpy Special-case mm, mv and vv contractions in eager path Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:improvement PR introduces an improvement and will be classified as such in release notes
Projects
None yet
Development

No branches or pull requests

3 participants