Skip to content

Feat: Adding Linear Algebra Dot operation support #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

SwayamInSync
Copy link
Collaborator

This PR contributes as follows:

  • Ship the dot method within package that supports following operations
    • vector-vector dot product
    • matrix-vector dot product
    • matrix-matrix multiplication
  • Optimized Linear Algebra ops supported by the QBLAS on x86-64 Linux and ARM machine. On windows it fallbacks to naive implementation due to QBLAS incompatibility with MSVC
  • Test Suite to validate dot products between inputs

Images below are the performance comparison

quadblas-x86-64-96-cores

Machine: x86-64 Linux with 96 cores

quadblas-ARM-8-core

Machine: MacOS-Silicon (ARM) with 8 cores

To compile without QBLAS set DISABLE_QUADBLAS as CFLAGS and CXXFLAGS

@SwayamInSync SwayamInSync requested a review from ngoldbaum July 11, 2025 08:19
@SwayamInSync
Copy link
Collaborator Author

Ahh forgot to update the general CI, or should we remove the quaddtype from there, given that build_wheels.yml has those same checks in its CI with more strict and on all platforms

@SwayamInSync SwayamInSync self-assigned this Jul 11, 2025
@juntyr
Copy link
Contributor

juntyr commented Jul 11, 2025

Would the new functionality only be accessed through the dot function or is there a way to call that automatically when using numpy dot, matmul, etc?

@ngoldbaum
Copy link
Member

Unfortunately no, not easily. We'd need to add a new dtype hook to the DType API in NumPy. Worth doing though! See numpy/numpy#28516 which adds a hook for sorting.

@SwayamInSync
Copy link
Collaborator Author

As per @seberg 's suggestion, this now implements numpy.matmul ufunc.

Some extra stuff includes:

  • Refactor of umath
  • Added a release_tracker.md to track the progress (pinned all the available numpy ufuncs and what we support right now)

@ngoldbaum please take a look and let me know if anything needs a fix

@ngoldbaum
Copy link
Member

This is a lot of code! I'll try to give this a once-over but I'm probably not going to have the bandwidth to go over this with a fine-toothed comb. @juntyr I'd appreciate it if you could also do a round of code review. Don't be afraid to ask questions if anything is confusing to you.

@SwayamInSync
Copy link
Collaborator Author

This is a lot of code! I'll try to give this a once-over but I'm probably not going to have the bandwidth to go over this with a fine-toothed comb. @juntyr I'd appreciate it if you could also do a round of code review. Don't be afraid to ask questions if anything is confusing to you.

Yeah the refactor of umath made it big (but those codes are already reviewed), here is what's new

  • umath/matmul.cpp & .h files
  • quadblas_interface.cpp and h files
  • tests/test_dot.py
  • GitHub CI files

@@ -53,11 +54,17 @@ source temp/bin/activate
# Install the package
pip install meson-python numpy pytest

export LDFLAGS="-Wl,-rpath,$SLEEF_DIR/lib"
export LDFLAGS="-Wl,-rpath,$SLEEF_DIR/lib -fopenmp -latomic -lpthread"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on Windows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can add a section for windows build too in README

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if not, maybe add a comment with a different suggestion)

)

import multiprocessing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was used earlier when dot exposed as python package, I'll remove this from here

return NULL;
}

QuadBLAS::set_num_threads(num_threads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also maybe check for a setting from threadpoolctl: https://github.com/joblib/threadpoolctl

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, yes so by default (during the importing of package) it uses all the available cores but I also expose a python function which directly allow users to change this to whatever they want at any time.

I can add the functionality to just pick similar to numpy using threadpoolctl

if (descr_in1->backend != BACKEND_SLEEF || descr_in2->backend != BACKEND_SLEEF) {
PyErr_SetString(PyExc_NotImplementedError,
"QBLAS-accelerated matmul only supports SLEEF backend. "
"Other backends are not supported with QBLAS.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could maybe add some text that people should open an issue if they want this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is QBLAS intrinsics, for now I implemented it using SLEEF based vectorized operations.
For longdouble I think OpenBLAS and this one I recently found https://github.com/nakatamaho/mplapack are better alternatives.
Although if demanded then we can move towards that direction

Will add the error message to open issue on QBLAS for this rather than here

QuadPrecDTypeObject *descr_out = (QuadPrecDTypeObject *)given_descrs[2];
if (descr_out->backend != target_backend) {
PyErr_SetString(PyExc_NotImplementedError,
"QBLAS-accelerated matmul only supports SLEEF backend for output.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this use exactly the same message as the other error above.

}

return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't carefully review the strided loop implementations, if someone else wants to do that it would be nice!


# ================================================================================
# UTILITIES
# ================================================================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helpers probably want to live in e.g. a test_utils.py file. There's also almost certainly a few spots you can use them in the other test file to justify moving them out of this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it



if __name__ == "__main__":
pytest.main([__file__, "-v"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bit at the end isn't necessary.

Sorry I didn't review the tests carefully - it's too much code for my poor brain...

@SwayamInSync
Copy link
Collaborator Author

Thanks @ngoldbaum will apply all the changes in next commit :) and so sorry this one got hectic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants