-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add support for BLIS to numpy.distutils #7294
Conversation
Needs a note in |
@rgommers Could you add the relese note? |
@charris done. Please don't merge yet though - @matthew-brett wanted to test this. |
# [blis] | ||
# libraries = blis | ||
# library_dirs = /home/username/blis/lib | ||
# include_dirs = /home/username/blis/include/blis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unusual, normally the include directory is just "include/" and subfolders are included via the #include
line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm but it seems where blis puts its stuff, should probably have a comment that this needs to be the folder containing cblas.h which is prefix/blis
by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done. also added a few more notes on compiling BLIS itself.
seems to work but import ends with |
you actually have to put |
I also get undefined symbols starting with: I don't get these errors if I specify BLIS in the
|
Using the
|
Yeah, BLIS seems to export a fortran-compatible BLAS interface by default, and only export the CBLAS interface if specifically configured to do so. |
@matthew-brett in your configuration in [blas] numpy will not use it, we need to have |
It's of course a bit odd that for |
On the other hand, for Scipy it makes little sense. The Fortran interface would be better. |
we will need both and all suitable blas libraries provide both. Its just not default in BLIS which can and should probably be changed. Maybe one can think about adding two blas sections to the config, one for fortran blas and one for cblas, but the need didn't really come up yet. |
It's not hard to enable CBLAS in BLIS if that's what we need... @fgvanzee: any reason in particular why BLIS's CBLAS layer is disabled by default? |
For reference, here's the BLIS config changes I had to make to make this work on 32-bit:
It works, with
but I'm of course not sure if those are all the changes needed (Numpy doesn't exercise that much of BLAS). It's a bit fiddly.... |
as blis defaults to no cblas, maybe it is a good idea to add an explicit cblas test like the blas_info class has |
but to encourage testing with blis I'm also fine with merging now, we can sort out details later. |
Here I think you just want a bare
I think this is wrong -- you're saying "make a binary that uses SSE3 instructions, and also all the instruction sets available on the machine where I'm compiling this". For our purposes I think we want something more like
You're not using the sandybridge config, so I think you can discard these changes :-)
These should probably be set in Also, the default internal integer size should probably be |
In a way that can be selected at build time without patching headers, ideally. |
@njsmith, CBLAS is disabled in most BLIS configurations by default because very few people I interact with need/use it. :) |
@fgvanzee - would it be easy to enable CBLAS with a flag like Where is the best place to ask about default SSE2 and SSE3 templates? I think we're really close to being able to use BLIS by default, but at the moment the reference implementation is very often selected by the template selection algorithm, and that is too slow compared to the alternatives. Is there any help we can offer in exchange? |
@matthew-brett: from looking at the autodetection code, I think it's just poorly written. Two obvious problems jump out at me: (a) if the CPU supports AVX but the OS does not, then it falls back on reference, instead of falling back on SSE3. (b) since it has a hard-coded table of every CPU model, it needs to be updated every time a new CPU type is released. So e.g. it doesn't know anything about the latest skylake processors, and therefore they get the reference kernel instead of AVX2 like they should. Instead it should be checking directly for which instruction sets are supported using feature flags ("dunnington" = sse3, "sandybridge" = avx, "haswell" = avx2). Also note that there are no SSE2-specific kernels, but the reference configuration can/should be built with (@fgvanzee: for context note that @matthew-brett has been experimenting with building multiple configurations of BLIS and then using cpu-detection code to decide which version to load at runtime.) |
@njsmith: The auto-detection code may very well be poorly written, but I did not write it--Xianyi did, during his visit. Now, we were very grateful that he put at least something in place, as prior to his stint here at UT there was zero auto-detection. As for SSE2/3, you are correct: we only have SSE3 kernels, and they are incomplete (real domain only, I think). But I don't know to what extent those systems are properly detected, and based on your comments it sounds like there are gaps. Given that we no longer have the hardware those were developed on, it will be up to others to test that functionality and propose patches. @matthew-brett: I am rearchitecting BLIS to facilitate various features which are not practical (feasible) presently, including runtime auto-detection of kernels/blocksizes. That will probably result in a redesign of the configure-time build system as well. So, these issues are on our radar, but they take time. But the goal is not to "fill the tree" of support for all hardware. We will need contributions from the community to fill the gaps once the new software architecture is in place. Again, not sure what you mean by "templates". Maybe you mean configurations? Or kernels? |
Sorry - by templates I mean 'configurations' as in the sub-directories in your Do you think it is possible, with something like the current state of BLIS, to make a collection of built configurations from which we could select at run-time, to give us a reasonably good performance on average across CPUs? If that was possible, then I think you would find a lot of developer interest coming your way from us, from Julia, R and Octave, by which I mean issues and patches. |
@matthew-brett: The current BLIS software architecture does not allow one to change configuration information (kernels, blocksizes, etc.) at runtime. As I alluded to, I am working to facilitate this, but it requires a lot of groundwork. (Auto-detection is just one "application" of the general feature. Experts--those who know what they are doing--will be able to switch between custom kernels on-demand, if they wish. I tend to think of all of this under the umbrella of "runtime management" of what we currently think of as "the configuration." The CPUID/hardware-detection side of this is actually the least interesting part, to me, even if it is the most useful to end-users.) |
@fgvanzee - sorry - I am not being clear. What we are experimenting with, is building all the currently defined x86 configurations into separate libraries. So we have a directory structure something like this:
Then, at runtime, we check |
Sincere apologies if that came across as criticism -- I'm aware that you didn't write it, and even if you had I wouldn't have considered it a judgement on you. All my code can certainly be improved further too :-) |
@fgvanzee - following up here. Let's say we have built each defined x86 configuration as a separate library, as I described above, and we have found some optimal way of choosing between these libraries using the CPU identification, do you think we would be able to get reasonable performance across a range of CPUs? If not - what would it take for that to be possible? |
@njsmith: No offense taken. I just wanted to make clear that it was one of the few parts of BLIS that I did not author, and that I have not even really looked at myself. I will probably try to clean it up eventually, but I don't know exactly when that will be. Furthermore, it will depend on how successfully I can learn the absurd intricacies of CPUID, and I will tell you right now, I may get to a point where it makes me want to jump off of a building. If you or anyone in your circle has expertise in that, and can describe how the CPUID register values should be detected for the modern family of Intel and AMD systems, that would be welcome. @matthew-brett: Yes, that clarifies things a bit. First, I hope we can agree that your extra-BLIS solution will become unnecessary in the short- to medium-term, since that functionality is planned for implementation within BLIS. Now, to your question: I feel like I'm missing something. If your multi-shared object approach uses reference and haswell, you will only get good performance on haswell/broadwell systems. If it uses reference, sandybridge, and haswell, you will only get good performance on sandy/ivybridge and haswell/broadwell, and so forth. (The reference configuration will always be slow, but it will always work.) I don't know what your definition of "reasonable range" of CPUs is. |
@matthew-brett @tkelman @juliantaylor @fgvanzee: You know, it occurs to me that it would be very straightforward to create our own runtime-autoconfiguring build of BLIS right now, as a temporary measure while Field is rearchitecting the core to support more powerful forms of runtime configuration. This could be done without any hackery to BLIS itself. Just create a new BLIS configuration called, bli_kernel.h
kernels/runtime-selected.c
The only thing we'd need to change about the core of BLIS is that we'd probably want to modify the existing x86-64 kernels to have more unique names (e.g. renaming the current For bonus points:
Of course this would be slightly klugey, but rather minimally: it'd be simple, would take advantage of the existing BLIS API configuration abstractions (and in particular, since all changes would be restricted to a new directory in |
Anything that can be implemented around BLIS in C rather than Python would be much easier for Julia (and R, Octave, SciLua, anyone else) to test and use, so I like the sound of that. |
For CPUID - the heavy lifting for https://github.com/matthew-brett/x86cpu is in the C code at https://github.com/matthew-brett/x86cpu/tree/master/src so would be easy enough to recycle into kernel selection. |
@matthew-brett: also, a lot of the complexity there goes away if we don't care about supporting compilers without inline asm (and blis very much requires inline asm). |
Anywhere you care about MSVC runtime compatibility but don't need Fortran, you should probably already be using clang. |
@njsmith: your interim solution sounds fine. Just be advised that you will probably need to rework your solution at least slightly (for example, EDIT: you've also hit on yet another item on my to-do list, which is to clean up the kernels directory. For example, the |
@fgvanzee: is your refactoring-in-progress available anywhere to peek at (e.g. on a personal branch)? And are there any kernels that you know that we should specifically steer clear of until you have time to clean things up more? (Or is there some way that someone external could help with the clean up?) |
@rgommers: sorry this turned into a general chat channel about numpy+blis interaction -- I kinda lost track of what's going on with the underlying PR. Should we just merge it? |
Can be merged as is, or I can add the |
@rgommers: I don't think the CBLAS check is too urgent just because whether we have it or not, numpy+BLIS is going to remain an experts-only endeavour in several ways for the next while :-). Might well be a good idea to fix it up later, but we can do that as a separate PR? |
sure |
Then let's do this |
ENH: add support for BLIS to numpy.distutils
I refactored the CPUID etc code into:
|
@njsmith: Sorry for the delay. I became overwhelmed with communication (and still am). And so I just kind of ignored most everything on github. Sometimes I feel like I'm autistic and I do strange things to cope. Unfortunately, I don't typically use branches for interim work. There certainly hasn't been any reason to so far; I recently did a @matthew-brett: Thanks for your efforts vis-a-vis the CPUID. That should be quite helpful. We may need to talk about licensing, though. There was a time when it was very important to our project that all of our code be "owned" by people employed by the university, and thus owned by UT. (I know, very much not in the spirit of OSS, but I don't call the shots.) Worst case, we would need to discuss what degree of rewrite I would need to perform to your code for it to qualify as "original" for the purposes of authorship (and thus having the new code qualify as being UT-authored). |
Field - is there any license I can put on the code that would allow you to use the code (and still allow me to distribute and modify)? |
@matthew-brett: The answer is probably something very close to whatever is the least restrictive "license" possible. Public domain? Something narrower would probably work, something that granted copyright to UT. I'm not an IP expert, so I don't know exactly how this works. But we still have time to figure it out. I'm not quite ready to pivot to the CPU autodetection stuff yet. But we can definitely revisit this later. That will give me time to huddle with Robert and for us to determine whether I'm being overly conservative, and if not, determine who to contact within UT for further clarification. |
I'm happy to make it public domain if that would help - just let me know. |
@matthew-brett: Many thanks for your flexibility and willingness to contribute. I'll get back to you. |
@fgvanzee: No worries! I think we all understand very well about being overwhelmed by communication :-) I'll look forward to seeing what you've got when it's ready... |
Note: the status of BLIS and some experiments/benchmarks in using it are discussed in gh-5479.
Besides adding a
blis_info
class plus the corresponding changes tosite.cfg.example
, a few things that generate spurious logging (checks for empty paths) are cleaned up.Edit: mention that #5479 is now closed and #7372 has a more extensive discussion of BLIS.