-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMP Groupby any/all #53149
OpenMP Groupby any/all #53149
Conversation
I tried this recently ish with libalgos.nancorr and couldn’t get it working on Mac. Would be awesome if you can figure it out! |
Assuming this link is true looks like the macOS bundled llvm just doesn't support llvm: https://stackoverflow.com/a/60043467/621736 Meson should help us compared to setuptools - will rebase after that goes into main and see |
@@ -22,7 +22,7 @@ py = py_mod.find_installation('python') | |||
py_dep = py.dependency() | |||
tempita = files('generate_pxi.py') | |||
versioneer = files('generate_version.py') | |||
|
|||
openmp = dependency('openmp', required: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that i shouldn't expect this PR to affect perf on mac?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using the standard compiler I don't think so, but you may be able to achieve it if you use a conda supplied installer.
There are two ways to validate:
- When installing with meson add the
--config-settings=compile-args="--verbose"
flag, i.e.python -m pip install -ve . --no-build-isolation --config-settings=compile-args="--verbose"
and search for the-fopenmp
flag being passed when linking the groupby lib - grep the groupby lib for any openmp symbols. On linux this would be
objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get
with the standard meson build directory + py311. You will need to tweak to your environment, and maybe tweak objdump (not sure how it works on macOS)
Here's the sample output from 2 on my computer:
objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get
0000000000000000 F *UND* 0000000000000000 omp_get_thread_num@OMP_1.0
0000000000000000 F *UND* 0000000000000000 omp_get_num_threads@OMP_1.0
The omp_get
symbols are of interest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the second solution above is the only "guaranteed" way of knowing; for option 1 it is possible for the flag to be silently ignored
@h-vetinari wondering if you have any experience with distributing openmp linked libraries via conda and pip. Not sure how the presence of this on the build system may affect downstream users |
@@ -621,7 +622,7 @@ def group_any_all( | |||
out[:] = 1 - flag_val | |||
|
|||
with nogil: | |||
for i in range(N): | |||
for i in prange(N): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason this went here rather than someplace else? my assumption is it was a pretty arbitrary choice made for a POC, but worth checking.
if this goes in, can i look forward to a bunch more PRs adding prange all over the place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this is probably usable in less places than you would think. I don't know all the rules for what makes Cython/OpenMP happy but I tried quite a few different algorithms and couldn't get them to work, before trying this one because it was relatively simple
OpenMP I don't think generally works with an accumulator shared across threads (for good reason), and Cython seems really conservative when it comes to allowing accumulators within a prange block, even if the accumulation would seemingly happen all on one thread.
Quoting the Cython docs from the OP:
You should also be aware that a lot of the choices Cython makes about how your code is parallelized are fairly fixed and if you want specific OpenMP behaviour that Cython doesn't provide by default you may be better writing it in C yourself.
I think will take trial and error to really know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add some context to the above, one of the first algorithms I looked at was corr since our implementation lags pretty bad relative to the NumPy implementation (granted we handle NA values and they do not)
Line 365 in 94e868a
with nogil: |
Here is what I tried to change:
with nogil:
for xi in prange(K):
for yi in range(xi + 1):
# Welford's method for the variance-calculation
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0
for i in range(N):
if mask[i, xi] and mask[i, yi]:
vx = mat[i, xi]
vy = mat[i, yi]
nobs += 1
dx = vx - meanx
dy = vy - meany
meanx += 1. / nobs * dx
meany += 1. / nobs * dy
ssqdmx += (vx - meanx) * dx
ssqdmy += (vy - meany) * dy
covxy += (vx - meanx) * dy
if nobs < minpv:
result[xi, yi] = result[yi, xi] = NaN
else:
divisor = (nobs - 1.0) if cov else sqrt(ssqdmx * ssqdmy)
if divisor != 0:
result[xi, yi] = result[yi, xi] = covxy / divisor
else:
result[xi, yi] = result[yi, xi] = NaN
And Cython gives this error when you try to generate it:
Error compiling Cython file:
------------------------------------------------------------
...
for i in range(N):
if mask[i, xi] and mask[i, yi]:
vx = mat[i, xi]
vy = mat[i, yi]
nobs += 1
dx = vx - meanx
^
------------------------------------------------------------
/home/willayd/clones/pandas/pandas/_libs/algos.pyx:377:34: Cannot read reduction variable in loop body
So I think all those values like meanx
, meany
, etc.. that accumulate a value over a loop end up with pretty limited value, even if that accumulation should happen all on one thread. My guess is Cython is being ultra-conservative about this, so as we learn more about it can definitely give feedback upstream and see what they say
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for taking a look. I'll poke at that one if/when this goes in, as it definitely seems like it should parallelize nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really think ipenmp is a good idea?
this opens up a huge box
Not sure. It definitely is the easiest way for us to parallelize things via Cython (where applicable), but I can't claim to have experience with using it otherwise. The packaging seems like the riskiest piece. Maybe we have to turn it off for wheels but enable for conda environments where openmp can be ensured? Are there other huge downsides anyone is aware of? |
In general, you don't want to parallelize at multiple levels of your code. E.g. a workflow that uses up all cores via multiprocessing a workflow and then trying to use all cores within each process will suffer from context switches. Another case is in a large shared server were you want to ensure you aren't over utilizing resources. |
TL;DR: It works fine with conda, not fine with pip. In more detail, conda-forge has all the required bits & pieces to make this work (and we ship openmp-support in openblas for example), though windows has the wrinkle that the As a concrete example, scipy has not allowed a hard dependence on OpenMP for various reasons, primarily related to the limitations of PyPI's distribution model. @rgommers was & is heavily involved in this, and also did a write-up about this very subject in the context of the pypackaging-native initiative. From a (somewhat-)recently-added cross-reference on that scipy-issue, I see that a bunch of familiar faces already discussed this in the wider PyData context, including pandas. |
FWIW numba seems to "select a threading layer if already available" instead of requiring one https://numba.pydata.org/numba-doc/latest/user/threading-layer.html#selecting-a-named-threading-layer
Yeah numba specifically detects and disallows this when jitting which might be hard to do on our side |
Wow thanks these are great resources. But I think the last link might be private? I get a 404 trying to access |
My bad, sorry, I'm not used to being a member of non-public GH orgs 😅 (i.e. when GH shows me an xref, I wrongly assume from habit that everyone else sees it too; in this case, the write-up I linked is more recent and more complete anyway) |
I am -1 on randomly adding parallelizable that is
this needs comprehensive policies otherwise you end up with multiple of layers of parallelism which is almost always bad point is that this needs substantial testing and validation and tbh we may want to punt completely at the pandas level on this |
I'm open to a "needs more thought" argument, but on a lot of fronts we've done a pretty good job with optimization so performance gains are going to have to come from parallelism |
Yea sounds like generally applying has some caveats. A conservative approach to start could be to add |
+1 to @jreback's points. I'll note that @thomasjpfan put quite some thought into this topic and wrote up a design proposal for enabling parallelism across core PyData packages in a careful fashion: https://thomasjpfan.github.io/parallelism-python-libraries-design/ (also going to be a SciPy'23 talk I believe). |
Awesome links thanks for sharing. So I see in the section: https://thomasjpfan.github.io/parallelism-python-libraries-design/#proposal-2 That there are three things being proposed:
2 seems feasible, 3 looks like the pre-cursor PR is merged, so I'm guessing just a matter of time (I see Intel OpenMP wheels already but likely not what the proposal refers to). For number 1 I'm not sure if that would be possible for pandas to tackle, especially if we just rely on the Cython-generated OpenMP code. Guessing that is something we may want Cython to tackle? |
Let's close this - I don't plan on taking any more immediate action. I think this is a good reference for any future discussion of the topic |
Looking over the concerns from this PR, I can see the following path forward for pandas + OpenMP:
Alternative path forwardAn alternative to OpenMP is to go with SciPy's solution, which developed their own pthreads based solution. It has the benefit of being a standard and avoids some of the issues with OpenMP. |
@thomasjpfan can you comment on the pros/cons of the scipy solution, particularly for those of us with limited c++ experience |
For reference, SciPy has discussed the topic of adding OpenMP and decided on pthreads: scipy/scipy#10239 At a high level the pros and cons unique to SciPy's pthreads solution: Pros
Cons
|
Seems to give a small boost with our current test suite, though this could depend on the shape of the input data:
For the openmp link arg meson should have a much better facility for auto-detecting that which we can leverage
Cython dev docs have been updated with more info on parallelization:
https://github.com/cython/cython/blob/master/docs/src/tutorial/parallelization.rst