-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMP Groupby any/all #53149
OpenMP Groupby any/all #53149
Changes from all commits
ebb8602
df37cc1
f7d3336
c1beccd
35462e6
8c7217b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -3,6 +3,7 @@ from cython cimport ( | |||
Py_ssize_t, | ||||
floating, | ||||
) | ||||
from cython.parallel cimport prange | ||||
from libc.math cimport ( | ||||
NAN, | ||||
sqrt, | ||||
|
@@ -621,7 +622,7 @@ def group_any_all( | |||
out[:] = 1 - flag_val | ||||
|
||||
with nogil: | ||||
for i in range(N): | ||||
for i in prange(N): | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any particular reason this went here rather than someplace else? my assumption is it was a pretty arbitrary choice made for a POC, but worth checking. if this goes in, can i look forward to a bunch more PRs adding prange all over the place? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately this is probably usable in less places than you would think. I don't know all the rules for what makes Cython/OpenMP happy but I tried quite a few different algorithms and couldn't get them to work, before trying this one because it was relatively simple OpenMP I don't think generally works with an accumulator shared across threads (for good reason), and Cython seems really conservative when it comes to allowing accumulators within a prange block, even if the accumulation would seemingly happen all on one thread. Quoting the Cython docs from the OP:
I think will take trial and error to really know There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To add some context to the above, one of the first algorithms I looked at was corr since our implementation lags pretty bad relative to the NumPy implementation (granted we handle NA values and they do not) Line 365 in 94e868a
Here is what I tried to change: with nogil:
for xi in prange(K):
for yi in range(xi + 1):
# Welford's method for the variance-calculation
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0
for i in range(N):
if mask[i, xi] and mask[i, yi]:
vx = mat[i, xi]
vy = mat[i, yi]
nobs += 1
dx = vx - meanx
dy = vy - meany
meanx += 1. / nobs * dx
meany += 1. / nobs * dy
ssqdmx += (vx - meanx) * dx
ssqdmy += (vy - meany) * dy
covxy += (vx - meanx) * dy
if nobs < minpv:
result[xi, yi] = result[yi, xi] = NaN
else:
divisor = (nobs - 1.0) if cov else sqrt(ssqdmx * ssqdmy)
if divisor != 0:
result[xi, yi] = result[yi, xi] = covxy / divisor
else:
result[xi, yi] = result[yi, xi] = NaN And Cython gives this error when you try to generate it: Error compiling Cython file:
------------------------------------------------------------
...
for i in range(N):
if mask[i, xi] and mask[i, yi]:
vx = mat[i, xi]
vy = mat[i, yi]
nobs += 1
dx = vx - meanx
^
------------------------------------------------------------
/home/willayd/clones/pandas/pandas/_libs/algos.pyx:377:34: Cannot read reduction variable in loop body So I think all those values like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, thanks for taking a look. I'll poke at that one if/when this goes in, as it definitely seems like it should parallelize nicely. |
||||
lab = labels[i] | ||||
if lab < 0: | ||||
continue | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that i shouldn't expect this PR to affect perf on mac?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using the standard compiler I don't think so, but you may be able to achieve it if you use a conda supplied installer.
There are two ways to validate:
--config-settings=compile-args="--verbose"
flag, i.e.python -m pip install -ve . --no-build-isolation --config-settings=compile-args="--verbose"
and search for the-fopenmp
flag being passed when linking the groupby libobjdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get
with the standard meson build directory + py311. You will need to tweak to your environment, and maybe tweak objdump (not sure how it works on macOS)Here's the sample output from 2 on my computer:
The
omp_get
symbols are of interestThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the second solution above is the only "guaranteed" way of knowing; for option 1 it is possible for the flag to be silently ignored