Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP Groupby any/all #53149

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ py = py_mod.find_installation('python')
py_dep = py.dependency()
tempita = files('generate_pxi.py')
versioneer = files('generate_version.py')

openmp = dependency('openmp', required: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that i shouldn't expect this PR to affect perf on mac?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using the standard compiler I don't think so, but you may be able to achieve it if you use a conda supplied installer.

There are two ways to validate:

  1. When installing with meson add the --config-settings=compile-args="--verbose" flag, i.e. python -m pip install -ve . --no-build-isolation --config-settings=compile-args="--verbose" and search for the -fopenmp flag being passed when linking the groupby lib
  2. grep the groupby lib for any openmp symbols. On linux this would be objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get with the standard meson build directory + py311. You will need to tweak to your environment, and maybe tweak objdump (not sure how it works on macOS)

Here's the sample output from 2 on my computer:

objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get
0000000000000000       F *UND*	0000000000000000              omp_get_thread_num@OMP_1.0
0000000000000000       F *UND*	0000000000000000              omp_get_num_threads@OMP_1.0

The omp_get symbols are of interest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the second solution above is the only "guaranteed" way of knowing; for option 1 it is possible for the flag to be silently ignored


add_project_arguments('-DNPY_NO_DEPRECATED_API=0', language : 'c')
add_project_arguments('-DNPY_NO_DEPRECATED_API=0', language : 'cpp')
Expand Down
3 changes: 2 additions & 1 deletion pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ from cython cimport (
Py_ssize_t,
floating,
)
from cython.parallel cimport prange
from libc.math cimport (
NAN,
sqrt,
Expand Down Expand Up @@ -621,7 +622,7 @@ def group_any_all(
out[:] = 1 - flag_val

with nogil:
for i in range(N):
for i in prange(N):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason this went here rather than someplace else? my assumption is it was a pretty arbitrary choice made for a POC, but worth checking.

if this goes in, can i look forward to a bunch more PRs adding prange all over the place?

Copy link
Member Author

@WillAyd WillAyd May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is probably usable in less places than you would think. I don't know all the rules for what makes Cython/OpenMP happy but I tried quite a few different algorithms and couldn't get them to work, before trying this one because it was relatively simple

OpenMP I don't think generally works with an accumulator shared across threads (for good reason), and Cython seems really conservative when it comes to allowing accumulators within a prange block, even if the accumulation would seemingly happen all on one thread.

Quoting the Cython docs from the OP:

You should also be aware that a lot of the choices Cython makes about how your code is parallelized are fairly fixed and if you want specific OpenMP behaviour that Cython doesn't provide by default you may be better writing it in C yourself.

I think will take trial and error to really know

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add some context to the above, one of the first algorithms I looked at was corr since our implementation lags pretty bad relative to the NumPy implementation (granted we handle NA values and they do not)

with nogil:

Here is what I tried to change:

    with nogil:
        for xi in prange(K):
            for yi in range(xi + 1):
                # Welford's method for the variance-calculation
                # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
                nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0
                for i in range(N):
                    if mask[i, xi] and mask[i, yi]:
                        vx = mat[i, xi]
                        vy = mat[i, yi]
                        nobs += 1
                        dx = vx - meanx
                        dy = vy - meany
                        meanx += 1. / nobs * dx
                        meany += 1. / nobs * dy
                        ssqdmx += (vx - meanx) * dx
                        ssqdmy += (vy - meany) * dy
                        covxy += (vx - meanx) * dy

                if nobs < minpv:
                    result[xi, yi] = result[yi, xi] = NaN
                else:
                    divisor = (nobs - 1.0) if cov else sqrt(ssqdmx * ssqdmy)

                    if divisor != 0:
                        result[xi, yi] = result[yi, xi] = covxy / divisor
                    else:
                        result[xi, yi] = result[yi, xi] = NaN

And Cython gives this error when you try to generate it:

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
                  for i in range(N):
                      if mask[i, xi] and mask[i, yi]:
                          vx = mat[i, xi]
                          vy = mat[i, yi]
                          nobs += 1
                          dx = vx - meanx
                                   ^
  ------------------------------------------------------------

  /home/willayd/clones/pandas/pandas/_libs/algos.pyx:377:34: Cannot read reduction variable in loop body

So I think all those values like meanx, meany, etc.. that accumulate a value over a loop end up with pretty limited value, even if that accumulation should happen all on one thread. My guess is Cython is being ultra-conservative about this, so as we learn more about it can definitely give feedback upstream and see what they say

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for taking a look. I'll poke at that one if/when this goes in, as it definitely seems like it should parallelize nicely.

lab = labels[i]
if lab < 0:
continue
Expand Down
2 changes: 1 addition & 1 deletion pandas/_libs/meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ libs_sources = {
'algos': {'sources': ['algos.pyx', _algos_common_helper, _algos_take_helper, _khash_primitive_helper],
'include_dirs': klib_include},
'arrays': {'sources': ['arrays.pyx']},
'groupby': {'sources': ['groupby.pyx']},
'groupby': {'sources': ['groupby.pyx'], 'deps': [openmp]},
'hashing': {'sources': ['hashing.pyx']},
'hashtable': {'sources': ['hashtable.pyx', _khash_primitive_helper, _hashtable_class_helper, _hashtable_func_helper],
'include_dirs': klib_include},
Expand Down
Loading