OpenMP Groupby any/all #53149

WillAyd · 2023-05-08T23:58:54Z

Seems to give a small boost with our current test suite, though this could depend on the shape of the input data:

       before           after         ratio
     [824fc830]       [f7d33366]
     <main>           <parallel-shift>
-      46.3±0.4μs       41.6±0.4μs     0.90  groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'direct', 5)
-     1.26±0.01ms      1.13±0.01ms     0.89  groupby.GroupByMethods.time_dtype_as_field('object', 'any', 'transformation', 5)
-     1.21±0.05ms      1.07±0.02ms     0.89  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation', 5)

For the openmp link arg meson should have a much better facility for auto-detecting that which we can leverage

Cython dev docs have been updated with more info on parallelization:

https://github.com/cython/cython/blob/master/docs/src/tutorial/parallelization.rst

jbrockmendel · 2023-05-09T01:11:25Z

I tried this recently ish with libalgos.nancorr and couldn’t get it working on Mac. Would be awesome if you can figure it out!

WillAyd · 2023-05-09T14:21:12Z

Assuming this link is true looks like the macOS bundled llvm just doesn't support llvm:

https://stackoverflow.com/a/60043467/621736

Meson should help us compared to setuptools - will rebase after that goes into main and see

jbrockmendel · 2023-05-09T21:19:57Z

meson.build

@@ -22,7 +22,7 @@ py = py_mod.find_installation('python')
 py_dep = py.dependency()
 tempita = files('generate_pxi.py')
 versioneer = files('generate_version.py')
-
+openmp = dependency('openmp', required: false)


does this mean that i shouldn't expect this PR to affect perf on mac?

If using the standard compiler I don't think so, but you may be able to achieve it if you use a conda supplied installer.

There are two ways to validate:

When installing with meson add the --config-settings=compile-args="--verbose" flag, i.e. python -m pip install -ve . --no-build-isolation --config-settings=compile-args="--verbose" and search for the -fopenmp flag being passed when linking the groupby lib

grep the groupby lib for any openmp symbols. On linux this would be objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get with the standard meson build directory + py311. You will need to tweak to your environment, and maybe tweak objdump (not sure how it works on macOS)

Here's the sample output from 2 on my computer:

objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get 0000000000000000 F *UND* 0000000000000000 omp_get_thread_num@OMP_1.0 0000000000000000 F *UND* 0000000000000000 omp_get_num_threads@OMP_1.0

The omp_get symbols are of interest

I guess the second solution above is the only "guaranteed" way of knowing; for option 1 it is possible for the flag to be silently ignored

WillAyd · 2023-05-09T21:30:42Z

@h-vetinari wondering if you have any experience with distributing openmp linked libraries via conda and pip. Not sure how the presence of this on the build system may affect downstream users

jbrockmendel · 2023-05-09T21:49:36Z

pandas/_libs/groupby.pyx

@@ -621,7 +622,7 @@ def group_any_all(
    out[:] = 1 - flag_val

    with nogil:
-        for i in range(N):
+        for i in prange(N):


any particular reason this went here rather than someplace else? my assumption is it was a pretty arbitrary choice made for a POC, but worth checking.

if this goes in, can i look forward to a bunch more PRs adding prange all over the place?

Unfortunately this is probably usable in less places than you would think. I don't know all the rules for what makes Cython/OpenMP happy but I tried quite a few different algorithms and couldn't get them to work, before trying this one because it was relatively simple

OpenMP I don't think generally works with an accumulator shared across threads (for good reason), and Cython seems really conservative when it comes to allowing accumulators within a prange block, even if the accumulation would seemingly happen all on one thread.

Quoting the Cython docs from the OP:

You should also be aware that a lot of the choices Cython makes about how your code is parallelized are fairly fixed and if you want specific OpenMP behaviour that Cython doesn't provide by default you may be better writing it in C yourself.

I think will take trial and error to really know

To add some context to the above, one of the first algorithms I looked at was corr since our implementation lags pretty bad relative to the NumPy implementation (granted we handle NA values and they do not)

pandas/pandas/_libs/algos.pyx

Line 365 in 94e868a

with nogil:

Here is what I tried to change:

with nogil: for xi in prange(K): for yi in range(xi + 1): # Welford's method for the variance-calculation # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0 for i in range(N): if mask[i, xi] and mask[i, yi]: vx = mat[i, xi] vy = mat[i, yi] nobs += 1 dx = vx - meanx dy = vy - meany meanx += 1. / nobs * dx meany += 1. / nobs * dy ssqdmx += (vx - meanx) * dx ssqdmy += (vy - meany) * dy covxy += (vx - meanx) * dy if nobs < minpv: result[xi, yi] = result[yi, xi] = NaN else: divisor = (nobs - 1.0) if cov else sqrt(ssqdmx * ssqdmy) if divisor != 0: result[xi, yi] = result[yi, xi] = covxy / divisor else: result[xi, yi] = result[yi, xi] = NaN

And Cython gives this error when you try to generate it:

Error compiling Cython file: ------------------------------------------------------------ ... for i in range(N): if mask[i, xi] and mask[i, yi]: vx = mat[i, xi] vy = mat[i, yi] nobs += 1 dx = vx - meanx ^ ------------------------------------------------------------ /home/willayd/clones/pandas/pandas/_libs/algos.pyx:377:34: Cannot read reduction variable in loop body

So I think all those values like meanx, meany, etc.. that accumulate a value over a loop end up with pretty limited value, even if that accumulation should happen all on one thread. My guess is Cython is being ultra-conservative about this, so as we learn more about it can definitely give feedback upstream and see what they say

Makes sense, thanks for taking a look. I'll poke at that one if/when this goes in, as it definitely seems like it should parallelize nicely.

jreback

do we really think ipenmp is a good idea?
this opens up a huge box

WillAyd · 2023-05-09T22:26:41Z

Not sure. It definitely is the easiest way for us to parallelize things via Cython (where applicable), but I can't claim to have experience with using it otherwise.

The packaging seems like the riskiest piece. Maybe we have to turn it off for wheels but enable for conda environments where openmp can be ensured?

Are there other huge downsides anyone is aware of?

rhshadrach · 2023-05-09T22:54:44Z

In general, you don't want to parallelize at multiple levels of your code. E.g. a workflow that uses up all cores via multiprocessing a workflow and then trying to use all cores within each process will suffer from context switches.

Another case is in a large shared server were you want to ensure you aren't over utilizing resources.

h-vetinari · 2023-05-09T23:19:56Z

@h-vetinari wondering if you have any experience with distributing openmp linked libraries via conda and pip. Not sure how the presence of this on the build system may affect downstream users

TL;DR: It works fine with conda, not fine with pip.

In more detail, conda-forge has all the required bits & pieces to make this work (and we ship openmp-support in openblas for example), though windows has the wrinkle that the intel-openmp and llvm-openmp implementations are different and cannot be co-installed. We've been wanting to switch to llvm-openmp also on windows (see e.g. here, but that's been blocked on some Fortran infrastructure work that's still pending).

As a concrete example, scipy has not allowed a hard dependence on OpenMP for various reasons, primarily related to the limitations of PyPI's distribution model. @rgommers was & is heavily involved in this, and also did a write-up about this very subject in the context of the pypackaging-native initiative. From a (somewhat-)recently-added cross-reference on that scipy-issue, I see that a bunch of familiar faces already discussed this in the wider PyData context, including pandas.

mroeschke · 2023-05-09T23:20:54Z

FWIW numba seems to "select a threading layer if already available" instead of requiring one https://numba.pydata.org/numba-doc/latest/user/threading-layer.html#selecting-a-named-threading-layer

you don't want to parallelize at multiple levels of your code

Yeah numba specifically detects and disallows this when jitting which might be hard to do on our side

WillAyd · 2023-05-09T23:30:53Z

As a concrete example, scipy has not allowed a hard dependence on OpenMP for various reasons, primarily related to the limitations of PyPI's distribution model. @rgommers was & is heavily involved in this, and also did a write-up about this very subject in the context of the pypackaging-native initiative. From a (somewhat-)recently-added cross-reference on that scipy-issue, I see that a bunch of familiar faces already discussed this in the wider PyData context, including pandas.

Wow thanks these are great resources. But I think the last link might be private? I get a 404 trying to access

h-vetinari · 2023-05-09T23:53:53Z

But I think the last link might be private? I get a 404 trying to access

My bad, sorry, I'm not used to being a member of non-public GH orgs 😅

(i.e. when GH shows me an xref, I wrongly assume from habit that everyone else sees it too; in this case, the write-up I linked is more recent and more complete anyway)

jreback · 2023-05-10T02:09:13Z

I am -1 on randomly adding parallelizable that is

not external controllable (via keyword and env variable)
requires non standard things (openmp is standard but not generally available)
defaults to anything hut False

this needs comprehensive policies otherwise you end up with multiple of layers of parallelism which is almost always bad

point is that this needs substantial testing and validation and tbh we may want to punt completely at the pandas level on this

jbrockmendel · 2023-05-10T02:30:34Z

I'm open to a "needs more thought" argument, but on a lot of fronts we've done a pretty good job with optimization so performance gains are going to have to come from parallelism

WillAyd · 2023-05-10T02:34:21Z

Yea sounds like generally applying has some caveats. A conservative approach to start could be to add -Duse-omp=ON as a meson option, but set the default to false. That would give people building from source an easy way to opt in, if they don't plan on using pandas in a multiprocessing environment

rgommers · 2023-05-10T10:15:33Z

+1 to @jreback's points. I'll note that @thomasjpfan put quite some thought into this topic and wrote up a design proposal for enabling parallelism across core PyData packages in a careful fashion: https://thomasjpfan.github.io/parallelism-python-libraries-design/ (also going to be a SciPy'23 talk I believe).

WillAyd · 2023-05-10T15:39:57Z

Awesome links thanks for sharing. So I see in the section: https://thomasjpfan.github.io/parallelism-python-libraries-design/#proposal-2

That there are three things being proposed:

Detect combination of libgomp + fork call
Detect when llvm openmp + intel openmp are loaded together
Use openMP pypi wheel

2 seems feasible, 3 looks like the pre-cursor PR is merged, so I'm guessing just a matter of time (I see Intel OpenMP wheels already but likely not what the proposal refers to).

For number 1 I'm not sure if that would be possible for pandas to tackle, especially if we just rely on the Cython-generated OpenMP code. Guessing that is something we may want Cython to tackle?

WillAyd · 2023-05-16T22:08:24Z

Let's close this - I don't plan on taking any more immediate action. I think this is a good reference for any future discussion of the topic

thomasjpfan · 2023-05-17T00:12:17Z

Looking over the concerns from this PR, I can see the following path forward for pandas + OpenMP:

Let the default be single threaded.
Pass a keyword argument from the Python layer all the way down to Cython's prange. This makes it harder to engage in nested parallelism in pandas code.
Have OpenMP be configurable with a compile time flag. This allows Pyodide to build with OpenMP off and conda to build with it on. As noted in OpenMP Groupby any/all #53149 (comment), conda has a better story around packaging OpenMP.
Be ready for issues. Even without nested parallelism, there will be performance issues when there are multiple threadpools. Pandas will need to have some documentation warning users about configuring threads >= 2.
If pandas wants OpenMP support in the near-term, be prepared to ship a PyPI wheel with an OpenMP binary.

Alternative path forward

An alternative to OpenMP is to go with SciPy's solution, which developed their own pthreads based solution. It has the benefit of being a standard and avoids some of the issues with OpenMP.

jbrockmendel · 2023-05-23T20:50:40Z

@thomasjpfan can you comment on the pros/cons of the scipy solution, particularly for those of us with limited c++ experience

thomasjpfan · 2023-05-29T21:28:33Z

For reference, SciPy has discussed the topic of adding OpenMP and decided on pthreads: scipy/scipy#10239 At a high level the pros and cons unique to SciPy's pthreads solution:

Pros

Portable: pthreads is a POSIX standard
None of the OpenMP headaches when libraries load different versions of OpenMP
Easier to build. (No need to vendor OpenMP for wheels)

Cons

As seen in SciPy's solution, pthreads needs a little more work to get going. OpenMP has higher level abstractions, which makes it easier to work with.
Need to maintain a custom solution. (Although, if many libraries want to go down this route, we can wrap a C/C++ solution with a nice Cython frontend)
Depending on how one implements the custom solution, you can still end up with CPU contention: OpenBLAS with Pthreads can cause CPU contention for MPI programs with Pthreads OpenMathLib/OpenBLAS#4033

WillAyd added 4 commits May 8, 2023 16:18

paralellized groupby.shift

ebb8602

openmp linkage

df37cc1

any/all parallelization

f7d3336

parallel any/all

c1beccd

WillAyd added 2 commits May 9, 2023 11:07

Merge branch 'main' into parallel-shift

35462e6

meson openmp handling

8c7217b

WillAyd marked this pull request as ready for review May 9, 2023 18:26

jbrockmendel reviewed May 9, 2023

View reviewed changes

jreback requested changes May 9, 2023

View reviewed changes

WillAyd closed this May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP Groupby any/all #53149

OpenMP Groupby any/all #53149

WillAyd commented May 8, 2023 •

edited

Loading

jbrockmendel commented May 9, 2023

WillAyd commented May 9, 2023 •

edited

Loading

jbrockmendel May 9, 2023

WillAyd May 9, 2023

WillAyd May 9, 2023

WillAyd commented May 9, 2023

jbrockmendel May 9, 2023

WillAyd May 9, 2023 •

edited

Loading

WillAyd May 9, 2023

jbrockmendel May 9, 2023

jreback left a comment

WillAyd commented May 9, 2023

rhshadrach commented May 9, 2023

h-vetinari commented May 9, 2023 •

edited

Loading

mroeschke commented May 9, 2023

WillAyd commented May 9, 2023

h-vetinari commented May 9, 2023 •

edited

Loading

jreback commented May 10, 2023

jbrockmendel commented May 10, 2023

WillAyd commented May 10, 2023

rgommers commented May 10, 2023

WillAyd commented May 10, 2023

WillAyd commented May 16, 2023

thomasjpfan commented May 17, 2023 •

edited

Loading

jbrockmendel commented May 23, 2023

thomasjpfan commented May 29, 2023

OpenMP Groupby any/all #53149

OpenMP Groupby any/all #53149

Conversation

WillAyd commented May 8, 2023 • edited Loading

jbrockmendel commented May 9, 2023

WillAyd commented May 9, 2023 • edited Loading

jbrockmendel May 9, 2023

Choose a reason for hiding this comment

WillAyd May 9, 2023

Choose a reason for hiding this comment

WillAyd May 9, 2023

Choose a reason for hiding this comment

WillAyd commented May 9, 2023

jbrockmendel May 9, 2023

Choose a reason for hiding this comment

WillAyd May 9, 2023 • edited Loading

Choose a reason for hiding this comment

WillAyd May 9, 2023

Choose a reason for hiding this comment

jbrockmendel May 9, 2023

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented May 9, 2023

rhshadrach commented May 9, 2023

h-vetinari commented May 9, 2023 • edited Loading

mroeschke commented May 9, 2023

WillAyd commented May 9, 2023

h-vetinari commented May 9, 2023 • edited Loading

jreback commented May 10, 2023

jbrockmendel commented May 10, 2023

WillAyd commented May 10, 2023

rgommers commented May 10, 2023

WillAyd commented May 10, 2023

WillAyd commented May 16, 2023

thomasjpfan commented May 17, 2023 • edited Loading

Alternative path forward

jbrockmendel commented May 23, 2023

thomasjpfan commented May 29, 2023

Pros

Cons

WillAyd commented May 8, 2023 •

edited

Loading

WillAyd commented May 9, 2023 •

edited

Loading

WillAyd May 9, 2023 •

edited

Loading

h-vetinari commented May 9, 2023 •

edited

Loading

h-vetinari commented May 9, 2023 •

edited

Loading

thomasjpfan commented May 17, 2023 •

edited

Loading