Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP Groupby any/all #53149

Closed
wants to merge 6 commits into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented May 8, 2023

Seems to give a small boost with our current test suite, though this could depend on the shape of the input data:

       before           after         ratio
     [824fc830]       [f7d33366]
     <main>           <parallel-shift>
-      46.3±0.4μs       41.6±0.4μs     0.90  groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'direct', 5)
-     1.26±0.01ms      1.13±0.01ms     0.89  groupby.GroupByMethods.time_dtype_as_field('object', 'any', 'transformation', 5)
-     1.21±0.05ms      1.07±0.02ms     0.89  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation', 5)

For the openmp link arg meson should have a much better facility for auto-detecting that which we can leverage

Cython dev docs have been updated with more info on parallelization:

https://github.com/cython/cython/blob/master/docs/src/tutorial/parallelization.rst

@jbrockmendel
Copy link
Member

I tried this recently ish with libalgos.nancorr and couldn’t get it working on Mac. Would be awesome if you can figure it out!

@WillAyd
Copy link
Member Author

WillAyd commented May 9, 2023

Assuming this link is true looks like the macOS bundled llvm just doesn't support llvm:

https://stackoverflow.com/a/60043467/621736

Meson should help us compared to setuptools - will rebase after that goes into main and see

@WillAyd WillAyd marked this pull request as ready for review May 9, 2023 18:26
@@ -22,7 +22,7 @@ py = py_mod.find_installation('python')
py_dep = py.dependency()
tempita = files('generate_pxi.py')
versioneer = files('generate_version.py')

openmp = dependency('openmp', required: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that i shouldn't expect this PR to affect perf on mac?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using the standard compiler I don't think so, but you may be able to achieve it if you use a conda supplied installer.

There are two ways to validate:

  1. When installing with meson add the --config-settings=compile-args="--verbose" flag, i.e. python -m pip install -ve . --no-build-isolation --config-settings=compile-args="--verbose" and search for the -fopenmp flag being passed when linking the groupby lib
  2. grep the groupby lib for any openmp symbols. On linux this would be objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get with the standard meson build directory + py311. You will need to tweak to your environment, and maybe tweak objdump (not sure how it works on macOS)

Here's the sample output from 2 on my computer:

objdump --syms build/cp311/pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so | grep omp_get
0000000000000000       F *UND*	0000000000000000              omp_get_thread_num@OMP_1.0
0000000000000000       F *UND*	0000000000000000              omp_get_num_threads@OMP_1.0

The omp_get symbols are of interest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the second solution above is the only "guaranteed" way of knowing; for option 1 it is possible for the flag to be silently ignored

@WillAyd
Copy link
Member Author

WillAyd commented May 9, 2023

@h-vetinari wondering if you have any experience with distributing openmp linked libraries via conda and pip. Not sure how the presence of this on the build system may affect downstream users

@@ -621,7 +622,7 @@ def group_any_all(
out[:] = 1 - flag_val

with nogil:
for i in range(N):
for i in prange(N):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason this went here rather than someplace else? my assumption is it was a pretty arbitrary choice made for a POC, but worth checking.

if this goes in, can i look forward to a bunch more PRs adding prange all over the place?

Copy link
Member Author

@WillAyd WillAyd May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is probably usable in less places than you would think. I don't know all the rules for what makes Cython/OpenMP happy but I tried quite a few different algorithms and couldn't get them to work, before trying this one because it was relatively simple

OpenMP I don't think generally works with an accumulator shared across threads (for good reason), and Cython seems really conservative when it comes to allowing accumulators within a prange block, even if the accumulation would seemingly happen all on one thread.

Quoting the Cython docs from the OP:

You should also be aware that a lot of the choices Cython makes about how your code is parallelized are fairly fixed and if you want specific OpenMP behaviour that Cython doesn't provide by default you may be better writing it in C yourself.

I think will take trial and error to really know

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add some context to the above, one of the first algorithms I looked at was corr since our implementation lags pretty bad relative to the NumPy implementation (granted we handle NA values and they do not)

with nogil:

Here is what I tried to change:

    with nogil:
        for xi in prange(K):
            for yi in range(xi + 1):
                # Welford's method for the variance-calculation
                # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
                nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0
                for i in range(N):
                    if mask[i, xi] and mask[i, yi]:
                        vx = mat[i, xi]
                        vy = mat[i, yi]
                        nobs += 1
                        dx = vx - meanx
                        dy = vy - meany
                        meanx += 1. / nobs * dx
                        meany += 1. / nobs * dy
                        ssqdmx += (vx - meanx) * dx
                        ssqdmy += (vy - meany) * dy
                        covxy += (vx - meanx) * dy

                if nobs < minpv:
                    result[xi, yi] = result[yi, xi] = NaN
                else:
                    divisor = (nobs - 1.0) if cov else sqrt(ssqdmx * ssqdmy)

                    if divisor != 0:
                        result[xi, yi] = result[yi, xi] = covxy / divisor
                    else:
                        result[xi, yi] = result[yi, xi] = NaN

And Cython gives this error when you try to generate it:

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
                  for i in range(N):
                      if mask[i, xi] and mask[i, yi]:
                          vx = mat[i, xi]
                          vy = mat[i, yi]
                          nobs += 1
                          dx = vx - meanx
                                   ^
  ------------------------------------------------------------

  /home/willayd/clones/pandas/pandas/_libs/algos.pyx:377:34: Cannot read reduction variable in loop body

So I think all those values like meanx, meany, etc.. that accumulate a value over a loop end up with pretty limited value, even if that accumulation should happen all on one thread. My guess is Cython is being ultra-conservative about this, so as we learn more about it can definitely give feedback upstream and see what they say

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for taking a look. I'll poke at that one if/when this goes in, as it definitely seems like it should parallelize nicely.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really think ipenmp is a good idea?
this opens up a huge box

@WillAyd
Copy link
Member Author

WillAyd commented May 9, 2023

Not sure. It definitely is the easiest way for us to parallelize things via Cython (where applicable), but I can't claim to have experience with using it otherwise.

The packaging seems like the riskiest piece. Maybe we have to turn it off for wheels but enable for conda environments where openmp can be ensured?

Are there other huge downsides anyone is aware of?

@rhshadrach
Copy link
Member

In general, you don't want to parallelize at multiple levels of your code. E.g. a workflow that uses up all cores via multiprocessing a workflow and then trying to use all cores within each process will suffer from context switches.

Another case is in a large shared server were you want to ensure you aren't over utilizing resources.

@h-vetinari
Copy link
Contributor

h-vetinari commented May 9, 2023

@h-vetinari wondering if you have any experience with distributing openmp linked libraries via conda and pip. Not sure how the presence of this on the build system may affect downstream users

TL;DR: It works fine with conda, not fine with pip.

In more detail, conda-forge has all the required bits & pieces to make this work (and we ship openmp-support in openblas for example), though windows has the wrinkle that the intel-openmp and llvm-openmp implementations are different and cannot be co-installed. We've been wanting to switch to llvm-openmp also on windows (see e.g. here, but that's been blocked on some Fortran infrastructure work that's still pending).

As a concrete example, scipy has not allowed a hard dependence on OpenMP for various reasons, primarily related to the limitations of PyPI's distribution model. @rgommers was & is heavily involved in this, and also did a write-up about this very subject in the context of the pypackaging-native initiative. From a (somewhat-)recently-added cross-reference on that scipy-issue, I see that a bunch of familiar faces already discussed this in the wider PyData context, including pandas.

@mroeschke
Copy link
Member

FWIW numba seems to "select a threading layer if already available" instead of requiring one https://numba.pydata.org/numba-doc/latest/user/threading-layer.html#selecting-a-named-threading-layer

you don't want to parallelize at multiple levels of your code

Yeah numba specifically detects and disallows this when jitting which might be hard to do on our side

@WillAyd
Copy link
Member Author

WillAyd commented May 9, 2023

As a concrete example, scipy has not allowed a hard dependence on OpenMP for various reasons, primarily related to the limitations of PyPI's distribution model. @rgommers was & is heavily involved in this, and also did a write-up about this very subject in the context of the pypackaging-native initiative. From a (somewhat-)recently-added cross-reference on that scipy-issue, I see that a bunch of familiar faces already discussed this in the wider PyData context, including pandas.

Wow thanks these are great resources. But I think the last link might be private? I get a 404 trying to access

@h-vetinari
Copy link
Contributor

h-vetinari commented May 9, 2023

But I think the last link might be private? I get a 404 trying to access

My bad, sorry, I'm not used to being a member of non-public GH orgs 😅

(i.e. when GH shows me an xref, I wrongly assume from habit that everyone else sees it too; in this case, the write-up I linked is more recent and more complete anyway)

@jreback
Copy link
Contributor

jreback commented May 10, 2023

I am -1 on randomly adding parallelizable that is

  • not external controllable (via keyword and env variable)
  • requires non standard things (openmp is standard but not generally available)
  • defaults to anything hut False

this needs comprehensive policies otherwise you end up with multiple of layers of parallelism which is almost always bad

point is that this needs substantial testing and validation and tbh we may want to punt completely at the pandas level on this

@jbrockmendel
Copy link
Member

I'm open to a "needs more thought" argument, but on a lot of fronts we've done a pretty good job with optimization so performance gains are going to have to come from parallelism

@WillAyd
Copy link
Member Author

WillAyd commented May 10, 2023

Yea sounds like generally applying has some caveats. A conservative approach to start could be to add -Duse-omp=ON as a meson option, but set the default to false. That would give people building from source an easy way to opt in, if they don't plan on using pandas in a multiprocessing environment

@rgommers
Copy link
Contributor

+1 to @jreback's points. I'll note that @thomasjpfan put quite some thought into this topic and wrote up a design proposal for enabling parallelism across core PyData packages in a careful fashion: https://thomasjpfan.github.io/parallelism-python-libraries-design/ (also going to be a SciPy'23 talk I believe).

@WillAyd
Copy link
Member Author

WillAyd commented May 10, 2023

Awesome links thanks for sharing. So I see in the section: https://thomasjpfan.github.io/parallelism-python-libraries-design/#proposal-2

That there are three things being proposed:

  1. Detect combination of libgomp + fork call
  2. Detect when llvm openmp + intel openmp are loaded together
  3. Use openMP pypi wheel

2 seems feasible, 3 looks like the pre-cursor PR is merged, so I'm guessing just a matter of time (I see Intel OpenMP wheels already but likely not what the proposal refers to).

For number 1 I'm not sure if that would be possible for pandas to tackle, especially if we just rely on the Cython-generated OpenMP code. Guessing that is something we may want Cython to tackle?

@WillAyd
Copy link
Member Author

WillAyd commented May 16, 2023

Let's close this - I don't plan on taking any more immediate action. I think this is a good reference for any future discussion of the topic

@WillAyd WillAyd closed this May 16, 2023
@thomasjpfan
Copy link
Contributor

thomasjpfan commented May 17, 2023

Looking over the concerns from this PR, I can see the following path forward for pandas + OpenMP:

  1. Let the default be single threaded.
  2. Pass a keyword argument from the Python layer all the way down to Cython's prange. This makes it harder to engage in nested parallelism in pandas code.
  3. Have OpenMP be configurable with a compile time flag. This allows Pyodide to build with OpenMP off and conda to build with it on. As noted in OpenMP Groupby any/all #53149 (comment), conda has a better story around packaging OpenMP.
  4. Be ready for issues. Even without nested parallelism, there will be performance issues when there are multiple threadpools. Pandas will need to have some documentation warning users about configuring threads >= 2.
  5. If pandas wants OpenMP support in the near-term, be prepared to ship a PyPI wheel with an OpenMP binary.

Alternative path forward

An alternative to OpenMP is to go with SciPy's solution, which developed their own pthreads based solution. It has the benefit of being a standard and avoids some of the issues with OpenMP.

@jbrockmendel
Copy link
Member

@thomasjpfan can you comment on the pros/cons of the scipy solution, particularly for those of us with limited c++ experience

@thomasjpfan
Copy link
Contributor

For reference, SciPy has discussed the topic of adding OpenMP and decided on pthreads: scipy/scipy#10239 At a high level the pros and cons unique to SciPy's pthreads solution:

Pros

  • Portable: pthreads is a POSIX standard
  • None of the OpenMP headaches when libraries load different versions of OpenMP
  • Easier to build. (No need to vendor OpenMP for wheels)

Cons

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants