Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: releasing the GIL, #8882 #10199

Merged
merged 2 commits into from
Jun 30, 2015
Merged

PERF: releasing the GIL, #8882 #10199

merged 2 commits into from
Jun 30, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 22, 2015

closes #8882
closes #10139

This is now implemented for all groupbys with the factorization part as well (which is actually a big part fo the time)

In [2]: N = 1000000

In [3]: ngroups = 1000

In [4]: np.random.seed(1234)

In [5]: df = DataFrame({'key' : np.random.randint(0,ngroups,size=N),
   ...:                 'data' : np.random.randn(N) })
def f():
    df.groupby('key')['data'].sum()

# run consecutivily
def g2():
    for i in range(2):
        f()
def g4():
    for i in range(4):
        f()
def g8():
    for i in range(8):
        f()

# run in parallel
@test_parallel(num_threads=2)
def pg2():
    f()

@test_parallel(num_threads=4)
def pg4():
    f()

@test_parallel(num_threads=8)
def pg8():
    f()

So seeing a nice curve as far as addtl cores are used (as compared to some of my previous posts).

In [19]: df
Out[19]: 
   after  before   speedup
2   27.3    52.1  1.908425
4   44.8   105.0  2.343750
8   83.2   213.0  2.560096

@jreback jreback added the Performance Memory or execution speed performance label May 22, 2015
@jreback jreback added this to the 0.17.0 milestone May 22, 2015
@mrocklin
Copy link
Contributor

What was the deal with NA and nan. These were Python objects?

@jreback
Copy link
Contributor Author

jreback commented May 22, 2015

apparently doing something

cdef ndarray[double_t] foo

foo = np.empty(10,dtype='float64')

foo[0] = np.nan

actually assigns a python object (and does inference and such), so that you MUST have the gil,
but casting to a double seems to work though (as its now c-level)

cdef double nan = <double> np.nan
with nogil:
    foo[0] = nan

@ahojnnes
Copy link

You can reuse NPY_NAN cross-platform, e.g., https://github.com/scikit-image/scikit-image/blob/master/skimage/feature/_texture.pyx#L11

@jreback
Copy link
Contributor Author

jreback commented May 22, 2015

@ahojnnes thxs, I did change it.

@jreback
Copy link
Contributor Author

jreback commented May 28, 2015

I updated this. works for all groupbys now. So it gives a general 5-10% speedup on most operations (even single threaded), as there is slightly less generated code (in nogil it pretty much generates straight c code).

so we still need to hold the gil for the .resize operation, but its generally quite limited when this happens (e.g. resizing is a function of the number of uniques for example), but the cost of this is pretty low.

There IS a soln where I can not use numpy resize at ALL. But its quite a bit more c-code and IMHO not really worth it.

@mrocklin
Copy link
Contributor

Cool. I'll try to run a few different groupby operations with dask.dataframe sometime early next week and report back. I'm curious what were the operations that needed to be released? Was there a core set of functionality or was it a large volume of trivial work?

@jreback
Copy link
Contributor Author

jreback commented May 29, 2015

TL;DR.

So lots of 'boilerplate' changes in generated.pyx, which is basically all of the algorithms for groupby ops themselves (e.g. sum,mean). These were straightforward insertions of with nogil, just needed to avoid typing the object ones (which for the most part was easy; though a bit tricky on some non-groupby routines, e.g. see #10213).

However, the big speeds occur once I fixed the hashtable.pyx factorizer (e.g. the get_labels routine). The trivial soln actually produces a very odd perf issue (see here), which is a bug / documentation issue.

In effect, you are using a nogil function that happens to have a with gil block WITHIN it, then the function is way slower than if you inline the exact same code.

I 'fixed' this issue by adding some additional c-structures to hold data (basically I changed the low-level implementation); these can be easily passed around; downside is that are a little 'messy' in the code department.

note to @shoyer I didn't actually bypass the realloc of memory (e.g. .resize) of numpy arrays. though I do have a soln to that (basically manage a memory block myself, then turning that into a numpy array at the very end (in to_array), easy enough, but was running into some compile issues, so dropped it in the end).

@jreback jreback force-pushed the gil branch 5 times, most recently from b6eea4b to 4e2acd1 Compare June 3, 2015 12:52
@jreback jreback force-pushed the gil branch 2 times, most recently from d1a08d7 to bb78fef Compare June 12, 2015 19:41
@mrocklin
Copy link
Contributor

I think that this is awesome and am afraid of it going stale. Is there something blocking it from being merged into master?

@shoyer
Copy link
Member

shoyer commented Jun 17, 2015

@mrocklin We were waiting to get the 0.16.2 release out first, which happened last Friday. At this point I don't think there are any blockers.

@jreback
Copy link
Contributor Author

jreback commented Jun 17, 2015

nothing blocking. I am going to update and see if I can hit some of #10213.

I have to do a bit of odd code-generation because for example you want to simultaneously support nogil (for numeric) and gil for object, but its a context manager (and NOT a function, which has some odd per characteristics). IOW, you have to worry about formatting, a bit annoying.

@jreback
Copy link
Contributor Author

jreback commented Jun 17, 2015

actually I do need to add a doc-note to the whatsnew (and maybe a small mention in the enhancingperf section as well).

@jreback jreback self-assigned this Jun 17, 2015
@mrocklin
Copy link
Contributor

Hrm, yes, unfortunate that you can't manipulate the lock directly

@mrocklin
Copy link
Contributor

Thought I'd caress this PR with a gentle ping.

ping

@jreback
Copy link
Contributor Author

jreback commented Jun 26, 2015

is that euphemism for harassment?

@mrocklin
Copy link
Contributor

Kind and loving harassment, yes.

@jreback
Copy link
Contributor Author

jreback commented Jun 26, 2015

any commentary @jorisvandenbossche @shoyer @cpcloud @TomAugspurger

going to have a followup for more GIL stuff. Its actually tricky making sure its doing (and you are measure exactly the right stuff)


We are releasing the global-interpreter-lock (GIL) on some cython operations.
This will allow other threads to run simultaneously during computation, potentially allowing performance improvements
from multi-threading. Notably ``groupby`` and some indexing operations are a benefit to this. (:issue:`8882`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd say "benefit from this"

# compile time specilization of the fused types
# as the cross-product is generated, but we cannot assign float->int
# the types that don't pass are pruned
if (vector_data is Int64VectorData and sixty_four_bit_scalar is int64_t) or (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud I got this to work after puzzling on what cython does with multiple fused types. It creates the cross-product, but you generally need only certain specilizations; it does this beaut as a compile-time prune, so no perf hit.

jreback added a commit that referenced this pull request Jun 30, 2015
@jreback jreback merged commit 16a44ad into pandas-dev:master Jun 30, 2015
@sinhrks
Copy link
Member

sinhrks commented Jun 30, 2015

Great job!!

@mrocklin
Copy link
Contributor

Woohoo!

Pulls from master, checks benchmarks.

@jorisvandenbossche
Copy link
Member

@mrocklin Can we already look forward to a new awesome blogpost? :-)

@mrocklin
Copy link
Contributor

Best I have to offer at the moment are the slowly growing dask.dataframe docs

@jorisvandenbossche
Copy link
Member

I should definitely check it out once I find some time!

@scari
Copy link
Contributor

scari commented Jun 30, 2015

Wow! Great job! 👍🏻

jreback pushed a commit that referenced this pull request Jul 15, 2016
 - [x] tests added / passed   - [x] passes ``git diff upstream/master
| flake8 --diff``    Rebased version of
#10229 which was [actually not](h
ttps://github.com//pull/10229#issuecomment-131470116)
fixed by #10199.    Nothing
particular relevant, just wanted to delete this branch locally and
noticed it still applies: you'll judge what to do of it.

Author: Pietro Battiston <me@pietrobattiston.it>

Closes #13594 from toobaz/fix_checkunique and squashes the following commits:

a63bd12 [Pietro Battiston] CLN: Initialization coincides with mapping, hence with uniqueness check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release the GIL in Cython code
8 participants