New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: High time consumption by PyUFunc_GetPyValues in ufunc_object.c #3686

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@arinkverma
Contributor

arinkverma commented Sep 4, 2013

For every single operation calls, numpy has to extract value of buffersize, errormask and name to pack and build error object. These two functions, _extract_pyvals and PyUFunc_GetPyValues together use >12% of time. it take useless time because all this time is spent on to look up entries in a python dict, extract them, and convert them into C level data. Not once but doing that again and again on every operation.

Here, I stash value to thread using TLS api and also avoiding building of errorObj again for same consecutive function.

Improvement

Before

In [1]: import numpy as np

In [2]: x = np.asarray(1)

In [3]: timeit x+x
1000000 loops, best of 3: 1.67 us per loop

In [4]: x = np.asarray([1])

In [5]: timeit x+x
1000000 loops, best of 3: 1.76 us per loop

After

In [1]: import numpy as np

In [2]: x = np.asarray(1)

In [3]: timeit x+x
1000000 loops, best of 3: 1.56 us per loop

In [4]: x = np.asarray([1])

In [5]: timeit x+x
1000000 loops, best of 3: 1.69 us per loop
@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 6, 2013

Member

Looks like a lot of tests are failing because error flags are not being set properly. I'd be suspicious of caching the status word, or maybe something needs to be initialized?

Member

charris commented Sep 6, 2013

Looks like a lot of tests are failing because error flags are not being set properly. I'd be suspicious of caching the status word, or maybe something needs to be initialized?

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 6, 2013

Member

Lots of style nitpicks, but lets figure the errors out first. What exactly are you caching, how do you know things don't change while things are cached, and how do you make the cache thread safe?

Member

charris commented Sep 6, 2013

Lots of style nitpicks, but lets figure the errors out first. What exactly are you caching, how do you know things don't change while things are cached, and how do you make the cache thread safe?

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 6, 2013

Contributor

@charris I want to cache *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval);. This alone cost more than 6% of time. When things get change ufunc_geterr and ufunc_seterr get called, so I have to update caching then.

I am using Thread Local Storage (TLS) API , from pythread.h over which PyList is build.

Contributor

arinkverma commented Sep 6, 2013

@charris I want to cache *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval);. This alone cost more than 6% of time. When things get change ufunc_geterr and ufunc_seterr get called, so I have to update caching then.

I am using Thread Local Storage (TLS) API , from pythread.h over which PyList is build.

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 6, 2013

Contributor

now, Its showing only error for python 2.6

FAIL: Ticket #955
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.6/lib/python2.6/site-packages/numpy/core/tests/test_regression.py", line 1175, in test_errobj_reference_leak
    assert_(n_before >= n_after, (n_before, n_after))
  File "/home/travis/virtualenv/python2.6/lib/python2.6/site-packages/numpy/testing/utils.py", line 44, in assert_
    raise AssertionError(msg)
AssertionError: (1546921, 1546922)

Any clue, why only python 2.6..showing this error?

Contributor

arinkverma commented Sep 6, 2013

now, Its showing only error for python 2.6

FAIL: Ticket #955
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.6/lib/python2.6/site-packages/numpy/core/tests/test_regression.py", line 1175, in test_errobj_reference_leak
    assert_(n_before >= n_after, (n_before, n_after))
  File "/home/travis/virtualenv/python2.6/lib/python2.6/site-packages/numpy/testing/utils.py", line 44, in assert_
    raise AssertionError(msg)
AssertionError: (1546921, 1546922)

Any clue, why only python 2.6..showing this error?

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 6, 2013

Member

But you have static global storage, how does that work?

Member

charris commented Sep 6, 2013

But you have static global storage, how does that work?

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 6, 2013

Member

I have no idea what could have changed between 2.6 and 2.7. Lets ping the travisbot and see how repeatable it is.

Member

charris commented Sep 6, 2013

I have no idea what could have changed between 2.6 and 2.7. Lets ping the travisbot and see how repeatable it is.

@charris charris closed this Sep 6, 2013

@charris charris reopened this Sep 6, 2013

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 6, 2013

Member

Same thing, looks pretty repeatable.

Member

charris commented Sep 6, 2013

Same thing, looks pretty repeatable.

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 6, 2013

Member

There seem to have been a number of threading fixes/changes in 2.7 but I can't say that any looked suspicious at first glance.

Member

charris commented Sep 6, 2013

There seem to have been a number of threading fixes/changes in 2.7 but I can't say that any looked suspicious at first glance.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 6, 2013

Contributor

is this faster than just replacing Py_BuildValue with PyTuple_Pack?

PyObject * str = PyBytes_FromString(name);
*errobj = PyTuple_Pack(2, str, retval);
Py_DECREF(str);
Contributor

juliantaylor commented Sep 6, 2013

is this faster than just replacing Py_BuildValue with PyTuple_Pack?

PyObject * str = PyBytes_FromString(name);
*errobj = PyTuple_Pack(2, str, retval);
Py_DECREF(str);
@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 6, 2013

Member

I don't understand why the DIRTY key exists. Can't we just do the error checking up front when the errobj is set?

I also don't understand all the non-TLS global variables. Those can't possibly be safe, right?

The TLS key should be initialized somehow at module load time; the current system of doing them on demand is technically not thread-safe. (Probably the cleanest way would be to define a little function that does this, and then call that function from the umath module initialization function.)

Member

njsmith commented Sep 6, 2013

I don't understand why the DIRTY key exists. Can't we just do the error checking up front when the errobj is set?

I also don't understand all the non-TLS global variables. Those can't possibly be safe, right?

The TLS key should be initialized somehow at module load time; the current system of doing them on demand is technically not thread-safe. (Probably the cleanest way would be to define a little function that does this, and then call that function from the umath module initialization function.)

@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 6, 2013

Contributor

I did some quick tests and replacing Py_BuildValue with a direct tuple creation is double as fast. Maybe this is already enough to make further optimizations not worthwhile.

But if further optimizations are needed I would go for these first:

  • numpy seems to control what we put into the errobj dict, so we can could replace the PyInt_FromLong with faster non errorchecking variants
  • possibly the ufunc could store its name as a PyString saving the conversion in the extract function, or numpy places an appropriate object into the errobj dict to begin with.
Contributor

juliantaylor commented Sep 6, 2013

I did some quick tests and replacing Py_BuildValue with a direct tuple creation is double as fast. Maybe this is already enough to make further optimizations not worthwhile.

But if further optimizations are needed I would go for these first:

  • numpy seems to control what we put into the errobj dict, so we can could replace the PyInt_FromLong with faster non errorchecking variants
  • possibly the ufunc could store its name as a PyString saving the conversion in the extract function, or numpy places an appropriate object into the errobj dict to begin with.
@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 8, 2013

Contributor

Now from 12%, time of PyUFunc_GetPyValues drop to 5%.

After

In [1]: import numpy as np

In [2]: x = np.asarray(1)

In [3]: timeit x+x
1000000 loops, best of 3: 1.65 us per loop

In [4]: timeit x+x
1000000 loops, best of 3: 1.64 us per loop

Before

In [1]: import numpy as np

In [2]: x = np.asarray(1)

In [3]: timeit x+x
1000000 loops, best of 3: 1.71 us per loop

In [4]: timeit x+x
1000000 loops, best of 3: 1.71 us per loop

Contributor

arinkverma commented Sep 8, 2013

Now from 12%, time of PyUFunc_GetPyValues drop to 5%.

After

In [1]: import numpy as np

In [2]: x = np.asarray(1)

In [3]: timeit x+x
1000000 loops, best of 3: 1.65 us per loop

In [4]: timeit x+x
1000000 loops, best of 3: 1.64 us per loop

Before

In [1]: import numpy as np

In [2]: x = np.asarray(1)

In [3]: timeit x+x
1000000 loops, best of 3: 1.71 us per loop

In [4]: timeit x+x
1000000 loops, best of 3: 1.71 us per loop

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 8, 2013

Contributor

Strange, when I run test at local, it pass all

arink@arink-pc:~/workspace/gsoc/testing$ python ../numpy/tools/test-installed-numpy.py --mode=full
Running unit tests for numpy
NumPy version 1.9.0.dev-7ae0206
NumPy is installed in /home/arink/.local/lib/python2.7/site-packages/numpy
Python version 2.7.4 (default, Apr 19 2013, 18:32:33) [GCC 4.7.3]
nose version 1.3.0
..................S.................................................................................................................................................................................................................................................................................................................................S..........................................................................................................................................S................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K...............................................................................................................................................................................................SSS.........................................................................................................................................................................................................................................................................................................................................................K...........................................................................................................K......................K.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................S...................................................................................................................................................................................................................................................................SS..SS.....................................................................................................SSSSSSSS....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K....................................................
----------------------------------------------------------------------
Ran 5281 tests in 97.180s
OK (KNOWNFAIL=5, SKIP=19)

How can I run extact travis test on my local machine.

Contributor

arinkverma commented Sep 8, 2013

Strange, when I run test at local, it pass all

arink@arink-pc:~/workspace/gsoc/testing$ python ../numpy/tools/test-installed-numpy.py --mode=full
Running unit tests for numpy
NumPy version 1.9.0.dev-7ae0206
NumPy is installed in /home/arink/.local/lib/python2.7/site-packages/numpy
Python version 2.7.4 (default, Apr 19 2013, 18:32:33) [GCC 4.7.3]
nose version 1.3.0
..................S.................................................................................................................................................................................................................................................................................................................................S..........................................................................................................................................S................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K...............................................................................................................................................................................................SSS.........................................................................................................................................................................................................................................................................................................................................................K...........................................................................................................K......................K.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................S...................................................................................................................................................................................................................................................................SS..SS.....................................................................................................SSSSSSSS....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................K....................................................
----------------------------------------------------------------------
Ran 5281 tests in 97.180s
OK (KNOWNFAIL=5, SKIP=19)

How can I run extact travis test on my local machine.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 8, 2013

Contributor

it crashes on my machine on import too.
I'm still not convinced this is better than my suggestion which is much simpler. (thread) global state is always dodgy and should be avoided if possible.

Contributor

juliantaylor commented Sep 8, 2013

it crashes on my machine on import too.
I'm still not convinced this is better than my suggestion which is much simpler. (thread) global state is always dodgy and should be avoided if possible.

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 8, 2013

Contributor

@juliantaylor TLS seems complicated and risky to work with but I was trying to make it. According to call-graph, caching buildvalue drop time consumption to 4%. Whereas with PyTuple_Pack I got 3% of improvement in _extract_pyvals. It would be nice if we can optimized it with both direction.

http://www.arinkverma.in/2013/09/scope-for-improvement-in-extractpyvals.html

Contributor

arinkverma commented Sep 8, 2013

@juliantaylor TLS seems complicated and risky to work with but I was trying to make it. According to call-graph, caching buildvalue drop time consumption to 4%. Whereas with PyTuple_Pack I got 3% of improvement in _extract_pyvals. It would be nice if we can optimized it with both direction.

http://www.arinkverma.in/2013/09/scope-for-improvement-in-extractpyvals.html

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 9, 2013

Member

@jtaylor: I'm not sure if the patch as it currently stands does this, but
the original observation -- which still seems reasonable to me -- was that
if you look at PyUFunc_GetPyValues/_extract_pyvals, there's a surprising
amount of futzing around (finding the per-thread dict, doing a dict lookup
(and allocating a tuple if no entry is found), and then unpacking the
entries in that tuple into C objects, and then allocating a new errobj
tuple, ...), of which essentially none of which actually has to happen at
ufunc call time. So the suggestion was to just calculate the appropriate C
values up front and stash them as C values in a TLS key, and then the
per-ufunc happy path would just be something like

_ufunc_settings_struct * settings = get_tls_value();
bufsize = settings->bufsize;
errmask = settings->errmask;
errobj = settings->errobj;
Py_INCREF(errobj);

No error handling, allocations, etc. needed. Does that make sense as a
sensible way to do things?

I guess looking more closely now we'd need a little more reorganization
than I implied above, because at some point someone kluged things together
so that 'errobj' is no longer just the user-level error object, but instead
a twople of (current ufunc name as Python string, user-level error object).
The result is that we have to allocate multiple Python objects (the tuple
and the ufunc name) on every call, and almost always throw them away
unused; instead we should just pass the user-level error object as is, and
pass the ufunc name as a char*. This would be necessary to fully eliminate
error-handling allocations from the fast path as per the above.

On Mon, Sep 9, 2013 at 12:06 AM, Arink Verma notifications@github.comwrote:

@juliantaylor https://github.com/juliantaylor TLS seems complicated and
risky to work with but I was trying to make it. According to call-graph,
caching buildvalue drop time consumption to 4%. Whereas with PyTuple_Pack I
got 3% of improvement in _extract_pyvals. It would be nice if we can
optimized it with both direction.

http://www.arinkverma.in/2013/09/scope-for-improvement-in-extractpyvals.html


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/pull/3686#issuecomment-24031676
.

Member

njsmith commented Sep 9, 2013

@jtaylor: I'm not sure if the patch as it currently stands does this, but
the original observation -- which still seems reasonable to me -- was that
if you look at PyUFunc_GetPyValues/_extract_pyvals, there's a surprising
amount of futzing around (finding the per-thread dict, doing a dict lookup
(and allocating a tuple if no entry is found), and then unpacking the
entries in that tuple into C objects, and then allocating a new errobj
tuple, ...), of which essentially none of which actually has to happen at
ufunc call time. So the suggestion was to just calculate the appropriate C
values up front and stash them as C values in a TLS key, and then the
per-ufunc happy path would just be something like

_ufunc_settings_struct * settings = get_tls_value();
bufsize = settings->bufsize;
errmask = settings->errmask;
errobj = settings->errobj;
Py_INCREF(errobj);

No error handling, allocations, etc. needed. Does that make sense as a
sensible way to do things?

I guess looking more closely now we'd need a little more reorganization
than I implied above, because at some point someone kluged things together
so that 'errobj' is no longer just the user-level error object, but instead
a twople of (current ufunc name as Python string, user-level error object).
The result is that we have to allocate multiple Python objects (the tuple
and the ufunc name) on every call, and almost always throw them away
unused; instead we should just pass the user-level error object as is, and
pass the ufunc name as a char*. This would be necessary to fully eliminate
error-handling allocations from the fast path as per the above.

On Mon, Sep 9, 2013 at 12:06 AM, Arink Verma notifications@github.comwrote:

@juliantaylor https://github.com/juliantaylor TLS seems complicated and
risky to work with but I was trying to make it. According to call-graph,
caching buildvalue drop time consumption to 4%. Whereas with PyTuple_Pack I
got 3% of improvement in _extract_pyvals. It would be nice if we can
optimized it with both direction.

http://www.arinkverma.in/2013/09/scope-for-improvement-in-extractpyvals.html


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/pull/3686#issuecomment-24031676
.

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 11, 2013

Contributor

Instead of Py_buildvalue, I have used PyErrObject as

typedef struct {
    PyObject_HEAD
    char *name;
    PyObject* retval;
} PyErrObject;

Time consumption of _extract_pyvals drop to 2% from 9.3%. See link for call-graph
http://www.arinkverma.in/2013/09/improvement-in-extractpyvals.html

Contributor

arinkverma commented Sep 11, 2013

Instead of Py_buildvalue, I have used PyErrObject as

typedef struct {
    PyObject_HEAD
    char *name;
    PyObject* retval;
} PyErrObject;

Time consumption of _extract_pyvals drop to 2% from 9.3%. See link for call-graph
http://www.arinkverma.in/2013/09/improvement-in-extractpyvals.html

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 11, 2013

Contributor

can the creation of the errobject be deferred to when an error actually occurs?

Contributor

juliantaylor commented Sep 11, 2013

can the creation of the errobject be deferred to when an error actually occurs?

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 11, 2013

Contributor

seems to work, this kludgy patch results in a good improvement:
https://gist.github.com/juliantaylor/6521016
would need to be cleaned up, generalized (could be used in generalizedfunction and the reduction ufuncts too) + the api break on checkfperrr removed.

Contributor

juliantaylor commented Sep 11, 2013

seems to work, this kludgy patch results in a good improvement:
https://gist.github.com/juliantaylor/6521016
would need to be cleaned up, generalized (could be used in generalizedfunction and the reduction ufuncts too) + the api break on checkfperrr removed.

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 11, 2013

Contributor

Buffersize seems to unrelated to errobj creation. How abt using pyufunc_GetPyValues for buffersize, and for rest make different function pyufunc_GetPyErrValues. Also at many places code is using buffersize so, its place be unchanged and only errvalues should be deferred to when an error actually occurs.

Contributor

arinkverma commented Sep 11, 2013

Buffersize seems to unrelated to errobj creation. How abt using pyufunc_GetPyValues for buffersize, and for rest make different function pyufunc_GetPyErrValues. Also at many places code is using buffersize so, its place be unchanged and only errvalues should be deferred to when an error actually occurs.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 11, 2013

Contributor

yes buffersize must still be handled, I overlooked that.

shouldn't the buffer only be used by the iterator? if so one could move the retrieving of the buffersize into the nditer construction so it only queries the thread local storage if necessary. Though this might be premature optimization.

Contributor

juliantaylor commented Sep 11, 2013

yes buffersize must still be handled, I overlooked that.

shouldn't the buffer only be used by the iterator? if so one could move the retrieving of the buffersize into the nditer construction so it only queries the thread local storage if necessary. Though this might be premature optimization.

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 11, 2013

Member

Accessing TLS can be extremely cheap. On Linux and OSX, on recent (3.x)
versions, Python's TLS functions just wrap the native pthread_key_create
and friends, which at least on Linux I know are extremely fast. And on
win32 even 2.7 uses the native TLS support, which I assume is also
reasonably optimized. (On older Pythons using pthreads, there's some
horrible by-hand implementation of TLS that involves taking and releasing a
mutex, so that could be more expensive, I don't know.) So I'm not sure
whether it's worth spending a lot of time on avoiding touching TLS entirely.

In principle it's nice to know the error mask before entering the ufunc
loop too, because if it's empty as a further optimization we could skip the
error checking entirely. (The most expensive part is clearing the FP error
flags, which happens before the loop.)

On Wed, Sep 11, 2013 at 3:48 PM, Julian Taylor notifications@github.comwrote:

yes buffersize must still be handled, I overlooked that.

shouldn't the buffer only be used by the iterator? if so one could move
the retrieving of the buffersize into the nditer construction so it only
queries the thread local storage if necessary. Though this might be
premature optimization.


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/pull/3686#issuecomment-24246335
.

Member

njsmith commented Sep 11, 2013

Accessing TLS can be extremely cheap. On Linux and OSX, on recent (3.x)
versions, Python's TLS functions just wrap the native pthread_key_create
and friends, which at least on Linux I know are extremely fast. And on
win32 even 2.7 uses the native TLS support, which I assume is also
reasonably optimized. (On older Pythons using pthreads, there's some
horrible by-hand implementation of TLS that involves taking and releasing a
mutex, so that could be more expensive, I don't know.) So I'm not sure
whether it's worth spending a lot of time on avoiding touching TLS entirely.

In principle it's nice to know the error mask before entering the ufunc
loop too, because if it's empty as a further optimization we could skip the
error checking entirely. (The most expensive part is clearing the FP error
flags, which happens before the loop.)

On Wed, Sep 11, 2013 at 3:48 PM, Julian Taylor notifications@github.comwrote:

yes buffersize must still be handled, I overlooked that.

shouldn't the buffer only be used by the iterator? if so one could move
the retrieving of the buffersize into the nditer construction so it only
queries the thread local storage if necessary. Though this might be
premature optimization.


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/pull/3686#issuecomment-24246335
.

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 11, 2013

Contributor

@juliantaylor Patch has given significant improvement. Now, pyufunc_GetPyValues cost 3.4% of time. Call-graph http://goo.gl/xnNGYl

@njsmith
I tired TLS pythread PyThread_get_key_value api. After profiling it turned out to be as expensive as pydict. I am using python 2.7 on 32 linux.
So we can avoid ,clearing the FP error flags, in case no error occurs last time.

Contributor

arinkverma commented Sep 11, 2013

@juliantaylor Patch has given significant improvement. Now, pyufunc_GetPyValues cost 3.4% of time. Call-graph http://goo.gl/xnNGYl

@njsmith
I tired TLS pythread PyThread_get_key_value api. After profiling it turned out to be as expensive as pydict. I am using python 2.7 on 32 linux.
So we can avoid ,clearing the FP error flags, in case no error occurs last time.

@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 13, 2013

Contributor

Now, it building error object only when error occurs. And also checking errormask before actually check for error.
Its turn out to be great improvement in speed.

Contributor

arinkverma commented Sep 13, 2013

Now, it building error object only when error occurs. And also checking errormask before actually check for error.
Its turn out to be great improvement in speed.

@njsmith

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@njsmith

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@njsmith

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 13, 2013

Contributor

great, it indeed speeds things up nicely.

Contributor

juliantaylor commented Sep 13, 2013

great, it indeed speeds things up nicely.

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 14, 2013

Contributor

make_error_object(PyObject* extobj, char * ufunc_name, int errormask, int *first_error) created to reduce code duplication.

Contributor

arinkverma commented Sep 14, 2013

make_error_object(PyObject* extobj, char * ufunc_name, int errormask, int *first_error) created to reduce code duplication.

@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 15, 2013

Contributor

What else we can do here, to get this in?

Contributor

arinkverma commented Sep 15, 2013

What else we can do here, to get this in?

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 16, 2013

Member

@juliantaylor @njsmith Is this ready? It is getting close to the end of gsoc and Arink needs to show code.

Member

charris commented Sep 16, 2013

@juliantaylor @njsmith Is this ready? It is getting close to the end of gsoc and Arink needs to show code.

@njsmith

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@@ -1715,6 +1721,36 @@ static int get_ufunc_arguments(PyUFuncObject *ufunc,
}
}
static int
_check_ufunc_err(int errmask, PyObject *extobj, char* ufunc_name, int *first) {

This comment has been minimized.

@arinkverma

arinkverma Sep 16, 2013

Contributor

@njsmith As now errobj build only if error occurred, so is it necessary to, pass ufunc_name as separate arguments ?
Now, *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval); doesn't seems to create much delay.

As suggested by @juliantaylor the flow is

extract extobj => call _check_ufunc_err => if (!okay) { build extract pyvals and build errobj }
@arinkverma

arinkverma Sep 16, 2013

Contributor

@njsmith As now errobj build only if error occurred, so is it necessary to, pass ufunc_name as separate arguments ?
Now, *errobj = Py_BuildValue("NO", PyBytes_FromString(name), retval); doesn't seems to create much delay.

As suggested by @juliantaylor the flow is

extract extobj => call _check_ufunc_err => if (!okay) { build extract pyvals and build errobj }
@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 19, 2013

Contributor

@juliantaylor I would like to have your opinion here.

Contributor

arinkverma commented Sep 19, 2013

@juliantaylor I would like to have your opinion here.

@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
@juliantaylor

View changes

Show outdated Hide outdated numpy/core/src/umath/ufunc_object.c
Py_XDECREF(errobj);
return -1;
}
}

This comment has been minimized.

@juliantaylor

juliantaylor Sep 19, 2013

Contributor

I think this part of the code would be cleaner if it where:

if (extobj == NULL)
   extobj = Get_ExtObjFromGlobal(...)
_extract_pyvals(extobj);

but I think this can be done in another refactoring PR, all this cleaning up gets a bit in the way of the improvement we are trying to add.

@juliantaylor

juliantaylor Sep 19, 2013

Contributor

I think this part of the code would be cleaner if it where:

if (extobj == NULL)
   extobj = Get_ExtObjFromGlobal(...)
_extract_pyvals(extobj);

but I think this can be done in another refactoring PR, all this cleaning up gets a bit in the way of the improvement we are trying to add.

This comment has been minimized.

@arinkverma

arinkverma Sep 21, 2013

Contributor

I am fine with another PR

@arinkverma

arinkverma Sep 21, 2013

Contributor

I am fine with another PR

}
}
return PyUFunc_handlefperr(errmask, errobj, fperr, first);

This comment has been minimized.

@juliantaylor

juliantaylor Sep 19, 2013

Contributor

seems like errobj not decref'd anymore here.

@juliantaylor

juliantaylor Sep 19, 2013

Contributor

seems like errobj not decref'd anymore here.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 19, 2013

Contributor

+1 from to me if the three comments above are fixed.
the code is still convoluted and needs refactoring, but this could be done in a separate PR.

Contributor

juliantaylor commented Sep 19, 2013

+1 from to me if the three comments above are fixed.
the code is still convoluted and needs refactoring, but this could be done in a separate PR.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 19, 2013

Contributor

the commits could be squashed, I don't think there is much value in keeping the old iterations of this in the history.

Contributor

juliantaylor commented Sep 19, 2013

the commits could be squashed, I don't think there is much value in keeping the old iterations of this in the history.

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 19, 2013

Member

Yeah, on a second look, I think the full cleanup is:

  • add a get_global_extobj function like you said, and rip out all calls to PyUFunc_GetPyValues
  • make _extract_pyvals unconditionally return: buffersize, errmask, and the raw third element (not constructing a tuple to put it in or anything like that, just PyList_GET_ITEM(extobj, 2) and done).
  • pass this third element down to _check_ufunc_err in place of where the PR is currently passing extobj (and where we used to pass errobj)
  • rip out all calls to PyUFunc_handlefperr and replace with a function that takes this third element. Or just inline the logic into _check_ufunc_err.

There's some question of what to do with the error checking that _extract_pyvals currently does (requiring that the third element be callable etc.). Not sure this is actually needed at all (as Guido says, "don't check if it's callable, just call it!"), but if it is needed it could be moved into _check_ufunc_err or thereabouts.

Always a tough call on whether to require cleanups when merging. I'm happier with this PR than I was, but (a) we're going to have to clean this up sooner or later, (b) doing it now is a lot cheaper than doing it later, after we've all forgotten all these details and need to work them out from scratch again, but (c) if we just merge we lose our biggest stick for making this actually happen :-). So I'm ambivalent. How much do we believe someone actually will clean this up as a second PR? @juliantaylor, are you volunteering? ;-)

Member

njsmith commented Sep 19, 2013

Yeah, on a second look, I think the full cleanup is:

  • add a get_global_extobj function like you said, and rip out all calls to PyUFunc_GetPyValues
  • make _extract_pyvals unconditionally return: buffersize, errmask, and the raw third element (not constructing a tuple to put it in or anything like that, just PyList_GET_ITEM(extobj, 2) and done).
  • pass this third element down to _check_ufunc_err in place of where the PR is currently passing extobj (and where we used to pass errobj)
  • rip out all calls to PyUFunc_handlefperr and replace with a function that takes this third element. Or just inline the logic into _check_ufunc_err.

There's some question of what to do with the error checking that _extract_pyvals currently does (requiring that the third element be callable etc.). Not sure this is actually needed at all (as Guido says, "don't check if it's callable, just call it!"), but if it is needed it could be moved into _check_ufunc_err or thereabouts.

Always a tough call on whether to require cleanups when merging. I'm happier with this PR than I was, but (a) we're going to have to clean this up sooner or later, (b) doing it now is a lot cheaper than doing it later, after we've all forgotten all these details and need to work them out from scratch again, but (c) if we just merge we lose our biggest stick for making this actually happen :-). So I'm ambivalent. How much do we believe someone actually will clean this up as a second PR? @juliantaylor, are you volunteering? ;-)

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 19, 2013

Contributor

its true that if its not done now it may never be done, but on the other hand we also shouldn't expect new contributers to clean up our old code before adding new useful stuff.

I can try and see what can be done to clean up this file this weekend (if no new 1.8 issues get in the way).
unless of course arinkverma wants to do another cleanup PR? :)

Contributor

juliantaylor commented Sep 19, 2013

its true that if its not done now it may never be done, but on the other hand we also shouldn't expect new contributers to clean up our old code before adding new useful stuff.

I can try and see what can be done to clean up this file this weekend (if no new 1.8 issues get in the way).
unless of course arinkverma wants to do another cleanup PR? :)

ENH: High time consumption by PyUFunc_GetPyValues in ufunc_object.c
For every single operation calls, numpy has to extract value of buffersize, errormask
 and name to pack and build error object. These two functions, _extract_pyvals and
PyUFunc_GetPyValues together use >12% of time.
@arinkverma

This comment has been minimized.

Show comment
Hide comment
@arinkverma

arinkverma Sep 21, 2013

Contributor

@juliantaylor okay! I can make another PR for clean-up but after GSOC. As right now I am busy completing documentation about project.

Contributor

arinkverma commented Sep 21, 2013

@juliantaylor okay! I can make another PR for clean-up but after GSOC. As right now I am busy completing documentation about project.

if (extobj == NULL) {
if (PyUFunc_GetPyValues(ufunc_name,
&buffersize, &errormask, &errobj) < 0) {
if (PyUFunc_GetPyValues(NULL,

This comment has been minimized.

@juliantaylor

juliantaylor Sep 29, 2013

Contributor

looks like these NULL names are never handled, so they likely lead to a crash in _extract_pyvals
if I'm right the case (extobj != NULL) is not covered by tests at all which is bad.

@juliantaylor

juliantaylor Sep 29, 2013

Contributor

looks like these NULL names are never handled, so they likely lead to a crash in _extract_pyvals
if I'm right the case (extobj != NULL) is not covered by tests at all which is bad.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 29, 2013

Contributor

I started some refactoring on https://github.com/juliantaylor/numpy/tree/pyufunc_refactor
see the commit log

let me know if you want to integrate that into your branch or if I we should merge that branch (which still contains your commit)

Contributor

juliantaylor commented Sep 29, 2013

I started some refactoring on https://github.com/juliantaylor/numpy/tree/pyufunc_refactor
see the commit log

let me know if you want to integrate that into your branch or if I we should merge that branch (which still contains your commit)

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Oct 18, 2013

Contributor

closing as changes are in #3946

Contributor

juliantaylor commented Oct 18, 2013

closing as changes are in #3946

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment