Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Improved performance of PyArray_FromAny for sequences of array-like #13399

Merged
merged 1 commit into from May 26, 2019

Conversation

superbobry
Copy link
Contributor

Prior to this commit

np.array([array_like])

would recursively copy each element of array_like. This is due to the fact that
setArrayFromSequence only special-cased lists of NumPy arrays, any other
object was treated as a sequence even if it supported buffer or __array*__
interfaces. See tensorflow/tensorflow#27692 for details.

The commit generalizes the special-case in setArrayFromSequence to any
array-like, i.e. a buffer or an object with __array__, __array_interface__
__array_struct__.

@mattip
Copy link
Member

mattip commented Apr 26, 2019

Running

python -mtimeit -s "import numpy as np; b=memoryview(np.arange(1000))" \
       "a = np.array([b] * 100)"

shows a marked improvement. Before

100 loops, best of 3: 2.6 msec per loop

After

10000 loops, best of 3: 71.9 usec per loop

return NULL;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: the original code had a check clause at this point if (tmp != Py_NotImplemented){ } which has been folded into the logic below instead

@mattip
Copy link
Member

mattip commented Apr 26, 2019

LGTM. Maybe could add a benchmark to benchmarks/benchmarks/bench_core.Core.time*.

@eric-wieser
Copy link
Member

I worry that this needs to handle broken ctypes buffer formats in the same way we do elsewhere. Will try to look at the diff later

@eric-wieser eric-wieser self-requested a review April 27, 2019 00:47
@superbobry
Copy link
Contributor Author

superbobry commented Apr 27, 2019

I was also thinking of addressing #8562 but I hesitated because buffer interface allows for ~free shape queries whereas __array__ potentially allocates a new array which seems wasteful.

@mattip I've added a benchmark, could you have another look please?

@mattip
Copy link
Member

mattip commented Apr 27, 2019

I was also thinking of addressing #8562

Great, but that should be a follow-on PR, not part of this one, unless it is a one-line fix

@mattip
Copy link
Member

mattip commented Apr 30, 2019

@eric-wieser since this PR reuses existing code, could we push off possible broken ctypes PEP 3118 handling to a different issue/PR? This seems like a clear win for more common memoryview and __array__ use cases

@superbobry superbobry force-pushed the np.array-list-of-array-like branch from 1b57708 to 9f8826a Compare May 1, 2019 11:38
@eric-wieser
Copy link
Member

@mattip: At a cursory glance, I was worried this introduced a regression in the following case:

class PaddedStruct(ctypes.Structure):
    _fields_ = [('a', ctypes.c_uint8), ('b', ctypes.c_uint16)]

data = (PaddedStruct * 4)()
np.array(data)  # ok
np.array(list(data))
# ValueError: invalid literal for int() with base 10: b'\x00\x00\x00\x00'

Testing seems to show that the error is produced both before and after this patch, so there's no need to worry about it here.

@superbobry superbobry force-pushed the np.array-list-of-array-like branch from 9f8826a to 2363084 Compare May 12, 2019 20:26
Prior to this commit

  np.array([array_like])

would recursively copy each element of array_like. This is due to the
fact that setArrayFromSequence only special-cased lists of NumPy arrays,
any other object was treated as a sequence even if it supported buffer
or __array*__ interfaces. See tensorflow/tensorflow#27692 for details.

The commit generalizes the special-case in setArrayFromSequence to any
array-like, i.e. a buffer or an object with __array__, __array_interface__
__array_struct__.
@mattip
Copy link
Member

mattip commented May 26, 2019

I created issue #13628 about np.array(list(ctypes-struct)) which does not need the padding to fail.

This looks ready to merge, any more comments?

@mattip mattip merged commit 95aacf5 into numpy:master May 26, 2019
@mattip
Copy link
Member

mattip commented May 26, 2019

Thanks @superbobry

@mattip
Copy link
Member

mattip commented May 28, 2019

This seems to have broken pandas, where import numpy as np, pandas as pd; np.array([pd.DataFrame()]) used to work and now raises ValueError: could not broadcast input array from shape (0,0) into shape (0)

seberg pushed a commit that referenced this pull request May 30, 2019
Fixup of gh-13399 to avoid a regression with pandas. The actual output after this is the same as
before, but incorrect with respect to the output shape. However, it seems a bit larger and an issue
to look at after the 1.17 release.

* BUG: test, fix regression for array([pandas.DataFrame()])

* BUG: refactor to make sure tmp is DECREFfed
seberg added a commit to seberg/numpy that referenced this pull request Jul 24, 2019
…rray-like"

This reverts commit 71fc59d
which was part of numpygh-13399 and the followup commit
ba53a63 from numpygh-13663.

The issue is that we mix Sequence protocol and `__array__` usage
when coercing arrays. It seems easiest to revert this for the 1.17.x release
and make a larger breaking change in the 1.18.x release cycle.
Since this only occurs for stranger array-likes. This revert is limited
to the 1.17.x branch for now.
charris pushed a commit to charris/numpy that referenced this pull request Nov 26, 2019
…rray-like"

This reverts commit 71fc59d
which was part of numpygh-13399 and the followup commit
ba53a63 from numpygh-13663.

The issue is that we mix Sequence protocol and `__array__` usage
when coercing arrays. It seems easiest to revert this for the 1.17.x release
and make a larger breaking change in the 1.18.x release cycle.
Since this only occurs for stranger array-likes. This revert is limited
to the 1.17.x branch for now.
charris pushed a commit to charris/numpy that referenced this pull request Dec 2, 2019
…rray-like"

This reverts commit 71fc59d
which was part of numpygh-13399 and the followup commit
ba53a63 from numpygh-13663.

The issue is that we mix Sequence protocol and `__array__` usage when
coercing arrays. It seems easiest to revert this for the 1.18.x release
and make a larger breaking change in the 1.19.x release cycle.  Since
this only occurs for stranger array-likes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants