Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept constant memoryviews in HashTable.lookup #21688

Merged
merged 13 commits into from
Jul 7, 2018

Conversation

xhochy
Copy link
Contributor

@xhochy xhochy commented Jun 30, 2018

xref #10070
xref #12813

building an ExtensionArray for nullable ints using a pyarrow.Array, the underlying memoryview is marked as constant. The HashTable Cython code uses the memoryview in the lookup function but in a read-only fashion. Without the const specifier Cython raises an error that the memoryview is read-only.

Not sure how to test this in a small scope (i.e. without using pyarrow) or if this small change is worth mentioning in the changelog.

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@xhochy
Copy link
Contributor Author

xhochy commented Jun 30, 2018

The Travis failure seems quite silent. Is there a way to extract more information?

@TomAugspurger
Copy link
Contributor

I restarted the one. I haven't seen that error before.

@TomAugspurger TomAugspurger added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 30, 2018
@codecov
Copy link

codecov bot commented Jun 30, 2018

Codecov Report

Merging #21688 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21688      +/-   ##
==========================================
- Coverage   91.95%   91.95%   -0.01%     
==========================================
  Files         160      160              
  Lines       49837    49820      -17     
==========================================
- Hits        45830    45812      -18     
- Misses       4007     4008       +1
Flag Coverage Δ
#multiple 90.33% <ø> (-0.01%) ⬇️
#single 42.07% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.01% <0%> (-0.28%) ⬇️
pandas/core/arrays/datetimelike.py 94.38% <0%> (-0.25%) ⬇️
pandas/core/indexes/period.py 93.35% <0%> (-0.23%) ⬇️
pandas/core/indexes/multi.py 94.96% <0%> (-0.09%) ⬇️
pandas/core/groupby/groupby.py 92.65% <0%> (-0.05%) ⬇️
pandas/core/indexes/timedeltas.py 91% <0%> (-0.03%) ⬇️
pandas/core/frame.py 97.19% <0%> (-0.01%) ⬇️
pandas/core/strings.py 98.63% <0%> (ø) ⬆️
pandas/core/arrays/datetimes.py 100% <0%> (ø) ⬆️
pandas/util/testing.py 85.33% <0%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e84cf2a...a4cb629. Read the comment docs.

@TomAugspurger
Copy link
Contributor

Must have been a one-off failure.

cc @chris-b1 if you have any thoughts on this?

@@ -379,7 +379,7 @@ cdef class {{name}}HashTable(HashTable):
self.table.vals[k] = i

@cython.boundscheck(False)
def lookup(self, {{dtype}}_t[:] values):
def lookup(self, const {{dtype}}_t[:] values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have multiple lookup routines for different drapes - is there a reason to not change them all?

why just lookup? we have many routines which couldnin theory benefit from this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a pass over hashtable_class_helper.pxi.in and add const specifiers where applicable.

@chris-b1
Copy link
Contributor

Now that cython supports const memoryviews we probably should go ahead and use them everywhere possible, #18825 example of working around this.

Looks like would bump our required version up to 0.28 but shouldn't really be a problem since we distribute c files on PyPI?
https://github.com/cython/cython/blob/master/CHANGES.rst#028-2018-03-13

@@ -343,6 +343,8 @@ Other
^^^^^

- :meth: `~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`)
- Building pandas for development now requires ``cython >= 0.28``
- Require at least 0.28 version of cython to support read-only memoryviews (:issue:`21688`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referenced the PR as issue here. Is that sufficient or would you like to have separate issues for each line for documentation purposes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we really need to bump cython here?

Copy link
Member

@jschendel jschendel Jul 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we need to bump Cython, it looks like there are a few things missing in order to fully bump it: pin some CI builds to the minimum Cython version, update version number in install.rst.

See the changes in #18623 for where I bumped to 0.24; should show all the places that need updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through the changelog of Cython, we need to bump to 0.28 to support the const specifier for memoryviews. Looking a bit through the Travis setup, we only pin Cython for the 2.7 builds and these are the ones that failed. (sadly with a weird error message).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to bump to 0.28 to support the const specifier for memoryviews

is this an actual change in cython?

I agree its nice to change this, but cython 0.28 is pretty new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the building pandas to a new sub-section (like we did previously when we bumped deps, as going to do this for more deps this cycle).

also is 0.28 min, or 0.28.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.28.0 works but we should build with 0.28.2 due to the mentioned problems in sklearn. I would at least bump the Cython version on Travis & CricleCI.

Should we then allow 0.28.0 or require 0.28.2 so that we don't get the bug reports w.r.t. problems that are fixes in .2?

@@ -343,6 +343,8 @@ Other
^^^^^

- :meth: `~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`)
- Building pandas for development now requires ``cython >= 0.28``
- Require at least 0.28 version of cython to support read-only memoryviews (:issue:`21688`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we really need to bump cython here?

@jreback jreback added the Dependencies Required and optional dependencies label Jul 2, 2018
@jorisvandenbossche
Copy link
Member

is this an actual change in cython?

Yes, it is a new cython feature that it can now work on read-only memoryviews (see the third bullet point here: https://github.com/cython/cython/blob/master/CHANGES.rst#028-2018-03-13). This is also a feature much wanted by sklearn: scikit-learn/scikit-learn#10624

As @chris-b1 said, if we bump the cython version requirement, we should try to simplify our code base and use memoryviews in more places (#10070 as another example where two version of a function are added, one with memoryviews and other with ndarray interface to support read-only arrays)

IMO, ideally, if we bump the version, it would be good to test for a few internal cases that the const memoryviews are actually working (since this is new functionality in cython, there might be still some rough edges. In sklearn they found a bug in const memoryviews that was fixed in 0.28.2).
(but since I am not going to do that work, no strong opinion on doing this before actually bumping the version requirement)

@TomAugspurger
Copy link
Contributor

Looks like would bump our required version up to 0.28 but shouldn't really be a problem since we distribute c files on PyPI?

That's correct.

@xhochy
Copy link
Contributor Author

xhochy commented Jul 3, 2018

Happy to implement the necessary changes when we bump the Cython version but it would be nice to know whether this is the preferred way forward?

@jreback
Copy link
Contributor

jreback commented Jul 3, 2018

yeah am ok with bumping the cython version i guess for 0.24

since this is not user facing really

we should also again bump numpy min but that’s a different issue

@xhochy xhochy force-pushed the accept-constant-memoryview-lookup branch from 6d38e08 to 36626d8 Compare July 7, 2018 13:17
@@ -343,6 +343,8 @@ Other
^^^^^

- :meth: `~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`)
- Building pandas for development now requires ``cython >= 0.28``
- Require at least 0.28 version of cython to support read-only memoryviews (:issue:`21688`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the building pandas to a new sub-section (like we did previously when we bumped deps, as going to do this for more deps this cycle).

also is 0.28 min, or 0.28.2?

@@ -1079,13 +1079,17 @@ class TestHashTable(object):

def test_lookup_nan(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a fixture and use this for False/True

@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

after we merge this can try to remove the other read-only memory fixes that are in take routines
xrref #10070 and #12813

@xhochy xhochy force-pushed the accept-constant-memoryview-lookup branch from 36626d8 to 20c181f Compare July 7, 2018 15:04
@jreback jreback added this to the 0.24.0 milestone Jul 7, 2018
@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

pushed some small doc changes. ping on green for merge.

@jreback jreback merged commit dcbf8b5 into pandas-dev:master Jul 7, 2018
@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

thanks @xhochy nice patch!

@xhochy xhochy deleted the accept-constant-memoryview-lookup branch July 8, 2018 18:21
gfyoung added a commit to forking-repos/pandas that referenced this pull request Sep 10, 2018
gfyoung added a commit that referenced this pull request Sep 11, 2018
* CI: Bump NumPy to 1.9.3

Backport of gh-22499.

* BLD: Fix openpyxl to 2.5.5

Backport of gh-22601.

* CI: Resolve timeout issue on Travis

Backported from gh-22429.

* CI: Migrate to CircleCI 2.0

Backport of gh-21814.

* Upgrade Cython to >=0.28.2

Backported from gh-21688.

* TST: Patch locale handling

Backported from gh-21739.
Backport of gh-22213.
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dependencies Required and optional dependencies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants