PERF/COMPAT: define platform int to np.intp #13972

Closed
wants to merge 4 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

chris-b1 commented Aug 12, 2016 edited

AFAIK this only affects 64 bit python on Windows.

numpy wants an np.intp (i8 on Windows) as a indexer for take, but pandas defines a "platform int" as a np.int_ (i4 on Windows). This hits performance twice, because we often start with i8, cast to i4, then numpy will cast back to i8 in its take.

This is an alternative to #13924 - there I explored replacing ndarray.take with our take_nd fully, but this approach solves the perf problem without much pain or risking new segfaults.

I'd still need to adjust a bunch of tests here to pass on Windows. This is an API change for "advanced" methods like get_indexer, but I don't think anything is necessary beyond a doc note?

ASV:

    before     after       ratio
  [4a80521 ] [3ced5d5 ]
     3.49ms     2.86ms      0.82  indexing.series_take_dtindex.time_series_take_dtindex
     3.28ms     3.03ms      0.92  indexing.series_take_intindex.time_series_take_intindex
    15.16ms    14.57ms      0.96  join_merge.join_dataframe_index_single_key_bigger.time_join_dataframe_index_single_key_bigger
    19.81ms    16.62ms      0.84  join_merge.join_dataframe_index_single_key_bigger_sort.time_join_dataframe_index_single_key_bigger_sort
    14.05ms    13.63ms      0.97  join_merge.join_dataframe_index_single_key_small.time_join_dataframe_index_single_key_small
     4.83ms     4.64ms      0.96  join_merge.join_dataframe_integer_2key.time_join_dataframe_integer_2key
     1.80ms     1.63ms      0.90  join_merge.join_dataframe_integer_key.time_join_dataframe_integer_key
   633.31us   570.37us      0.90  join_merge.join_non_unique_equal.time_join_non_unique_equal
      2.39s      1.36s      0.57  join_merge.left_outer_join_index.time_left_outer_join_index
@chris-b1 chris-b1 PERF/COMPAT: define platform int to np.intp
3ced5d5

codecov-io commented Aug 12, 2016 edited

Current coverage is 85.29% (diff: 100%)

Merging #13972 into master will increase coverage by 0.02%

@@             master     #13972   diff @@
==========================================
  Files           139        139          
  Lines         50219      50245    +26   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42819      42854    +35   
+ Misses         7400       7391     -9   
  Partials          0          0          

Powered by Codecov. Last update 4a80521...322b11a

@jreback jreback commented on the diff Aug 12, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -767,6 +767,40 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
- Bug in single row slicing on multi-type ``SparseDataFrame``s, types were previously forced to float (:issue:`13917`)
+Indexer dtype Changes
+^^^^^^^^^^^^^^^^^^^^^
@jreback

jreback Aug 12, 2016

Contributor

add the issue ref here.

@jreback jreback commented on an outdated diff Aug 12, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -767,6 +767,40 @@ Note that the limitation is applied to ``fill_value`` which default is ``np.nan`
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
- Bug in single row slicing on multi-type ``SparseDataFrame``s, types were previously forced to float (:issue:`13917`)
+Indexer dtype Changes
+^^^^^^^^^^^^^^^^^^^^^
+
+.. note::
+
+ This change only affects 64 bit python running on Windows, and only affects relatively advanced
+ indexing operations
+
+Methods such as ``Index.get_indexer`` that return an indexer array coerce that array to a "platform int", so that it can be
+directly used in 3rd party library operations like ``numpy.take``. Previously, a platform int was defined as ``np.int_``
+which corresponds to a C integer - but the correct type, and what is being used now, is ``np.intp``, which corresponds
@jreback

jreback Aug 12, 2016

Contributor

which corresponds to a C integer, but

@jreback jreback commented on an outdated diff Aug 12, 2016

doc/source/whatsnew/v0.19.0.txt
+
+Previous behaviour:
+
+.. code-block:: ipython
+
+ In [1]: i = pd.Index(['a', 'b', 'c'])
+
+ In [2]: i.get_indexer(['b', 'b', 'c']).dtype
+ Out[2]: dtype('int32')
+
+New behaviour:
+
+.. ipython :: python
+
+ i = pd.Index(['a', 'b', 'c'])
+ i.get_indexer(['b', 'b', 'c']).dtype
@jreback

jreback Aug 12, 2016

Contributor

this runs on linux, so maybe just use another code-block (from windows), though it is the same

Contributor

jreback commented Aug 12, 2016

yes this looks reasonable. go ahead an make ready on windows. Also add an asv as above.

@chris-b1 chris-b1 adjust test for platform independence
84f38b2
Contributor

jreback commented Aug 13, 2016

pls update the docs as indicated. otherwise lgtm.

Contributor

chris-b1 commented Aug 14, 2016

What else were you looking for in the docs? I did add an issue ref in the first paragraph.

Contributor

jreback commented Aug 15, 2016

ok this looks fine. can you run this on 32-bit linux as well and see that it comes up clean (IIRC you tests on windows so that should be good).

jreback added this to the 0.19.0 milestone Aug 15, 2016

@chris-b1 chris-b1 adjust for 32bit
fc80938
Contributor

chris-b1 commented Aug 15, 2016

The changes I just pushed were to get this passing on 32 bit windows. I don't have anything handy to test 32 bit Linux, I can probably set up a VM, although I think it should generally act the same 32 bit Windows (np.int_ == np.intp == np.int32)

Contributor

jreback commented Aug 15, 2016

ok I'll run on 32 bit tomorrow

I just use a macosx / vm

Contributor

jreback commented Aug 16, 2016

This currently fails on 32-bit linux (before your PR). I recall had to do something on windows to get this to pass. Something wrong somewhere. Your PR fixes this! yeah!

======================================================================
FAIL: test_alignment_non_pandas (pandas.tests.frame.test_operators.TestDataFrameOperators)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jreback/pandas/pandas/tests/frame/test_operators.py", line 1210, in test_alignment_non_pandas
    Series([1, 2, 3], index=df.index))
  File "/home/jreback/pandas/pandas/util/testing.py", line 1154, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "/home/jreback/pandas/pandas/util/testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "/home/jreback/pandas/pandas/util/testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  int32
[right]: int64

So as-is this passes on linux32 (and fixes the above).

lmk when good to go.

@chris-b1 chris-b1 lint fixup
322b11a
Contributor

chris-b1 commented Aug 16, 2016

Cool - fixed the lint issue so this should be good to go.

chris-b1 changed the title from PERF/COMPAT: define platform int to np.intp (WIP) to PERF/COMPAT: define platform int to np.intp Aug 16, 2016

jreback closed this in 0780443 Aug 17, 2016

Contributor

jreback commented Aug 17, 2016

thanks!

chris-b1 deleted the chris-b1:platform-int branch Aug 17, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment