Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use align_method in comp_method_FRAME #22880

Merged
merged 15 commits into from Oct 13, 2018

Conversation

Projects
None yet
6 participants
@jbrockmendel
Copy link
Member

commented Sep 28, 2018

Closes #20090

update Since #23000 was merged, some of the discussion is out of date. The bottom line remains unchanged: This PR makes DataFrame comparison ops behave like DataFrame arithmetic ops currently do. Also fixes some bugs e.g. #20090
end update

This is a much nicer alternative to the implementation in #22751. The problem is that two tests still fail with this implementation. We need to pin down the design spec more explicitly regardless.

core.ops has three functions for defining DataFrame ops: _arith_method_FRAME, _flex_comp_method_FRAME, _comp_method_FRAME. The first two both call _align_method_FRAME, with _comp_method_FRAME being the outlier. This PR just adds that alignment call.

The two tests that currently fail:

df = DataFrame(np.arange(6).reshape((3, 2)))
b_r = np.atleast_2d([2, 2])
l = (2, 2, 2)
expected = DataFrame([[False, False], [False, True], [True, True]])
result = df > l   # <-- raises ValueError under this PR because it has the wrong shape
assert_frame_equal(result, expected)

result = df > b_r   # <-- raises ValueError under this PR because it has the wrong shape
assert_frame_equal(result, expected)
df = pd.DataFrame(np.arange(6).reshape((3, 2)))
with pytest.raises(ValueError):  # <-- doesnt raise
           df == (2, 2)

I understand why the behavior tested by the first test makes sense, but don't see the logic behind having df == (2, 2) raise (maybe #4576 holds the answer, will look at that more closely)

@pep8speaks

This comment has been minimized.

Copy link

commented Sep 28, 2018

Hello @jbrockmendel! Thanks for submitting the PR.

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2018

Related: #22355.

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Sep 29, 2018

@jorisvandenbossche this is the "pretty" alternative to #22751, needs attention despite not being green. Any thoughts on desired behavior?

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 3, 2018

@pandas-dev/pandas-core this now contains a non-trivial API change, should make sure there is consensus before moving forward.

TL;DR: broadcasting/alignment on DataFrame comparison ops behaves differently from arithmetic ops (#22355). Both behave differently from numpy convention (#22686). This PR changes the comparison behavior to match the arithmetic behavior, with the numpy part saved for the next round.

Demonstration (based on tests.frame.test_operators test_boolean_comparison in master):

df = pd.DataFrame(np.arange(6).reshape((3, 2)))
b = np.array([2, 2])
b_r = np.atleast_2d([2, 2])
b_c = b_r.T
l = (2, 2, 2)
tup = tuple(l)

cmp_expected = pd.DataFrame([[False, False], [False, True], [True, True]])
arith_expected = pd.DataFrame([[2, 3], [4, 5], [6, 7]])

# For the 1-D array `b`, comparison and arithmetic broadcast the same way
result = df > b
tm.assert_frame_equal(result, cmp_expected)
result = df.values > b
tm.assert_numpy_array_equal(result, cmp_expected.values)
result = df + b
tm.assert_frame_equal(result, arith_expected)
result = df.values + b
tm.assert_numpy_array_equal(result, arith_expected.values)

# In the 2-D case, the broadcasting behavior is different
result = df > b_r
tm.assert_frame_equal(result, cmp_expected)
result = df.values > b_r
tm.assert_numpy_array_equal(result, cmp_expected.values)

result = df.values + b_r
tm.assert_numpy_array_equal(result, arith_expected.values)

with pytest.raises(ValueError):  # we DONT want this to raise
    # ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2)
    result = df + b_r


# ditto for tuple/list
result = df > tup  # doesn't match numpy or arithmetic behavior
tm.assert_frame_equal(result, cmp_expected)

with pytest.raises(ValueError):
     # numpy behavior
     result = df.values > tup

with pytest.raises(ValueError):
     # numpy behavior
    result = df.values + tup

with pytest.raises(ValueError):
    # ValueError: Unable to coerce to Series, length must be 2: given 3
    result = df + tup
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

Just to make sure I understand, the changes would be

  • DataFrame + 2-d ndarray with matching trailing dimension: Previously raised, now broadcasts
  • DataFrame + non-ndarray iterable with matching first dimension: Previously broadcast, now raises

Did I miss any?

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2018

DataFrame + 2-d ndarray with matching trailing dimension: Previously raised, now broadcasts

np.ndarray with shape (1, df.shape[1]): previously broadcasts, now raises Broadcasts following #23000

DataFrame + non-ndarray iterable with matching first dimension: Previously broadcast, now raises

list or tuple with len == df.shape[0] != df.shape[1]: previously broadcasts, now raises <-- AFAICT this behavior is not currently tested

list or tuple with len == df.shape[1]: previously raises, now broadcasts

* In a separate PR I would like to change the behavior when operating with ndarray with shape (nrows, 1) or (1, ncols) so that both of these broadcast identically to numpy (ATM the (nrows, 1) case raises for both arithmetic and comparison ops). This change would be for both arithmetic and comparison ops (whereas this PR just changes comparison ops to behave like arithmetic ops)

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2018

Update: In a different branch I edited the alignment code so that 2d arrays of (nrows, 1) or (1, ncols) both broadcast for arithmetic ops. This did not cause any test failures. So AFAICT this particular behavior may be unintentional.

update^2 #23000 has now implemented this behavior for arithmetic ops

@codecov

This comment has been minimized.

Copy link

commented Oct 4, 2018

Codecov Report

Merging #22880 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22880      +/-   ##
==========================================
+ Coverage   92.21%   92.24%   +0.03%     
==========================================
  Files         169      169              
  Lines       50884    50814      -70     
==========================================
- Hits        46922    46874      -48     
+ Misses       3962     3940      -22
Flag Coverage Δ
#multiple 90.66% <100%> (+0.03%) ⬆️
#single 42.3% <100%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/core/internals/managers.py 96.64% <ø> (-0.02%) ⬇️
pandas/core/internals/blocks.py 94.62% <ø> (+1.14%) ⬆️
pandas/core/ops.py 97.19% <100%> (ø) ⬆️
pandas/core/frame.py 97.11% <100%> (-0.01%) ⬇️
pandas/io/formats/csvs.py 98.21% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4b67ca...826f2c7. Read the comment docs.

jbrockmendel added some commits Oct 5, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2018

you refernced #4576 which is indeed behavior we patched in the past, but doesn't necessary make sense for a DataFrame, while for a homogenous ndarray does a bit more.

Have to look at your changes in some detail..

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 6, 2018

Have to look at your changes in some detail..

Sure. The succinct version is my favorite: "Make DataFrame comparison ops behave like DataFrame arithmetic ops". That change is enabled entirely by the 1 line added in core.ops.

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 7, 2018

Extra benefit: After we get rid of Block.eval, it looks like _try_coerce_args can be simplified.


result = df > b_r
result = df > b_r # broadcasts like ndarray (GH#23000)

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Oct 8, 2018

Author Member

Changed following #23000 (previously in this PR this comparison raised ValueError)

This comment has been minimized.

Copy link
@jreback

jreback Oct 9, 2018

Contributor

can you put comments on the previous line


result = df == b_r
result = df == b_r # broadcasts like ndarray (GH#23000)

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Oct 8, 2018

Author Member

Changed following #23000 (previously in this PR this comparison raised ValueError)

This comment has been minimized.

Copy link
@jreback

jreback Oct 9, 2018

Contributor

comments on previous

@jbrockmendel jbrockmendel changed the title WIP: Use align_method in comp_method_FRAME Use align_method in comp_method_FRAME Oct 8, 2018

@TomAugspurger
Copy link
Contributor

left a comment

There's a merge conflict with master now.

operations has been changed to match the arithmetic operations in these cases.
(:issue:`22880`)

The affected cases are: operating against a 2-dimensional ``np.ndarray`` with

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 8, 2018

Contributor

Slight preference for formatting these cases as a list.

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Oct 8, 2018

Author Member

Just pushed addressing your comments. This one I had to guess what you had in mind. LMK if I got it wrong.

In [3]: arr = np.arange(6).reshape(3, 2)
In [4]: df = pd.DataFrame(arr)

In [5]: df == arr[[0], :] # comparison used to broadcast where arithmetic would raise

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 8, 2018

Contributor

"used to" -> "previously" to avoid ambiguity with "used to".

...
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2)

In [7]: df == (1, 2) # length matches number of columns; comparison used to raise where arithmetic would broadcast

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 8, 2018

Contributor

These code blocks are put in <pre> divs in the docs, so long lines will force horizontal scrolling / look extra bad on mobile. Better to continue the input with a

In [7]:  ...
   ...: # comment ...
   ...: # maybe multiline

*Current Behavior*:

.. ipython:: python

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 8, 2018

Contributor

Will nee d an :okexcept here.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 8, 2018

Contributor

Can also remove the In [] prefixes. And hopefully the block continues running when there's an exception in an okexcept.

@jreback
Copy link
Contributor

left a comment

big issue I have here is what is the performance degredation. .eval dispatched to numexpr, I think that this is now handled in dispatch_to_* but not really sure. We may need some asvs to test the broadcasting behaviors (e.g. df + 2-d ndarray)

The affected cases are:
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column
- a list or tuple with the same length matching the number of rows in the :class:`DataFrame`
- a list or tuple with thlength matching the number of columns in the :class:`DataFrame`.

This comment has been minimized.

Copy link
@jreback

jreback Oct 9, 2018

Contributor

the length

df = pd.DataFrame(arr)
df == arr[[0], :] # raises just like the next arithmetic operation
df + arr[[0], :]
df == (1, 2) # broadcasts just like the next arithmetic operation

This comment has been minimized.

Copy link
@jreback

jreback Oct 9, 2018

Contributor

would prefer that the comments precede the lines, also use multiple ipython blocks to make the rendering more clear (if needed).


result = df > b_r
result = df > b_r # broadcasts like ndarray (GH#23000)

This comment has been minimized.

Copy link
@jreback

jreback Oct 9, 2018

Contributor

can you put comments on the previous line


result = df == b_r
result = df == b_r # broadcasts like ndarray (GH#23000)

This comment has been minimized.

Copy link
@jreback

jreback Oct 9, 2018

Contributor

comments on previous

@jreback jreback added the Reshaping label Oct 9, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

@jbrockmendel did you do a check of the performance to be sure there is no implication?

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2018

did you do a check of the performance to be sure there is no implication?

There does appear to be a performance hit in the relevant case, but a) there is room for subsequent optimization and b) the bugs this avoids are much more important than the performance impact.

In [3]: arr = np.arange(600).reshape(30, 20)
In [4]: df = pd.DataFrame(arr)
In [5] %timeit arr == arr[[0], :]    # calibrate based on numpy timings
10000 loops, best of 3: 82.9 µs per loop   <-- master
10000 loops, best of 3: 84 µs per loop      <-- PR

In [6]: ser = pd.Series(arr[0, :])
In [7]: l = list(ser)

In [8]: %timeit df == arr[[0], :]
100 loops, best of 3: 4.42 ms per loop   # <-- master
10 loops, best of 3: 87.5 ms per loop    # <-- PR

In [9]: %timeit df == ser
10 loops, best of 3: 83 ms per loop       <-- master
10 loops, best of 3: 86.9 ms per loop    <-- PR

In [10]: %timeit df == l
ValueError: Invalid broadcasting comparison       <-- master
10 loops, best of 3: 82.4 ms per loop    <-- PR

In [11]: df[1] = df[1].astype('f8')
    ...: df[3] = df[3].astype('uint8')
    ...: df[5] = df[5].view('m8[ns]')
    ...: df[7] = df[7].view('M8[ns]')
    ...: df[9] = df[9].astype('f4')
    ...:
In [12]: %timeit df == arr[[0], :]
ValueError: cannot broadcast shape [(30, 15)] with block values [(1, 20)]   <-- master, BUG
TypeError: cannot compare a TimedeltaIndex with type int64   <-- PR, BUG (in TimedeltaIndex.__eq__; just opened #23063)

In [13]: df[5] = df[5].view('i8').astype('i4')
In [14]: %timeit df == arr[[0], :]
ValueError: cannot broadcast shape [(30, 15)] with block values [(1, 20)]  <-- master, BUG
10 loops, best of 3: 86.7 ms per loop  <-- PR

In [14]: %timeit df == arr[0, :]
ValueError: Invalid broadcasting comparison [array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])] with block values  <-- master
10 loops, best of 3: 80.9 ms per loop  <-- PR

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 12, 2018

gentle ping. If this (or #23031) goes through the follow-ups will keep me busy in places orthogonal to PeriodArray constructors

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 12, 2018

Quick general feedback: for the cases that worked before but will raise now, I think we should go through a deprecation cycle.

(looking now more in detail on your explanation of the changes above)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 12, 2018

In [8]: %timeit df == arr[[0], :]
100 loops, best of 3: 4.42 ms per loop # <-- master
10 loops, best of 3: 87.5 ms per loop # <-- PR

How do you see this performance difference? I though this would now raise an error on this PR?


So if I understand correctly, some cases that now work, will raise after this PR, but you plan to let them work again in a later PR?
That makes it a bit difficult to assess what the overall changes for users would be (eg if a deprecation warning would be needed, see my previous post)

(it's not possible to for example first align arithmetic ops with numpy, and then comparison ops with arithmeric/numpy, so you don't have this back and forth?)


Something else to consider: for Index we actually deliberately disabled comparing with a length 1 array. See also the recently opened issue #23078 where we were planning to actually raise instead of broadcast

The affected cases are:
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column
- a list or tuple with length matching the number of rows in the :class:`DataFrame`
- a list or tuple with length matching the number of columns in the :class:`DataFrame`.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 12, 2018

Member

this list is not valid rst formatted: you don't need to indentation, and need an empty line before the list

(:issue:`22880`)

The affected cases are:
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 12, 2018

Member

Can you indicate here what changes for those cases?

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Oct 12, 2018

Author Member

Will update

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Oct 12, 2018

Author Member

Updated; I think other whatsnew comments addressed. Note that Previous Behavior blocks refer to 0.23, not pre-this-PR

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 12, 2018

How do you see this performance difference?

If you drill down into this, it is driven by poor performance of cmp(DataFrame, Series), which I think can be improved on a subsequent pass. In the interim I see this as a uncommon corner case and the performance hit as being less important than internal consistency and bugfixes (just noticed this clsoes #20090)

I though this would now raise an error on this PR?

That was the case when this PR was opened, but since then #22355 changed the behavior of op(DataFrame, ndarray[ndim=2]) to match numpy broadcasting behavior.

(it's not possible to for example first align arithmetic ops with numpy, and then comparison ops with arithmeric/numpy, so you don't have this back and forth?)

If I understand correctly, this is similar to what I have in mind for optimizing the DataFrame vs Series op.

So if I understand correctly, some cases that now work, will raise after this PR, but you plan to let them work again in a later PR?

I think the only case that falls into this category is a list/tuple (not ndarray or Series) with length(obj) == len(df) != len(df.columns). Those comparisons currently "work" and this PR causes them to raise (matching arithmetic ops behavior).

I would be open to subsequently changing the behavior of both arithmetic+comparisons to do something like:

if len(obj) == len(df.columns):
    [behave like arithmetic ops currently do]
elif len(obj) == len(df):
    [behave like comparison ops currently do]

but only if this is made consistent across arithmetic/comparison ops.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 12, 2018

I though this would now raise an error on this PR?

That was the case when this PR was opened, but since then #22355 changed the behavior of op(DataFrame, ndarray[ndim=2]) to match numpy broadcasting behavior.

But then the whatsnew addition you have in this PR is not up to date anymore? As that still mentions that this case will now raise an error

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 12, 2018

But then the whatsnew addition you have in this PR is not up to date anymore

Will update.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 12, 2018

Can you please give a new summary of what this PR does / changes etc, as I am rather lost now. The comment above about "non-trivial API change" is also not relevant any more? (and update the top post)

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 12, 2018

Can you please give a new summary of what this PR does / changes etc, as I am rather lost now.

Updated above.

Bottom line: This PR makes DataFrame comparison ops broadcasting consistent with DataFrame arithmetic ops.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 12, 2018

update Since #22355 was merged,

That's an open issue, not a merged PR?

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 12, 2018

Updated, #23000

@jreback
Copy link
Contributor

left a comment

lgtm. some minor comments. ping on green.

df = pd.DataFrame(arr)

.. ipython:: python
# comparison and arithmetic both broadcast

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

I'd prefer you make the comments actual sentences here, which is pretty easy when you already have the blocks.

@@ -48,15 +48,19 @@ def test_mixed_comparison(self):
assert result.all().all()

def test_df_boolean_comparison_error(self):
# GH 4576
# GH#4576, GH#22880
# boolean comparisons with a tuple/list give unexpected results

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

can you change this comment


result = df > tup
assert_frame_equal(result, expected)
with pytest.raises(ValueError):

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

can you assert the error messages in these

@jreback jreback added this to the 0.24.0 milestone Oct 12, 2018

@jbrockmendel

This comment has been minimized.

Copy link
Member Author

commented Oct 13, 2018

ping

@gfyoung gfyoung merged commit e96c691 into pandas-dev:master Oct 13, 2018

6 checks passed

ci/circleci: py27_compat Your tests passed on CircleCI!
Details
ci/circleci: py35_ascii Your tests passed on CircleCI!
Details
ci/circleci: py36_locale Your tests passed on CircleCI!
Details
ci/circleci: py36_locale_slow Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181013.1 succeeded
Details
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 13, 2018

@gfyoung sorry, but Brock and me were still actively discussing this. @jreback I think it was clear that was the case, so please don't say "ping on green" in such a case

jorisvandenbossche added a commit that referenced this pull request Oct 13, 2018

jorisvandenbossche added a commit that referenced this pull request Oct 13, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 13, 2018

OK, so I simply reverted the merge. That seemed the easiest way to continue the discussion. It's a bit annoying, but @jbrockmendel you will need to open a new PR since this one cannot be re-opened ..

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.