Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use align_method in comp_method_FRAME #22880

Merged
merged 15 commits into from
Oct 13, 2018
Merged

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Sep 28, 2018

Closes #20090

update Since #23000 was merged, some of the discussion is out of date. The bottom line remains unchanged: This PR makes DataFrame comparison ops behave like DataFrame arithmetic ops currently do. Also fixes some bugs e.g. #20090
end update

This is a much nicer alternative to the implementation in #22751. The problem is that two tests still fail with this implementation. We need to pin down the design spec more explicitly regardless.

core.ops has three functions for defining DataFrame ops: _arith_method_FRAME, _flex_comp_method_FRAME, _comp_method_FRAME. The first two both call _align_method_FRAME, with _comp_method_FRAME being the outlier. This PR just adds that alignment call.

The two tests that currently fail:

df = DataFrame(np.arange(6).reshape((3, 2)))
b_r = np.atleast_2d([2, 2])
l = (2, 2, 2)
expected = DataFrame([[False, False], [False, True], [True, True]])
result = df > l   # <-- raises ValueError under this PR because it has the wrong shape
assert_frame_equal(result, expected)

result = df > b_r   # <-- raises ValueError under this PR because it has the wrong shape
assert_frame_equal(result, expected)
df = pd.DataFrame(np.arange(6).reshape((3, 2)))
with pytest.raises(ValueError):  # <-- doesnt raise
           df == (2, 2)

I understand why the behavior tested by the first test makes sense, but don't see the logic behind having df == (2, 2) raise (maybe #4576 holds the answer, will look at that more closely)

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

@jbrockmendel
Copy link
Member Author

Related: #22355.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche this is the "pretty" alternative to #22751, needs attention despite not being green. Any thoughts on desired behavior?

@jbrockmendel
Copy link
Member Author

@pandas-dev/pandas-core this now contains a non-trivial API change, should make sure there is consensus before moving forward.

TL;DR: broadcasting/alignment on DataFrame comparison ops behaves differently from arithmetic ops (#22355). Both behave differently from numpy convention (#22686). This PR changes the comparison behavior to match the arithmetic behavior, with the numpy part saved for the next round.

Demonstration (based on tests.frame.test_operators test_boolean_comparison in master):

df = pd.DataFrame(np.arange(6).reshape((3, 2)))
b = np.array([2, 2])
b_r = np.atleast_2d([2, 2])
b_c = b_r.T
l = (2, 2, 2)
tup = tuple(l)

cmp_expected = pd.DataFrame([[False, False], [False, True], [True, True]])
arith_expected = pd.DataFrame([[2, 3], [4, 5], [6, 7]])

# For the 1-D array `b`, comparison and arithmetic broadcast the same way
result = df > b
tm.assert_frame_equal(result, cmp_expected)
result = df.values > b
tm.assert_numpy_array_equal(result, cmp_expected.values)
result = df + b
tm.assert_frame_equal(result, arith_expected)
result = df.values + b
tm.assert_numpy_array_equal(result, arith_expected.values)

# In the 2-D case, the broadcasting behavior is different
result = df > b_r
tm.assert_frame_equal(result, cmp_expected)
result = df.values > b_r
tm.assert_numpy_array_equal(result, cmp_expected.values)

result = df.values + b_r
tm.assert_numpy_array_equal(result, arith_expected.values)

with pytest.raises(ValueError):  # we DONT want this to raise
    # ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2)
    result = df + b_r


# ditto for tuple/list
result = df > tup  # doesn't match numpy or arithmetic behavior
tm.assert_frame_equal(result, cmp_expected)

with pytest.raises(ValueError):
     # numpy behavior
     result = df.values > tup

with pytest.raises(ValueError):
     # numpy behavior
    result = df.values + tup

with pytest.raises(ValueError):
    # ValueError: Unable to coerce to Series, length must be 2: given 3
    result = df + tup

@TomAugspurger
Copy link
Contributor

Just to make sure I understand, the changes would be

  • DataFrame + 2-d ndarray with matching trailing dimension: Previously raised, now broadcasts
  • DataFrame + non-ndarray iterable with matching first dimension: Previously broadcast, now raises

Did I miss any?

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Oct 4, 2018

DataFrame + 2-d ndarray with matching trailing dimension: Previously raised, now broadcasts

np.ndarray with shape (1, df.shape[1]): previously broadcasts, now raises Broadcasts following #23000

DataFrame + non-ndarray iterable with matching first dimension: Previously broadcast, now raises

list or tuple with len == df.shape[0] != df.shape[1]: previously broadcasts, now raises <-- AFAICT this behavior is not currently tested

list or tuple with len == df.shape[1]: previously raises, now broadcasts

* In a separate PR I would like to change the behavior when operating with ndarray with shape (nrows, 1) or (1, ncols) so that both of these broadcast identically to numpy (ATM the (nrows, 1) case raises for both arithmetic and comparison ops). This change would be for both arithmetic and comparison ops (whereas this PR just changes comparison ops to behave like arithmetic ops)

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Oct 4, 2018

Update: In a different branch I edited the alignment code so that 2d arrays of (nrows, 1) or (1, ncols) both broadcast for arithmetic ops. This did not cause any test failures. So AFAICT this particular behavior may be unintentional.

update^2 #23000 has now implemented this behavior for arithmetic ops

@codecov
Copy link

codecov bot commented Oct 4, 2018

Codecov Report

Merging #22880 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22880      +/-   ##
==========================================
+ Coverage   92.21%   92.24%   +0.03%     
==========================================
  Files         169      169              
  Lines       50884    50814      -70     
==========================================
- Hits        46922    46874      -48     
+ Misses       3962     3940      -22
Flag Coverage Δ
#multiple 90.66% <100%> (+0.03%) ⬆️
#single 42.3% <100%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/core/internals/managers.py 96.64% <ø> (-0.02%) ⬇️
pandas/core/internals/blocks.py 94.62% <ø> (+1.14%) ⬆️
pandas/core/ops.py 97.19% <100%> (ø) ⬆️
pandas/core/frame.py 97.11% <100%> (-0.01%) ⬇️
pandas/io/formats/csvs.py 98.21% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4b67ca...826f2c7. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2018

you refernced #4576 which is indeed behavior we patched in the past, but doesn't necessary make sense for a DataFrame, while for a homogenous ndarray does a bit more.

Have to look at your changes in some detail..

@jbrockmendel
Copy link
Member Author

Have to look at your changes in some detail..

Sure. The succinct version is my favorite: "Make DataFrame comparison ops behave like DataFrame arithmetic ops". That change is enabled entirely by the 1 line added in core.ops.

@jbrockmendel
Copy link
Member Author

Extra benefit: After we get rid of Block.eval, it looks like _try_coerce_args can be simplified.


result = df > b_r
result = df > b_r # broadcasts like ndarray (GH#23000)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed following #23000 (previously in this PR this comparison raised ValueError)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put comments on the previous line


result = df == b_r
result = df == b_r # broadcasts like ndarray (GH#23000)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed following #23000 (previously in this PR this comparison raised ValueError)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments on previous

@jbrockmendel jbrockmendel changed the title WIP: Use align_method in comp_method_FRAME Use align_method in comp_method_FRAME Oct 8, 2018
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a merge conflict with master now.

operations has been changed to match the arithmetic operations in these cases.
(:issue:`22880`)

The affected cases are: operating against a 2-dimensional ``np.ndarray`` with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for formatting these cases as a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed addressing your comments. This one I had to guess what you had in mind. LMK if I got it wrong.

In [3]: arr = np.arange(6).reshape(3, 2)
In [4]: df = pd.DataFrame(arr)

In [5]: df == arr[[0], :] # comparison used to broadcast where arithmetic would raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"used to" -> "previously" to avoid ambiguity with "used to".

...
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2)

In [7]: df == (1, 2) # length matches number of columns; comparison used to raise where arithmetic would broadcast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These code blocks are put in <pre> divs in the docs, so long lines will force horizontal scrolling / look extra bad on mobile. Better to continue the input with a

In [7]:  ...
   ...: # comment ...
   ...: # maybe multiline


*Current Behavior*:

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will nee d an :okexcept here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also remove the In [] prefixes. And hopefully the block continues running when there's an exception in an okexcept.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big issue I have here is what is the performance degredation. .eval dispatched to numexpr, I think that this is now handled in dispatch_to_* but not really sure. We may need some asvs to test the broadcasting behaviors (e.g. df + 2-d ndarray)

The affected cases are:
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column
- a list or tuple with the same length matching the number of rows in the :class:`DataFrame`
- a list or tuple with thlength matching the number of columns in the :class:`DataFrame`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the length

df = pd.DataFrame(arr)
df == arr[[0], :] # raises just like the next arithmetic operation
df + arr[[0], :]
df == (1, 2) # broadcasts just like the next arithmetic operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer that the comments precede the lines, also use multiple ipython blocks to make the rendering more clear (if needed).


result = df > b_r
result = df > b_r # broadcasts like ndarray (GH#23000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put comments on the previous line


result = df == b_r
result = df == b_r # broadcasts like ndarray (GH#23000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments on previous

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Oct 9, 2018
@jorisvandenbossche
Copy link
Member

@jbrockmendel did you do a check of the performance to be sure there is no implication?

@jbrockmendel
Copy link
Member Author

did you do a check of the performance to be sure there is no implication?

There does appear to be a performance hit in the relevant case, but a) there is room for subsequent optimization and b) the bugs this avoids are much more important than the performance impact.

In [3]: arr = np.arange(600).reshape(30, 20)
In [4]: df = pd.DataFrame(arr)
In [5] %timeit arr == arr[[0], :]    # calibrate based on numpy timings
10000 loops, best of 3: 82.9 µs per loop   <-- master
10000 loops, best of 3: 84 µs per loop      <-- PR

In [6]: ser = pd.Series(arr[0, :])
In [7]: l = list(ser)

In [8]: %timeit df == arr[[0], :]
100 loops, best of 3: 4.42 ms per loop   # <-- master
10 loops, best of 3: 87.5 ms per loop    # <-- PR

In [9]: %timeit df == ser
10 loops, best of 3: 83 ms per loop       <-- master
10 loops, best of 3: 86.9 ms per loop    <-- PR

In [10]: %timeit df == l
ValueError: Invalid broadcasting comparison       <-- master
10 loops, best of 3: 82.4 ms per loop    <-- PR

In [11]: df[1] = df[1].astype('f8')
    ...: df[3] = df[3].astype('uint8')
    ...: df[5] = df[5].view('m8[ns]')
    ...: df[7] = df[7].view('M8[ns]')
    ...: df[9] = df[9].astype('f4')
    ...:
In [12]: %timeit df == arr[[0], :]
ValueError: cannot broadcast shape [(30, 15)] with block values [(1, 20)]   <-- master, BUG
TypeError: cannot compare a TimedeltaIndex with type int64   <-- PR, BUG (in TimedeltaIndex.__eq__; just opened #23063)

In [13]: df[5] = df[5].view('i8').astype('i4')
In [14]: %timeit df == arr[[0], :]
ValueError: cannot broadcast shape [(30, 15)] with block values [(1, 20)]  <-- master, BUG
10 loops, best of 3: 86.7 ms per loop  <-- PR

In [14]: %timeit df == arr[0, :]
ValueError: Invalid broadcasting comparison [array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])] with block values  <-- master
10 loops, best of 3: 80.9 ms per loop  <-- PR

@jbrockmendel
Copy link
Member Author

gentle ping. If this (or #23031) goes through the follow-ups will keep me busy in places orthogonal to PeriodArray constructors

@jorisvandenbossche
Copy link
Member

Quick general feedback: for the cases that worked before but will raise now, I think we should go through a deprecation cycle.

(looking now more in detail on your explanation of the changes above)

@jorisvandenbossche
Copy link
Member

In [8]: %timeit df == arr[[0], :]
100 loops, best of 3: 4.42 ms per loop # <-- master
10 loops, best of 3: 87.5 ms per loop # <-- PR

How do you see this performance difference? I though this would now raise an error on this PR?


So if I understand correctly, some cases that now work, will raise after this PR, but you plan to let them work again in a later PR?
That makes it a bit difficult to assess what the overall changes for users would be (eg if a deprecation warning would be needed, see my previous post)

(it's not possible to for example first align arithmetic ops with numpy, and then comparison ops with arithmeric/numpy, so you don't have this back and forth?)


Something else to consider: for Index we actually deliberately disabled comparing with a length 1 array. See also the recently opened issue #23078 where we were planning to actually raise instead of broadcast

The affected cases are:
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column
- a list or tuple with length matching the number of rows in the :class:`DataFrame`
- a list or tuple with length matching the number of columns in the :class:`DataFrame`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this list is not valid rst formatted: you don't need to indentation, and need an empty line before the list

(:issue:`22880`)

The affected cases are:
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you indicate here what changes for those cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated; I think other whatsnew comments addressed. Note that Previous Behavior blocks refer to 0.23, not pre-this-PR

@jbrockmendel
Copy link
Member Author

How do you see this performance difference?

If you drill down into this, it is driven by poor performance of cmp(DataFrame, Series), which I think can be improved on a subsequent pass. In the interim I see this as a uncommon corner case and the performance hit as being less important than internal consistency and bugfixes (just noticed this clsoes #20090)

I though this would now raise an error on this PR?

That was the case when this PR was opened, but since then #22355 changed the behavior of op(DataFrame, ndarray[ndim=2]) to match numpy broadcasting behavior.

(it's not possible to for example first align arithmetic ops with numpy, and then comparison ops with arithmeric/numpy, so you don't have this back and forth?)

If I understand correctly, this is similar to what I have in mind for optimizing the DataFrame vs Series op.

So if I understand correctly, some cases that now work, will raise after this PR, but you plan to let them work again in a later PR?

I think the only case that falls into this category is a list/tuple (not ndarray or Series) with length(obj) == len(df) != len(df.columns). Those comparisons currently "work" and this PR causes them to raise (matching arithmetic ops behavior).

I would be open to subsequently changing the behavior of both arithmetic+comparisons to do something like:

if len(obj) == len(df.columns):
    [behave like arithmetic ops currently do]
elif len(obj) == len(df):
    [behave like comparison ops currently do]

but only if this is made consistent across arithmetic/comparison ops.

@jorisvandenbossche
Copy link
Member

I though this would now raise an error on this PR?

That was the case when this PR was opened, but since then #22355 changed the behavior of op(DataFrame, ndarray[ndim=2]) to match numpy broadcasting behavior.

But then the whatsnew addition you have in this PR is not up to date anymore? As that still mentions that this case will now raise an error

@jbrockmendel
Copy link
Member Author

But then the whatsnew addition you have in this PR is not up to date anymore

Will update.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 12, 2018

Can you please give a new summary of what this PR does / changes etc, as I am rather lost now. The comment above about "non-trivial API change" is also not relevant any more? (and update the top post)

@jbrockmendel
Copy link
Member Author

Can you please give a new summary of what this PR does / changes etc, as I am rather lost now.

Updated above.

Bottom line: This PR makes DataFrame comparison ops broadcasting consistent with DataFrame arithmetic ops.

@jorisvandenbossche
Copy link
Member

update Since #22355 was merged,

That's an open issue, not a merged PR?

@jbrockmendel
Copy link
Member Author

Updated, #23000

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some minor comments. ping on green.

df = pd.DataFrame(arr)

.. ipython:: python
# comparison and arithmetic both broadcast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer you make the comments actual sentences here, which is pretty easy when you already have the blocks.

@@ -48,15 +48,19 @@ def test_mixed_comparison(self):
assert result.all().all()

def test_df_boolean_comparison_error(self):
# GH 4576
# GH#4576, GH#22880
# boolean comparisons with a tuple/list give unexpected results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this comment


result = df > tup
assert_frame_equal(result, expected)
with pytest.raises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert the error messages in these

@jreback jreback added this to the 0.24.0 milestone Oct 12, 2018
@jbrockmendel
Copy link
Member Author

ping

@gfyoung gfyoung merged commit e96c691 into pandas-dev:master Oct 13, 2018
@jorisvandenbossche
Copy link
Member

@gfyoung sorry, but Brock and me were still actively discussing this. @jreback I think it was clear that was the case, so please don't say "ping on green" in such a case

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 13, 2018

OK, so I simply reverted the merge. That seemed the easiest way to continue the discussion. It's a bit annoying, but @jbrockmendel you will need to open a new PR since this one cannot be re-opened ..

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
@jbrockmendel jbrockmendel deleted the const branch April 5, 2020 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mixed-dtype dataframe comparison with array raises incorrectly
6 participants