Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df[col] = arr should not overwrite data in df[col] #35417

Closed
wants to merge 150 commits into from

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jul 26, 2020

cc @jorisvandenbossche this still fails 7 tests locally and there's one more (commented in-line) test that looks fishy. Extra eyeballs would be welcome.

xref #35271, #35266

@pep8speaks
Copy link

pep8speaks commented Jul 26, 2020

Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-29 22:43:33 UTC

@TomAugspurger
Copy link
Contributor

What's the summary of the behavior change from 1.0.5? DataFrame.__setitem__[array] will not mutate the existing array inplace? What dtypes does this affect?

@TomAugspurger
Copy link
Contributor

@jbrockmendel I attempted a whatsnew in e600237, if you could take a look.

@TomAugspurger TomAugspurger added this to the 1.1 milestone Jul 27, 2020
@TomAugspurger TomAugspurger added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 27, 2020
@TomAugspurger
Copy link
Contributor

Going through the failing tests

  • test_fancy_getitem_slice_mixed: OK
  • TestDataFrameIndexing.test_iloc_row: OK
  • TestDataFrameIndexing.test_iloc_col: OK
  • TestiLoc2.test_identity_slice_returns_new_object: ... Probably OK
  • TestiLoc2.test_iloc_setitem_categorical_updates_inplace: Probably OK
  • TestMerge.test_merge_nocopy: Probably OK

Looking into test_apply_function_with_indexing some more now.

@TomAugspurger
Copy link
Contributor

One question on desired behavior: df.loc[:, "A"] = value should mutate the array inplace, right? On this branch, that is lost:

In [2]: df = pd.DataFrame({"A": [1, 2, 3]})

In [3]: df2 = df.iloc[:]

In [4]: df._mgr.blocks[0].values is df2._mgr.blocks[0].values
Out[4]: True

In [5]: df.loc[:, "A"] = 0

In [6]: df2
Out[6]:
   A
0  1
1  2
2  3

@jbrockmendel
Copy link
Member Author

One question on desired behavior: df.loc[:, "A"] = value should mutate the array inplace, right? On this branch, that is lost:

Agreed. I think thats the behavior with the FIXME comment in test_block_internals.

@jbrockmendel
Copy link
Member Author

test_fancy_getitem_slice_mixed: OK

The relevant part of this test reads:

        sliced = float_frame.iloc[:, -3:]

        msg = r"\nA value is trying to be set on a copy of a slice from a DataFrame"
        with pytest.raises(com.SettingWithCopyError, match=msg):
            sliced["C"] = 4.0

        assert (float_frame["C"] == 4).all()

I think with the new behavior, the last assertion is now incorrect, as we expect the setitem to create a new array. My question is: do we still need the SettingWithCopyError?

@TomAugspurger
Copy link
Contributor

Sorry, but "OK" I meant OK with updating the tests for the new behavior.

My question is: do we still need the SettingWithCopyError?

I was wondering this as well. If the warning is about whether or not float_frame is updated then it seems like it can be removed. But perhaps the test should be updated to use sliced.loc[:, "C"] = 4.0, in which case the warning is still valid?

@jbrockmendel
Copy link
Member Author

Down to two failing tests:

FAILED pandas/tests/groupby/test_apply_mutate.py::test_apply_function_with_indexing - AssertionError: Series are different
FAILED pandas/tests/reshape/merge/test_merge.py::TestMerge::test_merge_nocopy - AssertionError: assert False

test_apply_function_with_indexing i've tracked down into libreduction.apply_frame_axis0

@simonjayhawkins
Copy link
Member

@jbrockmendel 1.2 whatsnew now merged. also you can add back the release note that was deleted in #35271?

@jorisvandenbossche
Copy link
Member

I am not comfortable with merging this right before the RC.

(and sorry for my slow response here (I have been quite absent the last weeks for several reasons), which is part of why it's delayed until right before the RC ...)


That said, in addition I am also not yet convinced that this PR is worth the 1) subtle breaking changes and 2) performance implications.

Does that summarize the perf concern accurately?

I think so, yes. The main point is that a setitem operation, which before this PR could be fully inplace, can now trigger a copy of the (almost) full DataFrame (triggered directly or a potential delayed until a next consolidation). This additional data copy has an inherent cost. And I don't think it can be avoided while preserving the actual original intent of this PR (i.e. not overriding existing data).

But it's not only the performance implications, it are also the subtle API changes. And of course it's always a trade-off. In many case a slight performance degradation and some behaviour changes can be worth it for the benefit of the change. I am just not convinced that the benefits of this PR are big enough.

@jbrockmendel
Copy link
Member Author

So we're roughly where we've always been: I think fixing this inconsistency is a bugfix, Joris thinks its an API change.

@simonjayhawkins
Copy link
Member

So we're roughly where we've always been: I think fixing this inconsistency is a bugfix, Joris thinks its an API change.

there's a dev meeting tomorrow where this can be discussed.

@jreback jreback modified the milestones: 1.3, 1.4, 2.0 Jun 9, 2021
@@ -693,8 +694,23 @@ def _ensure_listlike_indexer(self, key, axis=None, value=None):
# GH#38148
keys = self.obj.columns.union(key, sort=False)

# Try to get the right dtype when we do this reindex.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't infer_dtype_from do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that infers a dtype, but we need to pass a scalar fill_value to reindex_axis

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this edit as no longer necessary. i think at this point all the controversial bits are gone and we're down to just the bugfix.

@jbrockmendel
Copy link
Member Author

closing in favor of #45352

@jbrockmendel jbrockmendel deleted the 33457 branch January 26, 2022 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: setting column with setitem should not modify existing array inplace
7 participants