Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.__setitem__ sometimes operating inplace #43406

Merged
merged 11 commits into from Nov 1, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

The fixed bug is the currently-xfailed test test_setitem_same_dtype_not_inplace (see #39510, which has its own subsection in the 1.3.0 release notes). Pretty much all the other altered tests are changed because those tests rely on the incorrect behavior.

The main non-test change is the addition of an inplace check in BlockManager.iset. We want inplace to be False when this is reached via DataFrame.__setitem__ and True when reached via loc/iloc.__setitem__

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a user note of when this is changed, its not obvious at all.

pandas/core/series.py Show resolved Hide resolved
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves ArrayManager labels Sep 5, 2021
@jreback
Copy link
Contributor

jreback commented Sep 5, 2021

cc @jorisvandenbossche

@jbrockmendel
Copy link
Member Author

cc @phofl

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche @phofl gentle ping

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm in general. I think we should add tests to cover this:

@pytest.mark.parametrize("indexer", ["a", ["a"], [True, False]])
@pytest.mark.parametrize(
    "value, set_value",
    [
        (1, 5),
        (1.0, 5.0),
        (Timestamp("2020-12-31"), Timestamp("2021-12-31")),
        ("a", "b"),
    ],
)
def test_setitem_not_operating_inplace(value, set_value, indexer):
    df = DataFrame({"a": value}, index=[0, 1])
    expected = df.copy()
    view = df[:]
    df[indexer] = set_value
    tm.assert_frame_equal(view, expected)

It looks like as the boolean indexer case is still operating inplace

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche gentle ping

@jbrockmendel
Copy link
Member Author

@jreback gentle ping, doesn't sound like the others are weighing in

@phofl
Copy link
Member

phofl commented Oct 5, 2021

I think this looks good. Have to figure out the boolean case in the future, but otherwise is good

@jreback
Copy link
Contributor

jreback commented Oct 10, 2021

let me look

@jbrockmendel
Copy link
Member Author

id like to get this in before tackling #43996

@jbrockmendel
Copy link
Member Author

gentle ping; has a bearing on #40380

@jreback jreback added this to the 1.4 milestone Nov 1, 2021
@jreback jreback merged commit 03dd698 into pandas-dev:master Nov 1, 2021
@jreback
Copy link
Contributor

jreback commented Nov 1, 2021

cool, ideally can simplify the caching checks at some point (e.g. the maybe_update_cacher is getting more complicated).

@jbrockmendel jbrockmendel deleted the bug-set_inplace branch November 1, 2021 17:35
@SimeonStoykovQC
Copy link

df.assign uses DataFrame.__setitem__:

pandas/pandas/core/frame.py

Lines 4841 to 4845 in 965ceca

data = self.copy(deep=None)
for k, v in kwargs.items():
data[k] = com.apply_if_callable(v, data)
return data

We want inplace to be False when this is reached via DataFrame.__setitem__ and True when reached via loc/iloc.__setitem__

Did this change intend to modify df.assign's behavior? Or do you think assign should be using the in-place mode again through loc/iloc?

@jonashaag
Copy link
Contributor

jonashaag commented May 31, 2023

@SimeonStoykovQC no, assign should not modify the existing data so not be in-place. I guess an in-place modification on a copy would be fine though

@jonashaag
Copy link
Contributor

@SimeonStoykovQC I misunderstood what you meant. You meant to use [...] instead of .loc[:, ...] on the copy, right? That makes sense if .loc is faster (which I don't know)

@phofl
Copy link
Member

phofl commented Jun 1, 2023

No, the assign change is intended.

@jonashaag
Copy link
Contributor

It caused a 10x performance regression in some of our code. We will try to come up with a reproducer.

@SimeonStoykovQC
Copy link

No, the assign change is intended.

In the meantime, could you elaborate? What's wrong with in-place allocations if df.assign is creating a deep copy (prior version 2.0)?

@phofl
Copy link
Member

phofl commented Jun 1, 2023

Yeah I know, but this was fixed for 2.0, we had a nasty performance bottleneck in 1.5 unfortunately. Everything is back to normal with 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ArrayManager Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants