Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: setting column with setitem should not modify existing array inplace #35266

Conversation

jorisvandenbossche
Copy link
Member

Closes #33457

This changes ExtensionBlock.setitem to how it was before #32831: updating the Block itself inplace, but not the array.

@jbrockmendel I know you won't like that this update the Block inplace (it's also not exactly what you suggested at #33457 (comment)). But this was the smallest change I could find to fix the issue.

The alternative is probably changing BlockManager.iset to not use Block.set in case of an existing ExtensionBlock, but BlockManager.iset is quite complex ..

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version labels Jul 13, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Jul 13, 2020
@@ -705,6 +705,15 @@ def test_iloc_setitem_categorical_updates_inplace(self):
expected = pd.Categorical(["C", "B", "A"])
tm.assert_categorical_equal(cat, expected)

# __setitem__ under the other hand does not work in-place
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you we split this test into 2 tests

# overwrite column with new array
df["EA"] = pd.array([1, 2, 3], dtype="Int64")
assert original_arr is not df.EA.array
tm.assert_extension_array_equal(original_arr, pd.array([1, 2, None], dtype="Int64"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do
expected =
result =

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 13, 2020

The alternative is probably changing BlockManager.iset to not use Block.set in case of an existing ExtensionBlock, but BlockManager.iset is quite complex ..

@jbrockmendel quickly tried this, and so changing BlockManager.iset to create a new Block is not actually that complex. However, both iloc and __setitem__ use this under the hood, so this then doesn't preserve the change for iloc you did in #32831. So not directly sure how to solve that.

diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py
index c82670106..859e2b149 100644
--- a/pandas/core/internals/managers.py
+++ b/pandas/core/internals/managers.py
@@ -1055,11 +1055,16 @@ class BlockManager(PandasObject):
         unfit_mgr_locs = []
         unfit_val_locs = []
         removed_blknos = []
+        blocks = list(self.blocks)
         for blkno, val_locs in libinternals.get_blkno_placements(blknos, group=True):
-            blk = self.blocks[blkno]
+            blk = blocks[blkno]
             blk_locs = blklocs[val_locs.indexer]
             if blk.should_store(value):
-                blk.set(blk_locs, value_getitem(val_locs))
+                if blk.is_extension:
+                    new_blk = blk.make_block_same_class(value)
+                    blocks[blkno] = new_blk
+                else:
+                    blk.set(blk_locs, value_getitem(val_locs))
             else:
                 unfit_mgr_locs.append(blk.mgr_locs.as_array[blk_locs])
                 unfit_val_locs.append(val_locs)
@@ -1081,8 +1086,10 @@ class BlockManager(PandasObject):
             new_blknos[~is_deleted] = np.arange(self.nblocks - len(removed_blknos))
             self._blknos = new_blknos[self._blknos]
             self.blocks = tuple(
-                blk for i, blk in enumerate(self.blocks) if i not in set(removed_blknos)
+                blk for i, blk in enumerate(blocks) if i not in set(removed_blknos)
             )
+        else:
+            self.blocks = tuple(blocks)

@@ -1589,7 +1589,7 @@ def should_store(self, value: ArrayLike) -> bool:

def set(self, locs, values):
assert locs.tolist() == [0]
self.values[:] = values
self.values = values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this going to break views?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be, but that would mean it was broken in 1.0 as well, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the [:] was added after 1.0, then I guess so, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the [:] (i.e. behavior in master) in a released 1.0.x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is referenced in the issue and the top post: #32831 -> this is only in master

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel @jreback I also opened an alternative PR to consider: #35271, which is a more targeted revert of the original #32831 bug fix (simpler, but also not preserving the original fix that caused the regression).

@jorisvandenbossche
Copy link
Member Author

Given there are clearly some side effects here (failing tests related to replace), @jreback @jbrockmendel can you take a look at the above-mentioned alternative PR (a more targetted revert): #35254

@jbrockmendel
Copy link
Member

I'll take fresh look at this. As of now this is the only item on my todo list before 1.1. LMK if there's anything else that needs my immediate attention.

@TomAugspurger
Copy link
Contributor

Closing this. I don't think it'll be necessary after #32831 and #35417.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: setting column with setitem should not modify existing array inplace
4 participants