Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.19.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,4 @@ Bug Fixes

- Bug in ``pd.concat`` where names of the ``keys`` were not propagated to the resulting ``MultiIndex`` (:issue:`14252`)
- Bug in ``MultiIndex.set_levels`` where illegal level values were still set after raising an error (:issue:`13754`)
- Bug in ``DataFrame.insert`` where multiple calls with duplicate columns can fail (:issue:`14291`)
18 changes: 12 additions & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2487,7 +2487,7 @@ def _set_item(self, key, value):

# check if we are modifying a copy
# try to set first as we want an invalid
# value exeption to occur first
# value exception to occur first
if len(self):
self._check_setitem_copy()

Expand All @@ -2506,7 +2506,7 @@ def insert(self, loc, column, value, allow_duplicates=False):
value : int, Series, or array-like
"""
self._ensure_valid_index(value)
value = self._sanitize_column(column, value)
value = self._sanitize_column(column, value, broadcast=False)
self._data.insert(loc, column, value,
allow_duplicates=allow_duplicates)

Expand Down Expand Up @@ -2590,9 +2590,15 @@ def assign(self, **kwargs):

return data

def _sanitize_column(self, key, value):
# Need to make sure new columns (which go into the BlockManager as new
# blocks) are always copied
def _sanitize_column(self, key, value, broadcast=True):
"""
Ensures new columns (which go into the BlockManager as new blocks) are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a Parameters section

always copied.

The "broadcast" parameter indicates whether all columns with the given
key should be returned. The default behavior is desirable when
calling this method prior to modifying existing values in a DataFrame.
"""

def reindexer(value):
# reindex if necessary
Expand Down Expand Up @@ -2665,7 +2671,7 @@ def reindexer(value):
return value

# broadcast across multiple columns if necessary
if key in self.columns and value.ndim == 1:
if broadcast and key in self.columns and value.ndim == 1:
if (not self.columns.is_unique or
isinstance(self.columns, MultiIndex)):
existing_piece = self[key]
Expand Down
7 changes: 6 additions & 1 deletion pandas/sparse/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,12 @@ def fillna(self, value=None, method=None, axis=0, inplace=False,
# ----------------------------------------------------------------------
# Support different internal representation of SparseDataFrame

def _sanitize_column(self, key, value):
def _sanitize_column(self, key, value, broadcast=True):
"""
The "broadcast" parameter was added to match the method signature of
DataFrame._sanitize_column. However, this method does not make use of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (you can use a shared_doc to avoid repeating things)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these files import shared_doc from pandas.core.generic, but pandas.core.generic doesn't have it's own definition for _sanitize_column. I can't find any other instances of someone adding to the shared_doc without immediately using the docstring, but I can do that if that's what you want. For now, I just updated the docs in both places.

broadcasting.
"""
sp_maker = lambda x, index=None: SparseArray(
x, index=index, fill_value=self._default_fill_value,
kind=self._default_kind)
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/frame/test_mutate_columns.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,15 @@ def test_insert(self):
exp = DataFrame(data={'X': ['x', 'y', 'z']}, index=['A', 'B', 'C'])
assert_frame_equal(df, exp)

# GH 14291
df = DataFrame()
df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True)
df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a different value each time so its clear that its not copying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more appropriate in test_nonunique_indexes file.

df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True)
exp = DataFrame([['a', 'a', 'a'], ['b', 'b', 'b'],
['c', 'c', 'c']], columns=['A', 'A', 'A'])
assert_frame_equal(df, exp)

def test_delitem(self):
del self.frame['A']
self.assertNotIn('A', self.frame)
Expand Down