Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384

Closed
wants to merge 1 commit into from

Conversation

paul-mannino
Copy link
Contributor

I've been sitting on this simple fix because it seems a little kludgy. SparseDataFrame doesn't do anything parallel to broadcasting in its _sanitize_column method... should it?

@@ -1563,3 +1563,4 @@ Bug Fixes
- ``PeridIndex`` can now accept ``list`` and ``array`` which contains ``pd.NaT`` (:issue:`13430`)
- Bug in ``df.groupby`` where ``.median()`` returns arbitrary values if grouped dataframe contains empty bins (:issue:`13629`)
- Bug in ``Index.copy()`` where ``name`` parameter was ignored (:issue:`14302`)
- Bug in ``DataFrame.insert`` where multiple calls with duplicate columns can fail (:issue:`14291`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in 0.19.1.

@@ -2590,7 +2590,7 @@ def assign(self, **kwargs):

return data

def _sanitize_column(self, key, value):
def _sanitize_column(self, key, value, broadcast=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to do the opposite here -- keep the default behavior of _sanitize_column the same and override in the case of insert where you specify the column. The insert case where you don't want to broadcast the value is the special one. The more common case is:

df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'b': [2,3,4]})
df = pd.concat([df1, df1, df2], axis=1)
df.index = [1, 2, 3]
df.loc[1, 'a'] = 3

You would override the defaults less often if you kept the default as True and only overrode it in the special case of insert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

@@ -302,7 +302,7 @@ def fillna(self, value=None, method=None, axis=0, inplace=False,
# ----------------------------------------------------------------------
# Support different internal representation of SparseDataFrame

def _sanitize_column(self, key, value):
def _sanitize_column(self, key, value, broadcast=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this option is never used?

Copy link
Contributor Author

@paul-mannino paul-mannino Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this in because insert can call _sanitize_column for DataFrames and SparseDataFrames. If the signatures of the _sanitize_column functions don't match, an error gets thrown. Do you know of a cleaner workaround?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK to leave it in here, but please add a comment explaining this

How does an insert with duplicates currently work for SparseDataFrames ? Does it have the same bug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was surprising to me:

df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'b': [2,3,4]})
df = pd.concat([df1, df1, df2], axis=1).to_sparse()
df.index = [1, 2, 3]
df.loc[1, 'a'] = 3

## -- End pasted text --
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-e95308cf58e6> in <module>()
      3 df = pd.concat([df1, df1, df2], axis=1).to_sparse()
      4 df.index = [1, 2, 3]
----> 5 df.loc[1, 'a'] = 3

/Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in __setitem__(self, key, value)
    138             key = com._apply_if_callable(key, self.obj)
    139         indexer = self._get_setitem_indexer(key)
--> 140         self._setitem_with_indexer(indexer, value)
    141 
    142     def _has_valid_type(self, k, axis):

/Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in _setitem_with_indexer(self, indexer, value)
    545                 # scalar
    546                 for item in labels:
--> 547                     setter(item, value)
    548 
    549         else:

/Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in setter(item, v)
    453 
    454             def setter(item, v):
--> 455                 s = self.obj[item]
    456                 pi = plane_indexer[0] if lplane_indexer == 1 else plane_indexer
    457 

/Users/bkandel/code/forkz/pandas/pandas/sparse/frame.pyc in __getitem__(self, key)
    345             return self._getitem_array(key)
    346         else:
--> 347             return self._get_item_cache(key)
    348 
    349     @Appender(DataFrame.get_value.__doc__, indents=0)

/Users/bkandel/code/forkz/pandas/pandas/core/generic.py in _get_item_cache(self, item)
   1384         if res is None:
   1385             values = self._data.get(item)
-> 1386             res = self._box_item_values(item, values)
   1387             cache[item] = res
   1388             res._set_as_cached(item, self)

/Users/bkandel/code/forkz/pandas/pandas/core/frame.py in _box_item_values(self, key, values)
   2391         items = self.columns[self.columns.get_loc(key)]
   2392         if values.ndim == 2:
-> 2393             return self._constructor(values.T, columns=items, index=self.index)
   2394         else:
   2395             return self._box_col_values(values, items)

AttributeError: 'BlockManager' object has no attribute 'T'

Looks like the sparse version doesn't deal with the broadcasting issue at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkandel Can you open a separate issue for this?

Copy link
Contributor Author

@paul-mannino paul-mannino Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the following, which worked as expected:

df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'b': [2,3,4]})
df = pd.concat([df1, df1, df2], axis=1).to_sparse()
df['a'] = [1,2,5]

_sanitize_columns doesn't do broadcasting, but the SparseDataFrame setter seems to handle this case properly.

When playing around with SparseDataFrames, I did hit the same error in _box_item_values, but I was going to open a separate issue for that when I got home. Couldn't really figure out why that if/else block was there to begin with, even after looking back at the commit history.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche sure. I think fixing the behavior of the sparse data frame here is out of scope for this PR.

@jorisvandenbossche jorisvandenbossche added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 10, 2016
@jorisvandenbossche
Copy link
Member

@paul-mannino Already looking good, apart from the other comments of @bkandel :

  • Travis is failing due to some lint error: pandas/tests/frame/test_mutate_columns.py:171:80: E501 line too long (101 > 79 characters)
  • Can you add a docstring to _sanitize_columns where you document the new broadcast parameter?

@codecov-io
Copy link

codecov-io commented Oct 11, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14384 into master will not change coverage

@@             master     #14384   diff @@
==========================================
  Files           140        140          
  Lines         50634      50634          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43173      43173          
  Misses         7461       7461          
  Partials          0          0          

Powered by Codecov. Last update d98e982...a00f0fe

@bkandel
Copy link
Contributor

bkandel commented Oct 13, 2016

New changes addressed all of my comments. Thanks!

# blocks) are always copied
def _sanitize_column(self, key, value, broadcast=True):
"""
Ensures new columns (which go into the BlockManager as new blocks) are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a Parameters section

def _sanitize_column(self, key, value, broadcast=True):
"""
The "broadcast" parameter was added to match the method signature of
DataFrame._sanitize_column. However, this method does not make use of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (you can use a shared_doc to avoid repeating things)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these files import shared_doc from pandas.core.generic, but pandas.core.generic doesn't have it's own definition for _sanitize_column. I can't find any other instances of someone adding to the shared_doc without immediately using the docstring, but I can do that if that's what you want. For now, I just updated the docs in both places.

# GH 14291
df = DataFrame()
df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True)
df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a different value each time so its clear that its not copying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more appropriate in test_nonunique_indexes file.

@jorisvandenbossche
Copy link
Member

@paul-mannino There is going something wrong with the PRs that were open before we transferred to another repo. Can you therefore close this one and open a new PR? (just from the same branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present
5 participants