Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

Open
jorisvandenbossche opened this issue May 9, 2022 · 2 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jorisvandenbossche
Copy link
Member

The case being considered here is when setting multiple columns into a DataFrame (using __setitem__, df[[..]] = ..), using a DataFrame right-hand-side value. So a simple, unambiguous example is:

>>> df1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['a', 'b'])
>>> df2 = pd.DataFrame(np.arange(6).reshape(3, 2) * 2, columns=['a', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
   a   b
0  0   2
1  4   6
2  8  10

However, we are setting the multiple columns column-by-column in order, ignoring potential misaligned column names:

>>> df1[['a', 'b']] = df2[['b', 'a']]
>>> df1
    a  b
0   2  0
1   6  4
2  10  8

I think this is "expected" behaviour. Meaning, this seems to be intentional and long standing behaviour. Although I personally find this surprisin, especially because when using loc instead of plain setitem, i.e. df1.loc[:, ['a', 'b']] = df2[['b', 'a']], does align the column names:

>>> df1.loc[:, ['a', 'b']] = df2[['b', 'a']]
>>> df1
   a   b
0  0   2
1  4   6
2  8  10

I didn't directly find an issue about this, only a PR that touched the code that handles this but in case of duplicate columns (#39403), and a comment at https://github.com/pandas-dev/pandas/pull/39341/files#r563895152 about column names being irrelevant for setitem (cc @phofl @jbrockmendel)

But, because of the fact that we ignore alignment of column names, but then do the setting by name (and not position):

pandas/pandas/core/frame.py

Lines 3747 to 3750 in dd6869f

if isinstance(value, DataFrame):
check_key_length(self.columns, key, value)
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]

you get inconsistent results with duplicate column names.

For example, in this case the second column of df2 is set to both "b" columns of df1

>>> df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'b'])
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'c'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   2
1   6   8   8
2  12  14  14

On the other hand, if I change the column names in df2 to also have duplicate columns, but in a different order, depending on the exact order you get an error or a "working" example:

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'b'])
>>> df1[['a', 'b']] = df2
...
ValueError: Columns must be same length as key

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'a'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16

And if the columns names order matches exactly, the columns are set "correctly" as well:

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['a', 'b', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16

So in general, in those examples, the column names do matter.


General questions:

  • Are we OK with __setitem__ (df[key] = value) with a dataframe value ignoring the value's column names? (not aligning key and value.columns) And are we OK with this being different as .loc[]?
  • If we keep the current behaviour, should we set those columns by position instead of column name, so that also for duplicate column names you don't get such inconsistent results?
    (but how to we change this? (it's a breaking change) maybe we should deprecate/disallow such setitem with duplicate column names?)
@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels May 9, 2022
@jbrockmendel
Copy link
Member

maybe we should deprecate/disallow such setitem with duplicate column names?

yah, it sounds like the "correct" behavior is ambiguous/non-obvious, so seems like a good candidate to deprecate/disallow.

@phofl
Copy link
Member

phofl commented Jul 7, 2022

I agree with @jbrockmendel, the difference in behavior is expected but the inconsistency should be avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

3 participants