Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: loc __setitem__ has incorrect behavior when assigned a DataFrame and new columns and duplicated columns are added. #58317

Open
3 tasks done
sfc-gh-vbudati opened this issue Apr 18, 2024 · 0 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@sfc-gh-vbudati
Copy link

sfc-gh-vbudati commented Apr 18, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(
    [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]], columns=["D", "B", "C", "A"]
)

item = pd.DataFrame(
    [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]],
    columns=["A", "B", "C", "X"],
    index=[
        3,  # 3 does not exist in the row key, so it will be skipped
        2,
        1,
    ],
)

df.loc[[True, False, True], ["B", "E", "B"]] = item

Issue Description

Performing loc __setitem__ with pandas versions 2.2.0+ has faulty behavior when assigning a DataFrame to another DataFrame when inserting new columns with duplicated columns present. However, the column keys have to follow the pattern of [existing column(s), non-existent column(s), duplicated existing column(s)]. In the example provided, "B" exists but "E" does not. This can be reproduced with the following loc __setitem__ operations as well.

df.loc[[True, False, True], ["B", "E", 1, "B"]] = item
df.loc[[True, False, True], ["B", "E", 1, "B", "C", "X", "C", 2, "C"]] = item

Also, note that in some cases the output cannot be printed out, and if printing is tried it'll result in the error I ran into below:

>>> df = pd.DataFrame(
...     [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]], columns=["D", "B", "C", "A"]
... )
>>> df
   D  B  C   A
0  1  2  3   4
1  4  5  6   7
2  7  8  9  10

>>> item = pd.DataFrame(
...     [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]],
...     columns=["A", "B", "C", "X"],
...     index=[
...         3,  # 3 does not exist in the row key, so it will be skipped
...         2,
...         1,
...     ],
... )
>>> item
   A  B  C   X
3  1  2  3   4
2  4  5  6   7
1  7  8  9  10

>>> df.loc[[True, False, True], ["B", "E", "B"]] = item
>>> df

# ERROR!
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/frame.py", line 1203, in __repr__
    return self.to_string(**repr_params)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/frame.py", line 1383, in to_string
    return fmt.DataFrameRenderer(formatter).to_string(
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 962, in to_string
    string = string_formatter.to_string()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/string.py", line 29, in to_string
    text = self._get_string_representation()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/string.py", line 44, in _get_string_representation
    strcols = self._get_strcols()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/string.py", line 35, in _get_strcols
    strcols = self.fmt.get_strcols()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 476, in get_strcols
    strcols = self._get_strcols_without_index()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 740, in _get_strcols_without_index
    fmt_values = self.format_col(i)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 754, in format_col
    return format_array(
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1161, in format_array
    return fmt_obj.get_result()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1194, in get_result
    fmt_values = self._format_strings()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1250, in _format_strings
    & np.all(notna(vals), axis=tuple(range(1, len(vals.shape))))
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 457, in notna
    res = isna(obj)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 178, in isna
    return _isna(obj)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 207, in _isna
    return _isna_array(obj, inf_as_na=inf_as_na)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 300, in _isna_array
    result = np.isnan(values)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
"""

Since I cannot directly print out what the new df is, I tried doing this via iloc __getitem__, row-by-row:

>>> df.iloc[0]
D      1
B    NaN
B    NaN
C      4
A     []
E    NaN
1    NaN
Name: 0, dtype: object

>>> df.iloc[1]
D      4
B    5.0
B    6.0
C      7
A     []
E    NaN
1    NaN
Name: 1, dtype: object

>>> df.iloc[2]
D      7
B    5.0
B    5.0
C     10
A     []
E    NaN
1    NaN
Name: 2, dtype: object

The expected result is:

   D    B    B  C   A    E 
0  1  NaN  NaN  3   4  NaN
1  4  5.0  5.0  6   7  NaN 
2  7  5.0  5.0  9  10  NaN 

Notice column A -- originally it had values [3, 7, 10] (as seen from expected behavior). In the faulty result, all values in A are replaced by [].

Expected Behavior

# Expected behavior, from pandas versions 2.1.x and below, the result would be:
   D    B    B  C   A    E 
0  1  NaN  NaN  3   4  NaN 
1  4  5.0  5.0  6   7  NaN 
2  7  5.0  5.0  9  10  NaN 

# however, pandas versions 2.2.0+ error out.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c1 python : 3.9.18.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:49 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 2.2.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3.1
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@sfc-gh-vbudati sfc-gh-vbudati added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 18, 2024
@sfc-gh-vbudati sfc-gh-vbudati changed the title BUG: loc __setitem__ has incorrect behavior when assigned a DataFrame and new columns are added. BUG: loc __setitem__ has incorrect behavior when assigned a DataFrame and new columns and duplicated columns are added. Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant