DataFrame shift along columns not respecting position when dtypes are mixed #29417

pirsquared · 2019-11-05T18:07:25Z

Setup

import pandas as pd
df = pd.DataFrame(dict(
    A=[1, 2], B=[3., 4.], C=['X', 'Y'],
    D=[5., 6.], E=[7, 8], F=['W', 'Z']
))

df
df.shift(axis=1)

Outputs

   A    B  C    D  E  F
0  1  3.0  X  5.0  7  W
1  2  4.0  Y  6.0  8  Z

    A   B    C    D    E  F
0 NaN NaN  NaN  3.0  1.0  X
1 NaN NaN  NaN  4.0  2.0  Y

Problem description

See Associated Stackoverflow question

The shift places a column's values into the next column that shares the same dtype. The expectation is that the column's values are placed into the adjacent column.

Expected Output

dtypes = df.dtypes.shift(fill_value=object)
df_shifted = df.astype(object).shift(1, axis=1).astype(dtypes)

df_shifted

     A  B    C  D    E  F
0  NaN  1  3.0  X  5.0  7
1  NaN  2  4.0  Y  6.0  8

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1062.4.1.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : 0.4.0
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: 0.7.4
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.5
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

gfyoung · 2019-11-06T20:39:09Z

Nice catch! Investigation and PR are welcome!

haichuan0424 · 2019-11-09T16:19:14Z

Can I give this issue a try?

gfyoung · 2019-11-09T22:48:46Z

@haichuan0424 : Yes! You are more than welcome to!

haichuan0424 · 2019-11-12T22:05:58Z

@gfyoung
I've tried to add the following statement inside the shift function in frame.py file:
self = self.astype(object)

The result becomes:
A B C D E F
0 NaN 1 3 X 5 7
1 NaN 2 4 Y 6 8

It seems to downcast the float point number to integer if the decimal is zero.
However, if the float point number does have non-zero decimals, it works perfectly fine.
For example, if change the DataFrame to
df = pd.DataFrame(dict(
A=[1, 2], B=[3.1, 4.2], C=['X', 'Y'],
D=[5.4, 6.6], E=[7, 8], F=['W', 'Z']
))
df
A B C D E F
0 1 3.1 X 5.4 7 W
1 2 4.2 Y 6.6 8 Z
The result becomes:
A B C D E F
0 NaN 1 3.1 X 5.4 7
1 NaN 2 4.2 Y 6.6 8
which is as expected.

I have also tried booleans and it works as well.

Do you think this is worth submitting a pull request? Or the float number with 0 decimal has to be maintained?

gfyoung · 2019-11-12T22:13:21Z

Try making the change and see if tests pass for you. If they do, add a test for this issue, and submit the PR for further review.

haichuan0424 · 2019-11-14T21:37:45Z

After some investigation, I found that when mixed type DataFrame is used, it is broken down into several blocks with different types. Below is an example:

df = pd.DataFrame(dict(
    A=[1, 2], B=[3., 4.], C=['X', 'Y'],
    D=[5., 6.], E=[7, 8], F=['W', 'Z']
))
df
   A    B  C    D  E  F
0  1  3.0  X  5.0  7  W
1  2  4.0  Y  6.0  8  Z

The system breaks this DataFrame into three blocks by indexing and shift each separately.

BlockManager
Items: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(1, 5, 2), 2 x 2, dtype: float64
IntBlock: slice(0, 8, 4), 2 x 2, dtype: int64
ObjectBlock: slice(2, 8, 3), 2 x 2, dtype: object

FloatBlock:
[[3. 4.]
 [5. 6.]] 

 after shift:
[[nan nan]
 [ 3.  4.]]

IntBlock:
[[1 2]
 [7 8]]

after shift:
[[nan nan]
 [ 1.  2.]]

ObjectBlock:
[['X' 'Y']
 ['W' 'Z']]

after shift:
[[nan nan]
 ['X' 'Y']]

After shift each block, it assembles back based on original index slice. This is why the result is:

    A   B    C    D    E  F
0 NaN NaN  NaN  3.0  1.0  X
1 NaN NaN  NaN  4.0  2.0  Y

This is more complicated than I originally thought. We probably need to cast everything to object so it can be treated as one block and shift so only first column changes to NaN, then the result need to be cast back to what it supposed to be so we don't lose accuracy.

biddwan09 · 2019-11-19T15:20:29Z

Hi can I work on this issue ?

mproszewska · 2020-03-27T19:25:37Z

Shifts are done on NDFrames and NDFrame executes shift on each of blocks (as @haichuan0424 described), hence non intuitive results. I'm not sure if this is a bug, but IMO this could be solved by having alternative implementation for DataFrame.shift

jbrockmendel · 2020-09-26T22:58:29Z

closed by #35578

gfyoung added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure labels Nov 6, 2019

jbrockmendel added the Multi-Block Issues caused by the presence of multiple Blocks label Sep 21, 2020

jbrockmendel closed this as completed Sep 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame shift along columns not respecting position when dtypes are mixed #29417

DataFrame shift along columns not respecting position when dtypes are mixed #29417

pirsquared commented Nov 5, 2019 •

edited by gfyoung

Loading

INSTALLED VERSIONS

gfyoung commented Nov 6, 2019

haichuan0424 commented Nov 9, 2019

gfyoung commented Nov 9, 2019

haichuan0424 commented Nov 12, 2019

gfyoung commented Nov 12, 2019 •

edited

Loading

haichuan0424 commented Nov 14, 2019 •

edited

Loading

biddwan09 commented Nov 19, 2019

mproszewska commented Mar 27, 2020

jbrockmendel commented Sep 26, 2020

DataFrame shift along columns not respecting position when dtypes are mixed #29417

DataFrame shift along columns not respecting position when dtypes are mixed #29417

Comments

pirsquared commented Nov 5, 2019 • edited by gfyoung Loading

Setup

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Nov 6, 2019

haichuan0424 commented Nov 9, 2019

gfyoung commented Nov 9, 2019

haichuan0424 commented Nov 12, 2019

gfyoung commented Nov 12, 2019 • edited Loading

haichuan0424 commented Nov 14, 2019 • edited Loading

biddwan09 commented Nov 19, 2019

mproszewska commented Mar 27, 2020

jbrockmendel commented Sep 26, 2020

pirsquared commented Nov 5, 2019 •

edited by gfyoung

Loading

Output of `pd.show_versions()`

gfyoung commented Nov 12, 2019 •

edited

Loading

haichuan0424 commented Nov 14, 2019 •

edited

Loading