Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame shift along columns not respecting position when dtypes are mixed #29417

Closed
pirsquared opened this issue Nov 5, 2019 · 9 comments
Closed
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure Multi-Block Issues caused by the presence of multiple Blocks

Comments

@pirsquared
Copy link

pirsquared commented Nov 5, 2019

Setup

import pandas as pd
df = pd.DataFrame(dict(
    A=[1, 2], B=[3., 4.], C=['X', 'Y'],
    D=[5., 6.], E=[7, 8], F=['W', 'Z']
))

df
df.shift(axis=1)

Outputs

   A    B  C    D  E  F
0  1  3.0  X  5.0  7  W
1  2  4.0  Y  6.0  8  Z

    A   B    C    D    E  F
0 NaN NaN  NaN  3.0  1.0  X
1 NaN NaN  NaN  4.0  2.0  Y

Problem description

See Associated Stackoverflow question

The shift places a column's values into the next column that shares the same dtype. The expectation is that the column's values are placed into the adjacent column.

Expected Output

dtypes = df.dtypes.shift(fill_value=object)
df_shifted = df.astype(object).shift(1, axis=1).astype(dtypes)

df_shifted

     A  B    C  D    E  F
0  NaN  1  3.0  X  5.0  7
1  NaN  2  4.0  Y  6.0  8

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1062.4.1.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : 0.4.0
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: 0.7.4
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.5
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@gfyoung gfyoung added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure labels Nov 6, 2019
@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2019

Nice catch! Investigation and PR are welcome!

@haichuan0424
Copy link

Can I give this issue a try?

@gfyoung
Copy link
Member

gfyoung commented Nov 9, 2019

@haichuan0424 : Yes! You are more than welcome to!

@haichuan0424
Copy link

@gfyoung
I've tried to add the following statement inside the shift function in frame.py file:
self = self.astype(object)

The result becomes:
A B C D E F
0 NaN 1 3 X 5 7
1 NaN 2 4 Y 6 8

It seems to downcast the float point number to integer if the decimal is zero.
However, if the float point number does have non-zero decimals, it works perfectly fine.
For example, if change the DataFrame to
df = pd.DataFrame(dict(
A=[1, 2], B=[3.1, 4.2], C=['X', 'Y'],
D=[5.4, 6.6], E=[7, 8], F=['W', 'Z']
))
df
A B C D E F
0 1 3.1 X 5.4 7 W
1 2 4.2 Y 6.6 8 Z
The result becomes:
A B C D E F
0 NaN 1 3.1 X 5.4 7
1 NaN 2 4.2 Y 6.6 8
which is as expected.

I have also tried booleans and it works as well.

Do you think this is worth submitting a pull request? Or the float number with 0 decimal has to be maintained?

@gfyoung
Copy link
Member

gfyoung commented Nov 12, 2019

Try making the change and see if tests pass for you. If they do, add a test for this issue, and submit the PR for further review.

@haichuan0424
Copy link

haichuan0424 commented Nov 14, 2019

After some investigation, I found that when mixed type DataFrame is used, it is broken down into several blocks with different types. Below is an example:

df = pd.DataFrame(dict(
    A=[1, 2], B=[3., 4.], C=['X', 'Y'],
    D=[5., 6.], E=[7, 8], F=['W', 'Z']
))
df
   A    B  C    D  E  F
0  1  3.0  X  5.0  7  W
1  2  4.0  Y  6.0  8  Z

The system breaks this DataFrame into three blocks by indexing and shift each separately.

BlockManager
Items: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(1, 5, 2), 2 x 2, dtype: float64
IntBlock: slice(0, 8, 4), 2 x 2, dtype: int64
ObjectBlock: slice(2, 8, 3), 2 x 2, dtype: object

FloatBlock:
[[3. 4.]
 [5. 6.]] 

 after shift:
[[nan nan]
 [ 3.  4.]]

IntBlock:
[[1 2]
 [7 8]]

after shift:
[[nan nan]
 [ 1.  2.]]

ObjectBlock:
[['X' 'Y']
 ['W' 'Z']]

after shift:
[[nan nan]
 ['X' 'Y']]

After shift each block, it assembles back based on original index slice. This is why the result is:

    A   B    C    D    E  F
0 NaN NaN  NaN  3.0  1.0  X
1 NaN NaN  NaN  4.0  2.0  Y

This is more complicated than I originally thought. We probably need to cast everything to object so it can be treated as one block and shift so only first column changes to NaN, then the result need to be cast back to what it supposed to be so we don't lose accuracy.

@biddwan09
Copy link
Contributor

Hi can I work on this issue ?

@mproszewska
Copy link
Contributor

Shifts are done on NDFrames and NDFrame executes shift on each of blocks (as @haichuan0424 described), hence non intuitive results. I'm not sure if this is a bug, but IMO this could be solved by having alternative implementation for DataFrame.shift

@jbrockmendel jbrockmendel added the Multi-Block Issues caused by the presence of multiple Blocks label Sep 21, 2020
@jbrockmendel
Copy link
Member

closed by #35578

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure Multi-Block Issues caused by the presence of multiple Blocks
Projects
None yet
Development

No branches or pull requests

6 participants