Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot / unstack on large data frame does not work int32 overflow #26314

Open
MarkiesFredje opened this issue May 8, 2019 · 34 comments
Open

Pivot / unstack on large data frame does not work int32 overflow #26314

MarkiesFredje opened this issue May 8, 2019 · 34 comments

Comments

@MarkiesFredje
Copy link

@MarkiesFredje MarkiesFredje commented May 8, 2019

Code Sample, a copy-pastable example if possible

predictor_purchases_p = predictor_purchases.groupby(["ag", "artikelnr"])["som"].max().unstack().fillna(0)

or

predictor_purchases_p = predictor_purchases.pivot(index="ag", columns="artikelnr", value="som")

Problem description

I'm working on rather large data (>100GB in memory) on a beefy server (3TB ram)
When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.

Unstacked DataFrame is too big, causing int32 overflow

I was able to eliminate the problem by changing the reshape.py:
modify line 121from dtype np.int32 to dtype np.int64:
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)

Expected Output

Not being limited by int32 dims when reshaping a data frame.
This feels like a restriction which should not be there.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.0
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented May 8, 2019

try on master this was recently patched

@MarkiesFredje

This comment has been minimized.

Copy link
Author

@MarkiesFredje MarkiesFredje commented May 8, 2019

Corporate server, so long story short, I can only work with conda releases.
I'm not able to pull anything off github and make a build.

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented May 8, 2019

https://github.com/pandas-dev/pandas/pull/23512/files

if you are trying to do this approach your problem differently

@MarkiesFredje

This comment has been minimized.

Copy link
Author

@MarkiesFredje MarkiesFredje commented May 8, 2019

The point I'm trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames.
Basically, I'm proposing to change this this to np.int64.

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented May 8, 2019

so you have 2B columns?

@MarkiesFredje

This comment has been minimized.

Copy link
Author

@MarkiesFredje MarkiesFredje commented May 8, 2019

My current dataset has
RangeIndex: 2584251 entries
Columns: 4539 entries

num_cells = 2584251 * 4539 = 11.729.915.289 cells

So, I have 11.7 B cells

Putting a int32 contstraint on number of cells is way to small for my datasets.
I'm quite sure this causes problems for other users.

@MarkiesFredje

This comment has been minimized.

Copy link
Author

@MarkiesFredje MarkiesFredje commented May 8, 2019

Btw, in pandas 0.21 I could execute these unstacks without problems. Size was not an issue.
Upgrading to pandas 0.24.2 removes this ability.

@Rblivingstone

This comment has been minimized.

Copy link

@Rblivingstone Rblivingstone commented Jun 27, 2019

Yeah, I'm bumping into this as well. I'm trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index.

I'm not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me.

@jreback jreback added this to the Contributions Welcome milestone Jun 28, 2019
@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Jun 28, 2019

@Rblivingstone certainly would take a PR to allow int64 here; it’s not tested and that’s why we raise

@Code4SAFrankie

This comment has been minimized.

Copy link

@Code4SAFrankie Code4SAFrankie commented Jul 24, 2019

I'm getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64.

@Code4SAFrankie

This comment has been minimized.

Copy link

@Code4SAFrankie Code4SAFrankie commented Jul 24, 2019

Turns out I get the same error even after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64), although the error definitely looks like it occurs there.

@buddingPop

This comment has been minimized.

Copy link

@buddingPop buddingPop commented Aug 1, 2019

I get "ValueError: negative dimensions are not allowed" after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64).

Any chance we have a different workaround? I only have 6000 columns...

@meganson

This comment has been minimized.

Copy link

@meganson meganson commented Aug 12, 2019

I have same issue. I use 9.9 BG memory data.

df.pivot_table(index='uno', columns=['programid'], values='avg_time')

ValueError: Unstacked DataFrame is too big, causing int32 overflow

pandas 0.25.0

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Aug 12, 2019

@buddingPop @Code4SAFrankie @meganson are any of you interested in working on this?

We also need a reproducible example, if anyone can provide that.

@subhrajitb

This comment has been minimized.

Copy link

@subhrajitb subhrajitb commented Aug 17, 2019

You can download the ratings.csv file from
https://www.kaggle.com/grouplens/movielens-20m-dataset

Then create a pivottable as below to reproduce the problem:
pivotTable = ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating')

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Aug 19, 2019

We won't be able to include downloading that dataset in a unit test. I assume this can be reproduced with the right style of random data? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@vengertsev

This comment has been minimized.

Copy link

@vengertsev vengertsev commented Aug 26, 2019

Same issue here.
df.set_index(['Element', 'usix']).unstack().reset_index(col_level=1)
ValueError: Unstacked DataFrame is too big, causing int32 overflow
pandas: 0.23.4

@TomAugspurger Let me try to generate similar data with random data.

@vengertsev

This comment has been minimized.

Copy link

@vengertsev vengertsev commented Aug 26, 2019

@TomAugspurger

This is a very naive test, but seems to re-produced the error for me.
`import random
import string
import pandas as pd

row_cnt = 4000000
c1_unique_val_cnt = 1500000
c2_unique_val_cnt = 1600

c1_set = [ i for i in range(c1_unique_val_cnt)]
c1 = [ random.choice(c1_set) for i in range(row_cnt)]
c2_set = [ i for i in range(c2_unique_val_cnt)]
c2 = [ random.choice(c2_set) for i in range(row_cnt)]

df_test = pd.DataFrame({'c1':c1, 'c2': c2 })
t = df_test.set_index(['c1', 'c2']).unstack()`

Produces an error: "ValueError: Unstacked DataFrame is too big, causing int32 overflow"

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Aug 26, 2019

Thanks @vengertsev. A similar example that's a bit faster, since it vectorizes the column generation

In [62]: df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(4000000, 2)), columns=['a', 'b'])

In [63]: df.set_index(['a', 'b']).unstack()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-2ee2ef4b1279> in <module>
----> 1 df.set_index(['a', 'b']).unstack()

~/sandbox/pandas/pandas/core/frame.py in unstack(self, level, fill_value)
   6311         from pandas.core.reshape.reshape import unstack
   6312
-> 6313         return unstack(self, level, fill_value)
   6314
   6315     _shared_docs[

~/sandbox/pandas/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    408     if isinstance(obj, DataFrame):
    409         if isinstance(obj.index, MultiIndex):
--> 410             return _unstack_frame(obj, level, fill_value=fill_value)
    411         else:
    412             return obj.T.stack(dropna=False)

~/sandbox/pandas/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    438             value_columns=obj.columns,
    439             fill_value=fill_value,
--> 440             constructor=obj._constructor,
    441         )
    442         return unstacker.get_result()

~/sandbox/pandas/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
    135
    136         if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 137             raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
    138
    139         self._make_sorted_values_labels()

ValueError: Unstacked DataFrame is too big, causing int32 overflow
@vengertsev

This comment has been minimized.

Copy link

@vengertsev vengertsev commented Aug 26, 2019

@TomAugspurger, Nice, thanks!

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Aug 26, 2019

Anyone interested in working on this now that we have a reproducible example?

@meganson

This comment has been minimized.

Copy link

@meganson meganson commented Aug 28, 2019

@TomAugspurger


ValueError Traceback (most recent call last)
in
----> 1 ratings = pd.read_csv('/home/ml/rating_stg.csv').groupby(['uno', 'program_title'])['view_percent'].mean().unstack()
2 ratings.head()

~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
3299 """
3300 from pandas.core.reshape.reshape import unstack
-> 3301 return unstack(self, level, fill_value)
3302
3303 # ----------------------------------------------------------------------

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
394 unstacker = _Unstacker(obj.values, obj.index, level=level,
395 fill_value=fill_value,
--> 396 constructor=obj._constructor_expanddim)
397 return unstacker.get_result()
398

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in init(self, values, index, level, value_columns, fill_value, constructor)
122
123 if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 124 raise ValueError('Unstacked DataFrame is too big, '
125 'causing int32 overflow')
126

ValueError: Unstacked DataFrame is too big, causing int32 overflow

@imanekho

This comment has been minimized.

Copy link

@imanekho imanekho commented Sep 18, 2019

Hello,

i am facing the same issue trying to create a pivot table , if someone can help !

Thank you

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 18, 2019

@imanekho

This comment has been minimized.

Copy link

@imanekho imanekho commented Sep 18, 2019

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

@ericbrown

This comment has been minimized.

Copy link

@ericbrown ericbrown commented Sep 18, 2019

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

You could attempt to downgrade your pandas to 0.21 (See here #26314 (comment)). I ran into a similar problem recently and downgraded and it worked on one of my datasets. It isn't the best solution but it might get you moving.

@imanekho

This comment has been minimized.

Copy link

@imanekho imanekho commented Sep 18, 2019

thank you

@imanekho

This comment has been minimized.

Copy link

@imanekho imanekho commented Sep 18, 2019

Can i have an idea how you dwnograded your version i tried to intsall the 21.0 that i get alot of error on denpendencies with other libraries
Thank you

@ericbrown

This comment has been minimized.

Copy link

@ericbrown ericbrown commented Sep 18, 2019

Can i have an idea how you dwnograded your version i tried to intsall the 21.0 that i get alot of error on denpendencies with other libraries
Thank you

pip install pandas==0.21

@imanekho

This comment has been minimized.

Copy link

@imanekho imanekho commented Sep 18, 2019

this is exactly the command i used

@atriantafybbc

This comment has been minimized.

Copy link

@atriantafybbc atriantafybbc commented Oct 22, 2019

this is exactly the command i used

try with Python 3.6

@dumbledad

This comment has been minimized.

Copy link

@dumbledad dumbledad commented Nov 11, 2019

Any news on this, I'm failing to roll back to 0.21. My next course of action is to rewrite vectorised function as a loop but I'd rather not.

@jimhavrilla

This comment has been minimized.

Copy link

@jimhavrilla jimhavrilla commented Jan 23, 2020

I got this error in ver. 0.25.3, any news on it?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Jan 24, 2020

Still open, still looking for a volunteer to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.