Pivot / unstack on large data frame does not work int32 overflow #26314

MarkiesFredje · 2019-05-08T09:39:49Z

Code Sample, a copy-pastable example if possible

predictor_purchases_p = predictor_purchases.groupby(["ag", "artikelnr"])["som"].max().unstack().fillna(0)

or

predictor_purchases_p = predictor_purchases.pivot(index="ag", columns="artikelnr", value="som")

Problem description

I'm working on rather large data (>100GB in memory) on a beefy server (3TB ram)
When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.

Unstacked DataFrame is too big, causing int32 overflow

I was able to eliminate the problem by changing the reshape.py:
modify line 121from dtype np.int32 to dtype np.int64:
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)

Expected Output

Not being limited by int32 dims when reshaping a data frame.
This feels like a restriction which should not be there.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.0
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jreback · 2019-05-08T10:14:17Z

try on master this was recently patched

MarkiesFredje · 2019-05-08T11:10:11Z

Corporate server, so long story short, I can only work with conda releases.
I'm not able to pull anything off github and make a build.

jreback · 2019-05-08T11:25:23Z

https://github.com/pandas-dev/pandas/pull/23512/files

if you are trying to do this approach your problem differently

MarkiesFredje · 2019-05-08T12:30:49Z

The point I'm trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames.
Basically, I'm proposing to change this this to np.int64.

jreback · 2019-05-08T13:18:39Z

so you have 2B columns?

MarkiesFredje · 2019-05-08T13:55:34Z

My current dataset has
RangeIndex: 2584251 entries
Columns: 4539 entries

num_cells = 2584251 * 4539 = 11.729.915.289 cells

So, I have 11.7 B cells

Putting a int32 contstraint on number of cells is way to small for my datasets.
I'm quite sure this causes problems for other users.

MarkiesFredje · 2019-05-08T14:14:29Z

Btw, in pandas 0.21 I could execute these unstacks without problems. Size was not an issue.
Upgrading to pandas 0.24.2 removes this ability.

Rblivingstone · 2019-06-27T16:47:20Z

Yeah, I'm bumping into this as well. I'm trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index.

I'm not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me.

jreback · 2019-06-28T10:51:00Z

@Rblivingstone certainly would take a PR to allow int64 here; it’s not tested and that’s why we raise

Code4SAFrankie · 2019-07-24T14:17:39Z

I'm getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64.

Code4SAFrankie · 2019-07-24T14:23:14Z

Turns out I get the same error even after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64), although the error definitely looks like it occurs there.

buddingPop · 2019-08-01T13:54:24Z

I get "ValueError: negative dimensions are not allowed" after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64).

Any chance we have a different workaround? I only have 6000 columns...

meganson · 2019-08-12T01:51:27Z

I have same issue. I use 9.9 BG memory data.

df.pivot_table(index='uno', columns=['programid'], values='avg_time')

ValueError: Unstacked DataFrame is too big, causing int32 overflow

pandas 0.25.0

TomAugspurger · 2019-08-12T18:36:50Z

@buddingPop @Code4SAFrankie @meganson are any of you interested in working on this?

We also need a reproducible example, if anyone can provide that.

subhrajitb · 2019-08-17T07:26:45Z

You can download the ratings.csv file from
https://www.kaggle.com/grouplens/movielens-20m-dataset

Then create a pivottable as below to reproduce the problem:
pivotTable = ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating')

TomAugspurger · 2019-08-19T16:52:45Z

We won't be able to include downloading that dataset in a unit test. I assume this can be reproduced with the right style of random data? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

vengertsev · 2019-08-26T14:48:30Z

Same issue here.
df.set_index(['Element', 'usix']).unstack().reset_index(col_level=1)
ValueError: Unstacked DataFrame is too big, causing int32 overflow
pandas: 0.23.4

@TomAugspurger Let me try to generate similar data with random data.

vengertsev · 2019-08-26T15:45:26Z

@TomAugspurger

This is a very naive test, but seems to re-produced the error for me.
`import random
import string
import pandas as pd

row_cnt = 4000000
c1_unique_val_cnt = 1500000
c2_unique_val_cnt = 1600

c1_set = [ i for i in range(c1_unique_val_cnt)]
c1 = [ random.choice(c1_set) for i in range(row_cnt)]
c2_set = [ i for i in range(c2_unique_val_cnt)]
c2 = [ random.choice(c2_set) for i in range(row_cnt)]

df_test = pd.DataFrame({'c1':c1, 'c2': c2 })
t = df_test.set_index(['c1', 'c2']).unstack()`

Produces an error: "ValueError: Unstacked DataFrame is too big, causing int32 overflow"

TomAugspurger · 2019-08-26T15:51:07Z

Thanks @vengertsev. A similar example that's a bit faster, since it vectorizes the column generation

In [62]: df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(4000000, 2)), columns=['a', 'b'])

In [63]: df.set_index(['a', 'b']).unstack()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-2ee2ef4b1279> in <module>
----> 1 df.set_index(['a', 'b']).unstack()

~/sandbox/pandas/pandas/core/frame.py in unstack(self, level, fill_value)
   6311         from pandas.core.reshape.reshape import unstack
   6312
-> 6313         return unstack(self, level, fill_value)
   6314
   6315     _shared_docs[

~/sandbox/pandas/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    408     if isinstance(obj, DataFrame):
    409         if isinstance(obj.index, MultiIndex):
--> 410             return _unstack_frame(obj, level, fill_value=fill_value)
    411         else:
    412             return obj.T.stack(dropna=False)

~/sandbox/pandas/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    438             value_columns=obj.columns,
    439             fill_value=fill_value,
--> 440             constructor=obj._constructor,
    441         )
    442         return unstacker.get_result()

~/sandbox/pandas/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
    135
    136         if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 137             raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
    138
    139         self._make_sorted_values_labels()

ValueError: Unstacked DataFrame is too big, causing int32 overflow

vengertsev · 2019-08-26T15:56:38Z

@TomAugspurger, Nice, thanks!

TomAugspurger · 2019-08-26T15:59:53Z

Anyone interested in working on this now that we have a reproducible example?

meganson · 2019-08-28T06:42:27Z

@TomAugspurger

ValueError Traceback (most recent call last)
in
----> 1 ratings = pd.read_csv('/home/ml/rating_stg.csv').groupby(['uno', 'program_title'])['view_percent'].mean().unstack()
2 ratings.head()

~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
3299 """
3300 from pandas.core.reshape.reshape import unstack
-> 3301 return unstack(self, level, fill_value)
3302
3303 # ----------------------------------------------------------------------

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
394 unstacker = _Unstacker(obj.values, obj.index, level=level,
395 fill_value=fill_value,
--> 396 constructor=obj._constructor_expanddim)
397 return unstacker.get_result()
398

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in init(self, values, index, level, value_columns, fill_value, constructor)
122
123 if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 124 raise ValueError('Unstacked DataFrame is too big, '
125 'causing int32 overflow')
126

ValueError: Unstacked DataFrame is too big, causing int32 overflow

imanekho · 2019-09-18T12:56:53Z

Hello,

i am facing the same issue trying to create a pivot table , if someone can help !

Thank you

TomAugspurger · 2019-09-18T13:04:11Z

We need a PR to fix it in pandas.

…

On Wed, Sep 18, 2019 at 7:57 AM imanekho ***@***.***> wrote: Hello, i am facing the same issue trying to create a pivot table , if someone can help ! Thank you — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26314?email_source=notifications&email_token=AAKAOIWZIGTP6E5QAUW6DOLQKIQR5A5CNFSM4HLQBXN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD676ZYY#issuecomment-532671715>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIX6READO4P5TIOPTILQKIQR5ANCNFSM4HLQBXNQ> .

imanekho · 2019-09-18T13:16:35Z

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

ericbrown · 2019-09-18T13:21:14Z

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

You could attempt to downgrade your pandas to 0.21 (See here #26314 (comment)). I ran into a similar problem recently and downgraded and it worked on one of my datasets. It isn't the best solution but it might get you moving.

imanekho · 2019-09-18T13:23:06Z

thank you

imanekho · 2019-09-18T17:13:03Z

this is exactly the command i used

atriantafybbc · 2019-10-22T11:43:58Z

this is exactly the command i used

try with Python 3.6

dumbledad · 2019-11-11T12:15:05Z

Any news on this, I'm failing to roll back to 0.21. My next course of action is to rewrite vectorised function as a loop but I'd rather not.

jimhavrilla · 2020-01-23T23:42:03Z

I got this error in ver. 0.25.3, any news on it?

TomAugspurger · 2020-01-24T01:47:01Z

Still open, still looking for a volunteer to fix it.

pavany666666 · 2020-06-04T15:53:07Z

Any update on this?
Does downgrading the Pandas help??

jreback · 2020-06-04T18:08:21Z

pandas is an all volunteer project; PRs would be accepted from the community for this or any other issues

KaonToPion · 2020-06-10T14:04:21Z

I have been stuck with this some days and finally I have fixed it by changing int32 to int64. Would it be all right to send a pull request with it?

TomAugspurger · 2020-06-10T14:51:47Z

@KaonToPion please do.

pavany666666 · 2020-06-11T14:32:46Z

@KaonToPion Please let us know once you checkin... I would like to have this fix ASAP... Thanks a lot!

KaonToPion · 2020-06-15T13:44:07Z

I am sorry, I am having issues with installing the pandas development enviroment that's why it's taking long

TomAugspurger · 2020-06-15T13:50:43Z

https://pandas.pydata.org/docs/development/contributing.html, in case it helps.

storopoli · 2020-08-07T19:35:22Z

@KaonToPion also waiting on this fix. If you could please take another effort a try that would be great. Thank you!

ghuls · 2021-02-15T19:58:37Z

The current code to check for int32 overflow is wrong anyway as it will not catch all overflows (only when the result is still negative):

pandas/pandas/core/reshape/reshape.py

Lines 115 to 119 in a9cacd9

    
           # GH20601: This forces an overflow if the number of cells is too high. 
        
           num_cells = np.multiply(num_rows, num_columns, dtype=np.int32) 
        
           if num_rows > 0 and num_columns > 0 and num_cells <= 0: 
        
               raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")

Correct check:

# GH20601: Catch int32 overflow if the number of cells is too high. 
if num_rows > 0 and num_columns > 0 and num_rows * num_columns > 2**31 - 1: 
    raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")

Examples:

In [1]: import numpy as np

In [2]: def test_overflow(num_rows, num_columns):
     ...:     num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
     ...:     if num_rows > 0 and num_columns > 0 and num_cells <= 0:
     ...:         print("Unstacked DataFrame is too big, causing int32 overflow (np.multiply)")
     ...:     if num_rows > 0 and num_columns > 0 and num_rows * num_columns > 2**31 - 1:
     ...:         print("Unstacked DataFrame is too big, causing int32 overflow (python)")
     ...: 
In [3]: test_overflow(400_000, 1_000)

In [4]: test_overflow(4_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (np.multiply)
Unstacked DataFrame is too big, causing int32 overflow (python)

In [5]: test_overflow(40_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (python)

In [6]: test_overflow(400_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (python)

kamal1262 · 2021-02-21T10:50:36Z

Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!

I faced the same issue and needed to have an urgent fix for the unexpected int32 overflow. One of our recommendation model was running in production and at some point number of users base increased to more than 7 million records with around 21k items.

So, to solve the issue I chunked the dataset as mentioned @igorkf, create the pivot table using unstack and append it gradually.

import pandas as pd 
from tqdm import tqdm

chunk_size = 50000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)
0 49999
50000 99999
100000 149999
150000 199999
200000 249990
.........................



pivot_df = pd.DataFrame()

for i in tqdm(range(0, len(chunks) - 1)):
    chunk_df = df.iloc[ chunks[i]:chunks[i + 1] - 1]
    interactions = (
    chunk_df.groupby([user_col, item_col])[rating_col]
    .sum()
    .unstack()
    .reset_index()
    .fillna(0)
    .set_index(user_col)
    )
    print (interactions.shape)
    pivot_df = pivot_df.append(interactions, sort=False) 
And then I have to make a sparse matrix as input to lightFM recommendation model (run matrix-factorization algorithm). You can use it for any use case where unstacking is required. Using the following code, converted to sparse matrix-

from scipy import sparse
import numpy as np
sparse_matrix = sparse.csr_matrix(df_new.to_numpy())

NB: Pandas has pivot_table function which can be used for unstacking if your data is small. For my case, pivot_table was really slow

lucashadin · 2021-06-14T20:08:38Z

Did this ever get patched? :)

pavany666666 · 2021-06-14T22:09:42Z

nope. I don't think so.I moved to alternate ways of doing it. On Monday, 14 June, 2021, 04:08:57 pm GMT-4, lucashadin ***@***.***> wrote: Did this ever get patched? :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

lucashadin · 2021-06-15T09:34:09Z

nope. I don't think so.I moved to alternate ways of doing it. On Monday, 14 June, 2021, 04:08:57 pm GMT-4, lucashadin @.***> wrote: Did this ever get patched? :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

How did you solve it? Did you use the chunking method that kamal1262 mentioned? Or another package? Any advice would be great :)

gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 8, 2019

jreback added this to the Contributions Welcome milestone Jun 28, 2019

jreback mentioned this issue Apr 21, 2020

BUG: Wrong expression np.multiply to validate overflow exception in pandas.core.reshape.reshape.py#L118 #33694

Closed

3 tasks

simonjayhawkins mentioned this issue Apr 21, 2020

fix bug of overflow validaion in 'reshape' #33697

Closed

5 tasks

KaonToPion mentioned this issue Jun 16, 2020

BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

Closed

5 tasks

simonjayhawkins mentioned this issue Sep 15, 2020

CI: Add stale PR action #36336

Merged

mroeschke added Error Reporting Incorrect or improved errors from pandas Bug and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 2, 2021

martins0n mentioned this issue Nov 19, 2021

[BUG] M4 monthly and TSDataset.to_dataset tinkoff-ai/etna#304

Closed

1 task

mroeschke mentioned this issue Dec 28, 2021

BUG: Unstack/pivot raising ValueError on large result #45084

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Dec 28, 2021

jreback closed this as completed in #45084 Dec 28, 2021

Pivot / unstack on large data frame does not work int32 overflow #26314

Pivot / unstack on large data frame does not work int32 overflow #26314

Comments

MarkiesFredje commented May 8, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented May 8, 2019

MarkiesFredje commented May 8, 2019

jreback commented May 8, 2019

MarkiesFredje commented May 8, 2019

jreback commented May 8, 2019

MarkiesFredje commented May 8, 2019

MarkiesFredje commented May 8, 2019

Rblivingstone commented Jun 27, 2019

jreback commented Jun 28, 2019

Code4SAFrankie commented Jul 24, 2019

Code4SAFrankie commented Jul 24, 2019

buddingPop commented Aug 1, 2019

meganson commented Aug 12, 2019 • edited

TomAugspurger commented Aug 12, 2019

subhrajitb commented Aug 17, 2019

TomAugspurger commented Aug 19, 2019

vengertsev commented Aug 26, 2019

vengertsev commented Aug 26, 2019 • edited

TomAugspurger commented Aug 26, 2019

vengertsev commented Aug 26, 2019

TomAugspurger commented Aug 26, 2019

meganson commented Aug 28, 2019 • edited

imanekho commented Sep 18, 2019

TomAugspurger commented Sep 18, 2019 via email

imanekho commented Sep 18, 2019

ericbrown commented Sep 18, 2019

imanekho commented Sep 18, 2019

imanekho commented Sep 18, 2019

atriantafybbc commented Oct 22, 2019

dumbledad commented Nov 11, 2019

jimhavrilla commented Jan 23, 2020

TomAugspurger commented Jan 24, 2020

pavany666666 commented Jun 4, 2020

jreback commented Jun 4, 2020

KaonToPion commented Jun 10, 2020

TomAugspurger commented Jun 10, 2020

pavany666666 commented Jun 11, 2020

KaonToPion commented Jun 15, 2020

TomAugspurger commented Jun 15, 2020

storopoli commented Aug 7, 2020

ghuls commented Feb 15, 2021

kamal1262 commented Feb 21, 2021 • edited

Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!

lucashadin commented Jun 14, 2021

pavany666666 commented Jun 14, 2021 via email

lucashadin commented Jun 15, 2021

Output of `pd.show_versions()`

meganson commented Aug 12, 2019 •

edited

vengertsev commented Aug 26, 2019 •

edited

meganson commented Aug 28, 2019 •

edited

kamal1262 commented Feb 21, 2021 •

edited