Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot / unstack on large data frame does not work int32 overflow #26314

Closed
MarkiesFredje opened this issue May 8, 2019 · 47 comments · Fixed by #45084
Closed

Pivot / unstack on large data frame does not work int32 overflow #26314

MarkiesFredje opened this issue May 8, 2019 · 47 comments · Fixed by #45084
Labels
Bug Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@MarkiesFredje
Copy link

Code Sample, a copy-pastable example if possible

predictor_purchases_p = predictor_purchases.groupby(["ag", "artikelnr"])["som"].max().unstack().fillna(0)

or

predictor_purchases_p = predictor_purchases.pivot(index="ag", columns="artikelnr", value="som")

Problem description

I'm working on rather large data (>100GB in memory) on a beefy server (3TB ram)
When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.

Unstacked DataFrame is too big, causing int32 overflow

I was able to eliminate the problem by changing the reshape.py:
modify line 121from dtype np.int32 to dtype np.int64:
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)

Expected Output

Not being limited by int32 dims when reshaping a data frame.
This feels like a restriction which should not be there.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.0
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback
Copy link
Contributor

jreback commented May 8, 2019

try on master this was recently patched

@MarkiesFredje
Copy link
Author

Corporate server, so long story short, I can only work with conda releases.
I'm not able to pull anything off github and make a build.

@jreback
Copy link
Contributor

jreback commented May 8, 2019

https://github.com/pandas-dev/pandas/pull/23512/files

if you are trying to do this approach your problem differently

@MarkiesFredje
Copy link
Author

The point I'm trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames.
Basically, I'm proposing to change this this to np.int64.

@jreback
Copy link
Contributor

jreback commented May 8, 2019

so you have 2B columns?

@MarkiesFredje
Copy link
Author

My current dataset has
RangeIndex: 2584251 entries
Columns: 4539 entries

num_cells = 2584251 * 4539 = 11.729.915.289 cells

So, I have 11.7 B cells

Putting a int32 contstraint on number of cells is way to small for my datasets.
I'm quite sure this causes problems for other users.

@MarkiesFredje
Copy link
Author

Btw, in pandas 0.21 I could execute these unstacks without problems. Size was not an issue.
Upgrading to pandas 0.24.2 removes this ability.

@gfyoung gfyoung added Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 8, 2019
@Rblivingstone
Copy link

Yeah, I'm bumping into this as well. I'm trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index.

I'm not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me.

@jreback jreback added this to the Contributions Welcome milestone Jun 28, 2019
@jreback
Copy link
Contributor

jreback commented Jun 28, 2019

@Rblivingstone certainly would take a PR to allow int64 here; it’s not tested and that’s why we raise

@Code4SAFrankie
Copy link

I'm getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64.

@Code4SAFrankie
Copy link

Turns out I get the same error even after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64), although the error definitely looks like it occurs there.

@buddingPop
Copy link

I get "ValueError: negative dimensions are not allowed" after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64).

Any chance we have a different workaround? I only have 6000 columns...

@meganson
Copy link

meganson commented Aug 12, 2019

I have same issue. I use 9.9 BG memory data.

df.pivot_table(index='uno', columns=['programid'], values='avg_time')

ValueError: Unstacked DataFrame is too big, causing int32 overflow

pandas 0.25.0

@TomAugspurger
Copy link
Contributor

@buddingPop @Code4SAFrankie @meganson are any of you interested in working on this?

We also need a reproducible example, if anyone can provide that.

@subhrajitb
Copy link

You can download the ratings.csv file from
https://www.kaggle.com/grouplens/movielens-20m-dataset

Then create a pivottable as below to reproduce the problem:
pivotTable = ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating')

@TomAugspurger
Copy link
Contributor

We won't be able to include downloading that dataset in a unit test. I assume this can be reproduced with the right style of random data? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@vengertsev
Copy link

Same issue here.
df.set_index(['Element', 'usix']).unstack().reset_index(col_level=1)
ValueError: Unstacked DataFrame is too big, causing int32 overflow
pandas: 0.23.4

@TomAugspurger Let me try to generate similar data with random data.

@vengertsev
Copy link

vengertsev commented Aug 26, 2019

@TomAugspurger

This is a very naive test, but seems to re-produced the error for me.
`import random
import string
import pandas as pd

row_cnt = 4000000
c1_unique_val_cnt = 1500000
c2_unique_val_cnt = 1600

c1_set = [ i for i in range(c1_unique_val_cnt)]
c1 = [ random.choice(c1_set) for i in range(row_cnt)]
c2_set = [ i for i in range(c2_unique_val_cnt)]
c2 = [ random.choice(c2_set) for i in range(row_cnt)]

df_test = pd.DataFrame({'c1':c1, 'c2': c2 })
t = df_test.set_index(['c1', 'c2']).unstack()`

Produces an error: "ValueError: Unstacked DataFrame is too big, causing int32 overflow"

@TomAugspurger
Copy link
Contributor

Thanks @vengertsev. A similar example that's a bit faster, since it vectorizes the column generation

In [62]: df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(4000000, 2)), columns=['a', 'b'])

In [63]: df.set_index(['a', 'b']).unstack()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-2ee2ef4b1279> in <module>
----> 1 df.set_index(['a', 'b']).unstack()

~/sandbox/pandas/pandas/core/frame.py in unstack(self, level, fill_value)
   6311         from pandas.core.reshape.reshape import unstack
   6312
-> 6313         return unstack(self, level, fill_value)
   6314
   6315     _shared_docs[

~/sandbox/pandas/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    408     if isinstance(obj, DataFrame):
    409         if isinstance(obj.index, MultiIndex):
--> 410             return _unstack_frame(obj, level, fill_value=fill_value)
    411         else:
    412             return obj.T.stack(dropna=False)

~/sandbox/pandas/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    438             value_columns=obj.columns,
    439             fill_value=fill_value,
--> 440             constructor=obj._constructor,
    441         )
    442         return unstacker.get_result()

~/sandbox/pandas/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
    135
    136         if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 137             raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
    138
    139         self._make_sorted_values_labels()

ValueError: Unstacked DataFrame is too big, causing int32 overflow

@vengertsev
Copy link

@TomAugspurger, Nice, thanks!

@TomAugspurger
Copy link
Contributor

Anyone interested in working on this now that we have a reproducible example?

@meganson
Copy link

meganson commented Aug 28, 2019

@TomAugspurger


ValueError Traceback (most recent call last)
in
----> 1 ratings = pd.read_csv('/home/ml/rating_stg.csv').groupby(['uno', 'program_title'])['view_percent'].mean().unstack()
2 ratings.head()

~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
3299 """
3300 from pandas.core.reshape.reshape import unstack
-> 3301 return unstack(self, level, fill_value)
3302
3303 # ----------------------------------------------------------------------

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
394 unstacker = _Unstacker(obj.values, obj.index, level=level,
395 fill_value=fill_value,
--> 396 constructor=obj._constructor_expanddim)
397 return unstacker.get_result()
398

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in init(self, values, index, level, value_columns, fill_value, constructor)
122
123 if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 124 raise ValueError('Unstacked DataFrame is too big, '
125 'causing int32 overflow')
126

ValueError: Unstacked DataFrame is too big, causing int32 overflow

@imanekho
Copy link

Hello,

i am facing the same issue trying to create a pivot table , if someone can help !

Thank you

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 18, 2019 via email

@imanekho
Copy link

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

@ericbrown
Copy link

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

You could attempt to downgrade your pandas to 0.21 (See here #26314 (comment)). I ran into a similar problem recently and downgraded and it worked on one of my datasets. It isn't the best solution but it might get you moving.

@imanekho
Copy link

thank you

@imanekho
Copy link

this is exactly the command i used

@atriantafybbc
Copy link

this is exactly the command i used

try with Python 3.6

@dumbledad
Copy link

Any news on this, I'm failing to roll back to 0.21. My next course of action is to rewrite vectorised function as a loop but I'd rather not.

@jimhavrilla
Copy link

I got this error in ver. 0.25.3, any news on it?

@TomAugspurger
Copy link
Contributor

Still open, still looking for a volunteer to fix it.

@pavany666666
Copy link

Any update on this?
Does downgrading the Pandas help??

@jreback
Copy link
Contributor

jreback commented Jun 4, 2020

pandas is an all volunteer project; PRs would be accepted from the community for this or any other issues

@KaonToPion
Copy link

I have been stuck with this some days and finally I have fixed it by changing int32 to int64. Would it be all right to send a pull request with it?

@TomAugspurger
Copy link
Contributor

@KaonToPion please do.

@pavany666666
Copy link

@KaonToPion Please let us know once you checkin... I would like to have this fix ASAP... Thanks a lot!

@KaonToPion
Copy link

I am sorry, I am having issues with installing the pandas development enviroment that's why it's taking long

@TomAugspurger
Copy link
Contributor

@storopoli
Copy link

@KaonToPion also waiting on this fix. If you could please take another effort a try that would be great. Thank you!

@ghuls
Copy link

ghuls commented Feb 15, 2021

The current code to check for int32 overflow is wrong anyway as it will not catch all overflows (only when the result is still negative):

# GH20601: This forces an overflow if the number of cells is too high.
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
if num_rows > 0 and num_columns > 0 and num_cells <= 0:
raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")

Correct check:

# GH20601: Catch int32 overflow if the number of cells is too high. 
if num_rows > 0 and num_columns > 0 and num_rows * num_columns > 2**31 - 1: 
    raise ValueError("Unstacked DataFrame is too big, causing int32 overflow") 

Examples:

In [1]: import numpy as np

In [2]: def test_overflow(num_rows, num_columns):
     ...:     num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
     ...:     if num_rows > 0 and num_columns > 0 and num_cells <= 0:
     ...:         print("Unstacked DataFrame is too big, causing int32 overflow (np.multiply)")
     ...:     if num_rows > 0 and num_columns > 0 and num_rows * num_columns > 2**31 - 1:
     ...:         print("Unstacked DataFrame is too big, causing int32 overflow (python)")
     ...: 
In [3]: test_overflow(400_000, 1_000)

In [4]: test_overflow(4_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (np.multiply)
Unstacked DataFrame is too big, causing int32 overflow (python)

In [5]: test_overflow(40_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (python)

In [6]: test_overflow(400_000_000, 1_000)
Unstacked DataFrame is too big, causing int32 overflow (python)

@kamal1262
Copy link

kamal1262 commented Feb 21, 2021

Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!

I faced the same issue and needed to have an urgent fix for the unexpected int32 overflow. One of our recommendation model was running in production and at some point number of users base increased to more than 7 million records with around 21k items.

So, to solve the issue I chunked the dataset as mentioned @igorkf, create the pivot table using unstack and append it gradually.

import pandas as pd 
from tqdm import tqdm

chunk_size = 50000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)
0 49999
50000 99999
100000 149999
150000 199999
200000 249990
.........................



pivot_df = pd.DataFrame()

for i in tqdm(range(0, len(chunks) - 1)):
    chunk_df = df.iloc[ chunks[i]:chunks[i + 1] - 1]
    interactions = (
    chunk_df.groupby([user_col, item_col])[rating_col]
    .sum()
    .unstack()
    .reset_index()
    .fillna(0)
    .set_index(user_col)
    )
    print (interactions.shape)
    pivot_df = pivot_df.append(interactions, sort=False) 
And then I have to make a sparse matrix as input to lightFM recommendation model (run matrix-factorization algorithm). You can use it for any use case where unstacking is required. Using the following code, converted to sparse matrix-

from scipy import sparse
import numpy as np
sparse_matrix = sparse.csr_matrix(df_new.to_numpy())

NB: Pandas has pivot_table function which can be used for unstacking if your data is small. For my case, pivot_table was really slow

@lucashadin
Copy link

Did this ever get patched? :)

@pavany666666
Copy link

pavany666666 commented Jun 14, 2021 via email

@lucashadin
Copy link

nope. I don't think so.I moved to alternate ways of doing it. On Monday, 14 June, 2021, 04:08:57 pm GMT-4, lucashadin @.***> wrote: Did this ever get patched? :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

How did you solve it? Did you use the chunking method that kamal1262 mentioned? Or another package? Any advice would be great :)

@mroeschke mroeschke added Error Reporting Incorrect or improved errors from pandas Bug and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 2, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet