Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setting with enlargement fails for large DataFrames #10692

Closed
pekaalto opened this issue Jul 28, 2015 · 7 comments · Fixed by #11049
Closed

setting with enlargement fails for large DataFrames #10692

pekaalto opened this issue Jul 28, 2015 · 7 comments · Fixed by #11049
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@pekaalto
Copy link

Setting with enlargement seems to fail for DataFrames longer than 10**6 - 1
10**6 seems to be the exact treshold for me. That and anything bigger fails. Anything smaller works.

Example:

import pandas as pd

#works 
X = pd.DataFrame(dict(x=range(10**6-1)))
X.loc[len(X)] = 42

#doesn't work
Y = pd.DataFrame(dict(y=range(10**6)))
Y.loc[len(Y)] = 42   

>>> IndexError: index out of bounds

pd.show_versions() returns:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fi_FI

pandas: 0.16.2
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

this is the same issue as in #10645

the cases for len > 1M have different handling and something is amuck.

You know that you are copying the frame on enlargement right? This is extremely inefficient.

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 28, 2015
@jreback jreback added this to the Next Major Release milestone Jul 28, 2015
@johne13
Copy link

johne13 commented Jul 28, 2015

@jreback What is the recommended way to do this? This exact way is mentioned in the docs and doesn't seem to be discouraged there:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#setting-with-enlargement

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

what are you trying to do exactly?

@johne13
Copy link

johne13 commented Jul 28, 2015

I'm not trying to do anything! Or maybe you are talking to the OP? I was actually wondering the same thing, as I would generally use append() for this general thing.

But FWIW, these questions do come up at stack overflow with some regularity and if they found "setting with enlargement" in the documentation this is suggested as the way to do it. (or one of the ways, anyway). And in this case what OP did was pretty much identical to the last example in the "setting with enlargement" doc.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

@johne13 sorry, was on my phone.

So enlargement is equivalent of df.append(Series(..., name=key)). This creates by definition a copy. A doc note/warning would be nice here.

@pekaalto
Copy link
Author

Actually I didn't know that the df is copied with every enlargement anyway.
But yeah some warning in docs would probably be nice to avoid misundersting.


About "what are you trying to do exactly?":

I just have a huge DataFrame where I append some information when it's returned from functions etc. I probably have to do some redesigning. I guess the way to go is somehow to preallocate the rows in the main DataFrame or collecting the "stuff to be appended" in some smaller list/df first and then appending all in the end.

@pkch
Copy link

pkch commented Jun 19, 2016

@jreback commented on Jul 28, 2015

So enlargement is equivalent of df.append(Series(..., name=key)). This creates by definition a copy. A doc note/warning would be nice here.

Jeff, I guess you didn't mean it's a "copy" of the original object in the sense of creating a brand new, unrelated object. If you just meant that a lot of data had to be copied underneath the hood, then I understand completely.

Still, I'd guess it's quite different from append in that it manages to add a row in-place. (I didn't even know it's possible...)

df = pd.DataFrame({'a':[1]})
df1 = df
df1.loc[10] = 100
assert df is df1
assert len(df) == 2

Honestly, given the performance impact, I'm truly at a loss as to why "Setting with Enlargement" was added to the DataFrame API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants