setting with enlargement fails for large DataFrames #10692

Closed
pekaalto opened this Issue Jul 28, 2015 · 7 comments

Comments

Projects
None yet
4 participants

Setting with enlargement seems to fail for DataFrames longer than 10**6 - 1
10**6 seems to be the exact treshold for me. That and anything bigger fails. Anything smaller works.

Example:

import pandas as pd

#works 
X = pd.DataFrame(dict(x=range(10**6-1)))
X.loc[len(X)] = 42

#doesn't work
Y = pd.DataFrame(dict(y=range(10**6)))
Y.loc[len(Y)] = 42   

>>> IndexError: index out of bounds

pd.show_versions() returns:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fi_FI

pandas: 0.16.2
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
Contributor

jreback commented Jul 28, 2015

this is the same issue as in #10645

the cases for len > 1M have different handling and something is amuck.

You know that you are copying the frame on enlargement right? This is extremely inefficient.

jreback added this to the Next Major Release milestone Jul 28, 2015

johne13 commented Jul 28, 2015

@jreback What is the recommended way to do this? This exact way is mentioned in the docs and doesn't seem to be discouraged there:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#setting-with-enlargement

Contributor

jreback commented Jul 28, 2015

what are you trying to do exactly?

johne13 commented Jul 28, 2015

I'm not trying to do anything! Or maybe you are talking to the OP? I was actually wondering the same thing, as I would generally use append() for this general thing.

But FWIW, these questions do come up at stack overflow with some regularity and if they found "setting with enlargement" in the documentation this is suggested as the way to do it. (or one of the ways, anyway). And in this case what OP did was pretty much identical to the last example in the "setting with enlargement" doc.

Contributor

jreback commented Jul 28, 2015

@johne13 sorry, was on my phone.

So enlargement is equivalent of df.append(Series(..., name=key)). This creates by definition a copy. A doc note/warning would be nice here.

Actually I didn't know that the df is copied with every enlargement anyway.
But yeah some warning in docs would probably be nice to avoid misundersting.


About "what are you trying to do exactly?":

I just have a huge DataFrame where I append some information when it's returned from functions etc. I probably have to do some redesigning. I guess the way to go is somehow to preallocate the rows in the main DataFrame or collecting the "stuff to be appended" in some smaller list/df first and then appending all in the end.

@jreback jreback modified the milestone: 0.17.0, Next Major Release Sep 10, 2015

jreback closed this in #11049 Sep 10, 2015

pkch commented Jun 19, 2016 edited

@jreback commented on Jul 28, 2015

So enlargement is equivalent of df.append(Series(..., name=key)). This creates by definition a copy. A doc note/warning would be nice here.

Jeff, I guess you didn't mean it's a "copy" of the original object in the sense of creating a brand new, unrelated object. If you just meant that a lot of data had to be copied underneath the hood, then I understand completely.

Still, I'd guess it's quite different from append in that it manages to add a row in-place. (I didn't even know it's possible...)

df = pd.DataFrame({'a':[1]})
df1 = df
df1.loc[10] = 100
assert df is df1
assert len(df) == 2

Honestly, given the performance impact, I'm truly at a loss as to why "Setting with Enlargement" was added to the DataFrame API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment