-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Code Sample, a copy-pastable example if possible
frames_difindex = []
frames_sameindex = []
for i in range(1000):
# Create a single-row frame where half the elements are null
a = np.random.rand(600) - 0.5
a[a<0] = np.nan
# All rows here have the same column index
f = pd.DataFrame({str(i): a}).T
frames_sameindex.append(f)
# Rows here will have separate column indices
f = f.dropna(axis=1)
frames_difindex.append(f)
%%timeit
_ = pd.concat(frames_sameindex)
# Output:
# 10 loops, best of 3: 50.6 ms per loop
%%timeit
_ = pd.concat(frames_difindex)
# Output:
# 1 loop, best of 3: 24.7 s per loop
Problem description
I've been trying to concat about 50k wide frames (1 or 2 rows by up to 600 cols). I noticed that when the columns are fairly sparse, it takes about 17 hours to complete. That's a huge difference from the nearly immediate result when the indices are all the same.
I boiled it down to the above example where you can see the 500x difference for just a set of 1000 dataframes.
I've been trying to find a better way of doing this, but haven't had any luck. So I'm reporting it here.
Even https://tomaugspurger.github.io/modern-4-performance.html suggests that concat() should be the way to go.
Output of pd.show_versions()
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: 0.3.0
pandas_datareader: None