Skip to content

Concat slow when frames indexed separately #19958

@jeremywhelchel

Description

@jeremywhelchel

Code Sample, a copy-pastable example if possible

frames_difindex = []
frames_sameindex = []
for i in range(1000):
  # Create a single-row frame where half the elements are null
  a = np.random.rand(600) - 0.5
  a[a<0] = np.nan
  
  # All rows here have the same column index
  f = pd.DataFrame({str(i): a}).T
  frames_sameindex.append(f)
  
  # Rows here will have separate column indices
  f = f.dropna(axis=1)
  frames_difindex.append(f)

%%timeit
_ = pd.concat(frames_sameindex)
# Output:
# 10 loops, best of 3: 50.6 ms per loop

%%timeit
_ = pd.concat(frames_difindex)
# Output:
# 1 loop, best of 3: 24.7 s per loop

Problem description

I've been trying to concat about 50k wide frames (1 or 2 rows by up to 600 cols). I noticed that when the columns are fairly sparse, it takes about 17 hours to complete. That's a huge difference from the nearly immediate result when the indices are all the same.

I boiled it down to the above example where you can see the 500x difference for just a set of 1000 dataframes.

I've been trying to find a better way of doing this, but haven't had any luck. So I'm reporting it here.
Even https://tomaugspurger.github.io/modern-4-performance.html suggests that concat() should be the way to go.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 4.4.96+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: 0.3.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions