PERF: homogeneous concat #52685

jbrockmendel · 2023-04-15T19:04:29Z

closes PERF: concat slow, manual concat through reindexing enhances performance #50652 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Using the num_rows=100, num_cols=50_000, num_dfs=7 case from #50652, I'm getting 54.15s on main vs 1.21s on this branch.

topper-123 · 2023-04-16T06:04:56Z

I ran this through the code in #50652 and it looks good, see below.

But this only works for float64, so e.g. won't show a difference in ASV in join_merge.ConcatDataFrames, because that particular perf test is in float32.

Can this not already be generalized all numpy floats and ints? What happens if you concatenate e.g. int8 and float32?

EDT: Ok, I see that this depends on take_2d_axis0_{dtype1}_{dtype2} and there is no int8 -> float32 specifically and probably other conversions too. But if we (1) make all the take_2d_axis0_{dtype1}_{dtype2} functions and (2) find the common dtype of the concatenated dataframe, then should be easy? Though you mention "This will be simpler once JoinUnit.is_na behavior is deprecated", so maybe you have some followup, that will make that not necessary?

Testcase 1
NUM_ROWS: 100, NUM_COLS: 1000, NUM_DFS: 3
Pandas: 0.01
Manual: 0.01
True
...
Testcase 7
NUM_ROWS: 100, NUM_COLS: 10000, NUM_DFS: 7
Pandas: 0.14
Manual: 0.19
True
Testcase 8
NUM_ROWS: 100, NUM_COLS: 10000, NUM_DFS: 9
Pandas: 0.23
Manual: 0.30
True
...
Testcase 20
NUM_ROWS: 200, NUM_COLS: 10000, NUM_DFS: 9
Pandas: 0.40
Manual: 0.52
True
Testcase 21
NUM_ROWS: 200, NUM_COLS: 50000, NUM_DFS: 3
Pandas: 0.40
Manual: 0.43
True
Testcase 22
NUM_ROWS: 200, NUM_COLS: 50000, NUM_DFS: 5
Pandas: 0.91
Manual: 1.98
True
Testcase 23
NUM_ROWS: 200, NUM_COLS: 50000, NUM_DFS: 7
Pandas: 2.34
Manual: 7.30
True
Testcase 24
NUM_ROWS: 200, NUM_COLS: 50000, NUM_DFS: 9
Pandas: 10.25
Manual: 13.73
True

jbrockmendel · 2023-04-16T17:31:03Z

It would be easy to extend this to float32, others would take more effort

topper-123 · 2023-04-16T17:55:48Z

If we use pd.core.array_algos.take._take_2d_axis0_dict to match up different dtypes to the desired common dtype?

jbrockmendel · 2023-04-16T19:23:28Z

The trouble is determining what the desired common dtype is.

topper-123 · 2023-04-16T20:08:47Z

Yes, I can see it now, if the columns are not all the same it gets very complicated for ints.

I think this is good, but we should def. also get this performance boost for float32 IMO.

mroeschke · 2023-04-17T16:51:09Z

Also would be good to have a whatsnew note

topper-123 · 2023-04-17T18:31:01Z

doc/source/whatsnew/v2.1.0.rst

@@ -91,6 +91,7 @@ Other enhancements
 - Improved error message when creating a DataFrame with empty data (0 rows), no index and an incorrect number of columns. (:issue:`52084`)
 - Let :meth:`DataFrame.to_feather` accept a non-default :class:`Index` and non-string column names (:issue:`51787`)
 - Performance improvement in :func:`read_csv` (:issue:`52632`) with ``engine="c"``
+- Performance improvement in :func:`concat` with homogeneous dtypes (:issue:`52685`)


homogeneous dtypes -> homogeneous float dtypes .

topper-123 · 2023-04-17T18:33:20Z

pandas/core/internals/concat.py

@@ -200,6 +202,21 @@ def concatenate_managers(
    if concat_axis == 0:
        return _concat_managers_axis0(mgrs_indexers, axes, copy)

+    if len(mgrs_indexers) > 0 and mgrs_indexers[0][0].nblocks > 0:
+        first_dtype = mgrs_indexers[0][0].blocks[0].dtype


Any change to do an upcast if we have one/some float64 and one/some float32? That would generalize this pfastpath to covers all floats and not make a distinction between float32/float64.

that would change behavior in some cases

jbrockmendel · 2023-04-19T01:13:32Z

whatsnew added, float32 handled, + green

mroeschke · 2023-04-19T15:47:39Z

Thanks @jbrockmendel

topper-123 · 2023-04-26T11:52:15Z

Hey, This PR caused a rather big slowdown on c-aligned ndarrays, see also discussion in #52786:

>>> frame_c = pd.DataFrame(np.zeros((10000, 200), dtype=np.float32, order="C"))
>>> %timeit pd.concat([frame_c] * 20, axis=0, ignore_index=False)
45.1 ms ± 166 µs per loop  # after this PR
13.2 ms ± 126 µs per loop  # before this PR

jbrockmendel · 2023-04-26T15:56:37Z

I’m out of town this week, will take a look next week

…

On Wed, Apr 26, 2023 at 4:52 AM Terji Petersen ***@***.***> wrote: Hey, This PR caused a rather big slowdown on c-aligned ndarrays, see also discussion in #52786 <#52786>: >>> frame_c = pd.DataFrame(np.zeros((10000, 200), dtype=np.float32, order="C"))>>> %timeit pd.concat([frame_c] * 20, axis=0, ignore_index=False)45.1 ms ± 166 µs per loop # after this PR132. ms ± 126 µs per loop # before this PR — Reply to this email directly, view it on GitHub <#52685 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6BTL6HJL5UKJPB6X6LXDED7XANCNFSM6AAAAAAW7S3GE4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jbrockmendel · 2023-06-20T21:26:38Z

@phofl i have what i think should address this, but since i cant reproduce the slowdown also can't check if it works. can you try adding the following at the top of _concat_homogeneous_fastpath

    if all(not indexers for _, indexers in mgrs_indexers):
        # https://github.com/pandas-dev/pandas/pull/52685#issuecomment-1523287739
        arrs = [mgr.blocks[0].values.T for mgr, _ in mgrs_indexers]
        arr = np.concatenate(arrs).T
        bp = libinternals.BlockPlacement(slice(shape[0]))
        nb = new_block_2d(arr, bp)
        return nb

phofl · 2023-06-20T21:28:49Z

Yep that works! Thx for looking into it.

PERF: homogeneous concat

Unverified

The committer email address is not verified.

Learn about vigilant mode

f4929f5

topper-123 added Performance Reshaping labels Apr 16, 2023

topper-123 added this to the 2.1 milestone Apr 16, 2023

Merge branch 'main' into perf-join_unit-2

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

1d27d18

Handle float32, whatsnew

7422d83

topper-123 reviewed Apr 17, 2023

View reviewed changes

more specific whatsnew

888b5dc

mroeschke approved these changes Apr 19, 2023

View reviewed changes

mroeschke merged commit 4fef063 into pandas-dev:main Apr 19, 2023

jbrockmendel deleted the perf-join_unit-2 branch April 19, 2023 16:25

topper-123 mentioned this pull request Apr 26, 2023

CLN: some cleanups in Series.apply & related #52786

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

PERF: homogeneous concat #52685

PERF: homogeneous concat #52685

jbrockmendel commented Apr 15, 2023

topper-123 commented Apr 16, 2023 •

edited

Loading

jbrockmendel commented Apr 16, 2023

topper-123 commented Apr 16, 2023

jbrockmendel commented Apr 16, 2023

topper-123 commented Apr 16, 2023

mroeschke commented Apr 17, 2023

topper-123 Apr 17, 2023

topper-123 Apr 17, 2023

jbrockmendel Apr 17, 2023

jbrockmendel commented Apr 19, 2023

mroeschke commented Apr 19, 2023

topper-123 commented Apr 26, 2023 •

edited

Loading

jbrockmendel commented Apr 26, 2023 via email

jbrockmendel commented Jun 20, 2023

phofl commented Jun 20, 2023

PERF: homogeneous concat #52685

PERF: homogeneous concat #52685

Conversation

jbrockmendel commented Apr 15, 2023

topper-123 commented Apr 16, 2023 • edited Loading

jbrockmendel commented Apr 16, 2023

topper-123 commented Apr 16, 2023

jbrockmendel commented Apr 16, 2023

topper-123 commented Apr 16, 2023

mroeschke commented Apr 17, 2023

topper-123 Apr 17, 2023

Choose a reason for hiding this comment

topper-123 Apr 17, 2023

Choose a reason for hiding this comment

jbrockmendel Apr 17, 2023

Choose a reason for hiding this comment

jbrockmendel commented Apr 19, 2023

mroeschke commented Apr 19, 2023

topper-123 commented Apr 26, 2023 • edited Loading

jbrockmendel commented Apr 26, 2023 via email

jbrockmendel commented Jun 20, 2023

phofl commented Jun 20, 2023

topper-123 commented Apr 16, 2023 •

edited

Loading

topper-123 commented Apr 26, 2023 •

edited

Loading