Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: DataFrame.transpose for pyarrow-backed #54224

Merged
merged 6 commits into from Jul 24, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Jul 22, 2023

Similar to #52836, but for pyarrow-backed frames.

import pandas as pd
import pyarrow as pa
import numpy as np

values = np.random.randn(1000, 1000)
df = pd.DataFrame(values, dtype=pd.ArrowDtype(pa.float64()))

%timeit df.T

# 346 ms ± 67.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    -> main
# 42.5 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) -> PR

Renamed the masked implementation from homogenous -> homogeneous. Both may be valid(?), but homogeneous is used elsewhere.

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Jul 22, 2023
@lukemanley lukemanley added this to the 2.1 milestone Jul 22, 2023
if isinstance(self._mgr, ArrayManager):
arrays = self._mgr.arrays
else:
arrays = list(self._iter_column_arrays())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant do this unconditionally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean use _iter_column_arrays in both cases? The masked array implementation just above this uses the condition as well, presumably for performance:

import pandas as pd

pd.options.mode.data_manager = "array"

df = pd.DataFrame(columns=range(100000), dtype="int64[pyarrow]")

%timeit list(df._iter_column_arrays())
# 57.9 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df._mgr.arrays
# 110 ns ± 8.79 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, thanks for taking a look

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, I moved the condition into DataFrame._iter_column_arrays to avoid repeating it elsewhere

@mroeschke mroeschke merged commit f218fa3 into pandas-dev:main Jul 24, 2023
36 checks passed
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the perf-arrow-transpose branch July 25, 2023 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants