Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improve performance of infer_dtype #51054

Merged
merged 4 commits into from
Feb 9, 2023

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jan 29, 2023

Improves performance of api.types.infer_dtype:

>>> import numpy as np
>>> import pandas as pd
>>> from pandas.api.types import infer_dtype
>>>
>>> base_arr = np.arange(10_000_000)
>>> x = np.array(base_arr)
>>> %timeit infer_dtype(x)
596 ns ± 128 ns per loop  # main
146 ns ± 8.73 ns per loop  # this PR
>>> x = pd.Series(base_arr)
>>> %timeit infer_dtype(x)
6.83 µs ± 33.7 ns per loop  # main
583 ns ± 5.4 ns per loop  # this PR
>>> x = pd.Series(base_arr, dtype="Int32")
>>> %timeit infer_dtype(x)
3.92 µs ± 9.9 ns per loop  # main
816 ns ± 4.75 ns per loop  # this PR
>>> x = pd.array(base_arr)
>>> %timeit infer_dtype(x)
310 ns ± 0.704 ns per loop  # main
582 ns ± 4.73 ns per loop  # this PR , a bit slower

@@ -1475,11 +1482,17 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
bint seen_val = False
flatiter it

if not util.is_array(value):
if isinstance(value, (ABCSeries, ABCExtensionArray, ABCPandasArray)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to do this without relying on the ABCFoos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not ideal, but it was the best solution I could find that keeps the performance:

  • Importing Series, Index etc. into the global namespace is not possible because of circular import issues
  • Importing Series, Index etc. inside infer_dtype is possible, but costs 300 ns per imported class for each function call, so it adds up to something very noticable
  • I thought of checking class names as a guard before importing inside the function, but that gives problems with subclasses.

The advantage of importing ABCFoo into the global namespace is that it's a onetime import, so no cost for each function call and it's a nice speed boost. But I also see the problem of importing from pythonland.

IMO the speedup is worth this, but I can also see it's not ideal, so no biggie if it doesn't get accepted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely not suggesting using the non-ABC versions. my thought is that without adding this here, in the elif hasattr(value, "dtype") block we could do the _try_infer_map(value.dtype) check before the "Unwrap Series/Index" bit

@mroeschke mroeschke added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Jan 30, 2023
@topper-123
Copy link
Contributor Author

I've made a new version without ABCs.

"u": "integer",
"f": "floating",
"c": "complex",
}[self.dtype.kind]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed for a speed-up.

@@ -1484,23 +1484,17 @@ def infer_dtype(value: object, skipna: bool = True) -> str:

if util.is_array(value):
values = value
elif hasattr(value, "inferred_type") and skipna is False:
elif hasattr(type(value), "inferred_type") and skipna is False:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue in the old version was that using hasattr on a cache_readonly is slow. By doing this we get a good speed-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@topper-123
Copy link
Contributor Author

topper-123 commented Feb 5, 2023

Rerunning the tests gives these result:

>>> import numpy as np
>>> import pandas as pd
>>> from pandas.api.types import infer_dtype
>>>
>>> base_arr = np.arange(10_000_000)
>>> x = np.array(base_arr)
>>> %timeit infer_dtype(x)
596 ns ± 128 ns per loop  # main
146 ns ± 8.73 ns per loop  # this PR, v1
140 ns ± 0.0592 ns per loop  # this PR, v2
>>> x = pd.Series(base_arr)
>>> %timeit infer_dtype(x)
6.83 µs ± 33.7 ns per loop  # main
583 ns ± 5.4 ns per loop  # this PR, v1
725 ns ± 5.24 ns per loop  # this PR, v2
>>> x = pd.Series(base_arr, dtype="Int32")
>>> %timeit infer_dtype(x)
3.92 µs ± 9.9 ns per loop  # main
816 ns ± 4.75 ns per loop  # this PR, v1
765 ns ± 4.81 ns per loop  # This PR, v2
>>> x = pd.array(base_arr)
>>> %timeit infer_dtype(x)
310 ns ± 0.704 ns per loop  # main
582 ns ± 4.73 ns per loop  # this PR, v1
426 ns ± 2.96 ns per loop # this PR, v2

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@topper-123
Copy link
Contributor Author

Gentle ping.

@mroeschke mroeschke added this to the 2.0 milestone Feb 9, 2023
@mroeschke mroeschke merged commit 12faa2e into pandas-dev:main Feb 9, 2023
@mroeschke
Copy link
Member

Thanks @topper-123

@topper-123 topper-123 deleted the infer_dtype_performance branch February 9, 2023 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants