Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: loadtxt default float converter degrades performance for niche case #19541

Closed
rossbar opened this issue Jul 22, 2021 · 9 comments
Closed
Labels
33 - Question Question about NumPy usage or development component: numpy.lib

Comments

@rossbar
Copy link
Contributor

rossbar commented Jul 22, 2021

loadtxt uses a default converter for floating point values to support hexadecimal representations. This default converter has a significant impact on the performance when reading files with a lot of floating point numbers, for example:

# An example csv file containing many floats
a = np.random.rand(100_000, 5)
np.savetxt("foo.csv", a, delimiter=",")

Current default

In [1]: %timeit a = np.loadtxt("test.csv", delimiter=",")
515 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Relace floatconv with float on L753

In [1]: %timeit a = np.loadtxt("test.csv", delimiter=",")
440 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note also that passing in an explicit converter without the if for hex floats actually slows things down further rather than speeding them up, e.g.:

In [3]: converters = {i: float for i in range(5)}
In [4]: %timeit a = np.loadtxt("test.csv", delimiter=",", converters=converters)
570 ms ± 6.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I mostly wanted to open this issue to draw attention to the tradeoff between performance and convenience for the default float converter. It seems to me that decimal representations of floating point values are much more common in data files than hex representations, so it could be argued that the current default results in a significant performance reduction to support a niche case. Is there appetite for changing the behavior to drop support for hex floats (by default --- the conversion can always be done via the converters kwarg) to improve the parsing performance for decimal floats?

NumPy/Python version information:

1.22.0.dev0+546.gb797a0cb9 3.9.5 (default, May 24 2021, 12:50:35)
[GCC 11.1.0]

@rossbar rossbar added 33 - Question Question about NumPy usage or development component: numpy.lib labels Jul 22, 2021
@bashtage
Copy link
Contributor

bashtage commented Jul 23, 2021

You should use pandas if you care about CSV io.

%timeit pd.read_csv("foo.csv").to_numpy()
88.8 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I get 700ms for the current NumPy implementation.

@rossbar
Copy link
Contributor Author

rossbar commented Jul 23, 2021

Generally I agree for new code, but there is plenty of existing code that uses numpy for I/O from text files. This issue is related to npreadtext which aims to improve performance of file I/O in numpy. Currently npreadtext is on par or slightly faster than pd.read_csv for parsing floats with full precision w/out supporting hex floats by default. Adopting a more performance-minded default in NumPy would result in a speedup in existing code that uses np.loadtxt, and would help npreadtext maintain performance while staying completely compatible with the current np.loadtxt behavior, with the longer term goal of getting it into numpy itself.

@bashtage
Copy link
Contributor

IME/O the most helpful thing for users would be make them aware that there are more performant solutions then NumPy's file I/O in docstrings. There appear to be much better serialization solutions available for both binary or text formats.

Don't get me wrong -- I regularly use the I/O functions in NumPy when a simple solution is needed and performance is not a key concern. Performance tuning in this area doesn't seem like a comparative advantage for NumPy.

@bashtage
Copy link
Contributor

bashtage commented Jul 23, 2021

Do you get much of a bump from avoiding the lower() and switching the in to x.startswith("0x") or startswith("0X")"? The lower()` will always create a new string, which might be one of the key costs of the check.

Edit: This won't work if whitespace or - sign,.

@charris
Copy link
Member

charris commented Jul 23, 2021

We should steal the code from Pandas. That has been a long term intent.

@rossbar
Copy link
Contributor Author

rossbar commented Jul 23, 2021

Do you get much of a bump from avoiding the lower() and switching the in to x.startswith("0x") or startswith("0X")"? The lower()` will always create a new string, which might be one of the key costs of the check.

Not that I see - the (admittedly rudimentary) %timeit benchmarks are slightly worse for various versions. For example, with : if x.startswith("0x") or x.startswith("0X"):

%timeit a = np.loadtxt("test.csv", delimiter=",")
527 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@rossbar
Copy link
Contributor Author

rossbar commented Jul 23, 2021

We should steal the code from Pandas. That has been a long term intent.

I believe @WarrenWeckesser looked into this originally before starting npreadtxt, though I don't remember what the conclusions were specifically.

@rossbar
Copy link
Contributor Author

rossbar commented Jul 23, 2021

You should use pandas if you care about CSV io.

Also FWIW a significant amount of the speedup in pandas parsing of floating point values is a result of a precision tradeoff by default. Of course this default value for precision makes a lot of sense, but for the sake of comparison at full precision you should use float_precision="round_trip", e.g.:

In [1]: b = np.loadtxt("test.csv", delimiter=",")
503 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [2]: c = pd.read_csv("test.csv", delimiter=",")
70.5 ms ± 845 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: d = pd.read_csv("test.csv", delimiter=",", float_precision="round_trip")
196 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This isn't super relevant to the discussion above, but it was something that had confused me before so I thought I'd share for any others who are interested in comparing/benchmarking text readers.

@seberg
Copy link
Member

seberg commented Feb 8, 2022

Closing, the current situation (which could be debated), is to make the float parsing stricter from now on. Users will have to specify a custom converter if they want these things.

@seberg seberg closed this as completed Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
33 - Question Question about NumPy usage or development component: numpy.lib
Projects
None yet
Development

No branches or pull requests

4 participants