PERF: loadtxt default float converter degrades performance for niche case #19541

rossbar · 2021-07-22T10:08:07Z

loadtxt uses a default converter for floating point values to support hexadecimal representations. This default converter has a significant impact on the performance when reading files with a lot of floating point numbers, for example:

# An example csv file containing many floats
a = np.random.rand(100_000, 5)
np.savetxt("foo.csv", a, delimiter=",")

Current default

In [1]: %timeit a = np.loadtxt("test.csv", delimiter=",")
515 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Relace `floatconv` with `float` on L753

In [1]: %timeit a = np.loadtxt("test.csv", delimiter=",")
440 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note also that passing in an explicit converter without the if for hex floats actually slows things down further rather than speeding them up, e.g.:

In [3]: converters = {i: float for i in range(5)}
In [4]: %timeit a = np.loadtxt("test.csv", delimiter=",", converters=converters)
570 ms ± 6.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I mostly wanted to open this issue to draw attention to the tradeoff between performance and convenience for the default float converter. It seems to me that decimal representations of floating point values are much more common in data files than hex representations, so it could be argued that the current default results in a significant performance reduction to support a niche case. Is there appetite for changing the behavior to drop support for hex floats (by default --- the conversion can always be done via the converters kwarg) to improve the parsing performance for decimal floats?

NumPy/Python version information:

1.22.0.dev0+546.gb797a0cb9 3.9.5 (default, May 24 2021, 12:50:35)
[GCC 11.1.0]

The text was updated successfully, but these errors were encountered:

bashtage · 2021-07-23T12:42:51Z

You should use pandas if you care about CSV io.

%timeit pd.read_csv("foo.csv").to_numpy()
88.8 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I get 700ms for the current NumPy implementation.

rossbar · 2021-07-23T13:07:51Z

Generally I agree for new code, but there is plenty of existing code that uses numpy for I/O from text files. This issue is related to npreadtext which aims to improve performance of file I/O in numpy. Currently npreadtext is on par or slightly faster than pd.read_csv for parsing floats with full precision w/out supporting hex floats by default. Adopting a more performance-minded default in NumPy would result in a speedup in existing code that uses np.loadtxt, and would help npreadtext maintain performance while staying completely compatible with the current np.loadtxt behavior, with the longer term goal of getting it into numpy itself.

bashtage · 2021-07-23T13:24:27Z

IME/O the most helpful thing for users would be make them aware that there are more performant solutions then NumPy's file I/O in docstrings. There appear to be much better serialization solutions available for both binary or text formats.

Don't get me wrong -- I regularly use the I/O functions in NumPy when a simple solution is needed and performance is not a key concern. Performance tuning in this area doesn't seem like a comparative advantage for NumPy.

bashtage · 2021-07-23T13:26:51Z

Do you get much of a bump from avoiding the lower() and switching the in to x.startswith("0x") or startswith("0X")"? The lower()` will always create a new string, which might be one of the key costs of the check.

Edit: This won't work if whitespace or - sign,.

charris · 2021-07-23T13:32:59Z

We should steal the code from Pandas. That has been a long term intent.

rossbar · 2021-07-23T13:45:02Z

Do you get much of a bump from avoiding the lower() and switching the in to x.startswith("0x") or startswith("0X")"? The lower()` will always create a new string, which might be one of the key costs of the check.

Not that I see - the (admittedly rudimentary) %timeit benchmarks are slightly worse for various versions. For example, with : if x.startswith("0x") or x.startswith("0X"):

%timeit a = np.loadtxt("test.csv", delimiter=",")
527 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

rossbar · 2021-07-23T13:46:50Z

We should steal the code from Pandas. That has been a long term intent.

I believe @WarrenWeckesser looked into this originally before starting npreadtxt, though I don't remember what the conclusions were specifically.

rossbar · 2021-07-23T14:06:04Z

You should use pandas if you care about CSV io.

Also FWIW a significant amount of the speedup in pandas parsing of floating point values is a result of a precision tradeoff by default. Of course this default value for precision makes a lot of sense, but for the sake of comparison at full precision you should use float_precision="round_trip", e.g.:

In [1]: b = np.loadtxt("test.csv", delimiter=",")
503 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [2]: c = pd.read_csv("test.csv", delimiter=",")
70.5 ms ± 845 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: d = pd.read_csv("test.csv", delimiter=",", float_precision="round_trip")
196 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This isn't super relevant to the discussion above, but it was something that had confused me before so I thought I'd share for any others who are interested in comparing/benchmarking text readers.

seberg · 2022-02-08T13:54:16Z

Closing, the current situation (which could be debated), is to make the float parsing stricter from now on. Users will have to specify a custom converter if they want these things.

rossbar added 33 - Question Question about NumPy usage or development component: numpy.lib labels Jul 22, 2021

anntzer mentioned this issue Aug 2, 2021

PERF: Speed-up common case of loadtxt()ing non-hex floats. #19598

Merged

rossbar mentioned this issue Sep 7, 2021

loadtxt does not allow specifying a default radix #19836

Open

seberg closed this as completed Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: loadtxt default float converter degrades performance for niche case #19541

PERF: loadtxt default float converter degrades performance for niche case #19541

rossbar commented Jul 22, 2021

bashtage commented Jul 23, 2021 •

edited

rossbar commented Jul 23, 2021

bashtage commented Jul 23, 2021

bashtage commented Jul 23, 2021 •

edited

charris commented Jul 23, 2021

rossbar commented Jul 23, 2021

rossbar commented Jul 23, 2021

rossbar commented Jul 23, 2021

seberg commented Feb 8, 2022

PERF: loadtxt default float converter degrades performance for niche case #19541

PERF: loadtxt default float converter degrades performance for niche case #19541

Comments

rossbar commented Jul 22, 2021

Current default

Relace floatconv with float on L753

NumPy/Python version information:

bashtage commented Jul 23, 2021 • edited

rossbar commented Jul 23, 2021

bashtage commented Jul 23, 2021

bashtage commented Jul 23, 2021 • edited

charris commented Jul 23, 2021

rossbar commented Jul 23, 2021

rossbar commented Jul 23, 2021

rossbar commented Jul 23, 2021

seberg commented Feb 8, 2022

Relace `floatconv` with `float` on L753

bashtage commented Jul 23, 2021 •

edited

bashtage commented Jul 23, 2021 •

edited