Add third party dtoa library #272

jmcnamara · 2020-01-26T21:15:17Z

I have started to look at using a third party library to do dtoa (double to string) formatting. Currently this is on the dtoa branch.

This is in order to avoid locale issues with double sprintf() formatting (for example getting "1,234" instead of the "1.234" required by Excel. For more background and a discussion of other workarounds see #64

The code is working on the dtoa branch with Mac/Linux/Window but I'm still testing it. If you would like to test it then please let me know how you get on.

Added the Milo Yip DTOA library (emyg_dtoa) is uses to avoid issues where the standard sprintf() dtoa function changes output based on locale settings. It is also 40-50% faster than the standard dtoa for raw numeric data. If you wish to not use this third party library you can compile libxlsxwriter without it by passing `USE_STANDARD_DOUBLE=1` to make. The USE_DOUBLE_FUNCTION build variable is no longer used. Imported source from Source from https://github.com/miloyip/dtoa-benchmark Feature request #272

utelle · 2020-01-27T12:00:06Z

Great to see that you started to work on this feature. 😄

To make the emyg_dtoa code work with older C compilers, too, I had to make some further adjustments (see emyg_dtoa.c). Maybe you could consider to adopt those modifications.

jmcnamara · 2020-01-27T12:51:29Z

Great to see that you started to work on this feature.

I'm still testing it out. I need to add some double specific tests as well to see if there are any issues.

To make the emyg_dtoa code work with older C compilers, too,

Do you mean including a version of stdint.h? Are there other changes as well?

I don't plan to include a copy of stdint.h so people with Windows compilers older than VS 2010 will either need to figure out a way of including it or use the standard dtoa formatting. I think that this be a small subset of potential users.

utelle · 2020-01-27T13:52:06Z

I'm still testing it out. I need to add some double specific tests as well to see if there are any issues.

Sure, I understand that very well.

Do you mean including a version of stdint.h? Are there other changes as well?

Yes, at several places I had to move variable definitions to the top of a code block, because older compilers (not compatible with C99) complain otherwise.

I don't plan to include a copy of stdint.h so people with Windows compilers older than VS 2010 ...

Fair comment. However, quite a few people are still using VS 2010, even though it is a rather dated compiler, and VS 2010 for example does not support C99 (as far as I know, no VS compiler version does fully support C99). So, if you want to support VS 2010 you need to address its incompatibilities (like not having stdint.h) somehow.

Added the Milo Yip DTOA library (emyg_dtoa) is uses to avoid issues where the standard sprintf() dtoa function changes output based on locale settings. It is also 40-50% faster than the standard dtoa for raw numeric data. If you wish to not use this third party library you can compile libxlsxwriter without it by passing `USE_STANDARD_DOUBLE=1` to make. The USE_DOUBLE_FUNCTION build variable is no longer used. Imported source from Source from https://github.com/miloyip/dtoa-benchmark Feature request #272

znakeeye · 2020-04-20T11:51:15Z

Why choose an outdated less performant version of dtoa?!

Since 2018, most C++ compilers switched to the brand new Ryu algorithm (paper can be found here). And here's a post from the author of grisu himself, citing this new algorithm. The author of Ryu, Ulf Adams, has an implementation in C here at github.

Visual Studio incorporated this algorithm back in 2017. Release notes here.

We implemented the shortest round-trip decimal overloads of floating-point to_chars() in C++17's charconv header. For scientific notation, it is approximately 10x as fast as sprintf_s() "%.8e" for floats, and 30x as fast as sprintf_s() "%.16e" for doubles. This uses Ulf Adams' new algorithm, Ryu.

utelle · 2020-04-20T14:48:40Z

I mentioned Ryu already in January in one of my comments to issue 64.

It is true that Visual C++ 2017 adopted Ryu for to_chars. However, this function is only accessible from C++17 code. And libxlsxwriter is written in C.

Of course, it would be possible to replace the emyg_dtoa code by Ulf Adams' ryu C code. However, I found out that the conversion results differ, although only slightly. While emyg_dtoa tries to produce the shortest possible string, ryu always appends the exponent. Examples:

double x1 = 1.23456789012345678;
/* emyg_dtoa(x1) => "1.2345678901234568" */
/* ryu(x1)       => "1.2345678901234567E0" */
double x2 = 0.61728394506172835;
/* emyg_dtoa(x2) => "0.6172839450617284" */
/* ryu(x2)       => "6.172839450617283E-1" */

The example shows another effect: obviously the last significant digit is (at least sometimes) rounded differently by the two algorithms.

Most likely both differences don't matter much in practice.

znakeeye · 2020-04-20T16:00:26Z

Interesting. And Excel seems to format the latter as 0,617283945061728.

utelle · 2020-04-20T19:13:48Z

In the GUI Excel displays at most 15 significant digits. However, internally (that is, for the representation in the file itself) up to 17 significant digits are used.

In the GUI the decimal separator depends on the user's locale (or the user's settings). In the file always a point is used as the decimal separator.

znakeeye · 2020-04-20T19:16:25Z

(I used a custom cell decimal format with 30 decimals. That didn't change the UI.)

utelle · 2020-04-20T19:49:21Z

Such a custom cell format allows you to display a value like 1.23456789E-22 as 0.000000000000000000000123456789. However, it does not affect the number of significant digits. If you enter more than 15 significant digits in the Excel UI, the exceeding digits will be cut off.

utelle · 2020-04-21T16:25:14Z

In the meantime I added Ryu as an alternative to emyg to my local libxlsxwriter repository, compiled it, and ran the functional tests. It works as expected, and Excel opens the resulting xlsx files flawlessly.

Resulting file sizes are slightly bigger, on average 0.2 percent - that is, neglectable. There was no measurable effect on runtime, but that was to be expected for the small test cases. IMHO replacing emyg by Ryu will have an effect only, if the generated Excel file contains many floating point values.

jmcnamara · 2020-04-22T10:06:19Z

@utelle Can you push the RYU alternative to a branch of your repo so that I test it.

znakeeye · 2020-04-22T10:16:36Z

In the meantime I added Ryu as an alternative to emyg to my local libxlsxwriter repository, compiled it, and ran the functional tests. It works as expected, and Excel opens the resulting xlsx files flawlessly.

Resulting file sizes are slightly bigger, on average 0.2 percent - that is, neglectable. There was no measurable effect on runtime, but that was to be expected for the small test cases. IMHO replacing emyg by Ryu will have an effect only, if the generated Excel file contains many floating point values.

How many? 😛

Will it have an impact on, let's say, 200 000 values?

utelle · 2020-04-22T10:21:45Z

Can you push the RYU alternative to a branch of your repo so that I test it.

@jmcnamara I'll do that later today.

Since RYU adds an exponent field to all floating point numbers, the resulting Excel files differ from the ones that are used for comparison in the test cases. That is, all tests fail in that respect. Nevertheless, Excel can successfully open all generated files and from the user's perspective they are identical.

utelle · 2020-04-22T10:31:07Z

How many? 😛

Good question. I have not done any performance tests yet.

Will it have an impact on, let's say, 200 000 values?

Probably yes, but my guess is that it will be smaller than you may expect. In respect to speed the emyg_dtoa algorithm , grisu, is already much better than sprintf.

utelle · 2020-04-22T13:02:47Z

@jmcnamara I created branch ryu_test in my repository. On invoking make specify either USE_DTOA_EMYG or USE_DTOA_RYU in addition to USE_DOUBLE_FUNCTION to select which double formatting should be used, EMYG or RYU.

Added the optional Milo Yip DTOA library (emyg_dtoa) to avoid issues where the standard sprintf() dtoa function changes output based on locale settings. It is also 40-50% faster than the standard dtoa for raw numeric data. If you wish to use this third party library you can compile libxlsxwriter with it by passing `USE_DTOA_LIBRARY=1` to make. The USE_DOUBLE_FUNCTION build variable is no longer used. Imported source from https://github.com/miloyip/dtoa-benchmark Feature request #272

jmcnamara · 2021-07-11T13:03:49Z

@utelle I've dusted off this work again with e EMYG library on the dtoa branch and rebased it to main. Can you try it when/if you get a chance and let me know if you encounter any issues.

It is an option compilation so you will need to pass "USE_DTOA_LIBRARY=1" to make or "-DUSE_DTOA_LIBRARY=ON" to cmake.

If there are no issues I'll merge it into main and put it in the next release.

utelle · 2021-07-11T15:34:55Z

I tested the dtoa branch. Unfortunately, the dtoa implementation is flawed. In line 429 and line 437 of emyg_dtoa.c you added an explicit plus sign for the exponent to the output (not present in the original implementation). This leads to invalid output in case of a negative exponent. For example, setting a cell to 1.2e-17 results in 1.2E+-17 in the resulting Excel file ... and Excel chokes on such a file and can't repair it.

Lines 429 and 437 should be removed and the array index has to be adjusted in line 430 to

		WriteExponent(kk - 1, &buffer[2]);

resp line 438 to

		WriteExponent(kk - 1, &buffer[0 + length + 2]);

If you want an explicit plus sign in the exponent, you should modify function WriteExponent to do so.

jmcnamara · 2021-07-11T15:53:11Z

Thanks. I'll fix that.

Added the optional Milo Yip DTOA library (emyg_dtoa) to avoid issues where the standard sprintf() dtoa function changes output based on locale settings. It is also 40-50% faster than the standard dtoa for raw numeric data. If you wish to use this third party library you can compile libxlsxwriter with it by passing `USE_DTOA_LIBRARY=1` to make. The USE_DOUBLE_FUNCTION build variable is no longer used. Imported source from https://github.com/miloyip/dtoa-benchmark Feature request #272

jmcnamara · 2021-07-11T17:06:31Z

@utelle I've pushed a fix, with a test case. I used a force push so you will need to get the latest code from the branch again.

utelle · 2021-07-11T17:26:29Z

The new version works now as expected.

Added the optional Milo Yip DTOA library (emyg_dtoa) to avoid issues where the standard sprintf() dtoa function changes output based on locale settings. It is also 40-50% faster than the standard dtoa for raw numeric data. If you wish to use this third party library you can compile libxlsxwriter with it by passing `USE_DTOA_LIBRARY=1` to make. The USE_DOUBLE_FUNCTION build variable is no longer used. Imported source from https://github.com/miloyip/dtoa-benchmark Feature request #272

jmcnamara · 2021-07-12T22:37:49Z

This has been merged to main and released in libxlsxwriter version 1.1.1.

jmcnamara self-assigned this Jan 28, 2020

jmcnamara added the in progress label Jan 28, 2020

jmcnamara mentioned this issue Jul 11, 2021

Corrupted Excel files due to locale dependent conversion of floating point values #64

Closed

jmcnamara closed this as completed Jul 12, 2021

birktj mentioned this issue Sep 20, 2022

Invalid xlsx files generated with some locales informationsea/xlsxwriter-rs#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add third party dtoa library #272

Add third party dtoa library #272

jmcnamara commented Jan 26, 2020 •

edited

Loading

utelle commented Jan 27, 2020

jmcnamara commented Jan 27, 2020

utelle commented Jan 27, 2020 •

edited

Loading

znakeeye commented Apr 20, 2020

utelle commented Apr 20, 2020

znakeeye commented Apr 20, 2020

utelle commented Apr 20, 2020

znakeeye commented Apr 20, 2020

utelle commented Apr 20, 2020

utelle commented Apr 21, 2020

jmcnamara commented Apr 22, 2020

znakeeye commented Apr 22, 2020

utelle commented Apr 22, 2020

utelle commented Apr 22, 2020

utelle commented Apr 22, 2020

jmcnamara commented Jul 11, 2021

utelle commented Jul 11, 2021

jmcnamara commented Jul 11, 2021

jmcnamara commented Jul 11, 2021

utelle commented Jul 11, 2021

jmcnamara commented Jul 12, 2021

Add third party dtoa library #272

Add third party dtoa library #272

Comments

jmcnamara commented Jan 26, 2020 • edited Loading

utelle commented Jan 27, 2020

jmcnamara commented Jan 27, 2020

utelle commented Jan 27, 2020 • edited Loading

znakeeye commented Apr 20, 2020

utelle commented Apr 20, 2020

znakeeye commented Apr 20, 2020

utelle commented Apr 20, 2020

znakeeye commented Apr 20, 2020

utelle commented Apr 20, 2020

utelle commented Apr 21, 2020

jmcnamara commented Apr 22, 2020

znakeeye commented Apr 22, 2020

utelle commented Apr 22, 2020

utelle commented Apr 22, 2020

utelle commented Apr 22, 2020

jmcnamara commented Jul 11, 2021

utelle commented Jul 11, 2021

jmcnamara commented Jul 11, 2021

jmcnamara commented Jul 11, 2021

utelle commented Jul 11, 2021

jmcnamara commented Jul 12, 2021

jmcnamara commented Jan 26, 2020 •

edited

Loading

utelle commented Jan 27, 2020 •

edited

Loading