Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add third party dtoa library #272

Closed
jmcnamara opened this issue Jan 26, 2020 · 21 comments
Closed

Add third party dtoa library #272

jmcnamara opened this issue Jan 26, 2020 · 21 comments
Assignees

Comments

@jmcnamara
Copy link
Owner

jmcnamara commented Jan 26, 2020

I have started to look at using a third party library to do dtoa (double to string) formatting. Currently this is on the dtoa branch.

This is in order to avoid locale issues with double sprintf() formatting (for example getting "1,234" instead of the "1.234" required by Excel. For more background and a discussion of other workarounds see #64

The code is working on the dtoa branch with Mac/Linux/Window but I'm still testing it. If you would like to test it then please let me know how you get on.

jmcnamara added a commit that referenced this issue Jan 26, 2020
Added the Milo Yip DTOA library (emyg_dtoa) is uses to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to not use this third party library you can compile
libxlsxwriter without it by passing `USE_STANDARD_DOUBLE=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from Source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
jmcnamara added a commit that referenced this issue Jan 26, 2020
Added the Milo Yip DTOA library (emyg_dtoa) is uses to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to not use this third party library you can compile
libxlsxwriter without it by passing `USE_STANDARD_DOUBLE=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from Source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
jmcnamara added a commit that referenced this issue Jan 26, 2020
Added the Milo Yip DTOA library (emyg_dtoa) is uses to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to not use this third party library you can compile
libxlsxwriter without it by passing `USE_STANDARD_DOUBLE=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from Source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
jmcnamara added a commit that referenced this issue Jan 26, 2020
Added the Milo Yip DTOA library (emyg_dtoa) is uses to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to not use this third party library you can compile
libxlsxwriter without it by passing `USE_STANDARD_DOUBLE=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from Source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
@utelle
Copy link

utelle commented Jan 27, 2020

Great to see that you started to work on this feature. 😄

To make the emyg_dtoa code work with older C compilers, too, I had to make some further adjustments (see emyg_dtoa.c). Maybe you could consider to adopt those modifications.

@jmcnamara
Copy link
Owner Author

Great to see that you started to work on this feature.

I'm still testing it out. I need to add some double specific tests as well to see if there are any issues.

To make the emyg_dtoa code work with older C compilers, too,

Do you mean including a version of stdint.h? Are there other changes as well?

I don't plan to include a copy of stdint.h so people with Windows compilers older than VS 2010 will either need to figure out a way of including it or use the standard dtoa formatting. I think that this be a small subset of potential users.

@utelle
Copy link

utelle commented Jan 27, 2020

I'm still testing it out. I need to add some double specific tests as well to see if there are any issues.

Sure, I understand that very well.

Do you mean including a version of stdint.h? Are there other changes as well?

Yes, at several places I had to move variable definitions to the top of a code block, because older compilers (not compatible with C99) complain otherwise.

I don't plan to include a copy of stdint.h so people with Windows compilers older than VS 2010 ...

Fair comment. However, quite a few people are still using VS 2010, even though it is a rather dated compiler, and VS 2010 for example does not support C99 (as far as I know, no VS compiler version does fully support C99). So, if you want to support VS 2010 you need to address its incompatibilities (like not having stdint.h) somehow.

@jmcnamara jmcnamara self-assigned this Jan 28, 2020
jmcnamara added a commit that referenced this issue Feb 15, 2020
Added the Milo Yip DTOA library (emyg_dtoa) is uses to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to not use this third party library you can compile
libxlsxwriter without it by passing `USE_STANDARD_DOUBLE=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from Source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
@znakeeye
Copy link

Why choose an outdated less performant version of dtoa?!

Since 2018, most C++ compilers switched to the brand new Ryu algorithm (paper can be found here). And here's a post from the author of grisu himself, citing this new algorithm. The author of Ryu, Ulf Adams, has an implementation in C here at github.

Visual Studio incorporated this algorithm back in 2017. Release notes here.

We implemented the shortest round-trip decimal overloads of floating-point to_chars() in C++17's charconv header. For scientific notation, it is approximately 10x as fast as sprintf_s() "%.8e" for floats, and 30x as fast as sprintf_s() "%.16e" for doubles. This uses Ulf Adams' new algorithm, Ryu.

@utelle
Copy link

utelle commented Apr 20, 2020

I mentioned Ryu already in January in one of my comments to issue 64.

It is true that Visual C++ 2017 adopted Ryu for to_chars. However, this function is only accessible from C++17 code. And libxlsxwriter is written in C.

Of course, it would be possible to replace the emyg_dtoa code by Ulf Adams' ryu C code. However, I found out that the conversion results differ, although only slightly. While emyg_dtoa tries to produce the shortest possible string, ryu always appends the exponent. Examples:

double x1 = 1.23456789012345678;
/* emyg_dtoa(x1) => "1.2345678901234568" */
/* ryu(x1)       => "1.2345678901234567E0" */
double x2 = 0.61728394506172835;
/* emyg_dtoa(x2) => "0.6172839450617284" */
/* ryu(x2)       => "6.172839450617283E-1" */

The example shows another effect: obviously the last significant digit is (at least sometimes) rounded differently by the two algorithms.

Most likely both differences don't matter much in practice.

@znakeeye
Copy link

Interesting. And Excel seems to format the latter as 0,617283945061728.

@utelle
Copy link

utelle commented Apr 20, 2020

In the GUI Excel displays at most 15 significant digits. However, internally (that is, for the representation in the file itself) up to 17 significant digits are used.

In the GUI the decimal separator depends on the user's locale (or the user's settings). In the file always a point is used as the decimal separator.

@znakeeye
Copy link

(I used a custom cell decimal format with 30 decimals. That didn't change the UI.)

@utelle
Copy link

utelle commented Apr 20, 2020

Such a custom cell format allows you to display a value like 1.23456789E-22 as 0.000000000000000000000123456789. However, it does not affect the number of significant digits. If you enter more than 15 significant digits in the Excel UI, the exceeding digits will be cut off.

@utelle
Copy link

utelle commented Apr 21, 2020

In the meantime I added Ryu as an alternative to emyg to my local libxlsxwriter repository, compiled it, and ran the functional tests. It works as expected, and Excel opens the resulting xlsx files flawlessly.

Resulting file sizes are slightly bigger, on average 0.2 percent - that is, neglectable. There was no measurable effect on runtime, but that was to be expected for the small test cases. IMHO replacing emyg by Ryu will have an effect only, if the generated Excel file contains many floating point values.

@jmcnamara
Copy link
Owner Author

@utelle Can you push the RYU alternative to a branch of your repo so that I test it.

@znakeeye
Copy link

In the meantime I added Ryu as an alternative to emyg to my local libxlsxwriter repository, compiled it, and ran the functional tests. It works as expected, and Excel opens the resulting xlsx files flawlessly.

Resulting file sizes are slightly bigger, on average 0.2 percent - that is, neglectable. There was no measurable effect on runtime, but that was to be expected for the small test cases. IMHO replacing emyg by Ryu will have an effect only, if the generated Excel file contains many floating point values.

How many? 😛

Will it have an impact on, let's say, 200 000 values?

@utelle
Copy link

utelle commented Apr 22, 2020

Can you push the RYU alternative to a branch of your repo so that I test it.

@jmcnamara I'll do that later today.

Since RYU adds an exponent field to all floating point numbers, the resulting Excel files differ from the ones that are used for comparison in the test cases. That is, all tests fail in that respect. Nevertheless, Excel can successfully open all generated files and from the user's perspective they are identical.

@utelle
Copy link

utelle commented Apr 22, 2020

How many? 😛

Good question. I have not done any performance tests yet.

Will it have an impact on, let's say, 200 000 values?

Probably yes, but my guess is that it will be smaller than you may expect. In respect to speed the emyg_dtoa algorithm , grisu, is already much better than sprintf.

@utelle
Copy link

utelle commented Apr 22, 2020

@jmcnamara I created branch ryu_test in my repository. On invoking make specify either USE_DTOA_EMYG or USE_DTOA_RYU in addition to USE_DOUBLE_FUNCTION to select which double formatting should be used, EMYG or RYU.

jmcnamara added a commit that referenced this issue Jul 11, 2021
Added the optional Milo Yip DTOA library (emyg_dtoa) to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to use this third party library you can compile
libxlsxwriter with it by passing `USE_DTOA_LIBRARY=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
@jmcnamara
Copy link
Owner Author

@utelle I've dusted off this work again with e EMYG library on the dtoa branch and rebased it to main. Can you try it when/if you get a chance and let me know if you encounter any issues.

It is an option compilation so you will need to pass "USE_DTOA_LIBRARY=1" to make or "-DUSE_DTOA_LIBRARY=ON" to cmake.

If there are no issues I'll merge it into main and put it in the next release.

@utelle
Copy link

utelle commented Jul 11, 2021

I tested the dtoa branch. Unfortunately, the dtoa implementation is flawed. In line 429 and line 437 of emyg_dtoa.c you added an explicit plus sign for the exponent to the output (not present in the original implementation). This leads to invalid output in case of a negative exponent. For example, setting a cell to 1.2e-17 results in 1.2E+-17 in the resulting Excel file ... and Excel chokes on such a file and can't repair it.

Lines 429 and 437 should be removed and the array index has to be adjusted in line 430 to

		WriteExponent(kk - 1, &buffer[2]);

resp line 438 to

		WriteExponent(kk - 1, &buffer[0 + length + 2]);

If you want an explicit plus sign in the exponent, you should modify function WriteExponent to do so.

@jmcnamara
Copy link
Owner Author

Thanks. I'll fix that.

jmcnamara added a commit that referenced this issue Jul 11, 2021
Added the optional Milo Yip DTOA library (emyg_dtoa) to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to use this third party library you can compile
libxlsxwriter with it by passing `USE_DTOA_LIBRARY=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
@jmcnamara
Copy link
Owner Author

@utelle I've pushed a fix, with a test case. I used a force push so you will need to get the latest code from the branch again.

@utelle
Copy link

utelle commented Jul 11, 2021

The new version works now as expected.

jmcnamara added a commit that referenced this issue Jul 12, 2021
Added the optional Milo Yip DTOA library (emyg_dtoa) to avoid
issues where the standard sprintf() dtoa function changes output
based on locale settings. It is also 40-50% faster than the
standard dtoa for raw numeric data.

If you wish to use this third party library you can compile
libxlsxwriter with it by passing `USE_DTOA_LIBRARY=1` to
make. The USE_DOUBLE_FUNCTION build variable is no longer used.

Imported source from https://github.com/miloyip/dtoa-benchmark

Feature request #272
@jmcnamara
Copy link
Owner Author

This has been merged to main and released in libxlsxwriter version 1.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants