Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement floatDec/doubleDec #115

Closed
wants to merge 10 commits into from
Closed

Conversation

@winterland1989
Copy link

@winterland1989 winterland1989 commented Feb 14, 2017

Sorry for the delay, the floating format problem is definitely not as simple as i thought, but i finally made it forward somehow, in short this patch:

  • Implement doubleDec/floatDec using grisu3 with dragon4 fallback with reference here.

  • Implement floatDec using dragon4 with floatToDigits in GHC.Float.

Benchmarks are uploaded, please give a review any time you want.

@winterland1989
Copy link
Author

@winterland1989 winterland1989 commented Feb 14, 2017

I have made the benchmark running again. Grisu3 give ~3x boost to floatDec/doubleDec , which is reasonable according to this.

Run on my MacBook Pro 2015(2.7GHz i5):

...
benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap floatDec (10000)
time                 6.393 ms   (6.283 ms .. 6.506 ms)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 6.473 ms   (6.411 ms .. 6.544 ms)
std dev              196.3 μs   (158.5 μs .. 256.4 μs)

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap show float (10000)
time                 17.74 ms   (17.55 ms .. 17.95 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 17.83 ms   (17.63 ms .. 18.32 ms)
std dev              740.7 μs   (304.3 μs .. 1.398 ms)

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap doubleDec (10000)
time                 10.99 ms   (10.90 ms .. 11.10 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 11.20 ms   (11.07 ms .. 11.40 ms)
std dev              415.4 μs   (176.1 μs .. 645.1 μs)

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap show double (10000)
time                 30.69 ms   (30.27 ms .. 31.14 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 30.70 ms   (30.55 ms .. 30.87 ms)
std dev              352.9 μs   (285.3 μs .. 450.5 μs)


...

@dcoutts
Copy link
Contributor

@dcoutts dcoutts commented Feb 16, 2017

This looks very interesting. I'll have to review it in detail, and I invite anyone else to do so too.

@winterland1989
Copy link
Author

@winterland1989 winterland1989 commented Feb 16, 2017

Yes, please ; )

@neongreen
Copy link

@neongreen neongreen commented Feb 16, 2017

How does it compare to double-conversion?

@winterland1989
Copy link
Author

@winterland1989 winterland1989 commented Feb 16, 2017

The C/C++ part should be similar since they're basically same. But it's quite hard to make a comparison because this patch use primitive builder to fill the buffer, while double-conversion directly fill the buffer in C. I'll add double-conversion to the builder benchmark locally and post the result.

Update: It seems double-conversion is doing a better job here. i think we should find the reason.

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap double conversion
time                 2.210 ms   (2.204 ms .. 2.216 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.217 ms   (2.211 ms .. 2.227 ms)
std dev              25.88 μs   (19.04 μs .. 38.26 μs)

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap doubleDec (10000)
time                 7.766 ms   (7.732 ms .. 7.814 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.770 ms   (7.745 ms .. 7.797 ms)
std dev              74.86 μs   (57.71 μs .. 101.1 μs)

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap show double (10000)
time                 21.65 ms   (21.31 ms .. 21.89 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 21.60 ms   (21.48 ms .. 21.72 ms)
std dev              279.3 μs   (233.4 μs .. 369.3 μs)

@winterland1989
Copy link
Author

@winterland1989 winterland1989 commented Feb 17, 2017

Further benchmark indeed show that the time spend on C part is comparable to google's double conversion, much time is spent on FFI peeking, itoa, builder filling etc. If we want to go faster while keep backward-compatible format, we have to fill the buffer in C, and port doFmt to C to get exactly same formatting logic(which is not very fun).

Note that unlike this patch, double-conversion won't give you the same result as show(which is required by bytestring's test), for example:

Prelude Data.Double.Conversion.ByteString> toShortest 0.0012
"0.0012"
Prelude Data.Double.Conversion.ByteString> show 0.0012
"1.2e-3"

So for now, this patch is the best i can do. If you have any other ideas please tell me, i'm happy to give a try.

@dcoutts
Copy link
Contributor

@dcoutts dcoutts commented Feb 17, 2017

@winterland1989 there's no problem in principle with writing directly into the buffer in C, the only thing to worry about is that it's a client allocate buffer: so while you can ensure there's a fixed amount of space in advance, you cannot reallocate it half way through. Is that OK? Can we determine in advance the maximum size and then return the number of bytes we wrote? If so then no problem. There's a specific class of builder primitives like that (the prim ones that are not the fixed-size ones).

Indeed, the main thing I noticed when I first skimmed the code is that it's not just writing to the buffer on the C side, which I would have assumed is the way to go.

@winterland1989
Copy link
Author

@winterland1989 winterland1989 commented Feb 17, 2017

Can we determine in advance the maximum size and then return the number of bytes we wrote?

Yes, Bos use 26, which works fine.

What stops me from doing so is the compatibility with GHC's show format. What i'm doing in this patch is perfectly explained by aseipp in this post's comment:

Second, the Haskell code and the C code are doing different things, which is why it's all a bit verbose. The Haskell code has to both use Grisu3 and also fall back to Dragon4 (AKA the existing floatToDigits function, written in Haskell), because Grisu3 is not complete: it may detect that its output is not "the shortest possible" for some small percent of inputs, and thus it will fail. Furthermore, despite the somewhat opaque naming -- these functions are NOT functions of type Float -> String -- they are effectively implementations of the floatsToDigits function which only gives you the list of digits and exponent as a pair of ([Int],Int). That means you still have to actually convert this to a string in the proper representation.

The code inside the patch, aside from adding Grisu, is effectively just a copy of the source code from Float's Show instance inside GHC.Float, here, but instead using the Grisu/Dragon combination, and specialized to Builder from ByteString. That's why the patch replaces floatDec and doubleDec which were previously just implemented as string7 . show (because show for Float and Double in fact was literally defined as (show x = formatRealFloat ...), and it replaces them with a copy of our already well-used formatting code.

@knupfer
Copy link

@knupfer knupfer commented Nov 3, 2017

I'd really like this to be merged. This would remove my libs dependence on double-conversion.

int16_t b_exp, d_exp;
} power;

static const power pow_cache[] =
Copy link
Contributor

@bgamari bgamari Nov 10, 2017

A comment here is needed. How is this array indexed? How are the elements computed?

Copy link
Author

@winterland1989 winterland1989 Feb 23, 2018

Argh, I forget to submit this review comment in time, sorry for the delay. To see how this array works, please check section 4 Cache Powers from the paper in the code comment. This array is a cache table for converting between 2-based power into 10-based, it store data in power struct:

typedef struct power
{
    uint64_t fract;
    int16_t b_exp, d_exp;
} power;

The data from the table satisfy this equation: fract * power(2, b_exp) ~= power(10, d_exp) . It can be calculated in such manner:

  • Decide step value and range for d_exp which satisfy the condition from the paper(Section 5). In this patch the original author choose following (which exceeded the requirement from the paper):
define MIN_CACHED_EXP -348
define CACHED_EXP_STEP 8
  • For each d_exp from [MIN_CACHED_EXP, MIN_CACHED_EXP + CACHED_EXP_STEP, -MIN_CACHED_EXP], convert power(10, d_exp) to IEEE representation, record its fract and b_exp part.

Now we get the table.

Copy link
Member

@sjakobi sjakobi Dec 19, 2019

If I understand @bgamari correctly, he wanted your comment to be in the source code. (I agree that that would be useful).

@bgamari
Copy link
Contributor

@bgamari bgamari commented Feb 18, 2018

I agree that it would be nice to have this merged. However, it's hard to review without proper commenting.

@knupfer
Copy link

@knupfer knupfer commented Apr 29, 2018

@winterland1989
Copy link
Author

@winterland1989 winterland1989 commented May 1, 2018

I added notes on table generating, is there any problems?

@alexbiehl
Copy link
Member

@alexbiehl alexbiehl commented May 17, 2018

+1 for faster floating point encoding!

@fumieval
Copy link
Contributor

@fumieval fumieval commented Apr 2, 2020

I was trying out Grisu3 for decimal encoding on mason, my alternative bytestring builder library. It seems to work perfectly as far as I can see, and is ~20x faster than show! I'd like this change to be merged

@Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented May 10, 2020

I wonder how this compares against Ryu algorithm. @Lumaere would you possibly take a look?

@sergv could you possibly weigh in and review?

@la-wu
Copy link
Contributor

@la-wu la-wu commented May 11, 2020

Ulf Adams did an analysis of Ryu vs Grisu3 in his paper. Regarding performance

Grisu3 has several outliers, not shown in the graphs, which correspond to cases where it has to fall back to a slower arbitrary-precision implementation. The C implementation
of Ryu has better performance than Grisu3 and is also less volatile. On average, Ryu is roughly three times faster than Grisu for 32-bit and 64-bit floating point numbers

My implementation is in pure Haskell put does use malloc and directly writes bits for performance.

Ryu on this benchmark

[ bgroup "fold"
    [ benchB "foldMap f2s       " floatData $ foldMap (BB.string7 . f2s)
    , benchB "foldMap show float" floatData $ foldMap (BB.string7 . show)
    ]
]

on my machine (i7-8700k):

ryu> benchmarks
Running 1 benchmarks...
Benchmark ryu-bench: RUNNING...
benchmarking fold/foldMap f2s        (10000)
time                 1.783 ms   (1.759 ms .. 1.807 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 1.785 ms   (1.775 ms .. 1.802 ms)
std dev              44.22 μs   (28.67 μs .. 69.66 μs)
variance introduced by outliers: 13% (moderately inflated)

benchmarking fold/foldMap show float (10000)
time                 8.932 ms   (8.841 ms .. 9.003 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 8.862 ms   (8.825 ms .. 8.899 ms)
std dev              103.9 μs   (87.95 μs .. 123.6 μs)

This branch:

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap floatDec (10000)
time                 3.662 ms   (3.623 ms .. 3.710 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 3.646 ms   (3.632 ms .. 3.665 ms)
std dev              53.30 μs   (38.45 μs .. 71.58 μs)

benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap show float (10000)
time                 9.931 ms   (9.793 ms .. 10.08 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 9.776 ms   (9.717 ms .. 9.857 ms)
std dev              186.8 μs   (132.5 μs .. 260.9 μs)

My double implementation is not complete (missing fixed-point) but the exponential format is

benchmarking fold/foldMap d2s         (10000)
time                 1.908 ms   (1.899 ms .. 1.917 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.904 ms   (1.901 ms .. 1.908 ms)
std dev              11.77 μs   (9.075 μs .. 15.81 μs)

@Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented May 11, 2020

@knupfer @alexbiehl @fumieval is anyone up to review this PR?

@Lumaere would it be possible to port you implementation of Ryu, avoiding extra dependencies?

@la-wu
Copy link
Contributor

@la-wu la-wu commented May 12, 2020

Sure, I could take a look at bringing my implementation into Bytestring. However, for performance, using the C implementation and adding Haskell bindings would ultimately be faster. The C implementation is also more feature complete and has better support for different platforms, word sizes, etc.

@la-wu
Copy link
Contributor

@la-wu la-wu commented May 17, 2020

Here is the PR with the Ryu implementation: #222.

With a slightly less naive benchmark of the Haskell implementation, the performance difference is closer.

benchB "foldMap f2s       " floatData $ foldMap (BB.byteString . f2sGeneral)

results in

benchmarking fold/foldMap f2s        (10000)
time                 1.355 ms   (1.348 ms .. 1.363 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.357 ms   (1.350 ms .. 1.365 ms)
std dev              24.90 μs   (19.70 μs .. 30.80 μs)

But the difference was significant enough that I opted to use the C implementation.

@fumieval
Copy link
Contributor

@fumieval fumieval commented Sep 11, 2020

@nikita-volkov pointed out that this algorithm produces a different result from Show Double: fumieval/mason#5

As discussed in the issue, it's an improvement, but it might break some tests in the ecosystem.

@winterland1989
Copy link
Author

@winterland1989 winterland1989 commented Sep 11, 2020

@nikita-volkov pointed out that this algorithm produces a different result from Show Double: fumieval/mason#5

As discussed in the issue, it's an improvement, but it might break some tests in the ecosystem.

It's up to you to implement the logic here to match base's show. I definitely added the logic to match grisu's output with show's, otherwise, the tests won't pass.

@fumieval
Copy link
Contributor

@fumieval fumieval commented Sep 11, 2020

Ah I see, great.

@neongreen
Copy link

@neongreen neongreen commented Dec 24, 2020

Do we need this given that there is a Ryu pull-request available? #222

My impression is that Ryu is better — am I wrong?

@Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented Dec 24, 2020

@neongreen yes, potentially we could abandon this PR, if would be enough to proceed with #222 alone. But if #222 is stalled, finishing this one would be also a decent step forward.

I'd appreciate if someone steps in to shepherd this area of improvements.

@fumieval
Copy link
Contributor

@fumieval fumieval commented Dec 30, 2020

I think #222 is good to go; just some benchmarks are not compiling due to a few missing lines in bytestring.cabal. I asked the author to fix them but we could as well just merge it and fix them in a succeeding commit.

Who's responsible for merging PRs?

@Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented Dec 30, 2020

As I mentioned in #222, I am sorry, but I cannot review or maintain such amount of C code. Neither CLC can be made responsible to provide maintainers with C skills in future. I am not particularly happy even with existing cbits/itoa.c, but at least it is very straightforward.

I will not particularly oppose, if other maintainers (CC @sjakobi @hsyl20) are willing to sign off and agree a sustainable way of future support.

Besides this, both PRs lack a comprehensive test suite with a coverage report.

@Bodigrim
Copy link
Contributor

@Bodigrim Bodigrim commented Feb 26, 2021

Closing, superseded by #365.

@Bodigrim Bodigrim closed this Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

10 participants