-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement floatDec/doubleDec #115
Conversation
I have made the benchmark running again. Grisu3 give ~3x boost to Run on my MacBook Pro 2015(2.7GHz i5): ...
benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap floatDec (10000)
time 6.393 ms (6.283 ms .. 6.506 ms)
0.998 R² (0.996 R² .. 0.999 R²)
mean 6.473 ms (6.411 ms .. 6.544 ms)
std dev 196.3 μs (158.5 μs .. 256.4 μs)
benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap show float (10000)
time 17.74 ms (17.55 ms .. 17.95 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 17.83 ms (17.63 ms .. 18.32 ms)
std dev 740.7 μs (304.3 μs .. 1.398 ms)
benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap doubleDec (10000)
time 10.99 ms (10.90 ms .. 11.10 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 11.20 ms (11.07 ms .. 11.40 ms)
std dev 415.4 μs (176.1 μs .. 645.1 μs)
benchmarking Data.ByteString.Builder/Non-bounded encodings/foldMap show double (10000)
time 30.69 ms (30.27 ms .. 31.14 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 30.70 ms (30.55 ms .. 30.87 ms)
std dev 352.9 μs (285.3 μs .. 450.5 μs)
... |
This looks very interesting. I'll have to review it in detail, and I invite anyone else to do so too. |
Yes, please ; ) |
How does it compare to double-conversion? |
The C/C++ part should be similar since they're basically same. But it's quite hard to make a comparison because this patch use primitive builder to fill the buffer, while double-conversion directly fill the buffer in C. I'll add double-conversion to the builder benchmark locally and post the result. Update: It seems double-conversion is doing a better job here. i think we should find the reason.
|
Further benchmark indeed show that the time spend on C part is comparable to google's double conversion, much time is spent on FFI peeking, itoa, builder filling etc. If we want to go faster while keep backward-compatible format, we have to fill the buffer in C, and port Note that unlike this patch, Prelude Data.Double.Conversion.ByteString> toShortest 0.0012
"0.0012"
Prelude Data.Double.Conversion.ByteString> show 0.0012
"1.2e-3" So for now, this patch is the best i can do. If you have any other ideas please tell me, i'm happy to give a try. |
@winterland1989 there's no problem in principle with writing directly into the buffer in C, the only thing to worry about is that it's a client allocate buffer: so while you can ensure there's a fixed amount of space in advance, you cannot reallocate it half way through. Is that OK? Can we determine in advance the maximum size and then return the number of bytes we wrote? If so then no problem. There's a specific class of builder primitives like that (the prim ones that are not the fixed-size ones). Indeed, the main thing I noticed when I first skimmed the code is that it's not just writing to the buffer on the C side, which I would have assumed is the way to go. |
Yes, Bos use 26, which works fine. What stops me from doing so is the compatibility with GHC's
|
I'd really like this to be merged. This would remove my libs dependence on double-conversion. |
int16_t b_exp, d_exp; | ||
} power; | ||
|
||
static const power pow_cache[] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here is needed. How is this array indexed? How are the elements computed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, I forget to submit this review comment in time, sorry for the delay. To see how this array works, please check section 4 Cache Powers
from the paper in the code comment. This array is a cache table for converting between 2-based power into 10-based, it store data in power
struct:
typedef struct power
{
uint64_t fract;
int16_t b_exp, d_exp;
} power;
The data from the table satisfy this equation: fract * power(2, b_exp) ~= power(10, d_exp)
. It can be calculated in such manner:
- Decide step value and range for
d_exp
which satisfy the condition from the paper(Section 5). In this patch the original author choose following (which exceeded the requirement from the paper):
define MIN_CACHED_EXP -348
define CACHED_EXP_STEP 8
- For each
d_exp
from[MIN_CACHED_EXP, MIN_CACHED_EXP + CACHED_EXP_STEP, -MIN_CACHED_EXP]
, convertpower(10, d_exp)
to IEEE representation, record itsfract
andb_exp
part.
Now we get the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand @bgamari correctly, he wanted your comment to be in the source code. (I agree that that would be useful).
I agree that it would be nice to have this merged. However, it's hard to review without proper commenting. |
ping @winterland1989 |
I added notes on table generating, is there any problems? |
+1 for faster floating point encoding! |
I was trying out Grisu3 for decimal encoding on mason, my alternative bytestring builder library. It seems to work perfectly as far as I can see, and is ~20x faster than |
I wonder how this compares against Ryu algorithm. @Lumaere would you possibly take a look? @sergv could you possibly weigh in and review? |
Ulf Adams did an analysis of Ryu vs Grisu3 in his paper. Regarding performance
My implementation is in pure Haskell put does use malloc and directly writes bits for performance. Ryu on this benchmark
on my machine (i7-8700k):
This branch:
My double implementation is not complete (missing fixed-point) but the exponential format is
|
@knupfer @alexbiehl @fumieval is anyone up to review this PR? @Lumaere would it be possible to port you implementation of Ryu, avoiding extra dependencies? |
Sure, I could take a look at bringing my implementation into Bytestring. However, for performance, using the C implementation and adding Haskell bindings would ultimately be faster. The C implementation is also more feature complete and has better support for different platforms, word sizes, etc. |
Here is the PR with the Ryu implementation: #222. With a slightly less naive benchmark of the Haskell implementation, the performance difference is closer.
results in
But the difference was significant enough that I opted to use the C implementation. |
@nikita-volkov pointed out that this algorithm produces a different result from As discussed in the issue, it's an improvement, but it might break some tests in the ecosystem. |
It's up to you to implement the logic here to match base's show. I definitely added the logic to match grisu's output with show's, otherwise, the tests won't pass. |
Ah I see, great. |
Do we need this given that there is a Ryu pull-request available? #222 My impression is that Ryu is better — am I wrong? |
@neongreen yes, potentially we could abandon this PR, if would be enough to proceed with #222 alone. But if #222 is stalled, finishing this one would be also a decent step forward. I'd appreciate if someone steps in to shepherd this area of improvements. |
I think #222 is good to go; just some benchmarks are not compiling due to a few missing lines in bytestring.cabal. I asked the author to fix them but we could as well just merge it and fix them in a succeeding commit. Who's responsible for merging PRs? |
As I mentioned in #222, I am sorry, but I cannot review or maintain such amount of C code. Neither CLC can be made responsible to provide maintainers with C skills in future. I am not particularly happy even with existing I will not particularly oppose, if other maintainers (CC @sjakobi @hsyl20) are willing to sign off and agree a sustainable way of future support. Besides this, both PRs lack a comprehensive test suite with a coverage report. |
Closing, superseded by #365. |
Sorry for the delay, the floating format problem is definitely not as simple as i thought, but i finally made it forward somehow, in short this patch:
Implement
doubleDec/floatDec
using grisu3 with dragon4 fallback with reference here.ImplementfloatDec
using dragon4 withfloatToDigits
inGHC.Float
.Benchmarks are uploaded, please give a review any time you want.