Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON does allow better than IEEE 754 numbers #143

Closed
nicowilliams opened this issue Jun 5, 2013 · 20 comments
Closed

JSON does allow better than IEEE 754 numbers #143

nicowilliams opened this issue Jun 5, 2013 · 20 comments

Comments

@nicowilliams
Copy link
Contributor

The issue was raised at the IETF JSON WG this week: does JSON permit arbitrary precision numbers? The syntax certainly does. Neither RFC4627 nor ECMAScript 6 (draft) preclude it, and many implementations use whatever the native language run-time, OS, and/or machine architecture provide -- often not limited to IEEE 754 64-bit real numbers.

jq currently only supports IEEE 754 64-bit numbers. It could probably do better: also 64-bit signed integers using C int64_t (perhaps with much care regarding overflows, or perhaps not). It could even use bignum libraries. It's all probably too much, but I'm filing (and closing) this issue just to record this.

@stedolan
Copy link
Contributor

stedolan commented Jun 5, 2013

Some points to consider:

  • The JSON spec doesn't distinguish integers from floating-point at the moment. The spec is horribly vague on semantics, but since the grammar just describes "numbers" it seems that it might be against the spirit of the thing to consider 42 and 42.0 unequal. So, if jq were to support more than doubles, we'd need to know when to implicitly convert between representations. Should 1/2 * 1e20 give a double or an integer? If a calculation on doubles produces a number bigger than 1e20 or so, then IEEE754 will represent it as an integer. Should it be converted to bigint? What if 10 is then added to such a number? Should it increase? (it won't if it's a double).
  • <rant> Speaking of semantics, the JSON spec should really define equality of two JSON documents. In particular, there are a number of transformations that you can do to a JSON document without changing its meaning (reordering object fields, adding/removing whitespace outside of strings, and changing unicode escapes to characters and back inside strings). These all seem "safe" according to a vague reading of the spec, but it's not nailed down anywhere. Having them in the spec would prevent people doing this nonsense and calling the result valid JSON. </rant>
  • Finite IEEE754 doubles are, as far as I know, the only set of numbers supported by all JSON implementations. The advantage of JSON over, say, XML, is that all implementations speak the same language: there are no optional features or weird corner cases that only some implementations support (e.g. processing instructions, fetching entities from remote DTDs, etc). Bignums would be useful, but there are advantages to knowing your document is safe to transfer though any other JSON system, so it might be worthwhile sticking with doubles.
  • I'm not sure I see any use at all for 64-bit integers. Arbitrary-sized bigints might well be useful, but I see no reason to want int64.

@nicowilliams
Copy link
Contributor Author

Sounds like you want to subscribe to json@ietf.org (https://www.ietf.org/mailman/listinfo/json) :)

Also, no, not all JSON implementations support IEEE 754 doubles. Heimdal's, for example, only supports C ints. This point has been made by others on the json@ietf.org list, that implementations generally use whatever representation their environment most natively provides. An interoperable subset would be as hard to nail down as water. The ship has sailed and all that.

As for 64-bit integers, they have a significant advantage over doubles: exactitude for the whole range of 2^64 integers that fir in 64 bits. It's quite common on the SQLite3 list, for example, to see recommendations that monetary values be expressed as a whole number of cents or fractions of cents to avoid problems with IEEE 754 doubles. (Of course, one then has to express the shift factor somewhere, such as in a schema, and then one has to be prepared to change that when dealing with rapidly-inflating currencies, and so 64-bit integers aren't exactly great, but they are better than doubles for many uses.)

@stedolan
Copy link
Contributor

stedolan commented Jun 5, 2013

Thanks for the link, I'll join that mailing list.

I wasn't aware of Heimdal's implementation. Looking at the code, it seems be unable to parse "[42.0]", so I'm not sure it's right to call it a "JSON parser". The printer does seem to output valid JSON, but all of the values it outputs are representable exactly as doubles (by virtue of being 32bit ints).

Doubles can do perfectly exact integer arithmetic on all integers from -2^53 to +2^53. This range is enough to encode the world GDP in US$ cents, or a Unix timestamp in microseconds. When you step outside their range (e.g. with a rapidly-inflating currency), doubles lose information at the 16th significant figure, while 64-bit ints lose information at the first. Still not seeing the uses where 64-bit ints win :)

@stedolan
Copy link
Contributor

stedolan commented Jun 5, 2013

Of course, one then has to express the shift factor somewhere, such as in a schema, and then one has to be prepared to change that [...]

Why not express it in the top 11 bits? :-)

@nicowilliams
Copy link
Contributor Author

On Wed, Jun 5, 2013 at 9:47 AM, Stephen Dolan notifications@github.com wrote:

Thanks for the link, I'll join that mailing list.

Good! Review the archives (it's a new list; there's not a lot there
yet). And the WG charter and work items:

http://datatracker.ietf.org/wg/json/charter/
http://datatracker.ietf.org/wg/json/

The charter is always subject to change via the normal IETF process,
if there's anything amiss there.

I wasn't aware of Heimdal's implementation. Looking at the code, it seems be unable to parse "[42.0]", so I'm not sure it's right to call it a "JSON parser". The printer does seem to output valid JSON, but all of the values it outputs are representable exactly as doubles (by virtue of being 32bit ints).

Indeed, it only parses integers. And the parser is not online (it's
recursive), and it has some other issues. But it illustrates the
problem in that it's close enough.

Doubles can do perfectly exact integer arithmetic on all integers from -2^53 to +2^53. This range is enough to encode the world GDP in US$ cents, or a Unix timestamp in microseconds. When you step outside their range (e.g. with a rapidly-inflating currency), doubles lose information at the 16th significant figure, while 64-bit ints lose information at the first. Still not seeing the uses where 64-bit ints win :)

Think of totaling up all liabilities in USD. I hear that altogether
U.S. public and private sector debt, including un- and underfunded
pensions, social programs, and so on, totals up to near $100e12, which
is getting uncomfortably close to 53 bits and blows past it if you
want cent or fractional cent precision. Now do the same for Yen.

Some people dealing with financial and math applications routinely
deal with numbers that large and want no errors. 53-bit integers
won't cut it.

@stedolan
Copy link
Contributor

stedolan commented Jun 6, 2013

Right, but if we're working with $100e12 in fractions of a Yen, we can hit 2**63 (depending on how small "fractions" are). And if we go past the limit in doubles, we keep doing integer arithmetic in units of 2,4,8,... fractions of a cent. If we go past the limit in 64bit ints, we get nonsense.

I don't think there are financial reasons to care about more than 16 significant digits. I don't believe there is financial data that has more than about 10. There's lots of mathy reasons to want large integers, but they aren't interesting if 2^64 is an upper limit. Crypto's probably the main use case, but JSON representations of that stuff tend to be base64 encoded binary data rather than numbers anyway.

I see the point of having high-precision numbers in JSON. I'd argue that the advantages of trusting that your data can pass through any other JSON-compatible system unaffected outweigh the pain of having to encode really big numbers as something other than JSON numbers, but I definitely see the counter-argument.

I don't see the point of having 64-bit integers, though. There seems to be no real use case where they're better than doubles - they have a larger arbitrary limit but much much worse behavior when you hit it. I still can't think of any application where you need to do numeric calculations, your numbers might be bigger than 2^53, you can guarantee that they will be less than 2^63, and you need to represent all integers less than 2^63 and no floating-point numbers.

@nicowilliams
Copy link
Contributor Author

Right, but if we're working with $100e12 in fractions of a Yen, we can hit 2**63 (depending on how small "fractions" are). And if we go past the limit in doubles, we keep doing integer arithmetic in units of 2,4,8,... fractions of a cent. If we go past the limit in 64bit ints, we get nonsense.

I don't think there are financial reasons to care about more than 16 significant digits. I don't believe there is financial data that has more than about 10. There's lots of mathy reasons to want large integers, but they aren't interesting if 2^64 is an upper limit. Crypto's probably the main use case, but JSON representations of that stuff tend to be base64 encoded binary data rather than numbers anyway.

I won't try to convince you. I'll stop at pointing out that this
comes up frequently on the SQLite3 list and the advice is almost
always to use integers (which are signed 64-bit in SQLite3) for
monetary values.

I see the point of having high-precision numbers in JSON. I'd argue that the advantages of trusting that your data can pass through any other JSON-compatible system unaffected outweigh the pain of having to encode really big numbers as something other than JSON numbers, but I definitely see the counter-argument.

Particularly for a filter.

For a C libjv I might not care: chances are I'm NOT using bignums.
But for a filter I can see people caring about maximizing interop with
many possible encoders/parsers.

I don't see the point of having 64-bit integers, though. There seems to be no real use case where they're better than doubles - they have a larger arbitrary limit but much much worse behavior when you hit it. I still can't think of any application where you need to do numeric calculations, your numbers might be bigger than 2^53, you can guarantee that they will be less than 2^63, and you need to represent all integers less than 2^63 and no floating-point numbers.

If you'd add bignums, I agree. If not, it's just 10 more bits of
precise integer-ness, but it's not that 10 more bits is Earth-shaking,
it's the precision. In any case, if you're willing to do anything
here I think it'd be best that you add arbitrary precision reals
rather than 64-bit ints -- much more bang for your buck.

@stedolan
Copy link
Contributor

stedolan commented Jun 7, 2013

Yeah, this is definitely getting into a long silly argument.

Re SQLLite: you should definitely represent monetary values as integers, in cents or in hundreths-of-a-cent, etc. That way, you're doing arithmetic using the same units as everyone else, since 0.01 is not exactly representable in binary. My point was just that IEEE754 doubles are a pretty good integer format: If you're storing integer numeric data, I think 53-bit ints and sensible overflow behaviour is usually a better choice than 63-bit ints and catastrophic overflow behaviour.

Also, agreed: arbitrary precision reals (maybe implemented as rationals given by a pair of bigints?) would be useful, much more so than 64-bit ints.

@nicowilliams
Copy link
Contributor Author

Yeah, this is definitely getting into a long silly argument.

I wasn't arguing for anything in particular. Re: 64-bit ints I was
only being a devil's advocate.

Also, agreed: arbitrary precision reals (maybe implemented as rationals given by a pair of bigints?) would be useful, much more so than 64-bit ints.

Agreed. I think arbitrary precision reals are probably easier to
represent (a string of digits, with a decimal somewhere [possibly],
and possibly a leading sign, the whole thing possibly in BCD). But if
you could more easily use a bignum library, then representing reals as
ratios makes sense.

@dfkoh
Copy link

dfkoh commented Jul 15, 2013

I don't see the point of having 64-bit integers, though. There seems to be no real use case where they're better than doubles

tl;dr - rounding 64-bit ints is often unacceptable behavior for people who use JSON but not Javascript.

So I just came across jq, and I've been playing around with it a bit. I'm working with logs that have been serialized with json, and so I often find myself having to write short scripts to convert my data from one form to another for various data processing tasks, which is what I was considering using jq for, and it seems like it would work great except (you guessed it) it doesn't handle 64-bit ints.

Naturally, my data has some 64-bit integers (unsigned, even). For the most part, they're used as ids, but I don't exactly have the option of changing them to strings since there's already years of code that expects these things to be ints when they come out of the serialization, and actually uses the int-ness of them for useful purposes (e.g. sharding based on the high order bits, A/B testing based on the low order bits). Javascript, as you can probably imagine, isn't part of my data processing toolchain, so this doesn't really cause a problem for me normally.

My point here isn't that 64-bit ints are necessarily better than doubles or strings in any important way, but that there's a lot of languages that support them, and as a result a lot of software that uses them and also uses JSON as a serialization format for data that includes them. People who have to deal with 64-bit ints may not be able to use jq, or it may create some subtle bugs for them if they do.

Since there's probably a decent overlap between people who use JSON but not Javascript, and people who would find jq useful, you might want to reconsider adding 64-bit int support. If I find myself with some unexpected free time I might dig into your code and submit a pull request, but for now I'll content myself with leaving this comment and sticking with my hand-rolled Python.

@shimaore
Copy link

shimaore commented Nov 5, 2013

tl;dr jq shouldn't attempt to interpret numbers when it doesn't need to; and since it only manipulates rational numbers, it should use big-rational-numbers internally when doing so.

@stedolan I just discovered jq (great idea!) and saw it break in less than thirty seconds :(

$ echo 52525252525252525252 | jq '.'
52525252525252530000

Simply put, this isn't what I expect from a Unix filter.

My expectation was that jq would store numbers as-is (i.e. as the original format, as long as it is valid JSON). In other words, if the input says 0.5e20 then output 0.5e20.

tonumber should be a no-op for strings as long as they contain a valid number (for whatever JSON syntax defines as number). If the input of tonumber is "0.5e20" then the output should be 0.5e20.

In all the above cases there's no need to try to put some semantics on the input content. Only syntax is involved (for validation).

Should 1/2 * 1e20 give a double or an integer?

For what it's worth, it is an integer

$ calc -e '1/2*1e20'
50000000000000000000

but that's not really important. 42, 42.0, 42.00000000000, 4.2e1, 0.042e3 are all valid JSON representations of the same, exact number.

More generally speaking, the numbers that can be expressed in JSON / ECMA-404 all have the semantics of "(some integer) times (ten to the power of some integer)", because that's all the syntax allows. [The syntax allows for things like 0.5 or 0.3333333333333 and that's "(5) times (ten to the power of -1)" and "(3333333333333) times (ten to the power of -13)" respectively.] These are all exact rational numbers. [Strictly speaking the syntax does allow for an infinite number of digits but that won't happen in this universe.]

If the integer in "ten-to-the-power-of some integer" is greater than or equal to zero, then the overall number is what is called an integer, but in terms of arithmetic operations inside jq they shouldn't be special.
Because really, the only time numbers need to be semantically interpreted inside jq is during arithmetic operations; in which case I'd expect jq to use some form of "arbitrary precision" (rational numbers) arithmetic to obtain exact results, and print the result out based on some command-line parameter (to limit the number of digits being displayed for fractions such as 1/3 etc.), in a way similar to the config("display") parameter of calc.

I'm happy to look into integrating calc's (or another library's bignum implementation) into jq if that helps.

@stedolan
Copy link
Contributor

stedolan commented Nov 7, 2013

jq does deal with more than rational numbers at the moment: there's a sqrt function.

Arbitrary-precision integers would be nice. Arbitrary-precision rationals would also be nice but can lead to unexpected performance (even a small number can take an arbitrary amount of memory). Still, they're probably a good idea.

I'm not sure I like preserving the exact string syntax of an input number. That would mean that we could have $a == $b, but ($a | tostring) != ($b | tostring). I like my functions to preserve equivalence.

Also, while we're being pedantic, the JSON syntax doesn't allow infinite digit strings, since the * in [0-9]* is the Kleene star :)

@shimaore
Copy link

shimaore commented Nov 7, 2013

jq does deal with more than rational numbers at the moment: there's a sqrt function.

Oops, missed that. It deserves to be documented on http://stedolan.github.io/jq/manual/

[...] while we're being pedantic [...] :)

Wasn't trying to be, sorry if I came across that way. :)

I like my functions to preserve equivalence.

Good point. tostring-as-no-op really is poor-man's arbitrary precision when not doing arithmetics.

There's probably sense in letting the user choose which semantics they want to use ("as-is or IEEE754", "IEEE754 always", "as-is or arbitrary-precision", "arbitrary-precision always").

Well, time for me to stop talking and instead try and write some code.

@dfkoh
Copy link

dfkoh commented Nov 12, 2013

I actually started using jq again (it's so useful!) and ended up adding a patch that leaves numbers as-is until they have some operation act on them, then it goes back to doubles. It's on my forked copy: https://github.com/airfrog/jq

I agree that it's not a great solution since it has some frustrating edge cases. I might go back and do it right with arbitrary precision integers, but for now this is working.

Has anyone found a bigint/bignum library that would work well for jq? I poked around a bit but I haven't really researched it, and ideally I don't want to add any dependencies to the library.

@tischwa
Copy link

tischwa commented Nov 20, 2013

jq is really a great tool / filter!

But I do agree with airfrog, that it should not touch the representation of a number, unless jq has to (e.g. because a computation was done).
If a number is simply fed through, it should not change its representation at all.

My use case is processing huge logs having POSIX timestamps in micro seconds. They can be represented by doubles, so there is no problem. But still (with current HEAD):

% echo 1384975447132984 | jq '.'
1384975447132984
% echo 1384975447000000 | jq '.'
1.384975447e+15

So my integer suddenly is converted to floating point representation and the UNIX filters down the pipe suddenly have to deal with float numbers too.

So I think the right way would be, not to change the representation of a number, if the number is just piped through.

@MFornander
Copy link

First of all, awesome tool thank you. Also...

Wow, I just lost an evening trying to debug my JsonWriter class when it was jq that was truncating my 64bit IDs (63 actually). If it had converted to a scientific notation I would have known but not passing through a clean int64 makes this a much less useful pipline tool :(

echo "5093397704957986680" | jq '.'
5093397704957987000

@rsrsps
Copy link

rsrsps commented Jan 7, 2015

I'm not sure I see any use at all for 64-bit integers. Arbitrary-sized bigints might well be useful, but I see no reason to want int64.<<

Concrete values that only make sense as a number like Unix epoch nanotime already exceed the resolution available in the implementations that use floats for large integers. This is genuinely annoying as the Javascript implementations quietly truncate the values.

(just adding)

@tomwhoiscontrary
Copy link

I just tripped myself up using jq with nanosecond-scale unix timestamps:

$ echo '{"timestamp":1499816949527975237}' | jq .timestamp
1499816949527975200

I appreciate that there are some real headaches around using different number implementations according to the value. But there are also headaches around silently changing the values in the data.

If support for numbers not exactly representable as 64-bit floats isn't on the cards, how about an error exit rather than approximation?

@lambda-fairy
Copy link

Just got bitten by this today.

The API we use at work associates each entity with a 64-bit integer ID. I started to question the output of jq when some of these IDs were showing up as equal.

@leonid-s-usov
Copy link
Contributor

Please try this branch #1752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants