Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement request: avoid displaying floating point approximations to integers as integers #529

Closed
pkoppstein opened this issue Aug 4, 2014 · 17 comments
Labels

Comments

@pkoppstein
Copy link
Contributor

The following enhancement request is related to but quite distinct from #369.

Currently, jq nicely transitions from integers to floats in a noble attempt to mitigate the absence of BigInt support, but by displaying approximations to integers as though they were integers, things can be needlessly confusing at best and misleading at worst.

Consider:

$ jq -n '9999999999999995 + 1'
9999999999999996                         # nice

$ jq -n '9999999999999996 + 1'
9999999999999996                         # a bit confusing - can jq do integer arithmetic correctly? (**)

$ jq -n '(9999999999999996 * 10) / 10 '    
9999999999999996                         # well, it can still do arithmetic 

$ jq -n '9999999999999996 + 3'  
1e+16                                               # and sometimes with clarity!

It is easy to understand that at some point we must lose precision, so the problem here is simply that at (**), jq is misleadingly (incorrectly?) displaying a floating point value (an approximation to 9999999999999997) as an exact integer (9999999999999996).

@nicowilliams
Copy link
Contributor

This is a duplicate of many earlier issues about the limitations of IEEE754.

@nicowilliams
Copy link
Contributor

Dup of #218 and others.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

This is a duplicate of many earlier issues about the limitations of IEEE754.

The problem that I was trying to pinpoint has nothing to do with the limitations of IEEE754, and everything to do with the details in the implementation of jvp_dtoa_fmt. I am just asking that certain floating point numbers be printed with the "e" notation to make it clear that they are floating point.

I am not sure how best to do this in a portable way, but one possibility would be to check the proposed output before printing it and then adjusting the representation appropriately. Essentially, using the notation trial(x) to mean the trial string representation of x, the adjustment would be:

if trial(x) has no e or decimal point and if trial(x+1) is the same as trial(x) then use the e notation.

@nicowilliams
Copy link
Contributor

@pkoppstein IEEE754 can represent integers exactly in the range -2^52..2^52. 9999999999999995 is larger than 2^53, therefore well out of the range of IEEE754 exact integer range, therefore also well out of jq's.

@nicowilliams
Copy link
Contributor

All numbers in jq are "floating point" in that they are C doubles (generally IEEE 754). There is no canonical way to print numbers in JSON. This has been the subject of much debate. Check the IETF JSON WG list archives :( (or don't: you'll find it likely to suck up a lot of your time).

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

IEEE754 can represent integers exactly in the range -2^52..2^52.

Great! If my simplistic algorithm offends, why not print anything outside that range with an e (so long as jq has no BigInt support)? That is easily defensible and fits in with the line of reasoning that you have previously used in discussing these number representation issues. Most important, it would be far better than the current situation, which invites confusion and disappointment (at best).

That is, this change could be implemented without waiting for the larger issues to be resolved.

@nicowilliams
Copy link
Contributor

Look at past issues about this, and at various mailing lists. I don't think we can pick a way to format numbers that everyone will be happy with.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

Look at past issues

The motivation for writing this ER was based on my review of the issues and other documentation! Specifically, I was able to disentangle some of the issues, and it seemed to me (as it still does) that one way to address one class of issues would be to make a (very small) change in jvp_dtoa_fmt.

As I've said, I cannot see any sound reason NOT to make such a change -- it's a Pareto improvement!

@nicowilliams
Copy link
Contributor

I'm not clear as to what your proposed change is. Do you have a PR I could review?

@nicowilliams
Copy link
Contributor

If you meant that any values outside the -2^52..2^52 range should be output in scientific notation. I think that'd be fine, but you couldn't rely on that to indicate that the number is too large: you'd still have to parse it.

Now what about integers in that range? When should they be output in scientific notation, and when shouldn't they? E.g., 1e9: 1000000000, or 1e9? IF we never print those in scientific notation and always print integers outside the -2^52..2^52 range in scientific notation then that might help one detect data loss at a glance, which I think is what you want, and you're right, it'd be easy to make this happen (I think).

BUT, there are users who want input form preserved for numbers that are passed through to output unchanged. (There's a couple of issues about that.) I'm certain that we can't make everyone happy as long as we stick to IEEE754, and yet IEEE754 is the industry standard -- switching to bignums won't necessarily help in all cases.

@nicowilliams
Copy link
Contributor

I'm not naysaying, BTW.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

If you meant that any values outside the -2^52..2^52 range should be output in scientific notation. I think that'd be fine

Excellent!!! I think that that would completely resolve this particular "issue", while avoiding the ruffling of any feathers.

but you couldn't rely on that to indicate that the number is too large: you'd still have to parse it.

The goal here is to have a defensible Pareto improvement, not to resolve all the issues related to numbers.

Now what about integers in that range? When should they be output in scientific notation, and when shouldn't they? E.g., 1e9: 1000000000, or 1e9?

That is an interesting question, but it's totally separable from the issue here, which primarily concerns integers outside that range. Perhaps it would be better to open a different "incident report" to avoid muddying the waters here, but since you ask, let me offer two different answers from two slightly different perspectives:

  1. From the perspective of jq's current approach to dealing with numbers in general, it doesn't matter, in the sense that jq doesn't track whether it read what I'll call a "JSON integer" (0-9]+) or something else.

  2. Within the sandbox of jq's current approach to numbers in general, the current behavior (which I take to be preferring the "JSON integer" representation) is fine. It is defensible and useful.

IF we never print those in scientific notation and always print integers outside the -2^52..2^52 range in scientific notation then that might help one detect data loss at a glance, which I think is what you want, and you're right, it'd be easy to make this happen (I think).

That goal is worthy, but it's not necessary to achieve it in order to resolve this particular "issue" (#529).

BUT, there are users who want input form preserved for numbers that are passed through to output unchanged.

Understood.

(There's a couple of issues about that.)

Are there any issues besides implementation and backward compatibility issues?

I'm certain that we can't make everyone happy as long as we stick to IEEE754, and yet IEEE754 is the industry standard -- switching to bignums won't necessarily help in all cases.

Yes, I completely agree. Switching to BigInt will raise some interesting issues too.

@nicowilliams
Copy link
Contributor

@pkoppstein I didn't commit to making any changes; I'm tempted to leave it all as-is until we get to bignum support. Please look through the list of issues.

@nicowilliams
Copy link
Contributor

There are some options supported internally already, but none that do what you'd like:

    mode:
            0 ==> shortest string that yields d when read in
                    and rounded to nearest.
            1 ==> like 0, but with Steele & White stopping rule;
                    e.g. with IEEE P754 arithmetic , mode 0 gives
                    1e23 whereas mode 1 gives 9.999999999999999e22.
            2 ==> max(1,ndigits) significant digits.  This gives a
                    return value similar to that of ecvt, except
                    that trailing zeros are suppressed.
            3 ==> through ndigits past the decimal point.  This
                    gives a return value similar to that from fcvt,
                    except that trailing zeros are suppressed, and
                    ndigits can be negative.
            4,5 ==> similar to 2 and 3, respectively, but (in
                    round-nearest mode) with the tests of mode 0 to
                    possibly return a shorter string that rounds to d.
                    With IEEE arithmetic and compilation with
                    -DHonor_FLT_ROUNDS, modes 4 and 5 behave the same
                    as modes 2 and 3 when FLT_ROUNDS != 1.
            6-9 ==> Debugging modes similar to mode - 4:  don't try
                    fast floating-point estimate (if applicable).

            Values of mode other than 0-9 are treated as mode 0.

Looks like understanding the dtoa code is in order.

@nicowilliams
Copy link
Contributor

@pkoppstein You're quite right BTW, that if we're to encourage people to use recurse for repetition then we should tell them that TCO is available. Conversely, if we were to have an alias/copy of recurse called repeat, we'd have to warn them that depending on the closure passed in, TCO may not be available. I'd rather avoid having to say anything about it, but we can't, can we.

@nicowilliams
Copy link
Contributor

Incidentally, the same is true of while's update argument, and any possible repeat/1 (something like def repeat(f): def r: f | r; r;).

@nicowilliams
Copy link
Contributor

Docs update pushed. See 2159f9f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants