Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Float encoding #2

Open
paulsnar opened this issue Jul 16, 2022 · 11 comments
Open

Discussion: Float encoding #2

paulsnar opened this issue Jul 16, 2022 · 11 comments

Comments

@paulsnar
Copy link

paulsnar commented Jul 16, 2022

At the risk of causing bikeshedding, I wanted to raise a point of discussion regarding how floats are encoded -- given that integers are encoded in Base62 and therefore take up less space than their decimal form, it seems a bit odd that floats are left as-is.

My first idea was that they could also be encoded in a similar compacted form, such as base36, which JavaScript can do natively with .toString(36), but that isn't particularly great if the float's magnitude is not close to 0 because the expontent e+n notation can't be used unambiguously, and parsing them back from such a format probably isn't easy (definitely not in JavaScript.)

I'd love to hear more about the rationale behind this decision (if any)! All in all, for the goals that it sets out to accomplish, JCOF is really neat.

(Edit, this did end up causing bikeshedding, especially in the other issues, sorry for that.)

@mortie
Copy link
Owner

mortie commented Jul 16, 2022

So, having a way to do base62 encoding of floats has struck my mind, but I've disregarded it since I wanted to use JavaScript's float generator and parser. I didn't consider the fact that JavaScript lets you convert floats to strings with different bases.

I think the base 10 syntax has to stay. That doesn't necessarily mean we can't add a base 36 (or base 62 or anything else) in addition, but here are the reasons why we can't replace the base 10 syntax with a base 36 syntax:

  • Currently, the characters 0 through 9 and - indicate that the coming value is a float. This means we can encode floats without a leading character. Some of the lower-case characters are used as a leading character to indicate other values; i for positive base64 integer, n for null, b for true boolean, s for string reference. Now, we could replace al those characters with upper case characters so that the characters 0 through 9, - and a through z indicate float, and maybe that would be worth doing, except:
  • Base 36 turns out to be larger in a lot of cases. Numbers aren't always randomly picked from the real number line; often, our datasets contain numbers which look okay in base 10. For example, one test dataset contains a lot of numbers rounded to 2 base-10 decimal places. Moving from base 10 to base 36 means encoding 1.58 as 1.kvoha2voha instead, or encoding 2.2 as 2.7777777778.
  • And then there's the fact that I like that JCOF is human-writable. Removing a way to represent numbers as base 10 would make it much harder for a human to write a JCOF document.

I implemented a naïve change which just replaces the current value.toString() with value.toString(36) to see what kind of difference it makes. Here it is, with jcof being the old base 10 floats and jcff being the new base 36 floats:

tiny.json:
  JSON: 299 bytes
  jcof: 134 bytes (0.448x)
  jcff: 134 bytes (0.448x)
circuitsim.json:
  JSON: 8315 bytes
  jcof: 2093 bytes (0.252x)
  jcff: 2090 bytes (0.251x)
pokemon.json:
  JSON: 219635 bytes
  jcof: 39650 bytes (0.181x)
  jcff: 39643 bytes (0.180x)
pokedex.json:
  JSON: 56812 bytes
  jcof: 23132 bytes (0.407x)
  jcff: 25861 bytes (0.455x)
madrid.json:
  JSON: 37960 bytes
  jcof: 11923 bytes (0.314x)
  jcff: 11618 bytes (0.306x)
meteorites.json:
  JSON: 244920 bytes
  jcof: 87028 bytes (0.355x)
  jcff: 95370 bytes (0.389x)
comets.json:
  JSON: 51949 bytes
  jcof: 37480 bytes (0.721x)
  jcff: 37471 bytes (0.721x)

Now, there is a use-case for a more efficient float encoding. Storing the state of a physics-based game, for example, would have a lot of entities with position and velocity vectors which contain values that are essentially picked arbitrarily from the real number line without regard for any base. And base 36 or 62 floats would certainly be an advantage there. I wouldn't be against adding a new float syntax, maybe one which works like the base 62 integer syntax today; f followed by a base 36 or 62 float for positive floats, F for a negative float. My concern there would be: All languages probably have a base 10 float parser, but some languages may not have a base 36 float parser. I think I want to avoid making people hand-write float parsers and generators, since parsing and generating floats efficiently in a way which can round-trip all values without loss seems like one of those classic hard problems. I should do some research in this space.

@JobLeonard
Copy link

JobLeonard commented Jul 16, 2022

I was wondering about the same thing. Here's one idea I had, maybe it would work for your usecases? If it clashes with some design nuance I missed in my brief exposure to the library feel free to disregarded.

Since we already have the possibility of exponential notation via toExponential, e.g. (0.000123).toExponential() becomes 1.23e-4, why not think of that notation as a triplet of values (1, 23, and 4) that each can be converted to Base62? If we start with an fprefix to indicate we'll have a floating point value we can drop thee, instead using +or-as a separator. In our example that would result inf1.n-4`.

But we can do better: we can get rid of the . between 1 and 23, as long as we add the change in powers of ten to the coefficient to the exponent to compensate. So in this case, changing 1.23 means multiplying by 100, so 10⁻⁴ has to become 10⁻⁶ to compensate.

So 0.000123 becomes 1.23e-4 becomes 123-4 becomes 123-6, which becomes f1Z-6

I have a working but totally unsafe implementation here:

https://observablehq.com/d/e3010313d892b36e

(unsafe as in "it does not test and bail for invalid string input and probably can result in infinite loops as a result". But you already have a B62 parser yourself so that's the hard part handled)

Number.prototype.toExponential() is so old that even IE5 supported it, so I don't think you have to worry about it not being available. Edit: oh wait, I missed the nuance of whether or not other languages than JavaScript support this. Yeah, I hadn't considered that, no opinions there.

@JobLeonard
Copy link

Hahaha, dammit, all that work for a test implementation and I just realized that there is much simpler approach that seems to work a lot better:

image

@paulsnar
Copy link
Author

@mortie, thanks for the detailed response! Your reasoning makes a lot of sense.

Removing a way to represent numbers as base 10 would make it much harder for a human to write a JCOF document.

Well, this doesn't entirely hold for base 62 integers either, but given that in JS there's no difference between integers and floats, I suppose at the moment the float notation works as a makeshift solution for that :)

I respect that any change should be justified by both a use-case and data, and that the present notation is good enough for what JCOF sets out to achieve -- though as you said, there are some cases where a lot of floats might need to be passed around, and there's not currently a dataset among benchmarks that could represent that case, so perhaps one could inform design decisions.


@JobLeonard, I really like your proposals! If I might riff on them a bit:

Reusing the existing base 62 encoder/decoder makes a lot of sense. I'd've tried it too; the base 36 was meant as an illustrative example, and doing either of your proposals fits much better along with rest of JCOF.

For the first proposal, an equivalent of Number.prototype.toExponential could probably be put together with a bit of logarithms, so it's unlikely to be a problem in other languages; though the algorithm might require some time to figure out, afterwards it can be ported easily and basically only requires the logarithm function and exponentiation, both of which are likely to be supported. It'd also be possible to try to calculate it in a binary exponent form (so using 2 as the exponentiation base), and that might allow for very slightly faster parsing if you have access to the IEEE 754 binary representation, but that might be going in a bit too deep.

I like the sublimity of the second one, given that it's essentially just a compressed form of what JCOF does at the moment :)

Either way, all of these solutions (both the current decimal scheme, naïve base 36, as well as both proposed by Job) suffer to some extent from the base-incompatibility problem if trying to encode numbers like 1/3 without losing accuracy. Of course, then a way to reduce this is to decide a cutoff point of accuracy that'd be acceptable -- after all, 2.2 isn't exactly that in the float representation too, just that stringification/parsing algorithms skip all the insignificant zeroes and don't display decimals after the 17th-or-so position that would arise from the binary value.

That said, just truncating after some n digits can be done just as well with the current decimal encoding, and the gains attained by the rest of these are fairly minimal (or so it seems to me, not having benchmark data of my own), so perhaps the added complexity becomes not really worth it by that point...

@mortie
Copy link
Owner

mortie commented Jul 16, 2022

A proper float to string algorithm doesn't just choose some number of sufficient digits; the string "2.2" represents exactly the one floating point number which is the closest to being 2.2, and a float parser will return that exact floating point value when parsing "2.2". And the expression 1/3 returns a floating point value; (1/3).toString() returns the shortest string representation where (1/3) is the closest float, so that the float parser returns the exact same float.

My problem is that I only know about these challenges, not how they're solved, or even how difficult they are to solve.

FWIW, I tried out the algorithms in https://observablehq.com/d/e3010313d892b36e, and a lot of numbers aren't round-tripped correctly. For example, the base 10 float 1.1344e-319 stringifies to "f2WY-5d", which parses to 1.12094e-319, and 9.399999999999983 stringifies to "fH3egP4fQc-f" which parses to 9.399999999999984.

Here's a snippet which can be used to test all the floats:

// Return the next representable double from value towards direction
// From https://stackoverflow.com/a/27661477
function nextNearest(value) {
  if (!isFinite(value))
    return value;
  
  var buffer = new ArrayBuffer(8);
  var f64 = new Float64Array(buffer);
  var u32 = new Uint32Array(buffer);
  
  f64[0] = value;
  
  if (value === 0) {
    u32[0] = 1;
    u32[1] = 0;
  } else if ((value > 0) && (value < Infinity) || (value < 0) && (value > Infinity)) {
    if (u32[0]++ === 0xFFFFFFFF)
      u32[1]++;
  } else {
    if (u32[0]-- === 0)
      u32[1]--;
  }
  
  return f64[0];
}

let f = 0;
while (f != Infinity) {
  let str = stringifyF62(f);
  let parsed = parseF62(str);
  if (parsed == f) {
    console.log("OK:", f, "from string", str);
  } else {
    console.log("ERR: expected", f, "got", parsed, "from string", str);
  }

  f = nextNearest(f);
}

Replace with stringifyF62(f) with f.toString() and parseF62(str) with parseFloat(str), and you can see that the JavaScript float stringify and parse functions round-trip all numbers you try perfectly.

@JobLeonard
Copy link

@mortie yeah, after thinking a bit more about the "v1" approach that checks out - I do a multiplication which can lead to rounding errors.

The second version that first uses toString(10) and then converts the two separate strings to b62 integers should roundtrip correctly for all values, since it is a lossless transformation of two integers. I'll update the code to use that approach instead. Gimme a few minutes.

@JobLeonard
Copy link

Ok so first, thanks for that test function, that's been a major help in debugging!

There's one major flaw in the implementation though: it allocates new TypedArrays every call 😱. ArrayBuffer allocation is just about the slowest thing you can do on JS right now (TypedArrays are only fast in usage after allocation). So do yourself a favor and hoist it out for faster testing.

Anyway, that took longer than a few minutes because I bumped into two edge-cases:

  1. For very small or large numbers .toString(10) returns exponential notation, so we need support for that after all
  2. leading zeroes on the fractional part, e.g 1.00456789.

The first point wasn't too much of a pain, since it was basically just adapting the ideas from before but in a way that guarantees that we'll never have longer strings in the end.

The second point took more time. If I naively convert this the leading zeroes get dropped, resulting in 1.456789 after the roundtrip. Bad news.

To keep the output compact my solution was to introduce a second separator to indicate leading zeros: ,. In that case the next character encodes the number of leading zeroes in b62 (so up to 62). So:

  • 1.456789 converts to "f1.1UPz"
  • 1.0456789 converts to "f1,11UPz"
  • 1.00456789 converts to "f1,21UPz"

I don't think it's possible for any double to be converted to a number with a fraction with 62 leading zeroes, and vice versa a handcrafted number string with 62 leading zeroes would not convert to a double with the same value. In that case this I would consider this safe (I'm still letting the tests run in the background to verify this, it's taking forever even with that TypedArray hoisting).

Oh, and I also added a simple check to see if the converted string is longer than a plain .toString(10) conversion. In that case the latter is returned. Note that this does require that on the parsing side we don't just check of the first numbers are /[0-9-]/, we also need to check for optional exponential notation suffixes, so the strings 1e100, 1e+100, 1e-100 must all be recognized. The benefit would be that it's easier to manually insert numbers, which is also a goal, right?

@JobLeonard
Copy link

(updated code can be found at the same link as before: https://observablehq.com/d/e3010313d892b36e)

@JobLeonard
Copy link

Ok, dammit, found another roundtrip bug:

Error: Expected 0.10000000000000005, got 0.10000000000000003 from string "f.JNBBGsgRe"

The cause is that 0.10000000000000005 is a valid number, but 10000000000000005 converts to 10000000000000004.

I think that's easy enough to fix though, just have to parse the separate integer, fraction and exponent strings as BigInt values.

@JobLeonard
Copy link

JobLeonard commented Jul 16, 2022

Ok so this is as much as I'll write today - I already got nerdsniped on my free Saturday for too long now, although I had fun doing it :).

I'm sure there is still an edge-case I missed, but I think it pretty much works now.

Updated code still at https://observablehq.com/d/e3010313d892b36e.

To summarize, the notation logic is as follows:

  • if the string starts with 0 to 9 and - it is assumed to be a "plain" number string
  • otherwise, i and I are integer prefixes followed by a base62 number
  • f and F are floating point values that may have a fractional and exponential component (so ./, and e in number string notation
  • . is a separator for an integer and a fractional part, both base62 and the latter without leading zeros
  • , is a separator for an integer and a fractional part, both base62 and the latter with leading zeros, which are encoded as the first b62 digit of the fraction
  • + or - is a separator that indicates an exponential part.

Essentially, f{integer}[.,]{fraction}[+-]{exponent} is a reversible transformation of the standard way JavaScript may convert numbers to base-10 strings.

Currently stringifyN62 should always returns the shortest possible number string, and parseN62 assumes the numerical substring has already successfully been extracted from the full string.

@JobLeonard
Copy link

... actually, given how I currently coded it we could replace i, I, f and F prefixes with a universal n and N prefix, since the optional ., ,, + and - separators can be used to indicate if we're dealing with a pure integer or with a floating point number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants