Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for custom float precision via unquoted strings #1421

Closed
nathanieltagg opened this issue Jan 9, 2019 · 14 comments
Closed

Support for custom float precision via unquoted strings #1421

nathanieltagg opened this issue Jan 9, 2019 · 14 comments
Labels
state: please discuss please discuss the issue or vote for your favorite option state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated

Comments

@nathanieltagg
Copy link

I have an application that uses lots of floats. (Order 10,000 in a single JSON document that is transmitted by html). Thus, I need to ensure that they are small, but also of the desired accuracy. This requires changing the precision of floats determined by what role they play in the document.

For example:
{
'accurate_value': 12.3456789,
'good_to_three_significant_figures': 12.3,
'rounded_to_a_millimeter:' 12.236,
}

(Right now I use my own JSON library, but it's serialize-only, no parsing or manipulation.)

In order to replicate this work, I need exactly one simple feature: a way to add an unquoted string as an BasicJsonValue:

json j;
j['accurate_value'] = json::unquoted_string("12.3456789);
j['rounded_to_a_millimeter'] = json::unquoted_string("12.3");
j['rounded_to_a_millimeter'] = json::unquoted_string("12.236");
would yield the above JSON when serialize. It's fine if the unquoted string is manipulable only like a string.

Any syntax is OK.

Then it is trivial for me to provide (and give to you!) the wrapper functions that provide these values. I put a fair amount of work into it to make sure it always does the right thing to provide maximum accuracy for minimum character count.

This seems like a trivial change: simply duplicate the existing string type with an identical type and a slightly modified streamer.

It may even exist already under the hood - is there any way of dropping unquoted text in as a value?

@nickaein
Copy link
Contributor

nickaein commented Jan 9, 2019

It seems enclosing the strings in double-quotations during serialization is hard-coded here. So it would not be easy to support such feature with minimal tweaks. I might be wrong though.

As a hack, if you are OK with outputting all strings without quotation, you can comment out the two o->write_character('\"'); lines at the location specified in above link. With such change, the following code:

#include "json.hpp"
#include <iostream>

int main()
{
    nlohmann::json j;
    j["accurate_value"] = "12.3456789";
    j["rounded_to_a_centimeter"] = "12.3";
    j["rounded_to_a_millimeter"] = "12.236";
    j["an_actual_string"] = "I'm left unquoted, sad :(";

    std::cout << j.dump(4) << std::endl;
}

outputs:

{
    "accurate_value": 12.3456789,
    "an_actual_string": I'm left unquoted, sad :(,
    "rounded_to_a_centimeter": 12.3,
    "rounded_to_a_millimeter": 12.236
}

Note that in presence of some key with actual string (non-number) value the output is an invalid JSON. But it will be fine if your data is strictly limited to float values.

@nathanieltagg
Copy link
Author

No, this would unquote everything, which I don't want; I specifically want a variant with unquoted strings.

I wasted some time on some hacking today: https://github.com/nathanieltagg/json

This is a bit hacky and not really clean for a pull request, but does what I want:
o["unquoted"] = json(json::unquoted,"123.45");
double a = 123.456;
o["default"]= a;
o["fixed_1"] = json(json::fixed,a,1);
o["fixed_3"] = json(json::fixed,a,3);
o["sigfig2"] = json(json::sigfig,a,2);
o["sigfig3"] = json(json::sigfig,a,3);

yields:

{
"unquoted": 123.45,
"default": 123.456,
"fixed_1": 123.5,
"fixed_3": 123.456,
"sigfig2": 123,
"sigfig3": 123
}

@nathanieltagg
Copy link
Author

nathanieltagg commented Jan 9, 2019

I discovered several things when doing this, The nlohmann code already has extensive code to for doing decimal conversion; I underestimated how thoughtful it was, although it appears to blindly follow the same algorithm for all numeric values at runtime. But, it has some details which are poor for my application:

First, (double)0 gets rendered as "0.0". I can see the point of this, but if the JSON is parsed by javascript, there is no distinction between an integer and scalar number, so the ".0" is wasted space. Similarly, 10.0 gets rendered as "10.0" instead of "10".

Second, 1.234e9 gets rendered as 1234000000.0. when an exact match is achieved in less space with 1.234e+09, which is equally valid.

Third, nlohmann's code delivers null on NaNs. For my application, "nan" is preferred since this matches how javascript will interpret this value.
isNaN(null) yields false
but
isNaN("nan") yields true
and similarly with isFinite().
Since javascript is my target reader, this looks a lot better.

My code makes fugly use of sprintf statements, but counters these issues.

@nathanieltagg
Copy link
Author

OK, I've tidied up the code so it could be merged.

https://github.com/nathanieltagg/json

I think the syntax of my unquoted_string call is inobvious enough that it won't confuse anyone or make the code dangerous, but allow us to get our hands dirty.

Another use case for this: you have a large block of JSON that comes from another source. You want to wrap it in an object with a header and ship it - this allows you do do that without parsing the source.

@nlohmann
Copy link
Owner

nlohmann commented Jan 9, 2019

There have been frequently requests for a control of the serialization of floating-point numbers (i.e., how many decimal places to dump) - is this issue about that (this is how I understood #1421 (comment)) or do you "just" want to provide some values "as is" to be stored in the JSON value without deeper inspection?

About the remarks in #1421 (comment):

  • We would like to roundtrip as faithfully as possible. Storing any "zero" as 0 would not roundtrip values like 0.0 or -0.0. The same is for 10.0 vs. 10.
  • The goal of the library is not to find the shortest sequence of ASCII characters to represent each number. We rather rely on std::snprintf in combination with std::numeric_limits<number_float_t>::max_digits10:
        // get number of digits for a float -> text -> float round-trip
        static constexpr auto d = std::numeric_limits<number_float_t>::max_digits10;

        // the actual conversion
        std::ptrdiff_t len = (std::snprintf)(number_buffer.data(), number_buffer.size(), "%.*g", d, x);
  • The handling of NaN is difficult. The spec says:

Numeric values that cannot be represented in the grammar below (such as Infinity and NaN) are not permitted.

Maybe (?) this could be also something the user would like to configure...

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Jan 9, 2019
@nickaein
Copy link
Contributor

@nathanieltagg This is completely unrelated and you have probably already considered, but I am wondering if what is the reason not to use a more compact serialization format in your usecase. As you know, JSON has a lot of overhead which other formats like protobuf can eliminate without sacrificing extendability. There is also formats like BSON (binary JSON) that a JSON can be directly convert to/from them. These usually have a lower overhead in terms of size and serialization/deserialization speed.

One case that makes switching hard is when there is an existing system that we want to interact but only accepts JSON, but even in these cases sometimes there is the possibility off implementing a wrapper to handle the conversion when exchanging data.

@nlohmann I believe there should to see more control and customization points for serialization.
How about something like Newtonsoft JSON's JsonSerializerSettings class that encapsulated settings for:

Having API to specify such configuration makes it easier to extend the customizability of the serializer and expose more settings in the future.

Back to custom floating-point issue brought up in this thread, having a setting for floating-point precision is nice but I doubt it can helpful in this case, since it acts a global formatter for all floating point. It is difficult come up with a generic design that provides enough context to the formatter to decide the precision (e.g. providing "key" names to the custom formatter is helpful in @nathanieltagg case, but that's not always the case).

Therefore, I believe having unquoted_string proposal by @nathanieltagg is more realistic and doable for this usecase. It can be done by having custom formatters similar to Newtonsoft JSON.

@nlohmann
Copy link
Owner

@nickaein I am unfamiliar with Newtonsoft JSON - are the serializer settings you linked to global for the whole serialization?

@nathanieltagg
Copy link
Author

It's of course possible to expand the basic_json union to include a more complex object, like:
json(sigfigs(3),1.2345)
The value_t would have to be a struct that held the printout specification (fractional accuracy or fixed-point, space-saving or showing significant figures, plus number of decimals allowed) that would then get consumed by the serializer. However, this is a lot of work. I'd love it, but I don't need it.

I think my unquoted-string extension is a good one in any case. It allows payloads to be inserted without parsing, as well as the custom-output.

Thanks for thinking about it!
Nathaniel

@nathanieltagg
Copy link
Author

@nickaein I have actually looked at BSON, but it's got four issues:

  • encoders require 3rd party libraries (and I have a MASSIVE toolchain problem)
  • decoders in the browser are built in javascript and therefore not as fast as JSON native
  • The transferred data is not human-inspectable (valuable when debugging)
  • It only results in a 10% savings in document size and after gzipping, the JSON file is actually about 20% SMALLER. (Tested on my typical ~1 MB JSON document.)

This last bit surprised me, but it's true. Strings (and therefore object keys) don't get smaller. Small integers are only 1-2 bytes, whereas in BSON they need an type-identifier byte, and a four-byte storage for int32. The JSON version gzips down (b/c ascii) but the BSON version doesn't so much. It's possible to reformat your data to take advantage of the BSON power, but that's more work than simply optimizing JSON output a little.

Even a high-accuracy double is 8 bytes in BSON, and in JSON it's only a maximum of 24 bytes. So:
'value':-2.2250738585072020E−308,
is 31 bytes, corresponding to BSON 15 bytes (if I've counted correctly). That's only a factor of 2 savings, for a very contrived example!

@nickaein
Copy link
Contributor

@nlohmann You, AFAIK the settings in Newtonsoft Json are global throughout serialization/deserialization process. They cannot help much in this issue. Maybe a hack could be to use a JsonConverter to output different precisions based on key name.

I'm more interested in the customisability of serializer/deserializer. Having such an API makes it possible to easily cover cases like #1422.

@nickaein
Copy link
Contributor

@nathanieltagg

Interesting! The reason I suggested BSON was that it more compatible with JSON compared to something like protobuf. How about protobuf/flatbuffer? They seems to have promising results. For instance, in these benchmarks by Google flatbuffer beats other methods in almost all measured aspects. I'm not very familiar with protobuf but on flatbuffers I can give my experience:

encoders require 3rd party libraries (and I have a MASSIVE toolchain problem)

Luckily the flatbuffer library is simply a few header files with C++11 as a dependency. We managed to use it on very limited hardware like microcontrollers (e.g. LPC1768)

decoders in the browser are built in javascript and therefore not as fast as JSON native

That is a good point! While flatbuffer messages are very close to raw structs in terms of encoding/decoding process overhead in native environments, I don't know the trade-off of using them in a JavaScript engine. But I believe based on benchmark number it is going to be at least close to JSON.

It only results in a 10% savings in document size and after gzipping, the JSON file is actually about 20% SMALLER. (Tested on my typical ~1 MB JSON document.)

The benchmarks presented in above link shows flatbuffer has a better performance in terms of size, even with compression. However, as you have discussed it is also highly susceptible to data distribution and can vary. One thing that significantly help is the support of more compact data types in flatbuffers to reduce the size (e.g. integer types with various sizes). Also, replacing doubles with float can hugely helps given float precision is enough in your use-case.

The transferred data is not human-inspectable (valuable when debugging)

I agree. And this is not limited to just humans. JSON has become an ubiquitous data format that can be processed in any system. That is very point to stay on JSON except if there is a very compelling reason not to.

You could also share a typical JSON (or a dummy one with close representation of actual data). I might play around it (e.g. flatbuffer) and do some tests in free time.

@nlohmann
Copy link
Owner

From the binary formats supported by this library, I would always recommend using CBOR as it usually has the smaller serializations.

@stale
Copy link

stale bot commented Feb 11, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@leonshaw
Copy link

leonshaw commented Dec 2, 2022

I think representing numbers as strings is helpful in many cases, such as handling big integer, customizing float serialization, and when people want to keep track of the exact JSON input.

See also json.Number in Go: https://pkg.go.dev/encoding/json#Number

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: please discuss please discuss the issue or vote for your favorite option state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

No branches or pull requests

4 participants