Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer conversion to unsigned #178

Closed
Furcube opened this issue Jan 13, 2016 · 22 comments
Closed

Integer conversion to unsigned #178

Furcube opened this issue Jan 13, 2016 · 22 comments
Milestone

Comments

@Furcube
Copy link

Furcube commented Jan 13, 2016

It is possible to assign negative json value to unsigned variable.

This is a possible patch

diff --git a/src/json.hpp b/src/json.hpp
index edda8ed..a8c85e5 100644
--- a/src/json.hpp
+++ b/src/json.hpp
@@ -2316,7 +2316,12 @@ class basic_json
         {
             case value_t::number_integer:
             {
-                return static_cast<T>(m_value.number_integer);
+                auto res = m_value.number_integer;
+                if(std::is_unsigned<T>::value && res < 0) {
+                    throw std::domain_error("can't assign negative value to unsigned type");
+                } else {
+                    return static_cast<T>(res);
+               }
             }

             case value_t::number_float:
@gregmarr
Copy link
Contributor

It is possible that the value was unsigned in the json, and was converted to a negative value for storage in the json object. It would have to be a really big unsigned number with the default storage size being int64_t, but it's still possible. The json object would need to store additional data to know whether it was negative or not originally.

@Furcube
Copy link
Author

Furcube commented Jan 13, 2016

It would be better to check integer size and signedness in parser. Successfuly parsed json should contain valid data.

I forgot that NumberIntegerType can be changed, and even be unsigned, condition for conversion check should be like this:

if(std::is_unsigned<T>::value && std::is_signed<NumberIntegerType>::value && res < 0)

Another possibility is to remove static_cast and let the user handle conversions. But I'm not sure that it will not break API.

Anyway, silent incorrect conversion is not a good thing.

@gregmarr
Copy link
Contributor

I think this also has the possibility to break existing programs, though at runtime rather than compile time. Currently, if someone populates a json object with an unsigned integer value larger than the max size for NumberIntegerType, it will be encoded in the JSON string as signed. Parsing that string and extracting the value as unsigned will currently produce the same result as the original input. This change would make it throw an exception instead.

@Furcube
Copy link
Author

Furcube commented Jan 13, 2016

If it really ignores integer overflow and does not differentiate between signed and unsigned. At least this behavior should be noted in documentation.

For now i removed static_cast to get compiler errors and made validating wrapper, but considering above, it is possible that value would be in fact undefined in result of overflow or signedness conversion.

Simple example - some 3rd party app can send my app json with integer which need 128 bits, I expect deserialization of this value to fail, not to get invalid value.

@twelsby
Copy link
Contributor

twelsby commented Jan 15, 2016

@gregmarr, I might be missing something but why would a large unsigned value in the JSON be stored as a negative number in the object?

Wouldn't this (in parser::parse_internal() at 7880) cause it to be stored as a NumberFloatType:

// check if conversion loses precision
const auto int_val = static_cast<number_integer_t>(float_val);
if (approx(float_val, static_cast<long double>(int_val)))
{
// we would not lose precision -> return int
    result.m_type = value_t::number_integer;
    result.m_value = int_val;
}
else
{
    // we would lose precision -> return float
    result.m_type = value_t::number_float;
    result.m_value = static_cast<number_float_t>(float_val);
}
break;

My understanding, and I might be wrong, is that the parser casts from a NumberFloatType to an NumberIntegerType and back again and checks if there is a significant difference. If the number was too large for the NumberIntegerType won't it therefore fail the loss of precision test and just be stored as a NumberFloatType, with the sign preserved?

@Furcube, what are you trying to do exactly? Normally it is perfectly legal to assign a negative value to an unsigned variable. For example:

unsigned int val = -1;
std::cout << std::hex << val;

compiles without error, runs without exception, and simply produces FFFFFFFF i.e. the unsigned representation of -1. This is because the compiler does an implicit conversion.

I might be misunderstanding however but if you are trying to do something like this (where val is a negative number in the object, say -1):

unsigned int val = doc["val"].get<unsigned int>();

Then I would expect it to work because internally the library has a function in the form of the following (only in a class and not just returning a literal but the actual value with a static_cast):

template<typename T>
T get() { return -1; }

Which if called in the form unsigned int val = get<unsigned int>(); is instantiated as unsigned int get() { return -1;} which ends up becoming effectively unsigned int val = -1;. Again, I would not expect this to throw an exception.

I apologize in advance if I am way off with this. I haven't been using the library long so I might be misunderstanding how it works.

@gregmarr
Copy link
Contributor

@twelsby I wasn't referring to parsing, I was referring to creating a json object to be formatted as a JSON string:

basic_json(const CompatibleNumberIntegerType val) noexcept
    : m_type(value_t::number_integer),
      m_value(static_cast<number_integer_t>(val))
{}

The static cast causes the large positive number, such as uint64_t(-1), to be stored and then output as a negative number.

@Furcube
Copy link
Author

Furcube commented Jan 15, 2016

In the beginning negative number was used as a test for error handling.

I use uint64_t for generated ID, and I've noticed that logged id differ from id in json. Additionally it will make it harder to interface with java app, where jackson can use BigInteger for large values.

As for overflow. It is not so bad, as it can be range checked outside of library - large numbers will go into NumberFloatType (double), which will allow to do range check.

For example:

        std::stringstream jstr;

    uint64_t expected = std::numeric_limits<uint64_t>::max();

    jstr << "{\"max\":" << expected << "}";
    json j = json::parse(jstr.str());
    uint64_t u_max = j.at("max");
    double d_max = j.at("max");

    std::cout << "expected: " << expected << std::endl;
    std::cout << "json: " << jstr.str() << std::endl;
    std::cout << "dump: " << j.dump() << std::endl;
    std::cout << "uint64_t max: " << u_max  << std::endl;
    std::cout << "double max: " << d_max  << std::endl;
    std::cout << "is more than max(int64_t)? " << (d_max > std::numeric_limits<int64_t>::max()) << std::endl;

with gcc 5.3 will produce

expected: 18446744073709551615
json: {"max":18446744073709551615}
dump: {"max":1.84467440737096e+19}
uint64_t max: 0
double max: 1.84467e+19
is more than max(int64_t)? 1

I can't get uint64_t from it, don't know why, but range check is working;

@gregmarr
Copy link
Contributor

std::numeric_limits<uint64_t>::max can not be stored in a double without loss of accuracy, it has too many digits.

@twelsby
Copy link
Contributor

twelsby commented Jan 16, 2016

@gregmarr, @Furcube seems to be talking about parsing, or at least that is what his example does. However I think the behavior you raise is certainly worthy of its own issue.

The loss of accuracy you also mention (which will also occur during parsing because all numbers are parsed initially as doubles so anything larger than 2^53 will have loss of accuracy) may or not be a problem as this behavior is in compliance with RFC7159, which allows implementations to set limits on the range and precision of numbers that are accepted and also recommends supporting double precision numbers for interoperability. It would be nice to go all the way to 2^64 - 1 but is not a necessity.

@Furcube, when I run your example it doesn't execute the code path related to your fix but instead passes through the next switch case because m_type is set to value_t::number_float by parser::parse().

With both VS2015 and gcc 4.9.2 compiling for x64 the subsequent static_cast that occurs is actually performed by a cvttsd2si instruction. In the event that this instruction is applied to an input that exceeds the maximum that can be stored in a signed 64 bit integer then it will evaluate to 0x80000000. On VC2015 this is what is assigned to u_max and what gets printed out by your example i.e. the decimal equivalent of 0x80000000. For reasons unknown gcc always clears the top bit of the result (by XORing with 0x80000000) and so will set u_max to 0 (the result you are getting).

The minimal example that replicates this behavior is:

uint64_t x = std::numeric_limits<uint64_t>::max();
double y = static_cast<double>(x);
uint64_t z = static_cast<uint64_t>(y);
std::cout << x << std::endl << y << std::endl << z << std::endl;

Here x is the original value, y is what gets stored and z is what you extract. The output with gcc is:

18446744073709551615
1.84467e+19
0

The moral of the story is that if you parse a number that is larger than 2^53 then it will be stored as a double with reduced accuracy and if you then try and extract it to an uint64_t, you will get compiler dependent behavior, which will certainly not be the actual value stored. As this is a consequence of storing large integer values as doubles the only way I can see to fix it is to not store such values as doubles to begin with. This requires modification of the parse code (and obviously also the object creation code at the other end of the chain).

Incidentally, @nlohmann, I notice there has been issues with Visual Studio in the past but the current version compiles without error or warning in VS2015. c++11 (and c99) support seems to be greatly improved.

@nlohmann
Copy link
Owner

@twelsby Which version of MSVC did you try? Currently MSVC 19.0.23506.0 fails to compile the unit tests, see https://ci.appveyor.com/project/nlohmann/json.

@twelsby
Copy link
Contributor

twelsby commented Jan 16, 2016

Here is one possible solution, admittedly a not very elegant brute force approach:

#include <iostream>
#include <cfloat>
#include <cmath>
#include <cstdlib>

// Modify the existing value_t enum
enum class value_t : uint8_t
{
    number_integer,
    number_unsigned,               // Add this
    number_float
} m_type;

// Modify the exiting json_value union
union json_value {
    double  number_float;
    uint64_t number_unsigned;      // Add this
    int64_t number_integer;
} m_value;

const char * m_start;

// Rewrite lexer::get_number()
void get_number()
{
    char * end;

    // Parse it as an integer
    if(*m_start != '-')  {
        // Unsigned
        m_value.number_unsigned = strtoull(m_start,&end,0);
        m_type = value_t::number_unsigned;
    }
    else  {
        // Signed
        m_value.number_integer = strtoll(m_start,&end,0);
        m_type = value_t::number_integer;
    }

    // Parse it as a double
    double test = strtold(m_start,&end);
    double int_part;
    double frac_part = std::modf(test, &int_part);

    // Test if the double or integer is a better representation
    if(frac_part != 0 ||
            (m_type == value_t::number_unsigned && int_part != m_value.number_unsigned) ||
            (m_type == value_t::number_integer && int_part != m_value.number_integer)) {
        m_value.number_float = test;
        m_type = value_t::number_float;
    }
}

template<typename T>
T get_impl(T*)
{
    switch (m_type)
    {
        case value_t::number_integer:
        {
            return static_cast<T>(m_value.number_integer);
        }

        // Add from here ------------------------
        case value_t::number_unsigned:
        {
            return static_cast<T>(m_value.number_unsigned);
        }
        // to here-------------------------------

        case value_t::number_float:
        {
            return static_cast<T>(m_value.number_float);
        }

        default:
        {
        }
    }
}

int main()
{
    const char * str[] = {
            "18446744073709551615",
            "9223372036854775807",
            "-9223372036854775808",
            "-123.456e13",
            "123.456E-25",
            "123.456e2"
    };

    for (int i=0; i < 6; i++)  {
        m_start = str[i];
        get_number();

        double dout = get_impl(static_cast<double*>(nullptr));
        uint64_t uout = get_impl(static_cast<uint64_t*>(nullptr));
        int64_t iout = get_impl(static_cast<int64_t*>(nullptr));

        std::cout << "In:" << str[i];
        std::cout << " Double:" << dout;
        std::cout << " Unsigned:" << uout;
        std::cout << " Signed:" << iout << std::endl;   
    }
    return 0;
}

This produces the output

In:18446744073709551615 Double:1.84467e+19 Unsigned:18446744073709551615 Signed:-1
In:9223372036854775807 Double:9.22337e+18 Unsigned:9223372036854775807 Signed:9223372036854775807
In:-9223372036854775808 Double:-9.22337e+18 Unsigned:9223372036854775808 Signed:-9223372036854775808
In:-123.456e13 Double:-1.23456e+15 Unsigned:18445509513709551616 Signed:-1234560000000000
In:123.456E-25 Double:1.23456e-23 Unsigned:0 Signed:0
In:123.456e2 Double:12345.6 Unsigned:12345 Signed:12345

This is the output I would expect i.e. for:

  1. The input is 2^64-1 so when extracted as a double it loses accuracy, but as a uint64_t it is exact. As a int64_t it gives the wrong but expected result.
  2. The input is 2^63-1 so when extracted as a double it loses accuracy, but as both an uint64_t and int64_t it is accurately represented.
  3. The input is -2^63 so when extracted as a double it loses accuracy, while as a uint64_t it loses its sign, but as a int64_t it is accurately represented.
  4. The input is a negative floating point number can be accurately represented as a double and as a int64_t but the uint64_t result is wrong (but expected).
  5. The input is between 0 and 1 so can only be represented as a double. The integer representations get truncated to 0.
  6. The input is a floating point number that is accurately represented as a double but truncated at the decimal point for the integer types.

The downside of this approach is that it requires parsing the number twice, as an integer and as a double. The upside is that it is no longer necessary to do the precision loss check as this is effectively handled by the above code.

I haven't looked at the object creation side but I would think that having the number_unsigned type probably makes that trivial to solve - although perhaps I am just being optimistic.

@twelsby
Copy link
Contributor

twelsby commented Jan 16, 2016

@nlohmann, yeah that one. I didn't build the unit tests however and wasn't trying to see if everything worked. Disregard then.

@nlohmann
Copy link
Owner

Extending the enum value_t to the element number_unsigned would also mean that it cannot be stored with uint8_t, as it would contain 9 elements. Beside that, your sketched changes could be feasible.

@gregmarr
Copy link
Contributor

It's not a bitfield, so adding another value to the enum doesn't affect the required size.

@twelsby
Copy link
Contributor

twelsby commented Jan 16, 2016

I'll try it in the actual code and see how it goes. Although if anyone has a better idea jump in.

@twelsby
Copy link
Contributor

twelsby commented Jan 17, 2016

Well it works well in testing so I have submitted a pull request. With the existing unit testing the only errors related to type verification, which were expected. Changing those to accept the new type and it passes. I added extra (equivalent) unit tests for the new type (very tedious) and they also pass as well as tests that check the behavior in the specific cases discussed here.

@nlohmann
Copy link
Owner

@gregmarr, I understand that enum class value_t : uint8_t defines uint8_t as underlying type, so having more than 8 values would be a problem. And as the class has a member of type value_t, why would the required size of an object would not depend on the number of different enum values?

(Sorry for the stupid question...)

@twelsby
Copy link
Contributor

twelsby commented Jan 17, 2016

@nlohmann I tend to agree with @gregmarr. Unless something funny is going on a uint_8 based enum should be able to store 256 possible values, assuming there is no need to combine them at once (e.g. OR them together). If that were the case you would need to manually assign them values as the compiler will otherwise just number them them from 0 in the order in which they are listed.

@nlohmann
Copy link
Owner

Oh this is so embarrassing... I totally forgot the "8" in uint8_t stands for bytes... 😳

@Furcube
Copy link
Author

Furcube commented Jan 17, 2016

uint8_t is 8bit, enums are not bit based, they go like 0,1,2,3...
Anyway, uint8_t will be padded in memory at least to 32bit, so there's no change in memory consumption if you do not specify #pragma pack. And packed structures can be bad for some architectures which don't support unaligned memory access.

@nlohmann
Copy link
Owner

FYI: I'll try to close as many issues as possible until the weekend and publish version 1.0.1 then before I can focus on the 2.0.0 release (I would really like to follow Semantic Versioning and treat the change in the public enum value_t as API breaking.

@nlohmann nlohmann added this to the Release 2.0.0 milestone Jan 20, 2016
twelsby pushed a commit to twelsby/json that referenced this issue Jan 24, 2016
nlohmann added a commit that referenced this issue Jan 26, 2016
Issue #178 - Extending support to full uint64_t/int64_t range and unsigned type (updated)
@nlohmann
Copy link
Owner

Fixed by merging #193. Thanks so much!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants