Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

JSON parser has issues with small f64 vales using scientific notation #1426

Closed
DeliciousHair opened this issue Mar 5, 2023 · 4 comments
Closed
Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@DeliciousHair
Copy link

I am cross referencing my original polars issue pola-rs/polars#7312 as some digging around trying to find the source has led to finding this.

In https://github.com/jorgecarleitao/arrow2/blob/main/src/io/json/read/infer_schema.rs#L112-L117 it would seem that if the source json contains (small) floating point values like 2e-06, then this will cause problems as it attempts to set the dtype as Int64 and failure occurs. This problem only seems to occur for values of the form 2e-06 as opposed to 2.0e-06 as the presence of the decimal point keeps things legit.

@popsu
Copy link

popsu commented Mar 5, 2023

Seems that the issue is possibly in the underlying crate used json-deserializer. Which actually has a test case that looks weird, it's basically asserting 1e-10 should be Number::Integer(...):
https://github.com/jorgecarleitao/json-deserializer/blob/003cada719a3123ada049c11099c10d006c1e9df/tests/it/main.rs#L198-L211

So either

  • json-deserializer should change it to be parsed to Number::Float OR
  • arrow2 should check for the "exponent" whether it starts with zero. So change the function in
    fn infer_number(n: &Number) -> DataType {
    match n {
    Number::Float(..) => DataType::Float64,
    Number::Integer(..) => DataType::Int64,
    }
    }

    to something like this:
fn infer_number(n: &Number) -> DataType {
    match n {
        Number::Float(..) => DataType::Float64,
        Number::Integer(_, b) => {
            if !b.is_empty() && b[0] == b'-' {
                DataType::Float64
            } else {
                DataType::Int64
            }
    }
}

@DeliciousHair
Copy link
Author

DeliciousHair commented Mar 5, 2023

Will the sign check be enough though? That is, an IEEE 754 floating point is valid to 10^308 or something, which is larger than a 64 bit integer.

@DeliciousHair
Copy link
Author

Actually, fixing this in json-deserializer would be the way to go IMHO as the above pattern, while effective, is very counterintuitive as 2e-03 would not be considered an integer in any system I'm aware of, while 1e234 makes perfect sense as an integer. Whether or not it actually fits in the consumers dtype is a problem for them to worry about.

I've worked out a fix in json-deserializer that fixes this issue and the solution actually propagates to fix the original polars issue as well, so I'll create a PR in that repo shortly.

@DeliciousHair
Copy link
Author

fixed via jorgecarleitao/json-deserializer#20

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Mar 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants