Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable read_csv interpret 'Infinity' as floating point value #10065 #28181

Merged
merged 5 commits into from
Sep 2, 2019

Conversation

githeap
Copy link
Contributor

@githeap githeap commented Aug 27, 2019

@TomAugspurger
Copy link
Contributor

We should be careful about changing these. The last time we changed things we got a decent about of pushback about unintended behavior changes.

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2019

Yea I'd also be hesitant to do something like this. Code looks nice but given this isn't really standardized I would hate to start interpreting things that would really just be the string "Infinity" as a float value

@githeap
Copy link
Contributor Author

githeap commented Aug 28, 2019

@WillAyd, when you say

this isn't really standardized

Do you mean there is no specification (e.g. IEEE754) for this? De-facto, it is quite common to treat "Infinity" as infinite value

  • Python does it
assert float("Infinity") == float("inf")

The same thing why this pull-request #13274 was accepted.

  • Numpy has similar function numpy.genfromtxt that interprets "Infinity" as inf
In [1]: import numpy as np                                                                                            
In [2]: from io import StringIO                                                                                       
In [3]: data =''' 
   ...: Infinity 
   ...: +Infinity 
   ...: -Infinity'''                                                                                                  
In [4]: np.genfromtxt(StringIO(data))                                                                                 
Out[4]: array([ inf,  inf, -inf])
  • Fortran writes ∞ like that in formatted output (tested with GNU Fortran and Intel Fortran).
    This might be the reason for this issue.

  • Java does the same https://onlinegdb.com/H12ipZwHB

public class Main{
     public static void main(String []args){
        System.out.println(Double.POSITIVE_INFINITY);
        System.out.println(Double.NEGATIVE_INFINITY);
     }
}

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Double.html#toString(double)
If m is infinity, it is represented by the characters "Infinity"; thus, positive infinity produces the result "Infinity" and negative infinity produces the result "-Infinity".

I would hate to start interpreting things that would really just be the string "Infinity" as a float value

Ok, let it be string by default, as it is now. But what if I expect floating point values?

pd.read_csv(fname, dtype=float)

This throws a TypeError and ValueError. Could we make it work in this case?

I realize, my fix might affect performance. Maybe, there is a better solution. In #10065 (comment) @jreback suggested

to short-circuit with a strncasecmp (to only compare first n characters), then to compare versus a hash table of allowed values.

Should hash table be a local variable in _try_double function? Is it more expensive to allocate/deallocate it for each function call ? It would be nice, if someone could explain this solution in more detail.

Finally, if pandas team decides to mark this issue Won't fix, maybe it can be specified in documentation, that read_csv doesn't interpret Infinity as ∞.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

Hmm OK - thanks for the clarifications. Can you run the benchmarks in asv_bench/benchmarks/io/csv.py and see if this has an impact?

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Aug 28, 2019
@githeap
Copy link
Contributor Author

githeap commented Aug 28, 2019

Sorry, I missed one file. Now it passes tests. I ran performance testing. (Before fixing formatting)

asv continuous -E virtualenv --bench ^io master fix-#10065

In red color there is only this

     before           after         ratio
     [041b6b18]       [059808a0]
     <master>         <fix-#10065>
+      22.2±0.2ms         25.9±3ms     1.16  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
+      35.0±0.3ms         39.2±2ms     1.12  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')

Is this caused by my fix or some other commit?
Full logs:
results.zip
asv_full_output.txt

So, basically, my fix is the same as what @gfyoung proposed in #13274 (closed in commit da5fc17) only for Infinity

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me

@WillAyd
Copy link
Member

WillAyd commented Aug 30, 2019

So 15% increase on that one test - do we think that manifests itself typically from an end user perspective in any way? If not lgtm as well, just want to be careful of that

doc/source/whatsnew/v0.25.2.rst Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.0 milestone Sep 2, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd WillAyd merged commit 15eb9ca into pandas-dev:master Sep 2, 2019
@WillAyd
Copy link
Member

WillAyd commented Sep 2, 2019

Thanks @githeap - great change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable read_csv to interpret "Infinity", "+Infinity" and "-Infinity" as floating point values
5 participants