ENH: Enable read_csv interpret 'Infinity' as floating point value #10065 #28181

githeap · 2019-08-27T18:53:33Z

closes Enable read_csv to interpret "Infinity", "+Infinity" and "-Infinity" as floating point values #10065
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

TomAugspurger · 2019-08-27T19:17:47Z

We should be careful about changing these. The last time we changed things we got a decent about of pushback about unintended behavior changes.

WillAyd · 2019-08-27T22:12:45Z

Yea I'd also be hesitant to do something like this. Code looks nice but given this isn't really standardized I would hate to start interpreting things that would really just be the string "Infinity" as a float value

githeap · 2019-08-28T09:40:00Z

@WillAyd, when you say

this isn't really standardized

Do you mean there is no specification (e.g. IEEE754) for this? De-facto, it is quite common to treat "Infinity" as infinite value

Python does it

assert float("Infinity") == float("inf")

The same thing why this pull-request #13274 was accepted.

Numpy has similar function numpy.genfromtxt that interprets "Infinity" as inf

In [1]: import numpy as np                                                                                            
In [2]: from io import StringIO                                                                                       
In [3]: data =''' 
   ...: Infinity 
   ...: +Infinity 
   ...: -Infinity'''                                                                                                  
In [4]: np.genfromtxt(StringIO(data))                                                                                 
Out[4]: array([ inf,  inf, -inf])

Fortran writes ∞ like that in formatted output (tested with GNU Fortran and Intel Fortran).
This might be the reason for this issue.
Java does the same https://onlinegdb.com/H12ipZwHB

public class Main{
     public static void main(String []args){
        System.out.println(Double.POSITIVE_INFINITY);
        System.out.println(Double.NEGATIVE_INFINITY);
     }
}

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Double.html#toString(double)
If m is infinity, it is represented by the characters "Infinity"; thus, positive infinity produces the result "Infinity" and negative infinity produces the result "-Infinity".

I would hate to start interpreting things that would really just be the string "Infinity" as a float value

Ok, let it be string by default, as it is now. But what if I expect floating point values?

pd.read_csv(fname, dtype=float)

This throws a TypeError and ValueError. Could we make it work in this case?

I realize, my fix might affect performance. Maybe, there is a better solution. In #10065 (comment) @jreback suggested

to short-circuit with a strncasecmp (to only compare first n characters), then to compare versus a hash table of allowed values.

Should hash table be a local variable in _try_double function? Is it more expensive to allocate/deallocate it for each function call ? It would be nice, if someone could explain this solution in more detail.

Finally, if pandas team decides to mark this issue Won't fix, maybe it can be specified in documentation, that read_csv doesn't interpret Infinity as ∞.

WillAyd · 2019-08-28T15:10:20Z

Hmm OK - thanks for the clarifications. Can you run the benchmarks in asv_bench/benchmarks/io/csv.py and see if this has an impact?

githeap · 2019-08-28T20:41:23Z

Sorry, I missed one file. Now it passes tests. I ran performance testing. (Before fixing formatting)

asv continuous -E virtualenv --bench ^io master fix-#10065

In red color there is only this

     before           after         ratio
     [041b6b18]       [059808a0]
     <master>         <fix-#10065>
+      22.2±0.2ms         25.9±3ms     1.16  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
+      35.0±0.3ms         39.2±2ms     1.12  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')

Is this caused by my fix or some other commit?
Full logs:
results.zip
asv_full_output.txt

So, basically, my fix is the same as what @gfyoung proposed in #13274 (closed in commit da5fc17) only for Infinity

gfyoung

These changes look good to me

WillAyd · 2019-08-30T15:06:28Z

So 15% increase on that one test - do we think that manifests itself typically from an end user perspective in any way? If not lgtm as well, just want to be careful of that

doc/source/whatsnew/v0.25.2.rst

jreback

lgtm. @WillAyd @TomAugspurger

WillAyd · 2019-09-02T23:52:21Z

Thanks @githeap - great change

…das-dev#10065 (pandas-dev#28181)

ENH: Enable read_csv interpret 'Infinity' as floating point value (is…

28ac8e4

…sue pandas-dev#10065)

githeap mentioned this pull request Aug 27, 2019

bug in Pandas when reading files with 'Infinity' erwanp/qtplaskin#6

Closed

4 tasks

WillAyd added the IO CSV read_csv, to_csv label Aug 28, 2019

githeap added 2 commits August 28, 2019 21:03

Add missing files

059808a

Fix formatting

2592575

gfyoung approved these changes Aug 29, 2019

View reviewed changes

WillAyd requested changes Aug 30, 2019

View reviewed changes

doc/source/whatsnew/v0.25.2.rst Outdated Show resolved Hide resolved

githeap added 2 commits August 30, 2019 22:52

Move whatsnew to v1.0.0

215477f

Clarify test case

a77f708

jreback added this to the 1.0 milestone Sep 2, 2019

jreback approved these changes Sep 2, 2019

View reviewed changes

WillAyd approved these changes Sep 2, 2019

View reviewed changes

WillAyd merged commit 15eb9ca into pandas-dev:master Sep 2, 2019

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Enable read_csv interpret 'Infinity' as floating point value pan…

b767996

…das-dev#10065 (pandas-dev#28181)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: Enable read_csv interpret 'Infinity' as floating point value pan…

b0cbf0a

…das-dev#10065 (pandas-dev#28181)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Enable read_csv interpret 'Infinity' as floating point value #10065 #28181

ENH: Enable read_csv interpret 'Infinity' as floating point value #10065 #28181

githeap commented Aug 27, 2019

TomAugspurger commented Aug 27, 2019

WillAyd commented Aug 27, 2019

githeap commented Aug 28, 2019 •

edited

Loading

WillAyd commented Aug 28, 2019

githeap commented Aug 28, 2019 •

edited

Loading

gfyoung left a comment

WillAyd commented Aug 30, 2019

jreback left a comment

WillAyd commented Sep 2, 2019

ENH: Enable read_csv interpret 'Infinity' as floating point value #10065 #28181

ENH: Enable read_csv interpret 'Infinity' as floating point value #10065 #28181

Conversation

githeap commented Aug 27, 2019

TomAugspurger commented Aug 27, 2019

WillAyd commented Aug 27, 2019

githeap commented Aug 28, 2019 • edited Loading

WillAyd commented Aug 28, 2019

githeap commented Aug 28, 2019 • edited Loading

gfyoung left a comment

Choose a reason for hiding this comment

WillAyd commented Aug 30, 2019

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Sep 2, 2019

githeap commented Aug 28, 2019 •

edited

Loading

githeap commented Aug 28, 2019 •

edited

Loading