BUG: read_csv fails with uint64 #14983

Closed
gfyoung opened this Issue Dec 25, 2016 · 2 comments

Comments

Projects
None yet
2 participants
@gfyoung
Member

gfyoung commented Dec 25, 2016

master at aba7d2:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a\n' + str(2**63)
>>>
>>> read_csv(StringIO(data), engine='c').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes
>>>
>>> read_csv(StringIO(data), engine='python').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
a    1 non-null object
dtypes: object(1)
memory usage: 88.0+ bytes

We should be able to handle uint64, and tests like this one here should not be enforcing buggy behavior.

The buggy behavior for the C engine traces to here, where we attempt to cast according to this order defined here. Note for starters that uint64 is not in that list. This try-except is due to OverflowError with int64, after which we immediately convert to an object array of strings. At first, I thought inserting uint64 to the list would be good, but that can cause bad casting in the other direction, i.e. negative numbers get converted to their uint64 equivalents.

The buggy behavior for the Python engine traces to here, where we attempt to infer the dtype here. However, as I pointed out in #14982, this function fails with uint64 with a similar (and non-sensical) try-except for OverflowError in int64.

The questions that I posed in #14982 are also relevant here, since they should be consistent across both engines that also is performant. Patching the Python engine probably requires fixing #14982 first, and patching the C engine probably requires adding new functions to parser.pyx to parse uint64 and tokenizer.c. However, in light of the questions that I posed in #14982, I'm not really sure what is best.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 26, 2016

Contributor

I think response in #14982 answers this. Key idea is to make sure this is performant though.

Contributor

jreback commented Dec 26, 2016

I think response in #14982 answers this. Key idea is to make sure this is performant though.

@jreback jreback added this to the Next Major Release milestone Dec 26, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 26, 2016

Contributor

also in this issue can make sure that passing dtype='uint64' works properly (e.g. explict user casting)

Contributor

jreback commented Dec 26, 2016

also in this issue can make sure that passing dtype='uint64' works properly (e.g. explict user casting)

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 28, 2016

BUG: Convert uint64 in maybe_convert_numeric
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes gh-14982.
Partial fix for gh-14983.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 28, 2016

BUG: Convert uint64 in maybe_convert_numeric
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes gh-14982.
Partial fix for gh-14983.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 29, 2016

BUG: Convert uint64 in maybe_convert_numeric
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes gh-14982.
Partial fix for gh-14983.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 29, 2016

BUG: Convert uint64 in maybe_convert_numeric
Add handling for uint64 elements in an array
with the follow behavior specifications:

1) If uint64 and NaN are both detected, the
original input will be returned if coerce_numeric
is False. Otherwise, an Exception is raised.

2) If uint64 and negative numbers are both
detected, the original input be returned if
coerce_numeric is False. Otherwise, an
Exception is raised.

Closes gh-14982.
Partial fix for gh-14983.

jreback added a commit that referenced this issue Dec 30, 2016

BUG: Convert uint64 in maybe_convert_numeric
Add handling for `uint64` elements in an array with the follow
behavior specifications:    1) If `uint64` and `NaN` are both
detected, the original input will be returned if `coerce_numeric`  is
`False`. Otherwise, an `Exception` is raised.    2) If `uint64` and
negative numbers are both detected, the original input be returned if
`coerce_numeric` is `False`. Otherwise, an `Exception` is raised.
Closes #14982.  Partial fix for #14983.

Author: gfyoung <gfyoung17@gmail.com>

Closes #15005 from gfyoung/maybe-convert-numeric-uint64 and squashes the following commits:

c3bd28a [gfyoung] BUG: Convert uint64 in maybe_convert_numeric

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 31, 2016

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 31, 2016

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 31, 2016

BUG: Parse uint64 in read_csv
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes gh-14983.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 31, 2016

BUG: Parse uint64 in read_csv
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes gh-14983.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 31, 2016

BUG: Parse uint64 in read_csv
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes gh-14983.

gfyoung added a commit to gfyoung/pandas that referenced this issue Dec 31, 2016

BUG: Parse uint64 in read_csv
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes gh-14983.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Jan 2, 2017

@jreback jreback closed this in #15020 Jan 2, 2017

jreback added a commit that referenced this issue Jan 2, 2017

BUG: Parse uint64 in read_csv (#15020)
Adds behavior to allow for parsing of
uint64 data in read_csv. Also ensures
that they are properly handled along
with NaN and negative values.

Closes gh-14983.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment