BUG: Various issues with maybe_convert_objects #2845

stephenwlin · 2013-02-11T18:54:38Z

Found some issues with lib.maybe_convert_objects (which is an internal library function, but the same problems as below occur when calling DataFrame({'a': [...]}), for example)

(Also, mostly unrelated to the below, I noticed the inference loop continues for no reason after seeing an object, when it can safely break because the output is a forgone conclusion: added the breaks to make it slightly more efficient.)

Fails on python long integer outside `np.int64` range:

In [4]: arr = np.asarray([2**63], dtype=np.object_)

In [5]: p.lib.maybe_convert_objects(arr)
(exception)

Probably should leave the integer unchanged as a long integer object

Datetimes and complexes lost completely when mixed with booleans:

It's not even doing a cast: it's reading uninitialized memory as boolean

In [6]: arr = np.asarray([dt.datetime(2005, 1, 1), True])

In [7]: p.lib.maybe_convert_objects(arr, convert_datetime=1)
Out[7]: array([ True,  True], dtype=bool)

In [8]: arr = np.asarray([1.0+2.0j, True], dtype=np.object_)

In [9]: p.lib.maybe_convert_objects(arr)
Out[9]: array([ True,  True], dtype=bool)

`None` converts to `np.nan` with floats, but not complexes:

In [10]: arr = np.asarray([1.0, None], dtype=np.object_)

In [11]: p.lib.maybe_convert_objects(arr)
Out[11]: array([  1.,  nan])

In [12]: arr = np.asarray([1.0+2.0j, None], dtype=np.object_)

In [13]: p.lib.maybe_convert_objects(arr)
Out[13]: array([(1+2j), None], dtype=object)

`safe` option preserves integer types with `np.nan` but not `None`:

The comment for safe says "don't cast int to float, etc." but that is not true with None

In [14]: arr = np.asarray([1, 2.0, np.nan], dtype=np.object_)

In [15]: p.lib.maybe_convert_objects(arr, safe=1)
Out[15]: array([1, 2.0, nan], dtype=object)

In [16]: arr = np.asarray([1, 2.0, None], dtype=np.object_)

In [17]: p.lib.maybe_convert_objects(arr, safe=1)
Out[17]: array([  1.,   2.,  nan])

Not entirely sure if this last one is intended behavior or not; let me know if there is any feedback.

The text was updated successfully, but these errors were encountered:

stephenwlin · 2013-02-11T18:58:28Z

Please provide any feedback on commit 0491971 if you can; will create a PR (with appropriate tests for the new behavior) if there are no objections to the changes. All tests pass (at least on linux 32-bit)

jreback · 2013-02-11T19:18:54Z

i would agree with all of these, except for the mixed datetimes-bools...admitedly that is weird, but its essentially a mixed case, so should just be object (and not try to guess), in any event the user could always coerce this case via df.convert_objects(convert_dates='coerce')

last one (casting with None) is an issue I think, maybe search and see if you can find it?

essentially, I think the philosophy (someone correct me if this is not right!), is that a homogenous array should be typed correctly (and not object), while mixed should be left (be it as float or object or whatever). The user then can coerce explicity via astype or convert_objects if needed (which now in 0.11 can coerce dates and numeric like strings). In addition, unlless dtypes are specified, numpy defaults are used (meaning int64 and float64, regardless of architecture)

stephenwlin · 2013-02-11T19:24:05Z

Found another one:

Booleans and datetimes lost when mixed with `None` and numerics

Also, uninitialized memory is being read as float

In [18]: arr = np.asarray([2.0, 1, True, None], dtype=np.object_)

In [19]: p.lib.maybe_convert_objects(arr)
Out[19]: 
array([  2.00000000e+000,   1.00000000e+000,   1.66954563e-256,
                     nan])

In [20]: arr = np.asarray([2.0, 1, dt.datetime(2006,1,1), None], dtype=np.object_)

In [21]: p.lib.maybe_convert_objects(arr, convert_datetime=1)
Out[21]: 
array([  2.00000000e+000,   1.00000000e+000,   1.57847506e-260,
                     nan])

stephenwlin · 2013-02-11T19:31:07Z

@jreback what do you mean w.r.t to mixed datetimes-bools? right now it's taking bools (and reading uninitialized memory by doing so)--I've fixed it to keep the original objects instead

jreback · 2013-02-11T19:36:33Z

no...i mean your last 2 exaamples are inherently mixed (e.g. floats and bools)....so result should be the same as input, e.g. it should not change from object type

only if ALL are that type (or special cases of None or nan which are converted for floats (and datestimes), but that is done in com._possibily_convert_datetimes)

ahh...just reread your comments...yes if it actually tries to do something on a mixed-type that is wrong

stephenwlin · 2013-02-11T19:38:14Z

ok, unless i'm confused, i think we agree then? in all the mixed cases presented my new code just preserves the existing objects.

jreback · 2013-02-11T19:40:28Z

yes!

essentially maybe_convert_objects should ONLY convert if it can do so cleanly (I have sort of moved coerce type options to com._possibily_convert_objects, which now that I am thinking about should have a convert_bool option as well

jreback · 2013-02-11T20:11:41Z

somewhat related is #2798

Series([None,None]) is currently object I think, but maybe should be coerced to equiv of Series([np.nan,np.nan])?

stephenwlin · 2013-02-11T20:26:06Z

Actually I noticed that and made a change to address it locally but it broke a test so I reverted it. It was HTML serialization/deserialization round-trip: basically, it made "nan" show up instead of a blank string in the output where it was expected before. I suspect it probably affects a lot of other unserialization/serialization, too, but it's just not tested thoroughly in the test suite.

I don't know what the proper thing to do here is: if you had a bunch of string objects, for example, except that some of them were None, it would be odd for the Nones to become np.nan...best would probably be a blank string but that would require maybe_convert_objects to do a lot of extra deduction that'll always be incomplete (since objects can be anything). In lieu of that, the current behavior (basically, only converting None to np.nan when there is at least one numeric and no non-numerics in the input) might be the best that can be done.

EDIT: I guess you could change the behavior so that the output is presumed to be numeric unless given something non-numeric and non-None, so Series([None,None]) would be floating with np.nans but Series(['', None,None]) would be objects with Nones. But that's kind of odd too.

jreback · 2013-02-11T20:42:32Z

yep....not really sure on these edge cases....

what should definitly do though:

if it is all numeric, None -> np.nan
Nones are converted to NaT (as are np.nans), (in _possibly_convert_datetimes),not 100% Nones are handled though

bool is a harder case....I don't think None makes sense, but the concept of missing values.....so its tricky now...

essentially these other cases are now left as object....

maybe_convert_objects should't be too much more complicated I think.....can make more handling on a higher level (in _possibily_convert_objects).....as can have more options on what to do

BUG: Various issues with maybe_convert_objects (GH #2845) thanks!

jreback · 2013-02-15T00:13:13Z

closed via #2846

stephenwlin added a commit to stephenwlin/pandas that referenced this issue Feb 11, 2013

BUG: Various issues with maybe_convert_objects (GH pandas-dev#2845)

0491971

stephenwlin mentioned this issue Feb 11, 2013

BUG: Series dtype casting to platform numeric (GH #2751) #2838

Merged

stephenwlin added a commit to stephenwlin/pandas that referenced this issue Feb 11, 2013

BUG: Another issue with maybe_convert_objects (GH pandas-dev#2845)

13c5d72

stephenwlin mentioned this issue Feb 11, 2013

BUG: Various issues with maybe_convert_objects (GH #2845) #2846

Merged

stephenwlin added a commit to stephenwlin/pandas that referenced this issue Feb 11, 2013

TST: Corner case tests for maybe_convert_objects (GH pandas-dev#2845)

a3bfa7f

jreback added a commit that referenced this issue Feb 14, 2013

Merge pull request #2846 from stephenwlin/fix-maybe-convert-objects

d44e9c7

BUG: Various issues with maybe_convert_objects (GH #2845) thanks!

jreback closed this as completed Feb 15, 2013

ghost mentioned this issue Mar 13, 2013

Added metadataframe.py which has MetaDataframe class. This class stores... #2695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Various issues with maybe_convert_objects #2845

BUG: Various issues with maybe_convert_objects #2845

stephenwlin commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

jreback commented Feb 15, 2013

BUG: Various issues with maybe_convert_objects #2845

BUG: Various issues with maybe_convert_objects #2845

Comments

stephenwlin commented Feb 11, 2013

Fails on python long integer outside np.int64 range:

Datetimes and complexes lost completely when mixed with booleans:

None converts to np.nan with floats, but not complexes:

safe option preserves integer types with np.nan but not None:

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

Booleans and datetimes lost when mixed with None and numerics

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

jreback commented Feb 15, 2013

Fails on python long integer outside `np.int64` range:

`None` converts to `np.nan` with floats, but not complexes:

`safe` option preserves integer types with `np.nan` but not `None`:

Booleans and datetimes lost when mixed with `None` and numerics