Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Various issues with maybe_convert_objects #2845

Closed
stephenwlin opened this issue Feb 11, 2013 · 11 comments · Fixed by #2846

Comments

@stephenwlin
Copy link
Contributor

commented Feb 11, 2013

Found some issues with lib.maybe_convert_objects (which is an internal library function, but the same problems as below occur when calling DataFrame({'a': [...]}), for example)

(Also, mostly unrelated to the below, I noticed the inference loop continues for no reason after seeing an object, when it can safely break because the output is a forgone conclusion: added the breaks to make it slightly more efficient.)

Fails on python long integer outside np.int64 range:
In [4]: arr = np.asarray([2**63], dtype=np.object_)

In [5]: p.lib.maybe_convert_objects(arr)
(exception)

Probably should leave the integer unchanged as a long integer object

Datetimes and complexes lost completely when mixed with booleans:

It's not even doing a cast: it's reading uninitialized memory as boolean

In [6]: arr = np.asarray([dt.datetime(2005, 1, 1), True])

In [7]: p.lib.maybe_convert_objects(arr, convert_datetime=1)
Out[7]: array([ True,  True], dtype=bool)

In [8]: arr = np.asarray([1.0+2.0j, True], dtype=np.object_)

In [9]: p.lib.maybe_convert_objects(arr)
Out[9]: array([ True,  True], dtype=bool)
None converts to np.nan with floats, but not complexes:
In [10]: arr = np.asarray([1.0, None], dtype=np.object_)

In [11]: p.lib.maybe_convert_objects(arr)
Out[11]: array([  1.,  nan])

In [12]: arr = np.asarray([1.0+2.0j, None], dtype=np.object_)

In [13]: p.lib.maybe_convert_objects(arr)
Out[13]: array([(1+2j), None], dtype=object)
safe option preserves integer types with np.nan but not None:

The comment for safe says "don't cast int to float, etc." but that is not true with None

In [14]: arr = np.asarray([1, 2.0, np.nan], dtype=np.object_)

In [15]: p.lib.maybe_convert_objects(arr, safe=1)
Out[15]: array([1, 2.0, nan], dtype=object)

In [16]: arr = np.asarray([1, 2.0, None], dtype=np.object_)

In [17]: p.lib.maybe_convert_objects(arr, safe=1)
Out[17]: array([  1.,   2.,  nan])

Not entirely sure if this last one is intended behavior or not; let me know if there is any feedback.

stephenwlin added a commit to stephenwlin/pandas that referenced this issue Feb 11, 2013
@stephenwlin

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2013

Please provide any feedback on commit 0491971 if you can; will create a PR (with appropriate tests for the new behavior) if there are no objections to the changes. All tests pass (at least on linux 32-bit)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2013

i would agree with all of these, except for the mixed datetimes-bools...admitedly that is weird, but its essentially a mixed case, so should just be object (and not try to guess), in any event the user could always coerce this case via df.convert_objects(convert_dates='coerce')

last one (casting with None) is an issue I think, maybe search and see if you can find it?

essentially, I think the philosophy (someone correct me if this is not right!), is that a homogenous array should be typed correctly (and not object), while mixed should be left (be it as float or object or whatever). The user then can coerce explicity via astype or convert_objects if needed (which now in 0.11 can coerce dates and numeric like strings). In addition, unlless dtypes are specified, numpy defaults are used (meaning int64 and float64, regardless of architecture)

@stephenwlin

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2013

Found another one:

Booleans and datetimes lost when mixed with None and numerics

Also, uninitialized memory is being read as float

In [18]: arr = np.asarray([2.0, 1, True, None], dtype=np.object_)

In [19]: p.lib.maybe_convert_objects(arr)
Out[19]: 
array([  2.00000000e+000,   1.00000000e+000,   1.66954563e-256,
                     nan])

In [20]: arr = np.asarray([2.0, 1, dt.datetime(2006,1,1), None], dtype=np.object_)

In [21]: p.lib.maybe_convert_objects(arr, convert_datetime=1)
Out[21]: 
array([  2.00000000e+000,   1.00000000e+000,   1.57847506e-260,
                     nan])
@stephenwlin

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2013

@jreback what do you mean w.r.t to mixed datetimes-bools? right now it's taking bools (and reading uninitialized memory by doing so)--I've fixed it to keep the original objects instead

stephenwlin added a commit to stephenwlin/pandas that referenced this issue Feb 11, 2013
@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2013

no...i mean your last 2 exaamples are inherently mixed (e.g. floats and bools)....so result should be the same as input, e.g. it should not change from object type

only if ALL are that type (or special cases of None or nan which are converted for floats (and datestimes), but that is done in com._possibily_convert_datetimes)

ahh...just reread your comments...yes if it actually tries to do something on a mixed-type that is wrong

@stephenwlin

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2013

ok, unless i'm confused, i think we agree then? in all the mixed cases presented my new code just preserves the existing objects.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2013

yes!

essentially maybe_convert_objects should ONLY convert if it can do so cleanly (I have sort of moved coerce type options to com._possibily_convert_objects, which now that I am thinking about should have a convert_bool option as well

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2013

somewhat related is #2798

Series([None,None]) is currently object I think, but maybe should be coerced to equiv of Series([np.nan,np.nan])?

@stephenwlin

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2013

Actually I noticed that and made a change to address it locally but it broke a test so I reverted it. It was HTML serialization/deserialization round-trip: basically, it made "nan" show up instead of a blank string in the output where it was expected before. I suspect it probably affects a lot of other unserialization/serialization, too, but it's just not tested thoroughly in the test suite.

I don't know what the proper thing to do here is: if you had a bunch of string objects, for example, except that some of them were None, it would be odd for the Nones to become np.nan...best would probably be a blank string but that would require maybe_convert_objects to do a lot of extra deduction that'll always be incomplete (since objects can be anything). In lieu of that, the current behavior (basically, only converting None to np.nan when there is at least one numeric and no non-numerics in the input) might be the best that can be done.

EDIT: I guess you could change the behavior so that the output is presumed to be numeric unless given something non-numeric and non-None, so Series([None,None]) would be floating with np.nans but Series(['', None,None]) would be objects with Nones. But that's kind of odd too.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2013

yep....not really sure on these edge cases....

what should definitly do though:

if it is all numeric, None -> np.nan
Nones are converted to NaT (as are np.nans), (in _possibly_convert_datetimes),not 100% Nones are handled though

bool is a harder case....I don't think None makes sense, but the concept of missing values.....so its tricky now...

essentially these other cases are now left as object....

maybe_convert_objects should't be too much more complicated I think.....can make more handling on a higher level (in _possibily_convert_objects).....as can have more options on what to do

stephenwlin added a commit to stephenwlin/pandas that referenced this issue Feb 11, 2013
jreback added a commit that referenced this issue Feb 14, 2013
Merge pull request #2846 from stephenwlin/fix-maybe-convert-objects
BUG: Various issues with maybe_convert_objects (GH #2845)
thanks!
@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2013

closed via #2846

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.