Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: .convert_objects is deprecated, do we want a .convert to replace? #11221

Closed
jreback opened this issue Oct 2, 2015 · 46 comments
Closed

API: .convert_objects is deprecated, do we want a .convert to replace? #11221

jreback opened this issue Oct 2, 2015 · 46 comments

Comments

@jreback
Copy link
Contributor

@jreback jreback commented Oct 2, 2015

xref #11173

or IMHO simply replace by use of pd.to_datetime,pd.to_timedelta,pd.to_numeric.

Having an auto-guesser is ok, but when you try to forcefully coerce things can easily go awry.

@jreback jreback added this to the Next Major Release milestone Oct 2, 2015
@jreback
Copy link
Contributor Author

@jreback jreback commented Oct 2, 2015

@bashtage
Copy link
Contributor

@bashtage bashtage commented Oct 2, 2015

There is already _convert which could be promoted.

On Fri, Oct 2, 2015, 10:16 AM Jeff Reback notifications@github.com wrote:

cc @bashtage https://github.com/bashtage
@jorisvandenbossche https://github.com/jorisvandenbossche @shoyer
https://github.com/shoyer @TomAugspurger
https://github.com/TomAugspurger @sinhrks https://github.com/sinhrks


Reply to this email directly or view it on GitHub
#11221 (comment).

@bashtage
Copy link
Contributor

@bashtage bashtage commented Oct 2, 2015

The advantage of a well designed convert is that it works on DataFrames. All of to_* are only for 1-d types.

@jreback
Copy link
Contributor Author

@jreback jreback commented Oct 2, 2015

@bashtage oh I agree.

The problem is with coerce, you have to basically not auto-coerce things partially and so leave ambiguous things up to the user (via a 1-d use of the pd.to_*). But assuming we do that then yes, you could make it work.

@bashtage
Copy link
Contributor

@bashtage bashtage commented Oct 2, 2015

I was just thinking of the case where I imported data that should be numeric into a DF, but it has some mixed characters, and I want just numbers or NaNs. This type of conversion is what I ultimately wanted when I started looking at convert_objects when I was surprised that asking to coerce of all strings didn't coerce it to NaN.

@jreback
Copy link
Contributor Author

@jreback jreback commented Oct 2, 2015

but the problem is that of a mixed boolean/nan one is ambiguous (so maybe just need to 'handle' that)

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 2, 2015

Some comments/observations:

  • I actually like convert_objects more than convert, because it more clearly says what it does: try to convert objects dtyped columns to a builtin dtype (convert is rather general).

  • If we decide that we something like current the convert objects functionality, I don't really see a reason to deprecate convert_objects for a new converts. I think it should be technically possible to deprecate the old keywords (and not the function) in favor of new keywords. (actually the original approach in the reverted PR).

  • I think the functionality of convert_objects is useful (as already is stated above: that you can do something like to_datetime/to_numeric/.. on dataframes). Using the to_.. functions on each series separate will always be the preferable solution for robust code, but as long as convert_objects is very clearly defined (now there are some strange inconsistencies), I think it is useful to have this. It would be very nice if this could just be implemented in terms of the to_.. methods.
    A bit simplified in pseudo code:

    def convert_objects(self, numeric=False, datetime=False, timedelta=False, coerce=False):
        for each column:
            if numeric:
                pd.to_numeric(self, coerce=coerce)
            elif datetime:
                pd.to_datetime(self, coerce=coerce)
            elif timedelta:
                pd.to_timedelta(self, coerce=coerce)
    
  • But, the main problem with this is: the reason convert_objects is useful now, is precisely because it has an extra 'rule' that the to_.. methods don't have: only convert the column if there is at least one value that can be converted.
    This is the reason that something like this works:

    In [2]: df = pd.DataFrame({'int_str':['1', '2'], 'real_str':['a', 'b']})
    
    In [3]: df.convert_objects(convert_numeric=True)
    Out[3]:
       int_str real_str
    0        1        a
    1        2        b
    
    In [4]: df.convert_objects(convert_numeric=True).dtypes
    Out[4]:
    int_str      int64
    real_str    object
    dtype: object
    

    and does not give:

    Out[3]:
       int_str   real_str
    0        1        NaN
    1        2        NaN
    

    which would not be really useful (although maybe more predictable). The fact that is not always coerced to NaNs was considered as a bug, for which @bashtage did a PR (and for to_numeric, it is also logical that it returns NaNs). But this made convert_objects also less useful (so it was reverted in the end).
    So I think that in this case, we will have to deviate from the to_.. behaviour

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 2, 2015

Maybe this could be an extra parameter to convert/convert_objects: whether to coerce non-convertible-columns to NaN or not (meaning: columns for which there is at least not one element convertible and would lead to a full NaN column). @bashtage then you could have the behaviour you want, but the method can still be used for dataframes were not all columns should be considered as numeric.

@jreback
Copy link
Contributor Author

@jreback jreback commented Oct 2, 2015

ok so the question is should we u deprecate convert_objects thrn?

I actually think convert is s much better name snd we certainly could add the options u describe to make it more useful

@bashtage
Copy link
Contributor

@bashtage bashtage commented Oct 2, 2015

convert_objects just seems like a bad API feature since it has this path dependence where it

tries to convert to type a
tries to convert to type b if a fails, but not if a succeeds
tries to convert to type c is a and b fail, but not if either succeed

A better design would only convert a single type which removes any ambiguity if some data is ever convertible to more than one type. to to_* sort of get there, with the caveat that they operate column by column.

@hayd
Copy link
Contributor

@hayd hayd commented Nov 20, 2015

Long live convert_objects!

@jreback
Copy link
Contributor Author

@jreback jreback commented Nov 20, 2015

maybe what we need in the docs are some examples showing:

df.apply(pd.to_numeric) and such (which effectively / more safely) replaces .convert_objects

@usagliaschi
Copy link

@usagliaschi usagliaschi commented Jul 14, 2016

Hi all,

I currently use convert_objects in many of my codes and I think this functionality is very useful when importing datasets that may differ every day in terms of columns composition. Is it really necessary to deprecate it or there's a chance to keep it alive?

Many thanks,
Umberto

@jreback
Copy link
Contributor Author

@jreback jreback commented Jul 14, 2016

.convert_objects was inherently ambiguous and this was deprecate multiple versions ago. see the docs here for how to explicity do object conversion.

@bashtage
Copy link
Contributor

@bashtage bashtage commented Jul 14, 2016

I agree with @jreback - convert_objects was full of magic and had difficult to guess behavior that was inconsistent across different conversion targets (e.g. numbers were not forces if all were not numbers even if told to coerce).

A well designed guesser with clear, simple rule and no option to coerce could be useful, but it isn't hard to write your own with your favorite set of rules.

@BKJackson
Copy link

@BKJackson BKJackson commented Sep 10, 2016

FYI, the convert all (errors='coerce') and ignore (errors='ignore') options in .to_numeric is a problem in data files containing columns of strings that you want to keep and columns of strings that are actually numbers expressed in scientific notation (e.g, 6.2e+15) which require 'coerce' to convert from strings to float64.

The (deprecated) convert.py file has a handy soft convert function that checks if a forced conversion produces all NaNs (such as a string that you want to keep) and then declines to convert the whole column.

A fourth error option, such as 'soft-coerce,' would catch scientific notation numbers while not forcing all strings to NaNs.

At the moment, my work around is:

    for col in df.columns:   
        converted = pd.to_numeric(df[col],errors='coerce')  
        df[col] = converted if not pd.isnull(converted).all() else df[col]
@abalter
Copy link

@abalter abalter commented Sep 26, 2016

The great thing about convert_objects over the various to_* methods is that you don't need to know the datatypes in advance. As @usagliaschi said, you may have heterogeneous data coming in and want a single function to handle it. This is exactly my current situation.

Is there any replacement for a function that will match this functionality, in particular infer dates/datetimes?

@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Mar 21, 2017

xref #15757 (comment)

I think it would be worth exposing whatever the new soft convert api is 0.20 (I haven't looked at it in detail), referencing it in the convert_objects depr message, then deferring convert_objects to the next version, if possible.

I say this because I know there are people (for example, me) who have ignored the convert_objects depr message in a couple cases, in particular working with data where you don't necessarily know the columns. Real instance:

df = pd.read_html(source)[0]  # poorly formatted table, everything inferred to object
                              # exact columns can vary

df.columns = df.loc[0, :]
df = df.drop(0).dropna()

df = df.convert_objects()

Looking at this again, I realize df.apply(lambda x: pd.to_numeric(x, errors='ignore')) would also work fine in this case, but that wasn't immediately obvious, and I'm not sure we've done enough handholding (for lack of a better term) to help people transition.

@jreback
Copy link
Contributor Author

@jreback jreback commented Mar 21, 2017

IF we decide to expose a 'soft convert objects', would we want this called .covert_objects()? or different name, maybe .convert()? (e.g. instead of removing the deprecation, we simply changed it - which is probably more in breaking back-compat).

@jreback
Copy link
Contributor Author

@jreback jreback commented Mar 27, 2017

xref #15550

so I think a resolution to this could be:

  • adding .to_* to Series (#15550)
  • adding .to_* to DataFrame
  • adding a soft option

then easy enough to do:

df.to_numeric(errors='soft')

if you really really want to actually convert things ala the original .convert_object().

df.to_datetime(errors='soft').to_timedelta(errors='soft').to_numeric(errors='soft')

And I suppose could offer a convenience feature for this:

  • df.to_converted()
  • df.convert() (maybe too generic)
  • df.convert_objects() (resurrect)
  • df.to_just_figure_this_out()
@bashtage
Copy link
Contributor

@bashtage bashtage commented Mar 27, 2017

I think the most useful soft conversion function would have either the ability to order the to_* rules, e.g. numeric-date-time or time-date-numeric since there are occasionally data that could be interpreted as multiple types. At least this was the case in convert_objects. Alternatively, one could only select a subset of the filters, such as only consider numeric-date.

I agree extending the to_* to correctly operate on DataFrames would be useful.

@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Mar 28, 2017

Thanks @jreback - I like adding to_... to the DataFrame api, although maybe it's worth splitting out use cases. Consider this ill-formed frame:

df = pd.DataFrame({'num_objects': [1, 2, 3], 'num_str': ['1', '2', '3']}, dtype=object)

df
Out[2]: 
  num_objects num_str
0           1       1
1           2       2
2           3       3

df.dtypes
Out[3]: 
num_objects    object
num_str        object
dtype: object

The default behavior of convert_objects is to only reinterpret the python ints into a proper int dtype, not cast the strings. This is the behavior that I'd really miss killing convert_objects, and suspect others might too.

df.convert_objects().dtypes
Out[4]: 
num_objects     int64
num_str        object
dtype: object

In [5]: df.apply(pd.to_numeric).dtypes
Out[5]: 
num_objects    int64
num_str        int64
dtype: object

So is it worth adding a convert_pyobjects (...not in love with that name) for just this case?
infer_python_types
convert_python_types
??

@bashtage
Copy link
Contributor

@bashtage bashtage commented Mar 28, 2017

The to_* are pretty precise and do what you tell them, even to non-objects. For examples:

import pandas as pd
import datetime as dt
t = pd.Series([dt.datetime.now(), dt.datetime.now()])

pd.to_numeric(t)
Out[7]: 
0    1490739351272159000
1    1490739351272159000
dtype: int64

I would assume that a successor to convert_objects would only convert object dtype and would not behave like this.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 28, 2017

The reason that I don't like adding the .to_ functions as method on a DataFrame (or at least not as solution in this discussion), is because IMO you typically do not want to apply this to all columns and/or not in the same way (and if you want this, you can easily do the apply approach as you can do now).
Eg with DataFrame.to_datetime, I would expect that it does this for all columns, which means both converting numerical columns as string columns. I don't think this is typically what you want.

So for me one of the reasons to have a convert_objects method (irregardless of the exact behavioral details) is that it would only try to convert actual object dtyped columns.

@jreback
Copy link
Contributor Author

@jreback jreback commented Mar 28, 2017

ok if we resurrect this with an all new signature. this is current.

In [1]: DataFrame.convert_objects?
Signature: DataFrame.convert_objects(self, convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True)
Docstring:
Deprecated.

Attempt to infer better dtype for object columns

Parameters
----------
convert_dates : boolean, default True
    If True, convert to date where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
    If True, attempt to coerce to numbers (including strings), with
    unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
    If True, convert to timedelta where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
copy : boolean, default True
    If True, return a copy even if no copy is necessary (e.g. no
    conversion was done). Note: This is meant for internal use, and
    should not be confused with inplace.

IIRC @jorisvandenbossche suggested. (with a mod).

DataFrame.convert_object(self, datetime=True, timedelta=True, numeric=False, copy=True)

Though if everything is changed. Then maybe we should just rename this. (note the .convert_object)

@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Jul 7, 2017

Sorry I'm just getting back to this here's a proposal of how I think this could work, open to suggestions on any piece.

0.20.1 - leave convert_objects but update depr message with new methods I'll go through
0.20.2 - remove convert_objects

First, for conversions that are simply unboxing of python objects, add a new method infer_objects with no options. This essentially re-applies our ctor inference on any object columns, and if a column can be losslessly unboxed to a native type, do it, otherwise leave unchanged. Useful in munging scenarios where the original inference fails. Example:

df = pd.DataFrame({'a': ['a', 1, 2, 3],
                   'b': ['b', 2.0, 3.0, 4.1],
                   'c': ['c', datetime.datetime(2016, 1, 1), datetime.datetime(2016, 1, 2), 
                         datetime.datetime(2016, 1, 3)]})

df = df.iloc[1:]

In [194]: df
Out[194]: 
   a    b                    c
1  1    2  2016-01-01 00:00:00
2  2    3  2016-01-02 00:00:00
3  3  4.1  2016-01-03 00:00:00

In [195]: df.dtypes
Out[195]: 
a    object
b    object
c    object
dtype: object

# exactly what convert_objects does in this scenario today!
In [196]: df.infer_objects().dtypes
Out[196]: 
a             int64
b           float64
c    datetime64[ns]
dtype: object

Second, for all other conversions, add to_numeric, to_datetime, and to_datetime to the DataFrame API, with the following sig. Basically work as they do today, but some convenience column selecting options. Not sure on the defaults here, starting with the most 'convenient'

"""
DataFrame.to_...(self, errors='ignore', object_only=True, include=None, exclude=None)
Parameters
------------
errors: {'ignore', 'coerce', 'raise'}
   error mode passed to `pd.to_....`
object_only: boolean
    if True, only apply inference to object typed columns

include / exclude: column selection
"""

Example frame, with what is needed today:

df1 = pd.DataFrame({
    'date': pd.date_range('2014-01-01', periods=3),
    'date_unconverted': ['2014-01', '2015-01', '2016-01'],
    'number': [1, 2, 3],
    'number_unconverted': ['1', '2', '3']})


In [198]: df1
Out[198]: 
        date date_unconverted  number number_unconverted
0 2014-01-01          2014-01       1                  1
1 2014-01-02          2015-01       2                  2
2 2014-01-03          2016-01       3                  3

In [199]: df1.dtypes
Out[199]: 
date                  datetime64[ns]
date_unconverted              object
number                         int64
number_unconverted            object
dtype: object


In [202]: df1.convert_objects(convert_numeric=True, convert_dates='coerce').dtypes
C:\Users\chris.bartak\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object

With the new api:

In [202]: df1.to_numeric().to_datetime()
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object
@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Jul 7, 2017

And to be honest, I don't personally care much about the second API, my pushback over deprecating convert_objects was entirely based on the lack of something like infer_objects

@bashtage
Copy link
Contributor

@bashtage bashtage commented Jul 7, 2017

I would second infer_objects() as long as the rules were crystal clear and the implementation matched the description. Another important usecase is when one ends up with a DF transposed with all object columns, and then something like df = df.T.infer_types() would produce

I think function like to_numeric, etc. shouldn't methods on a dataframe and instead should just be stand alone. I can't think they are used frequently enough to pollute the to list.

@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Jul 7, 2017

Cool, yeah the more I think about the less I think adding to_... to the DataFrame api is a good idea. In terms of infer_objects the impl would basically be as follows - based on maybe_convert_objects, which generally unsurprising (in my opinion) behavior:

In [251]: from pandas._libs.lib import maybe_convert_objects

In [252]: converter = lambda x: maybe_convert_objects(np.asarray(x, dtype='O'), convert_datetime=True, convert_timedelta=True)

In [253]: converter([1,2,3])
Out[253]: array([1, 2, 3], dtype=int64)

In [254]: converter([1,2,3])
Out[254]: array([1, 2, 3], dtype=int64)

In [255]: converter([1,2,'3'])
Out[255]: array([1, 2, '3'], dtype=object)

In [256]: converter([datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 2)])
Out[256]: array(['2015-01-01T00:00:00.000000000', '2015-01-02T00:00:00.000000000'], dtype='datetime64[ns]')

In [257]: converter([datetime.datetime(2015, 1, 1), 'a'])
Out[257]: array([datetime.datetime(2015, 1, 1, 0, 0), 'a'], dtype=object)

In [258]: converter([datetime.datetime(2015, 1, 1), 1])
Out[258]: array([datetime.datetime(2015, 1, 1, 0, 0), 1], dtype=object)

In [259]: converter([datetime.timedelta(seconds=1), datetime.timedelta(seconds=1)])
Out[259]: array([1000000000, 1000000000], dtype='timedelta64[ns]')

In [260]: converter([datetime.timedelta(seconds=1), 1])
Out[260]: array([datetime.timedelta(0, 1), 1], dtype=object)
@jreback
Copy link
Contributor Author

@jreback jreback commented Jul 7, 2017

yes maybe_convert_objects is a soft conversion
it will only convert if all these are strictly convertible

@jreback
Copy link
Contributor Author

@jreback jreback commented Jul 7, 2017

I could be on board with a very simple .infer_objects() in that case. It wouldn't accept any arguments I think?

@jreback
Copy link
Contributor Author

@jreback jreback commented Jul 7, 2017

could add the new function and change the msg on the convert_objects deprecation to point to .infer_objects() and .to_* for 0.21, then remove in 1.0

@jreback jreback modified the milestones: 0.21.0, Next Major Release Jul 7, 2017
@gfyoung
Copy link
Member

@gfyoung gfyoung commented Jul 13, 2017

@jreback : Judging from this conversation, it seems that removal of convert_objects will not be happening in 0.21. Would it be best to close #15757 and let a fresh PR take its place for the implementation of infer_objects (which BTW, seems a like a good idea)?

@gfyoung
Copy link
Member

@gfyoung gfyoung commented Jul 13, 2017

IIUC, to what extent is infer_objects just a port of convert_objects to being a method of DataFrame (or just NDFrame in general)?

@bashtage
Copy link
Contributor

@bashtage bashtage commented Jul 13, 2017

convert_objects has it's own logic and has options. infer_objects should use the default inference as-if on a DataFrame (but only object columns).

@gfyoung
Copy link
Member

@gfyoung gfyoung commented Jul 13, 2017

Ah right, so do you mean then that infer_objects is convert_objects with the defaults passed in (more or less, maybe some tweaked specifically for DataFrame)?

@jreback
Copy link
Contributor Author

@jreback jreback commented Jul 13, 2017

infer_objects should have no options, simply do soft-conversion (it would basically just call maybe_convert_objects with the default options

@gfyoung
Copy link
Member

@gfyoung gfyoung commented Jul 13, 2017

Ah, okay, that makes sense. I was just trying to understand and collate the comments made in this discussion in my mind.

@chris-b1 chris-b1 mentioned this issue Jul 13, 2017
3 of 4 tasks complete
@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Jul 13, 2017

fyi, opened #16915 for infer_objects if anyone is interested - in particular if you have edge test cases in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

9 participants
You can’t perform that action at this time.