New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: .convert_objects is deprecated, do we want a .convert to replace? #11221

Closed
jreback opened this Issue Oct 2, 2015 · 46 comments

Comments

Projects
None yet
9 participants
@jreback
Contributor

jreback commented Oct 2, 2015

xref #11173

or IMHO simply replace by use of pd.to_datetime,pd.to_timedelta,pd.to_numeric.

Having an auto-guesser is ok, but when you try to forcefully coerce things can easily go awry.

@jreback jreback added this to the Next Major Release milestone Oct 2, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
Contributor

jreback commented Oct 2, 2015

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Oct 2, 2015

Contributor

There is already _convert which could be promoted.

On Fri, Oct 2, 2015, 10:16 AM Jeff Reback notifications@github.com wrote:

cc @bashtage https://github.com/bashtage
@jorisvandenbossche https://github.com/jorisvandenbossche @shoyer
https://github.com/shoyer @TomAugspurger
https://github.com/TomAugspurger @sinhrks https://github.com/sinhrks


Reply to this email directly or view it on GitHub
#11221 (comment).

Contributor

bashtage commented Oct 2, 2015

There is already _convert which could be promoted.

On Fri, Oct 2, 2015, 10:16 AM Jeff Reback notifications@github.com wrote:

cc @bashtage https://github.com/bashtage
@jorisvandenbossche https://github.com/jorisvandenbossche @shoyer
https://github.com/shoyer @TomAugspurger
https://github.com/TomAugspurger @sinhrks https://github.com/sinhrks


Reply to this email directly or view it on GitHub
#11221 (comment).

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Oct 2, 2015

Contributor

The advantage of a well designed convert is that it works on DataFrames. All of to_* are only for 1-d types.

Contributor

bashtage commented Oct 2, 2015

The advantage of a well designed convert is that it works on DataFrames. All of to_* are only for 1-d types.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 2, 2015

Contributor

@bashtage oh I agree.

The problem is with coerce, you have to basically not auto-coerce things partially and so leave ambiguous things up to the user (via a 1-d use of the pd.to_*). But assuming we do that then yes, you could make it work.

Contributor

jreback commented Oct 2, 2015

@bashtage oh I agree.

The problem is with coerce, you have to basically not auto-coerce things partially and so leave ambiguous things up to the user (via a 1-d use of the pd.to_*). But assuming we do that then yes, you could make it work.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Oct 2, 2015

Contributor

I was just thinking of the case where I imported data that should be numeric into a DF, but it has some mixed characters, and I want just numbers or NaNs. This type of conversion is what I ultimately wanted when I started looking at convert_objects when I was surprised that asking to coerce of all strings didn't coerce it to NaN.

Contributor

bashtage commented Oct 2, 2015

I was just thinking of the case where I imported data that should be numeric into a DF, but it has some mixed characters, and I want just numbers or NaNs. This type of conversion is what I ultimately wanted when I started looking at convert_objects when I was surprised that asking to coerce of all strings didn't coerce it to NaN.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 2, 2015

Contributor

but the problem is that of a mixed boolean/nan one is ambiguous (so maybe just need to 'handle' that)

Contributor

jreback commented Oct 2, 2015

but the problem is that of a mixed boolean/nan one is ambiguous (so maybe just need to 'handle' that)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 2, 2015

Member

Some comments/observations:

  • I actually like convert_objects more than convert, because it more clearly says what it does: try to convert objects dtyped columns to a builtin dtype (convert is rather general).

  • If we decide that we something like current the convert objects functionality, I don't really see a reason to deprecate convert_objects for a new converts. I think it should be technically possible to deprecate the old keywords (and not the function) in favor of new keywords. (actually the original approach in the reverted PR).

  • I think the functionality of convert_objects is useful (as already is stated above: that you can do something like to_datetime/to_numeric/.. on dataframes). Using the to_.. functions on each series separate will always be the preferable solution for robust code, but as long as convert_objects is very clearly defined (now there are some strange inconsistencies), I think it is useful to have this. It would be very nice if this could just be implemented in terms of the to_.. methods.
    A bit simplified in pseudo code:

    def convert_objects(self, numeric=False, datetime=False, timedelta=False, coerce=False):
        for each column:
            if numeric:
                pd.to_numeric(self, coerce=coerce)
            elif datetime:
                pd.to_datetime(self, coerce=coerce)
            elif timedelta:
                pd.to_timedelta(self, coerce=coerce)
    
  • But, the main problem with this is: the reason convert_objects is useful now, is precisely because it has an extra 'rule' that the to_.. methods don't have: only convert the column if there is at least one value that can be converted.
    This is the reason that something like this works:

    In [2]: df = pd.DataFrame({'int_str':['1', '2'], 'real_str':['a', 'b']})
    
    In [3]: df.convert_objects(convert_numeric=True)
    Out[3]:
       int_str real_str
    0        1        a
    1        2        b
    
    In [4]: df.convert_objects(convert_numeric=True).dtypes
    Out[4]:
    int_str      int64
    real_str    object
    dtype: object
    

    and does not give:

    Out[3]:
       int_str   real_str
    0        1        NaN
    1        2        NaN
    

    which would not be really useful (although maybe more predictable). The fact that is not always coerced to NaNs was considered as a bug, for which @bashtage did a PR (and for to_numeric, it is also logical that it returns NaNs). But this made convert_objects also less useful (so it was reverted in the end).
    So I think that in this case, we will have to deviate from the to_.. behaviour

Member

jorisvandenbossche commented Oct 2, 2015

Some comments/observations:

  • I actually like convert_objects more than convert, because it more clearly says what it does: try to convert objects dtyped columns to a builtin dtype (convert is rather general).

  • If we decide that we something like current the convert objects functionality, I don't really see a reason to deprecate convert_objects for a new converts. I think it should be technically possible to deprecate the old keywords (and not the function) in favor of new keywords. (actually the original approach in the reverted PR).

  • I think the functionality of convert_objects is useful (as already is stated above: that you can do something like to_datetime/to_numeric/.. on dataframes). Using the to_.. functions on each series separate will always be the preferable solution for robust code, but as long as convert_objects is very clearly defined (now there are some strange inconsistencies), I think it is useful to have this. It would be very nice if this could just be implemented in terms of the to_.. methods.
    A bit simplified in pseudo code:

    def convert_objects(self, numeric=False, datetime=False, timedelta=False, coerce=False):
        for each column:
            if numeric:
                pd.to_numeric(self, coerce=coerce)
            elif datetime:
                pd.to_datetime(self, coerce=coerce)
            elif timedelta:
                pd.to_timedelta(self, coerce=coerce)
    
  • But, the main problem with this is: the reason convert_objects is useful now, is precisely because it has an extra 'rule' that the to_.. methods don't have: only convert the column if there is at least one value that can be converted.
    This is the reason that something like this works:

    In [2]: df = pd.DataFrame({'int_str':['1', '2'], 'real_str':['a', 'b']})
    
    In [3]: df.convert_objects(convert_numeric=True)
    Out[3]:
       int_str real_str
    0        1        a
    1        2        b
    
    In [4]: df.convert_objects(convert_numeric=True).dtypes
    Out[4]:
    int_str      int64
    real_str    object
    dtype: object
    

    and does not give:

    Out[3]:
       int_str   real_str
    0        1        NaN
    1        2        NaN
    

    which would not be really useful (although maybe more predictable). The fact that is not always coerced to NaNs was considered as a bug, for which @bashtage did a PR (and for to_numeric, it is also logical that it returns NaNs). But this made convert_objects also less useful (so it was reverted in the end).
    So I think that in this case, we will have to deviate from the to_.. behaviour

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 2, 2015

Member

Maybe this could be an extra parameter to convert/convert_objects: whether to coerce non-convertible-columns to NaN or not (meaning: columns for which there is at least not one element convertible and would lead to a full NaN column). @bashtage then you could have the behaviour you want, but the method can still be used for dataframes were not all columns should be considered as numeric.

Member

jorisvandenbossche commented Oct 2, 2015

Maybe this could be an extra parameter to convert/convert_objects: whether to coerce non-convertible-columns to NaN or not (meaning: columns for which there is at least not one element convertible and would lead to a full NaN column). @bashtage then you could have the behaviour you want, but the method can still be used for dataframes were not all columns should be considered as numeric.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 2, 2015

Contributor

ok so the question is should we u deprecate convert_objects thrn?

I actually think convert is s much better name snd we certainly could add the options u describe to make it more useful

Contributor

jreback commented Oct 2, 2015

ok so the question is should we u deprecate convert_objects thrn?

I actually think convert is s much better name snd we certainly could add the options u describe to make it more useful

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Oct 2, 2015

Contributor

convert_objects just seems like a bad API feature since it has this path dependence where it

tries to convert to type a
tries to convert to type b if a fails, but not if a succeeds
tries to convert to type c is a and b fail, but not if either succeed

A better design would only convert a single type which removes any ambiguity if some data is ever convertible to more than one type. to to_* sort of get there, with the caveat that they operate column by column.

Contributor

bashtage commented Oct 2, 2015

convert_objects just seems like a bad API feature since it has this path dependence where it

tries to convert to type a
tries to convert to type b if a fails, but not if a succeeds
tries to convert to type c is a and b fail, but not if either succeed

A better design would only convert a single type which removes any ambiguity if some data is ever convertible to more than one type. to to_* sort of get there, with the caveat that they operate column by column.

@hayd

This comment has been minimized.

Show comment
Hide comment
@hayd

hayd Nov 20, 2015

Contributor

Long live convert_objects!

Contributor

hayd commented Nov 20, 2015

Long live convert_objects!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 20, 2015

Contributor

maybe what we need in the docs are some examples showing:

df.apply(pd.to_numeric) and such (which effectively / more safely) replaces .convert_objects

Contributor

jreback commented Nov 20, 2015

maybe what we need in the docs are some examples showing:

df.apply(pd.to_numeric) and such (which effectively / more safely) replaces .convert_objects

@usagliaschi

This comment has been minimized.

Show comment
Hide comment
@usagliaschi

usagliaschi Jul 14, 2016

Hi all,

I currently use convert_objects in many of my codes and I think this functionality is very useful when importing datasets that may differ every day in terms of columns composition. Is it really necessary to deprecate it or there's a chance to keep it alive?

Many thanks,
Umberto

usagliaschi commented Jul 14, 2016

Hi all,

I currently use convert_objects in many of my codes and I think this functionality is very useful when importing datasets that may differ every day in terms of columns composition. Is it really necessary to deprecate it or there's a chance to keep it alive?

Many thanks,
Umberto

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 14, 2016

Contributor

.convert_objects was inherently ambiguous and this was deprecate multiple versions ago. see the docs here for how to explicity do object conversion.

Contributor

jreback commented Jul 14, 2016

.convert_objects was inherently ambiguous and this was deprecate multiple versions ago. see the docs here for how to explicity do object conversion.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Jul 14, 2016

Contributor

I agree with @jreback - convert_objects was full of magic and had difficult to guess behavior that was inconsistent across different conversion targets (e.g. numbers were not forces if all were not numbers even if told to coerce).

A well designed guesser with clear, simple rule and no option to coerce could be useful, but it isn't hard to write your own with your favorite set of rules.

Contributor

bashtage commented Jul 14, 2016

I agree with @jreback - convert_objects was full of magic and had difficult to guess behavior that was inconsistent across different conversion targets (e.g. numbers were not forces if all were not numbers even if told to coerce).

A well designed guesser with clear, simple rule and no option to coerce could be useful, but it isn't hard to write your own with your favorite set of rules.

@BKJackson

This comment has been minimized.

Show comment
Hide comment
@BKJackson

BKJackson Sep 10, 2016

FYI, the convert all (errors='coerce') and ignore (errors='ignore') options in .to_numeric is a problem in data files containing columns of strings that you want to keep and columns of strings that are actually numbers expressed in scientific notation (e.g, 6.2e+15) which require 'coerce' to convert from strings to float64.

The (deprecated) convert.py file has a handy soft convert function that checks if a forced conversion produces all NaNs (such as a string that you want to keep) and then declines to convert the whole column.

A fourth error option, such as 'soft-coerce,' would catch scientific notation numbers while not forcing all strings to NaNs.

At the moment, my work around is:

    for col in df.columns:   
        converted = pd.to_numeric(df[col],errors='coerce')  
        df[col] = converted if not pd.isnull(converted).all() else df[col]

BKJackson commented Sep 10, 2016

FYI, the convert all (errors='coerce') and ignore (errors='ignore') options in .to_numeric is a problem in data files containing columns of strings that you want to keep and columns of strings that are actually numbers expressed in scientific notation (e.g, 6.2e+15) which require 'coerce' to convert from strings to float64.

The (deprecated) convert.py file has a handy soft convert function that checks if a forced conversion produces all NaNs (such as a string that you want to keep) and then declines to convert the whole column.

A fourth error option, such as 'soft-coerce,' would catch scientific notation numbers while not forcing all strings to NaNs.

At the moment, my work around is:

    for col in df.columns:   
        converted = pd.to_numeric(df[col],errors='coerce')  
        df[col] = converted if not pd.isnull(converted).all() else df[col]
@abalter

This comment has been minimized.

Show comment
Hide comment
@abalter

abalter Sep 26, 2016

The great thing about convert_objects over the various to_* methods is that you don't need to know the datatypes in advance. As @usagliaschi said, you may have heterogeneous data coming in and want a single function to handle it. This is exactly my current situation.

Is there any replacement for a function that will match this functionality, in particular infer dates/datetimes?

abalter commented Sep 26, 2016

The great thing about convert_objects over the various to_* methods is that you don't need to know the datatypes in advance. As @usagliaschi said, you may have heterogeneous data coming in and want a single function to handle it. This is exactly my current situation.

Is there any replacement for a function that will match this functionality, in particular infer dates/datetimes?

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Mar 21, 2017

Contributor

xref #15757 (comment)

I think it would be worth exposing whatever the new soft convert api is 0.20 (I haven't looked at it in detail), referencing it in the convert_objects depr message, then deferring convert_objects to the next version, if possible.

I say this because I know there are people (for example, me) who have ignored the convert_objects depr message in a couple cases, in particular working with data where you don't necessarily know the columns. Real instance:

df = pd.read_html(source)[0]  # poorly formatted table, everything inferred to object
                              # exact columns can vary

df.columns = df.loc[0, :]
df = df.drop(0).dropna()

df = df.convert_objects()

Looking at this again, I realize df.apply(lambda x: pd.to_numeric(x, errors='ignore')) would also work fine in this case, but that wasn't immediately obvious, and I'm not sure we've done enough handholding (for lack of a better term) to help people transition.

Contributor

chris-b1 commented Mar 21, 2017

xref #15757 (comment)

I think it would be worth exposing whatever the new soft convert api is 0.20 (I haven't looked at it in detail), referencing it in the convert_objects depr message, then deferring convert_objects to the next version, if possible.

I say this because I know there are people (for example, me) who have ignored the convert_objects depr message in a couple cases, in particular working with data where you don't necessarily know the columns. Real instance:

df = pd.read_html(source)[0]  # poorly formatted table, everything inferred to object
                              # exact columns can vary

df.columns = df.loc[0, :]
df = df.drop(0).dropna()

df = df.convert_objects()

Looking at this again, I realize df.apply(lambda x: pd.to_numeric(x, errors='ignore')) would also work fine in this case, but that wasn't immediately obvious, and I'm not sure we've done enough handholding (for lack of a better term) to help people transition.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 21, 2017

Contributor

IF we decide to expose a 'soft convert objects', would we want this called .covert_objects()? or different name, maybe .convert()? (e.g. instead of removing the deprecation, we simply changed it - which is probably more in breaking back-compat).

Contributor

jreback commented Mar 21, 2017

IF we decide to expose a 'soft convert objects', would we want this called .covert_objects()? or different name, maybe .convert()? (e.g. instead of removing the deprecation, we simply changed it - which is probably more in breaking back-compat).

@jsexauer jsexauer referenced this issue Mar 25, 2017

Open

DEPR: deprecations from prior versions #6581

0 of 88 tasks complete
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 27, 2017

Contributor

xref #15550

so I think a resolution to this could be:

  • adding .to_* to Series (#15550)
  • adding .to_* to DataFrame
  • adding a soft option

then easy enough to do:

df.to_numeric(errors='soft')

if you really really want to actually convert things ala the original .convert_object().

df.to_datetime(errors='soft').to_timedelta(errors='soft').to_numeric(errors='soft')

And I suppose could offer a convenience feature for this:

  • df.to_converted()
  • df.convert() (maybe too generic)
  • df.convert_objects() (resurrect)
  • df.to_just_figure_this_out()
Contributor

jreback commented Mar 27, 2017

xref #15550

so I think a resolution to this could be:

  • adding .to_* to Series (#15550)
  • adding .to_* to DataFrame
  • adding a soft option

then easy enough to do:

df.to_numeric(errors='soft')

if you really really want to actually convert things ala the original .convert_object().

df.to_datetime(errors='soft').to_timedelta(errors='soft').to_numeric(errors='soft')

And I suppose could offer a convenience feature for this:

  • df.to_converted()
  • df.convert() (maybe too generic)
  • df.convert_objects() (resurrect)
  • df.to_just_figure_this_out()
@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Mar 27, 2017

Contributor

I think the most useful soft conversion function would have either the ability to order the to_* rules, e.g. numeric-date-time or time-date-numeric since there are occasionally data that could be interpreted as multiple types. At least this was the case in convert_objects. Alternatively, one could only select a subset of the filters, such as only consider numeric-date.

I agree extending the to_* to correctly operate on DataFrames would be useful.

Contributor

bashtage commented Mar 27, 2017

I think the most useful soft conversion function would have either the ability to order the to_* rules, e.g. numeric-date-time or time-date-numeric since there are occasionally data that could be interpreted as multiple types. At least this was the case in convert_objects. Alternatively, one could only select a subset of the filters, such as only consider numeric-date.

I agree extending the to_* to correctly operate on DataFrames would be useful.

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Mar 28, 2017

Contributor

Thanks @jreback - I like adding to_... to the DataFrame api, although maybe it's worth splitting out use cases. Consider this ill-formed frame:

df = pd.DataFrame({'num_objects': [1, 2, 3], 'num_str': ['1', '2', '3']}, dtype=object)

df
Out[2]: 
  num_objects num_str
0           1       1
1           2       2
2           3       3

df.dtypes
Out[3]: 
num_objects    object
num_str        object
dtype: object

The default behavior of convert_objects is to only reinterpret the python ints into a proper int dtype, not cast the strings. This is the behavior that I'd really miss killing convert_objects, and suspect others might too.

df.convert_objects().dtypes
Out[4]: 
num_objects     int64
num_str        object
dtype: object

In [5]: df.apply(pd.to_numeric).dtypes
Out[5]: 
num_objects    int64
num_str        int64
dtype: object

So is it worth adding a convert_pyobjects (...not in love with that name) for just this case?
infer_python_types
convert_python_types
??

Contributor

chris-b1 commented Mar 28, 2017

Thanks @jreback - I like adding to_... to the DataFrame api, although maybe it's worth splitting out use cases. Consider this ill-formed frame:

df = pd.DataFrame({'num_objects': [1, 2, 3], 'num_str': ['1', '2', '3']}, dtype=object)

df
Out[2]: 
  num_objects num_str
0           1       1
1           2       2
2           3       3

df.dtypes
Out[3]: 
num_objects    object
num_str        object
dtype: object

The default behavior of convert_objects is to only reinterpret the python ints into a proper int dtype, not cast the strings. This is the behavior that I'd really miss killing convert_objects, and suspect others might too.

df.convert_objects().dtypes
Out[4]: 
num_objects     int64
num_str        object
dtype: object

In [5]: df.apply(pd.to_numeric).dtypes
Out[5]: 
num_objects    int64
num_str        int64
dtype: object

So is it worth adding a convert_pyobjects (...not in love with that name) for just this case?
infer_python_types
convert_python_types
??

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 28, 2017

Contributor

I think its easy enough to add a soft option to errors to do exactly this.

Contributor

jreback commented Mar 28, 2017

I think its easy enough to add a soft option to errors to do exactly this.

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Mar 28, 2017

Contributor

Would pd.Series(['1', '2', '3']).to_numeric(errors='soft') cast?

Contributor

chris-b1 commented Mar 28, 2017

Would pd.Series(['1', '2', '3']).to_numeric(errors='soft') cast?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 28, 2017

Contributor

soft would just return [3] (as would coerce

The difference is [4] (the Series in it). I think soft would return [5] and coerce would return [4]

In [3]: pd.to_numeric(pd.Series(['1', '2', '3']), errors='coerce')
Out[3]: 
0    1
1    2
2    3
dtype: int64

In [4]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='coerce')
Out[4]: 
0    1.0
1    2.0
2    NaN
dtype: float64

In [5]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='ignore')
Out[5]: 
0      1
1      2
2    foo
dtype: object
Contributor

jreback commented Mar 28, 2017

soft would just return [3] (as would coerce

The difference is [4] (the Series in it). I think soft would return [5] and coerce would return [4]

In [3]: pd.to_numeric(pd.Series(['1', '2', '3']), errors='coerce')
Out[3]: 
0    1
1    2
2    3
dtype: int64

In [4]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='coerce')
Out[4]: 
0    1.0
1    2.0
2    NaN
dtype: float64

In [5]: pd.to_numeric(pd.Series(['1', '2', 'foo']), errors='ignore')
Out[5]: 
0      1
1      2
2    foo
dtype: object
@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Mar 28, 2017

Contributor

Thanks for the examples.

I still think "only losslessly convert python objects into proper dtypes" might be a better as separate operation from to_numeric? There wouldn't be any way to produce Out[4] from my example above?

Contributor

chris-b1 commented Mar 28, 2017

Thanks for the examples.

I still think "only losslessly convert python objects into proper dtypes" might be a better as separate operation from to_numeric? There wouldn't be any way to produce Out[4] from my example above?

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Mar 28, 2017

Contributor

I don't think that it is possibly to losslessly convert python objects into proper dtypes is generally well defined. There are certainly come objects that don't have a lossless native representation (e.g. str->float).

This ambiguity that was just described is precisely the challenge in writing a useful, correct and precise converter.

Should the set of conversion options and the rules that will be used be described prior to implementing them? I think they should or the code will default to be the reference set of rules (which was one of the problems with convert_objects).

Contributor

bashtage commented Mar 28, 2017

I don't think that it is possibly to losslessly convert python objects into proper dtypes is generally well defined. There are certainly come objects that don't have a lossless native representation (e.g. str->float).

This ambiguity that was just described is precisely the challenge in writing a useful, correct and precise converter.

Should the set of conversion options and the rules that will be used be described prior to implementing them? I think they should or the code will default to be the reference set of rules (which was one of the problems with convert_objects).

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Mar 28, 2017

Contributor

To be clear, what I mean by losslessly converting is doing exactly what pd.Series([<python objects]>) would do - converting to a numpy dtype if possible, otherwise leaving as object.

Contributor

chris-b1 commented Mar 28, 2017

To be clear, what I mean by losslessly converting is doing exactly what pd.Series([<python objects]>) would do - converting to a numpy dtype if possible, otherwise leaving as object.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Mar 28, 2017

Contributor

I think the point of convert_objects and any successor is to strictly go beyond what that io tools will automatically do. IOW, some coercion of some objects some of the time is essential. The old convert_objects would, for example, coerce mixed strings and numbers to numbers and nulls. Tools like read_csv intentionally don't do this since this is fairly arbitrary.

Contributor

bashtage commented Mar 28, 2017

I think the point of convert_objects and any successor is to strictly go beyond what that io tools will automatically do. IOW, some coercion of some objects some of the time is essential. The old convert_objects would, for example, coerce mixed strings and numbers to numbers and nulls. Tools like read_csv intentionally don't do this since this is fairly arbitrary.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Mar 28, 2017

Contributor

The to_* are pretty precise and do what you tell them, even to non-objects. For examples:

import pandas as pd
import datetime as dt
t = pd.Series([dt.datetime.now(), dt.datetime.now()])

pd.to_numeric(t)
Out[7]: 
0    1490739351272159000
1    1490739351272159000
dtype: int64

I would assume that a successor to convert_objects would only convert object dtype and would not behave like this.

Contributor

bashtage commented Mar 28, 2017

The to_* are pretty precise and do what you tell them, even to non-objects. For examples:

import pandas as pd
import datetime as dt
t = pd.Series([dt.datetime.now(), dt.datetime.now()])

pd.to_numeric(t)
Out[7]: 
0    1490739351272159000
1    1490739351272159000
dtype: int64

I would assume that a successor to convert_objects would only convert object dtype and would not behave like this.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Mar 28, 2017

Member

The reason that I don't like adding the .to_ functions as method on a DataFrame (or at least not as solution in this discussion), is because IMO you typically do not want to apply this to all columns and/or not in the same way (and if you want this, you can easily do the apply approach as you can do now).
Eg with DataFrame.to_datetime, I would expect that it does this for all columns, which means both converting numerical columns as string columns. I don't think this is typically what you want.

So for me one of the reasons to have a convert_objects method (irregardless of the exact behavioral details) is that it would only try to convert actual object dtyped columns.

Member

jorisvandenbossche commented Mar 28, 2017

The reason that I don't like adding the .to_ functions as method on a DataFrame (or at least not as solution in this discussion), is because IMO you typically do not want to apply this to all columns and/or not in the same way (and if you want this, you can easily do the apply approach as you can do now).
Eg with DataFrame.to_datetime, I would expect that it does this for all columns, which means both converting numerical columns as string columns. I don't think this is typically what you want.

So for me one of the reasons to have a convert_objects method (irregardless of the exact behavioral details) is that it would only try to convert actual object dtyped columns.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 28, 2017

Contributor

ok if we resurrect this with an all new signature. this is current.

In [1]: DataFrame.convert_objects?
Signature: DataFrame.convert_objects(self, convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True)
Docstring:
Deprecated.

Attempt to infer better dtype for object columns

Parameters
----------
convert_dates : boolean, default True
    If True, convert to date where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
    If True, attempt to coerce to numbers (including strings), with
    unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
    If True, convert to timedelta where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
copy : boolean, default True
    If True, return a copy even if no copy is necessary (e.g. no
    conversion was done). Note: This is meant for internal use, and
    should not be confused with inplace.

IIRC @jorisvandenbossche suggested. (with a mod).

DataFrame.convert_object(self, datetime=True, timedelta=True, numeric=False, copy=True)

Though if everything is changed. Then maybe we should just rename this. (note the .convert_object)

Contributor

jreback commented Mar 28, 2017

ok if we resurrect this with an all new signature. this is current.

In [1]: DataFrame.convert_objects?
Signature: DataFrame.convert_objects(self, convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True)
Docstring:
Deprecated.

Attempt to infer better dtype for object columns

Parameters
----------
convert_dates : boolean, default True
    If True, convert to date where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
    If True, attempt to coerce to numbers (including strings), with
    unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
    If True, convert to timedelta where possible. If 'coerce', force
    conversion, with unconvertible values becoming NaT.
copy : boolean, default True
    If True, return a copy even if no copy is necessary (e.g. no
    conversion was done). Note: This is meant for internal use, and
    should not be confused with inplace.

IIRC @jorisvandenbossche suggested. (with a mod).

DataFrame.convert_object(self, datetime=True, timedelta=True, numeric=False, copy=True)

Though if everything is changed. Then maybe we should just rename this. (note the .convert_object)

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jul 7, 2017

Contributor

Sorry I'm just getting back to this here's a proposal of how I think this could work, open to suggestions on any piece.

0.20.1 - leave convert_objects but update depr message with new methods I'll go through
0.20.2 - remove convert_objects

First, for conversions that are simply unboxing of python objects, add a new method infer_objects with no options. This essentially re-applies our ctor inference on any object columns, and if a column can be losslessly unboxed to a native type, do it, otherwise leave unchanged. Useful in munging scenarios where the original inference fails. Example:

df = pd.DataFrame({'a': ['a', 1, 2, 3],
                   'b': ['b', 2.0, 3.0, 4.1],
                   'c': ['c', datetime.datetime(2016, 1, 1), datetime.datetime(2016, 1, 2), 
                         datetime.datetime(2016, 1, 3)]})

df = df.iloc[1:]

In [194]: df
Out[194]: 
   a    b                    c
1  1    2  2016-01-01 00:00:00
2  2    3  2016-01-02 00:00:00
3  3  4.1  2016-01-03 00:00:00

In [195]: df.dtypes
Out[195]: 
a    object
b    object
c    object
dtype: object

# exactly what convert_objects does in this scenario today!
In [196]: df.infer_objects().dtypes
Out[196]: 
a             int64
b           float64
c    datetime64[ns]
dtype: object

Second, for all other conversions, add to_numeric, to_datetime, and to_datetime to the DataFrame API, with the following sig. Basically work as they do today, but some convenience column selecting options. Not sure on the defaults here, starting with the most 'convenient'

"""
DataFrame.to_...(self, errors='ignore', object_only=True, include=None, exclude=None)
Parameters
------------
errors: {'ignore', 'coerce', 'raise'}
   error mode passed to `pd.to_....`
object_only: boolean
    if True, only apply inference to object typed columns

include / exclude: column selection
"""

Example frame, with what is needed today:

df1 = pd.DataFrame({
    'date': pd.date_range('2014-01-01', periods=3),
    'date_unconverted': ['2014-01', '2015-01', '2016-01'],
    'number': [1, 2, 3],
    'number_unconverted': ['1', '2', '3']})


In [198]: df1
Out[198]: 
        date date_unconverted  number number_unconverted
0 2014-01-01          2014-01       1                  1
1 2014-01-02          2015-01       2                  2
2 2014-01-03          2016-01       3                  3

In [199]: df1.dtypes
Out[199]: 
date                  datetime64[ns]
date_unconverted              object
number                         int64
number_unconverted            object
dtype: object


In [202]: df1.convert_objects(convert_numeric=True, convert_dates='coerce').dtypes
C:\Users\chris.bartak\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object

With the new api:

In [202]: df1.to_numeric().to_datetime()
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object
Contributor

chris-b1 commented Jul 7, 2017

Sorry I'm just getting back to this here's a proposal of how I think this could work, open to suggestions on any piece.

0.20.1 - leave convert_objects but update depr message with new methods I'll go through
0.20.2 - remove convert_objects

First, for conversions that are simply unboxing of python objects, add a new method infer_objects with no options. This essentially re-applies our ctor inference on any object columns, and if a column can be losslessly unboxed to a native type, do it, otherwise leave unchanged. Useful in munging scenarios where the original inference fails. Example:

df = pd.DataFrame({'a': ['a', 1, 2, 3],
                   'b': ['b', 2.0, 3.0, 4.1],
                   'c': ['c', datetime.datetime(2016, 1, 1), datetime.datetime(2016, 1, 2), 
                         datetime.datetime(2016, 1, 3)]})

df = df.iloc[1:]

In [194]: df
Out[194]: 
   a    b                    c
1  1    2  2016-01-01 00:00:00
2  2    3  2016-01-02 00:00:00
3  3  4.1  2016-01-03 00:00:00

In [195]: df.dtypes
Out[195]: 
a    object
b    object
c    object
dtype: object

# exactly what convert_objects does in this scenario today!
In [196]: df.infer_objects().dtypes
Out[196]: 
a             int64
b           float64
c    datetime64[ns]
dtype: object

Second, for all other conversions, add to_numeric, to_datetime, and to_datetime to the DataFrame API, with the following sig. Basically work as they do today, but some convenience column selecting options. Not sure on the defaults here, starting with the most 'convenient'

"""
DataFrame.to_...(self, errors='ignore', object_only=True, include=None, exclude=None)
Parameters
------------
errors: {'ignore', 'coerce', 'raise'}
   error mode passed to `pd.to_....`
object_only: boolean
    if True, only apply inference to object typed columns

include / exclude: column selection
"""

Example frame, with what is needed today:

df1 = pd.DataFrame({
    'date': pd.date_range('2014-01-01', periods=3),
    'date_unconverted': ['2014-01', '2015-01', '2016-01'],
    'number': [1, 2, 3],
    'number_unconverted': ['1', '2', '3']})


In [198]: df1
Out[198]: 
        date date_unconverted  number number_unconverted
0 2014-01-01          2014-01       1                  1
1 2014-01-02          2015-01       2                  2
2 2014-01-03          2016-01       3                  3

In [199]: df1.dtypes
Out[199]: 
date                  datetime64[ns]
date_unconverted              object
number                         int64
number_unconverted            object
dtype: object


In [202]: df1.convert_objects(convert_numeric=True, convert_dates='coerce').dtypes
C:\Users\chris.bartak\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object

With the new api:

In [202]: df1.to_numeric().to_datetime()
Out[202]: 
date                  datetime64[ns]
date_unconverted      datetime64[ns]
number                         int64
number_unconverted             int64
dtype: object
@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jul 7, 2017

Contributor

And to be honest, I don't personally care much about the second API, my pushback over deprecating convert_objects was entirely based on the lack of something like infer_objects

Contributor

chris-b1 commented Jul 7, 2017

And to be honest, I don't personally care much about the second API, my pushback over deprecating convert_objects was entirely based on the lack of something like infer_objects

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Jul 7, 2017

Contributor

I would second infer_objects() as long as the rules were crystal clear and the implementation matched the description. Another important usecase is when one ends up with a DF transposed with all object columns, and then something like df = df.T.infer_types() would produce

I think function like to_numeric, etc. shouldn't methods on a dataframe and instead should just be stand alone. I can't think they are used frequently enough to pollute the to list.

Contributor

bashtage commented Jul 7, 2017

I would second infer_objects() as long as the rules were crystal clear and the implementation matched the description. Another important usecase is when one ends up with a DF transposed with all object columns, and then something like df = df.T.infer_types() would produce

I think function like to_numeric, etc. shouldn't methods on a dataframe and instead should just be stand alone. I can't think they are used frequently enough to pollute the to list.

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jul 7, 2017

Contributor

Cool, yeah the more I think about the less I think adding to_... to the DataFrame api is a good idea. In terms of infer_objects the impl would basically be as follows - based on maybe_convert_objects, which generally unsurprising (in my opinion) behavior:

In [251]: from pandas._libs.lib import maybe_convert_objects

In [252]: converter = lambda x: maybe_convert_objects(np.asarray(x, dtype='O'), convert_datetime=True, convert_timedelta=True)

In [253]: converter([1,2,3])
Out[253]: array([1, 2, 3], dtype=int64)

In [254]: converter([1,2,3])
Out[254]: array([1, 2, 3], dtype=int64)

In [255]: converter([1,2,'3'])
Out[255]: array([1, 2, '3'], dtype=object)

In [256]: converter([datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 2)])
Out[256]: array(['2015-01-01T00:00:00.000000000', '2015-01-02T00:00:00.000000000'], dtype='datetime64[ns]')

In [257]: converter([datetime.datetime(2015, 1, 1), 'a'])
Out[257]: array([datetime.datetime(2015, 1, 1, 0, 0), 'a'], dtype=object)

In [258]: converter([datetime.datetime(2015, 1, 1), 1])
Out[258]: array([datetime.datetime(2015, 1, 1, 0, 0), 1], dtype=object)

In [259]: converter([datetime.timedelta(seconds=1), datetime.timedelta(seconds=1)])
Out[259]: array([1000000000, 1000000000], dtype='timedelta64[ns]')

In [260]: converter([datetime.timedelta(seconds=1), 1])
Out[260]: array([datetime.timedelta(0, 1), 1], dtype=object)
Contributor

chris-b1 commented Jul 7, 2017

Cool, yeah the more I think about the less I think adding to_... to the DataFrame api is a good idea. In terms of infer_objects the impl would basically be as follows - based on maybe_convert_objects, which generally unsurprising (in my opinion) behavior:

In [251]: from pandas._libs.lib import maybe_convert_objects

In [252]: converter = lambda x: maybe_convert_objects(np.asarray(x, dtype='O'), convert_datetime=True, convert_timedelta=True)

In [253]: converter([1,2,3])
Out[253]: array([1, 2, 3], dtype=int64)

In [254]: converter([1,2,3])
Out[254]: array([1, 2, 3], dtype=int64)

In [255]: converter([1,2,'3'])
Out[255]: array([1, 2, '3'], dtype=object)

In [256]: converter([datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 2)])
Out[256]: array(['2015-01-01T00:00:00.000000000', '2015-01-02T00:00:00.000000000'], dtype='datetime64[ns]')

In [257]: converter([datetime.datetime(2015, 1, 1), 'a'])
Out[257]: array([datetime.datetime(2015, 1, 1, 0, 0), 'a'], dtype=object)

In [258]: converter([datetime.datetime(2015, 1, 1), 1])
Out[258]: array([datetime.datetime(2015, 1, 1, 0, 0), 1], dtype=object)

In [259]: converter([datetime.timedelta(seconds=1), datetime.timedelta(seconds=1)])
Out[259]: array([1000000000, 1000000000], dtype='timedelta64[ns]')

In [260]: converter([datetime.timedelta(seconds=1), 1])
Out[260]: array([datetime.timedelta(0, 1), 1], dtype=object)
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 7, 2017

Contributor

yes maybe_convert_objects is a soft conversion
it will only convert if all these are strictly convertible

Contributor

jreback commented Jul 7, 2017

yes maybe_convert_objects is a soft conversion
it will only convert if all these are strictly convertible

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 7, 2017

Contributor

I could be on board with a very simple .infer_objects() in that case. It wouldn't accept any arguments I think?

Contributor

jreback commented Jul 7, 2017

I could be on board with a very simple .infer_objects() in that case. It wouldn't accept any arguments I think?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 7, 2017

Contributor

could add the new function and change the msg on the convert_objects deprecation to point to .infer_objects() and .to_* for 0.21, then remove in 1.0

Contributor

jreback commented Jul 7, 2017

could add the new function and change the msg on the convert_objects deprecation to point to .infer_objects() and .to_* for 0.21, then remove in 1.0

@jreback jreback modified the milestones: 0.21.0, Next Major Release Jul 7, 2017

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 13, 2017

Member

@jreback : Judging from this conversation, it seems that removal of convert_objects will not be happening in 0.21. Would it be best to close #15757 and let a fresh PR take its place for the implementation of infer_objects (which BTW, seems a like a good idea)?

Member

gfyoung commented Jul 13, 2017

@jreback : Judging from this conversation, it seems that removal of convert_objects will not be happening in 0.21. Would it be best to close #15757 and let a fresh PR take its place for the implementation of infer_objects (which BTW, seems a like a good idea)?

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 13, 2017

Member

IIUC, to what extent is infer_objects just a port of convert_objects to being a method of DataFrame (or just NDFrame in general)?

Member

gfyoung commented Jul 13, 2017

IIUC, to what extent is infer_objects just a port of convert_objects to being a method of DataFrame (or just NDFrame in general)?

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Jul 13, 2017

Contributor

convert_objects has it's own logic and has options. infer_objects should use the default inference as-if on a DataFrame (but only object columns).

Contributor

bashtage commented Jul 13, 2017

convert_objects has it's own logic and has options. infer_objects should use the default inference as-if on a DataFrame (but only object columns).

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 13, 2017

Member

Ah right, so do you mean then that infer_objects is convert_objects with the defaults passed in (more or less, maybe some tweaked specifically for DataFrame)?

Member

gfyoung commented Jul 13, 2017

Ah right, so do you mean then that infer_objects is convert_objects with the defaults passed in (more or less, maybe some tweaked specifically for DataFrame)?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 13, 2017

Contributor

infer_objects should have no options, simply do soft-conversion (it would basically just call maybe_convert_objects with the default options

Contributor

jreback commented Jul 13, 2017

infer_objects should have no options, simply do soft-conversion (it would basically just call maybe_convert_objects with the default options

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Jul 13, 2017

Member

Ah, okay, that makes sense. I was just trying to understand and collate the comments made in this discussion in my mind.

Member

gfyoung commented Jul 13, 2017

Ah, okay, that makes sense. I was just trying to understand and collate the comments made in this discussion in my mind.

@chris-b1 chris-b1 referenced this issue Jul 13, 2017

Merged

API: add infer_objects for soft conversions #16915

3 of 4 tasks complete
@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jul 13, 2017

Contributor

fyi, opened #16915 for infer_objects if anyone is interested - in particular if you have edge test cases in mind

Contributor

chris-b1 commented Jul 13, 2017

fyi, opened #16915 for infer_objects if anyone is interested - in particular if you have edge test cases in mind

@chris-b1 chris-b1 closed this in #16915 Jul 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment