New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DEPR: Deprecate inplace parameter #16529

Open
gfyoung opened this Issue May 29, 2017 · 19 comments

Comments

Projects
None yet
6 participants
@gfyoung
Copy link
Member

gfyoung commented May 29, 2017

The parameter inplace=False should be deprecated across the board in preparation for pandas 2, which will not support that input (we will always return a copy). That would give people time to stop using it.

Thoughts?

Methods using inplace:

Deprecation non controvertial (a copy will be made anyway, and inplace=True does not add value):

  • (Series/DataFrame).drop
  • (Series/DataFrame).drop_duplicates
  • (Series/DataFrame).dropna
  • DataFrame.set_index (with drop=False wouldn't change the data, but that doesn't seem the main use case)
  • DataFrame.query
  • DataFrame.eval

Not sure:

  • (Series/DataFrame).sort_values
  • (Series/DataFrame).sort_index

Should be able to not copy memory (under discussion on what to do):

  • (Series/DataFrame).clip
  • (Series/DataFrame).where
  • (Series/DataFrame).fillna
  • (Series/DataFrame).rename_axis
  • (Series/DataFrame).reset_index
  • (Series/DataFrame).replace
  • (Series/DataFrame).set_axis
  • (Series/DataFrame).mask
  • (Series/DataFrame).interpolate
  • DataFrame.rename
  • Index.rename
  • Index.set_names
  • MultiIndex.set_levels
  • MultiIndex.set_labels
  • pandas.core.resample.Resampler.interpolate

Special cases:

  • pandas.eval (with inplace=False the value is not returned but set to an argument target)

Will be fully deprecated, so doesn't matter:

  • (Series/DataFrame/Panel).clip_lower
  • (Series/DataFrame/Panel).clip_upper
  • pandas.Panel.clip
  • pandas.Panel.drop
  • pandas.Panel.dropna
  • pandas.Panel.sort_index

@gfyoung gfyoung changed the title API/DEPR: Deprecate inplace operations API/DEPR: Deprecate inplace parameter May 29, 2017

@jreback jreback modified the milestones: 0.21.0, 1.0 May 30, 2017

@drafter250

This comment has been minimized.

Copy link

drafter250 commented Jul 26, 2017

I thought The main usage of this feature was to mitigate memory usage in case of large DataFrames?

@gfyoung

This comment has been minimized.

Copy link
Member

gfyoung commented Jul 26, 2017

I thought The main usage of this feature was to mitigate memory usage in case of large DataFrames?

I think @jreback might know more about the initial usage, but generally from previous discussion, this parameter is misused and is more prone to introducing bugs.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Aug 22, 2017

inplace does not generally do anything inplace
but makes a copy and reassigns the pointer

mutating code is much harder to debug (not to mentiin more complicated to support actual inplace ops)

so except for inplace indexing generally these operations are simply more verbose and serve just to provide corner cases for testing

@datapythonista

This comment has been minimized.

Copy link
Member

datapythonista commented Dec 3, 2018

@pandas-dev/pandas-core I think this was agreed during the sprint at Scipy, but I'm not sure if it was discussed when to deprecate the inplace parameters. Is it something we want to do before 1.0?

Personally I think the sooner, the better, since the decision is made.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 3, 2018

this is going to cause major downstream complaints, but I agree this should be sooner rather than later.

Would accept a FutureWarning for this for 0.24.0

@datapythonista datapythonista referenced this issue Dec 3, 2018

Closed

WIP/DEPR: Deprecate inplace=True #24063

1 of 4 tasks complete
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Dec 3, 2018

A few questions:

  1. There are some places (.where comes to mind, fillna? Maybe others?) where the operation can truly be done inplace. I don't think those should be deprecated.
  2. Following 1. is it possible to detect when we're doing an "inplace" operation on a copy? Could we only warn in those cases?
@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 3, 2018

There are some places (.where comes to mind, fillna? Maybe others?) where the operation can truly be done inplace. I don't think those should be deprecated.

this is a very limited number, and though possible to detect (e.g. a Series of a single dtype, or a DataFrame of a single dtype), IMHO is not worth leaving this option. just adding code complexity w/o any real perf benefit. So -1 on doing this.

@datapythonista

This comment has been minimized.

Copy link
Member

datapythonista commented Dec 3, 2018

In my opinion, ideally we should always do the operation in place if possible, but still return a new object.

So, df = df.fillna(0) wouldn't copy memory (assuming no type change), and if the user does want to modify a copy, they should do df2 = df.copy().fillna(0).

Not sure if in practice can be complex to implement in some cases. But if that makes sense, we should deprecate all inplace arguments.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 3, 2018

So, df = df.fillna(0) wouldn't copy memory (assuming no type change), and if the user does want to modify a copy, they should do df2 = df.copy().fillna(0).

this is too complex. You either do it inplace or you don't. The keyword controls it. If we remove the keyword then it should never modify the original.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Dec 3, 2018

just adding code complexity w/o any real perf benefit.

It depends on the application, but not having to copy can be a pretty big win, right? :)

Still, I agree that this is tough (impossible) to detect ahead of time. Is it feasible to detect it after the fact, and raise a PerformanceWarning in those cases (possibly elevating the an error later, if that makes sense)? In a simplified BlockManager world (one block per column) we could maybe keep a weakref from the column to the actual data (ndarray or the EA's data).

def fillna(self, **kwargs):
    # in BlockManager.fillna
    ref_to_data = self._get_ref_to_data()
    result = self.apply('fillna', **kwargs)
    new_data = self._get_ref_to_data()
    if ref_to_data != new_data and inplace:
         warnings.warn(PerformanceWarning)

If so, I would prefer to go that route, rather than having users change code.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 3, 2018

I am really -1 on this in any users code. So while this may have to be an extended deprecation cycle I think its worth it.

@datapythonista

This comment has been minimized.

Copy link
Member

datapythonista commented Dec 3, 2018

This is the list of methods functions in the public API with a parameter named inplace.

EDITED: Moved the lists to the description

Those are not really in place, and should be deprecated:

  • pandas.eval
  • pandas.Series.clip
  • pandas.Series.clip_lower
  • pandas.Series.clip_upper
  • pandas.Series.drop
  • pandas.Series.drop_duplicates
  • pandas.Series.rename_axis
  • pandas.Series.reset_index
  • pandas.Series.set_axis
  • pandas.Series.mask
  • pandas.Series.dropna
  • pandas.Series.interpolate
  • pandas.Series.sort_values
  • pandas.Series.sort_index
  • pandas.Series.replace
  • pandas.DataFrame.mask
  • pandas.DataFrame.query
  • pandas.DataFrame.clip
  • pandas.DataFrame.clip_lower
  • pandas.DataFrame.clip_upper
  • pandas.DataFrame.eval
  • pandas.DataFrame.drop
  • pandas.DataFrame.drop_duplicates
  • pandas.DataFrame.rename
  • pandas.DataFrame.rename_axis
  • pandas.DataFrame.reset_index
  • pandas.DataFrame.set_axis
  • pandas.DataFrame.set_index
  • pandas.DataFrame.dropna
  • pandas.DataFrame.replace
  • pandas.DataFrame.interpolate
  • pandas.DataFrame.sort_values
  • pandas.DataFrame.sort_index
  • pandas.Index.rename
  • pandas.Index.set_names
  • pandas.MultiIndex.set_levels
  • pandas.MultiIndex.set_labels
  • pandas.core.resample.Resampler.interpolate

Can be really in place (under discussion):

  • pandas.Series.where
  • pandas.Series.fillna
  • pandas.DataFrame.where
  • pandas.DataFrame.fillna

Removed with Panel:

  • pandas.Panel.clip
  • pandas.Panel.clip_lower
  • pandas.Panel.clip_upper
  • pandas.Panel.drop
  • pandas.Panel.dropna
  • pandas.Panel.sort_index

If that sounds good, I'll create a PR to raise FutureWarning for the ones that we all agree should be deprecated (the first list). Then we can follow up with the rest.

Let me know if there are more from the first list that you want to postpone to the second phase.

@jorisvandenbossche may be you also want to take a look here.

@h-vetinari

This comment has been minimized.

Copy link
Contributor

h-vetinari commented Dec 3, 2018

@datapythonista @jreback
Another method that should be part of this discussion is .update, which is AFAICT the only method that's inplace by default, and does not even have an inplace-kwarg. Adding that has met resistance in #22286, but one could deprecate the whole method and replace it with df.coalesce: #22812

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Dec 3, 2018

@datapythonista how'd you define that list of methods that are not really inplace? I haven't looked closely, but things like Series.rename_axis should be doable without copying data. I believe clip / clip_lower as well.


I am really -1 on this in any users code.

So that argument is independent of whether or not any operation can be inplace, and we should discuss that. i.e. that it's the "opinion of pandas" that inplace is an anti-pattern to be avoided at all time.

Personally, I'm not sure about that.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 3, 2018

my point is this adds a lot of complexity
sure it’s opionated but so what

complexity is killing the ability of most folks to make modifications

this simplifies the model a great deal

@datapythonista

This comment has been minimized.

Copy link
Member

datapythonista commented Dec 3, 2018

I just started the list of the actual inplace methods with the ones you said. There are some I can guess they should be able to be in place, like replace, but thought you or Jeff would know from the top of your head, so didn't want to guess.

Will move the lists to the description, and try to get it closer to reality, feel free to edit afterwards.

@datapythonista

This comment has been minimized.

Copy link
Member

datapythonista commented Dec 3, 2018

I updated the description with what I think it's non-controversial, and the rest. Let me know if there is anything that seems to be in the wrong place.

@datapythonista

This comment has been minimized.

Copy link
Member

datapythonista commented Dec 31, 2018

Is the list of the methods I defined as non-controversial, non-controversial enough so we can deprecate them already (they are listed in the description)?

I assume the rest needs some more discussion?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

TomAugspurger commented Jan 2, 2019

Sorry, haven't had a chance to look. Dumping some thoughts below.


IMO, inplace is currently overloaded for two uses: shorter code
from not having to repeat the names of identifiers, and (possibly) avoiding
a data copy.

  1. Avoid typing the name of the identifier twice. We write
my_df_with_a_really_long_name.dropna(inplace=True)

versus

my_df_with_a_really_long_name = my_df_with_a_really_long_name.dropna()

In this case, inplace serves a role similar to inplace operators on
mutable objects (e.g. int).

my_long_id = 5
my_long_id += 1

This inplace operation doesn't mutate the object (you can't mutate an int).

  1. (Possibly) avoiding a copy. Some methods, like DataFrame.where, may
    avoid a memory copy by mutating data in place.
In [42]: arr = np.arange(12, dtype='float').reshape(4, 3)

In [43]: df = pd.DataFrame(arr, copy=False)

In [44]: np.shares_memory(df._data.blocks[0].values, arr)
Out[44]: True

In [45]: df.where(df > 5, inplace=True)

In [46]: arr
Out[46]:
array([[nan, nan, nan],
       [nan, nan, nan],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.]])

In [47]: np.shares_memory(df._data.blocks[0].values, arr)
Out[47]: True

Personally, I don't really care about use-case 1. I'd recommend people chain
methods.

I care a great deal about use-case 2. I'd love it if pandas could, optionally,
do zero-copy operations. But this is a hard problem.


So if I were designing the API today, I would use the keyword copy=True, rather
than inplace=False. Operations would never mutate data in place by default, but
would provide that option when it's feasible. I likely wouldn't include the
inplace keyword, and would instead instruct people to use method chaining.

But given the current situation, I might prefer having both inplace and copy.
Then we can leave inplace for "update the object associated with this name",
regardless of whether data is copied, and copy for "mutate data in place, and
return a new dataframe that's a view on the same data." But I'm less certain about
how to move forward than I am about what I would do if starting from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment