Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Pandas DataFrame.append and Series.append methods should get an inplace kwag #14796

Closed
dragonator4 opened this issue Dec 4, 2016 · 6 comments
Labels
API Design Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@dragonator4
Copy link

Problem description

Currently to append to a DataFrame, the following is the approach:

df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
df = df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')))

append is a DataFrame or Series method, and as such should be able to modify the DataFrame or Series in place. If in place modification is not required, one may use concat or set inplace kwag to False. It will avoid an explicit assignment operation which is quite slow in Python, as we all know. Further, it will make the expected behavior similar to Python lists, and avoid questions such as these: 1, 2...

Additionally at present, append is full subset of concat, and as such it need not exist at all. Given the vast number of functions to append a DataFrame or Series to another in Pandas, it makes sense that each has it's merits and demerits. Gaining an inplace kwag will clearly distinguish append from concat, and simplify code.

I understand that this issue was raised in #2801 a long time ago. However, the conversation in that deviated from the simplification offered by the inplace kwag to performance enhancement. I (and many like me) are looking for ease of use, and not so much at performance. Also, we expect the data to fit in memory (which is a limitation even with current version of append).

Expected Code

df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')), inplace=True)
@shoyer
Copy link
Member

shoyer commented Dec 4, 2016

I am opposed to this for the exact reasons discussed in #2801: it would mislead users who might expect a performance benefit.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2016

Virtually all of pandas methods return a new object, the exception being the indexing operations. Using inplace is not idiomatic, quite unreadable and not (more) performant at all.

Closing, though if someone thinks that we should add a signature like

(...., inplace=False), and then raise a TypeError if inplace=True to give a nice error message, then we can reopen for that purpose.

In [2]: df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
   ...: df.append(pd.DataFrame(np.random.rand(5,3), columns=list('abc')), inplace=True)
TypeError: append() got an unexpected keyword argument 'inplace'

@jreback jreback closed this as completed Dec 4, 2016
@jreback jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 4, 2016
@jreback jreback added this to the No action milestone Dec 4, 2016
@remidebette
Copy link

In the case of a namedtuple which contains a Series object, the inplace approach would be nice to have as a feature.
This would not be related in any way to the performance but would be a way to expose data to users.

Indeed, the nametuple objects are by design providing a way for writing a library and exposing it to a user allowing them to only modify it inplace.
Trying to overwrite an attribute of a namedtuple is intentionally raising AttributeError: can't set attribute so that the user does not try to affect your library. But mutable attributes are allowed.

Consider the following dummy code:

from collections import namedtuple
from pandas import Series

# ----- Library part ------
sample_schema = {
    "name": str,
    "some_info": str,
    "content": Series
}

my_data_type = namedtuple("MyDataType", sample_schema.keys())

exposed_data = my_data_type(
    name="Library data",
    some_info="Modify the content as you want",
    content=Series({"a": 0})
)


# ----- User code part ------
series_to_be_appended = Series({"b": 0})

 # This is forbidden
exposed_data.content = exposed_data.content.append(series_to_be_appended)

# This would be allowed but is not implemented in Series
exposed_data.content.append(series_to_be_appended, inplace=True)

The name and some_info attributes are string and therefore immutable. A user would not (easily) be able to affect them. But here the content can be modified as long as it is not set to a new object altogether.

I would think inplace methods are nice to have on any mutable object in general.

@rtruxal
Copy link

rtruxal commented Mar 13, 2019

So the consensus among the maintainers is that it would be too confusing to have an append() method which actually appends?

I'd suggest removing the method from DataFrame entirely, or potentially renaming it. Someone familiar with pandas might find it confusing, but the opposite is currently true for those of us without your level of experience.

@paulstapor
Copy link

paulstapor commented Oct 26, 2020

Agreeing here.
Never got why Pandas affords an API having its own logic rather than sharing the one of Python itself. One can get used to the fact that most pandas methods return objects rather than modifying their objects, although its counter-intuitive. (Pandas standard behavior is imho counter-intuitive for all persons that use more Python than Pandas, which should be most of the user-base). And one can get used to the fact that most Pandas methods behave as a user would expect it when passing inplace=True as argument.

Can live still with that. But not adding the possibility to specify inplace for append() and defaulting just it to False, which effectively keeps the method for all who want it but greatly helps those who need it, is something I cannot follow. Sorry.

@aitikgupta
Copy link

Adding a usecase:

  1. Have a lot of csv files, with few entries in each, many of which have additional columns.
  2. Want a combined dataframe, which should consist of the additional columns. (Land right up on pandas.DataFrame.append() docs)

Columns in other that are not in the caller are added as new columns.

  1. Above line reassures that I landed up in the right place.
combined_dataframe = pd.DataFrame()
for dataframe in list_of_dataframes_read_from_csvs:
    combined_dataframe.append(dataframe, inplace=True)
  1. This raised an error, checked docs, no inplace for append(), led me to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants