Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate Series / DataFrame.append #35407

Closed
TomAugspurger opened this issue Jul 24, 2020 · 103 comments · Fixed by #44539
Closed

Deprecate Series / DataFrame.append #35407

TomAugspurger opened this issue Jul 24, 2020 · 103 comments · Fixed by #44539
Assignees
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@TomAugspurger
Copy link
Contributor

I think that we should deprecate Series.append and DataFrame.append. They're making an analogy to list.append, but it's a poor analogy since the behavior isn't (and can't be) in place. The data for the index and values needs to be copied to create the result.

These are also apparently popular methods. DataFrame.append is around the 10th most visited page in our API docs.

Unless I'm mistaken, users are always better off building up a list of values and passing them to the constructor, or building up a list of NDFrames followed by a single concat.

@TomAugspurger TomAugspurger added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jul 24, 2020
@jreback
Copy link
Contributor

jreback commented Jul 24, 2020

+1 from me (though i will usually be plus on deprecating things generally)

yeah free here these are a foot gun

@erfannariman
Copy link
Member

+1, it's better to have one method, which is pandas.concat, also it's more flexible with the list of dataframes and the option to concat over axis 0 / axis 1.

@shoyer
Copy link
Member

shoyer commented Jul 25, 2020

Strong +1 from me!

Just look at all the (bad) answers to this StackOverflow question:
https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe

@jreback
Copy link
Contributor

jreback commented Jul 25, 2020

we should also deprecate expansion indexing as well (which is an implicit append)

@AlexKirko
Copy link
Member

AlexKirko commented Jul 27, 2020

+1 from me
There is really no reason to have this when we have concat available. Especially, because IIRC append works by calling concat and I don't think append abstracts away enough to keep it.

@achapkowski
Copy link

How do you expand a dataframe by a single row without having to create a whole dataframe then?

@TomAugspurger
Copy link
Contributor Author

I'd recommend thinking about why you need to expand by a single row. Can those updates be batched before adding to the DataFrame?

If you know the label you want to set it at, then you can use .loc[key] = ... to expand the index without having to create an intermediate. Otherwise you'll need to create a DataFrame and use concat.

@darindillon
Copy link

Disagree. Appending a single row is useful functionality and very common. Yes, we understand its inefficient; but as TomAugspurger himself said, this is the 10th most commonly referenced page on the help, so clearly lots of people have this use case of adding a single row to the end. We can tell ourselves we're removing the method to "encourage good design" but people still want this functionality, so they'll just use the workaround of creating a new DataFrame with a single row and concat'ing, but that just requires the user to write even more code to still get the exact same performance hit, so how have we made anyone's life better?

@taylor-schneider
Copy link

Not being able to add rows to a data structure makes no sense. It's one thing to not add the inplace argument but to deprecate the feature is nutts.

@achapkowski
Copy link

@TomAugspurger using df.loc[] requires me to know the length of the dataframe. and create code like this:

df.iloc[len(df) + 1] = <new row>

This feel like overly complex syntax for an API that makes data operations simple. Internally df.append or series.append could just do what is shown above, but don't dirty up the user interface.

Why not take a page from lists, the append method is quick because it has pre-allocates slots in advanced. Modify the internals post DataFrame/Series creation to have 1000 empty hidden rows slotted and ready to have new information. If/When the slots are filled, then the DF/Series would expand it outside the view of the user.

@TomAugspurger
Copy link
Contributor Author

loc requires you to know the label you want to insert it at, not the length.

Why not take a page from lists, the append method is quick because it has pre-allocates slots in advanced.

You could perhaps suggest that to NumPy. I don't think it would work in practice given the NumPy data model.

@achapkowski
Copy link

Is Numpy deprecating the append method? If not, why deprecate it here?

Numpy doc: https://numpy.org/doc/stable/reference/generated/numpy.append.html

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 24, 2021

Shall we make this happen and get a deprecation warning in for 1.4 so these can be removed in 2.0? If there's no objections, I'll make a PR later (or anyone following along can, that's probably the fastest way to move the conversation forward)

@achapkowski
Copy link

@MarcoGorelli my question still stands, why is this being done?

@darindillon
Copy link

darindillon commented Aug 24, 2021

Yes, why are we doing this? It seems like we're removing a VERY popular feature (the 10th most visited help page according to the OP) just because that feature is slow. But if we remove the feature, people will still want this functionality so they'll just end up implementing it manually anyway, so how are we improving anything by removing this?

@jreback
Copy link
Contributor

jreback commented Aug 24, 2021

there is a ton of discussion pls read in full

this has long been planned as inplace operations make the code base inordinately complex and offer very little benefit

@achapkowski
Copy link

@jreback I don't see tons of discussion in this issue, please point me to the discussion that I might be better informed. From what I see is a community asking you not to do this.

@MarcoGorelli
Copy link
Member

There's a long discussion here on deprecating inplace: #16529

But if we remove the feature, people will still want this functionality so they'll just end up implementing it manually anyway, so how are we improving anything by removing this?

I'd argue that this is still an improvement, because then it would be clearer to users that this is a slow feature - with the status quo, people are likely to think it's analogous to list.append

What's your use-case for append? What does it do that you can't do without 1-2 lines of code which call concat? If you want to make a case for keeping it, please show a minimal example of where having append is a significant improvement

@neinkeinkaffee
Copy link
Contributor

take

@jreback
Copy link
Contributor

jreback commented Apr 6, 2022

@behrenhoff

By the way: how does concat improve this code:

total_df = pd.DataFrame()
for file in glob("*.csv"):
print(f"reading {file}")
df = pd.read_csv(file)
total_df = > total_df.append(df).drop_duplicates()

Yes, it is easy to replace:

total_df = pd.DataFrame()
for file in glob("*.csv"):
print(f"reading {file}")
df = pd.read_csv(file)
total_df = pd.concat([total_df,

this is exactly the reason append is super problematic
we have an entire doc note that i guess no one reads that explain as you are doing an exponential copy here (no kidding u run out ram)

so you have proved the point why append is a terrible idea - it's not about readability but easy to fall into traps that are non obvious at first glance

@wumpus
Copy link

wumpus commented Apr 7, 2022

If only there was a well-known algorithm which was not an exponential copy.

@behrenhoff
Copy link
Contributor

behrenhoff commented Apr 7, 2022

this is exactly the reason append is super problematic
we have an entire doc note that i guess no one reads that explain as you are doing an exponential copy here (no kidding u run out ram)

You did not read or not understand what I was saying. The version with append is the one that WORKS, the one with concat at the end runs into memory issues (because there is the small drop_duplicates in the loop that fixes the problem and cannot be moved out).

And yes, you can be smarter, for example ((file1 + file2).drop_dups + (file3 + file4).drop_dups).drop_dups or similar - where + can be concat or append - doesn't matter. I was just proving the point that the suggested way "collect all DFs in a list and concat them all at the end" does not always work.

@MarcoGorelli
Copy link
Member

Thanks @behrenhoff , that's a nice example - though can't you still batch the concats? Say, read 10 files at a time, concat them, drop duplicates, repeat...


This seems like a perfect summary of the issue anyway:

it's not about readability but easy to fall into traps that are non obvious at first glance


At some point we should lock the issue, this is taking a lot of attention away from a lot of people, there's been off-topic comments, no compelling use-case for keeping DataFrame.append, and strong agreement among pandas devs (especially those who have been around the longest)

@behrenhoff
Copy link
Contributor

behrenhoff commented Apr 7, 2022

Say, read 10 files at a time, concat them, drop duplicates, repeat...

Yes, that would work. So would 1 million other solutions. In practice, I could even exploit more about the date ordering inside of the files (all files here have a rather long overlapping history, but newer files can overwrite (fix) data in older files, so it is of course a drop_dups with a subset and keep=last). My point is: this is a non-issue because the operation is done once per 6 month or so, the daily operation just adds exactly one file. No point in optimizing this further as long as it works. That is the whole point I was trying to make. You force people to optimize / change code where old code just works and where there is no need to modify it. And the real gains in this example are not in append vs concat but in exploiting knowledge of the input files and reading them in different order or in groups.

Note that I am not saying this is a usecase that can only be done with append. I am saying it that removing a common feature is unnecessary work imposed on many people and that you don't get performance gains for free by only replacing append with concat (you need to do more).

Anyway, end of discussion for me. I already did the work and got rid of all my appends.

I just fear that many people will not upgrade if their code breaks. You are also making it harder for new users. append is a good and common English word, concat is not, at least I can't find it in a dictionary (there is concatenate but it is a word that a lot fewer people know - this might not be a problem for native English speakers though). I would always search for "append", not for "concat" if I didn't knew the proper function name.

@PolarNick239
Copy link

Hi, minimal reproducer that was totally broken:

Before:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
    print(a.append(row))
# Output:
#    A    B
#0  1  2.0
#0  3  NaN

After:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": 3}, index=[0])
for rowIndex, row in b.iterrows():
    print(pd.concat([a, row]))
# Output:
#     A    B    0
#0  1.0  2.0  NaN
#A  NaN  NaN  3.0

Also, please, note that if you add deprecation warning in such popular method that is used widely and calls many times per second - this message will be spammed a lot leading to much bigger overhead than you have with allocations and memory copying. So it is beneficial to print such message only on first call.

@phofl
Copy link
Member

phofl commented Apr 8, 2022

What are you trying to do? It would be way more efficient to call

pd.concat([a, b], ignore_index=True)

Edit: Or was it on purpose to put A into the Index instead as a column?

@PolarNick239
Copy link

I know, this is just an illustration. I was iterating over rows and if row is OK - adding it to another table. I believe that there are much better way via masking and concatenation with taking such masks into account, but I wanted to have code as simple as possible.

@phofl
Copy link
Member

phofl commented Apr 8, 2022

Thanks for your response. It is important for us to see usecases that can not be done more efficiently in another way. You are right, checking data can be done way more efficiently via masking and the concatenating the result.

@PolarNick239
Copy link

How can I concat such row to another table a (with superset of row's column names) in such case?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Apr 8, 2022

with

pd.concat([a, row.to_frame().T], ignore_index=True)

@phofl
Copy link
Member

phofl commented Apr 8, 2022

You can simply do:

a = pd.DataFrame({"A": 1, "B": 2}, index=[0])
b = pd.DataFrame({"A": [3, 4]})

result = pd.concat([a, b.loc[b["A"] > 3]], ignore_index=True)

Just change the greater 3 to a condition that suits your needs. This avoids the iterating over the rows step. If you have to iterate for some reason, you can use the example from @MarcoGorelli

@PolarNick239
Copy link

Not all conditions and not every logic can be readable with such single-line expression.

For people who like me want to just get rid of warnings:

import pandas as pd
def pandas_append(df, row, ignore_index=False):
    if isinstance(row, pd.DataFrame):
        result = pd.concat([df, row], ignore_index=ignore_index)
    elif isinstance(row, pd.core.series.Series):
        result = pd.concat([df, row.to_frame().T], ignore_index=ignore_index)
    elif isinstance(row, dict):
        result = pd.concat([df, pd.DataFrame(row, index=[0], columns=df.columns)])
    else:
        raise RuntimeError("pandas_append: unsupported row type - {}".format(type(row)))
    return result

PolarNick239 added a commit to agisoft-llc/metashape-scripts that referenced this issue Apr 8, 2022
…thod is deprecated and will be removed from pandas in a future version. Use pandas.concat instead', see pandas-dev/pandas#35407
PolarNick239 added a commit to agisoft-llc/metashape-scripts that referenced this issue Apr 8, 2022
…d is deprecated and will be removed from pandas in a future version. Use pandas.concat instead', see pandas-dev/pandas#35407
@wstomv
Copy link

wstomv commented Apr 21, 2022

Here is a use case for Data.Frame.append, that I think makes sense and for which it took me way too long to figure out how to replace it with pandas.concat. (Do note that I am not a seasoned pandas user.)

I have a data frame with numeric values, such as

df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

and I append a single row with all the column sums

totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)

Simple enough.
Here are the values of df, totals, and df_append

>>> df
   A  B
0  1  2
1  3  4

>>> totals
A    4
B    6
Name: totals, dtype: int64

>>> df_append
        A  B
0       1  2
1       3  4
totals  4  6

Now, using pd.concat naively:

df_concat_bad = pd.concat([df, totals])

which produces

>>> df_concat_bad
     A    B    0
0  1.0  2.0  NaN
1  3.0  4.0  NaN
A  NaN  NaN  4.0
B  NaN  NaN  6.0

Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column.
You cannot fix this with something like axis=1, because that would add the totals as column.

Fortunately, in a comment above, the implementation of DataFrame.append is quoted, and from this one can glean the solution:

df_concat_good = pd.concat([df, totals.to_frame().T])

which yields the desired

>>> df_concat_good
        A  B
0       1  2
1       3  4
totals  4  6

I think users need to be aware of such subtleties. I also posted this on StackOverflow.

@MarcoGorelli
Copy link
Member

This was brought up in #35407 (comment) , and some other comments in this thread, and would/should be part of the transition docs (see #46825)

@javiertognarelli
Copy link

Worst idea I've seen, why complicate something so easy, I think it's better to have more options/ways to do something than just one strict way. Dataframe.append() was very easy for noobies to add data to a dataframe

@etale-cohomology
Copy link

"[...] around the 10th most visited page in our API docs" and they go ahead and deprecate it.

claudiodonofrio added a commit to ICOS-Carbon-Portal/pylib that referenced this issue Jan 8, 2023
adding series to a pandas dataframe creates a performance warning
append to dataframe is deprecated, use concat instead
pandas-dev/pandas#35407
@mcclaassen
Copy link

mcclaassen commented Mar 6, 2023

This seems to be decided but, in the future, I would argue against doing these sort of things to improve user's code (and requesting proof why they can't use pd.concat when they disagree). If it improves maintainability, or makes things easier for devs, go for it. But if something is popular and not "correct", let people do what they want to do. The only valid point I've seen here is for removing the 'inplace' argument, everything else resembles nannying.

@MarcoGorelli
Copy link
Member

Thanks all for your comments

This is becoming draining - some comments are off-topic, no new arguments are being presented, and some are not particularly respectful.

Locking for now then - if anyone has any new arguments and wants to make them in a respectful manner, no objections to opening a new issue

It's understandable that some people are unhappy with this decision and have to rewrite some code, but for newbies, getting them to write their code in a better way to begin with will be better for them in the long-run.

If the docs on how to use concat are unclear, pull requests are welcome

@pandas-dev pandas-dev locked as resolved and limited conversation to collaborators Mar 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.