[WIP] API/CLN: Refactor DataFrame.append #22915

lowerthansound · 2018-10-01T00:24:13Z

This is a WIP pull request, there are still a bunch of things to do. I'm posting it here so you can give me early feedback or point me in an direction if something is not looking good.

PR will be split into multiple smaller ones, as suggested

How it is

DataFrame.append method is pretty confusing as it is right now. There are multiple codeways according to different input types (dict/Series/DataFrame or list of these) and the results are undefined in lots of edge cases.

Objective

This PR aims to make the code a bit cleaner and define the behavior for append better. For example, the input type should not interfere with the result (i.e. whether other is a DataFrame, Series, etc.)

Also, I wanted to make changes to the behavior of concat easier. append moves its heavy-load to concat and, the way it was before, it was hard to understand what was expected from concat. Now it's a bit easier.

Behavior changes

(there might be some things I forgot)

Accepted inputs for other are now exactly dict, Series, DataFrame or a list of any of those. Empty lists are also accepted. Wrong input types produce a TypeError
The behavior for each type of input now is exactly the same for equivalent inputs.
Empty columns and empty indexes will not be used when calculating the result dtypes. This is something I will think a bit about still, if you have any early suggestions, it will help.
DataFrames containing any types of columns can be appended now. In contrast, IntervalIndex, PeriodIndex and MultiIndex raised errors before.
sort=False will never sort the resulting columns. append used to sort them in some cases.
~~sort=True will raise a TypeError when the resulting columns cannot be sorted~~ (it may be better to just issue a RuntimeWarning)
An InvalidIndexError will be raised if there are duplicates in the columns indexes and all indexes ~~are not the same (id)~~ do no share the same values (will raise an issue to discuss this)

Task list (not in order)

The test fixed another thing, but got broken by this PR due to its expected result being based indirectly in the bad upcasting of dtypes.

The sorting behaviour of append was pretty messy (still is, but to a lesser extent). When passed a dict or series, the append function would reindexing both self and other, and then send these to the concat function. When passed a list of series or dicts, the list would be transformed to a dframe and, if its columns were contained in self, we would reshape it to match those of self. Other types would be sent directly to concat though. The concat function, by itself, has a pretty consistent sorting for when sort=True, but doesn't seem so consistent for when sort=False or sort=None (I haven't dig much deeper, but pandas/core/indexes/api.py:_union_indexes sorts some arrays even if the sort argument is False. So, to solve this mess, I tried for the simplest approach. I'm leaving concat untouched and focusing only on append. Do the concatenating as usual and, if the sort parameter was False, we can reindex the resulting dframe to match what we would expect from an unsorted axis. Also, I needed to do some juggling because of the sort=None that is being deprecated. When 'other' is a dict or series and sort=None, we can set sort=False, and it will work like before. If 'other' is something else, we just avoid doing the reindex in the end. I haven't checked all cases, so I'm not sure if this is exact the same working as before. But hopefully, we can provide more tests to pinpoint the sorting behaviour of append better, and this will be enough by now.

There was a test that checked whether an exception was raised when appending to a series whose index could not be joined together. This test checked exclusively for TypeError, and this was what the old code produced during a dframe reindex at pandas/core/frame.py:6198. With the new code, however, the reindex is only being called at the end, and so, the sorts of error that occur tend to vary. For example, we got AttribteError, ValueError and TypeError. While this is not perfect, the situation didn't change much, it will only break for someone who was catching errors like this with a `except TypeError` clause. Despite this, the message that appeared was cryptic and still is. A better solution would involve informing the user of the reason such error ocurred, and keep the TypeError, to let things backward-compatible.

Lots of tests have been created. Tests scope: - Input types - Bad coercion to float (difficult to imagine, but I feel it was tested fully) - Rows index - Index name - Dtypes (+coercing) - Columns index - Index name - Dtypes (+ coercing) - Check duplicates - Sorting behavior \* Some tests not yet implemented: - Checks for duplicates in the rows index (linked with verify_integrity) - Ignoring rows index (ignore_index) --> Remember to use a modified assert_index_equal (or assert_frame_equal) - Behavior related to ignore_index (e.g. raising when unnamed Series is passed as arg) \*: The sorting tests forget about sort=None completely

Because concat does not have a clear reindexing behavior on the columns, I've implemented this behavior separately on DataFrame.append. Actually, the reindexing is implemented at pandas/code/indexes/api.py and is called from DataFrame.append. The main catch here is that any column types are allowed to be concatenated and, when sort is not possible, it raises an error. Another thing that was added was xfail marks on the tests. These represents parts of the code that haven't been implemented yet or fixes that are better on another PR. The behavior for sort=None isn't totally sorted out yet.

The resulting index dtype was being inferred (not always?) from the indexes values. This caused a subtle problem where we inferred in a place the user didn't want to. For example: >>> df1.columns Index([0, 1, 2], dtype='object') >>> df1.append(df1, sort=False).columns Int64Index([0, 1, 2], dtype='int64') This commit solves this problem, but it raises a question for empty indexes, should they be considered when calculating the final dtype or not? My intuition says that the user usually doesn't know about the index type, specially if it's empty, so we may avoid friction ignoring empty indexes in the calculation. The same argument may be used for column dtypes after an append where the resulting DataFrame has exactly one row.

TODO: This shall be reversed later, or be made a bit more strict. My best choice is: ignore when it is empty of dtype object, consider if it is empty of another dtype. May interact somewhat with the result float64 of reindex.

Will be better made in a future version.

When there were duplicates on the columns index, sort was allowed and duplicates were allowed if the indexes had the same values (as found by idx.tolist()). Now, considering that pandas doesn't allow to sort the index when there are duplicate values (DataFrame.reindex fails) and that searching for the same values is counter-productive and prone to fail, depending on the different types of indexes, the behavior was modified to this: - When sort=True and there are duplicates in at least one index, an error is raised and append stops. - Dframes with duplicate indexes are only considered to be joined when the indexes share the same identity (that is, they are the same object comparable with `idx1 is idx2`) Some other improvements to the code have also been made and I believe it is better in a general mode.

Handle columns index duplicates

pep8speaks · 2018-10-01T00:24:18Z

Hello @araraonline! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/frame.py !
In the file pandas/core/indexes/api.py, following are the PEP8 issues :

Line 294:49: E128 continuation line under-indented for visual indent
Line 299:39: E128 continuation line under-indented for visual indent

There are no PEP8 issues in the file pandas/core/indexes/base.py !
In the file pandas/tests/reshape/test_append.py, following are the PEP8 issues :

Line 79:6: E121 continuation line under-indented for hanging indent
Line 84:6: E121 continuation line under-indented for hanging indent
Line 866:17: E127 continuation line over-indented for visual indent
Line 867:32: E721 do not compare types, use 'isinstance()'
Line 991:80: E501 line too long (80 > 79 characters)

There are no PEP8 issues in the file pandas/tests/reshape/test_concat.py !

Comment last updated on October 02, 2018 at 01:18 Hours UTC

Also trying to use Index without arguments

This reflects in the columns index of DataFrame.append, it will ignore empty indexes (of dtype object)! Some tests are not passing, but this is due to columns dtypes, not indexes.

Implement sort=None behavior and regression tests The default value for sort (None) had a complex behavior and this commit aimed to reproduce it. When the defaul valuet change, it will be wise to remove what was added in this commit. Along with some preparation code that was already present in `append` (calculating `_obj_type` and `_item_type`).

TomAugspurger · 2018-10-02T01:54:36Z

I haven't had a chance to look yet, but FYI this will be very hard to review in its current state.

If possible, I'd recommend identifying small, self-contained changes that can be split off into smaller pull requests. If you're changing anything substantially, it's best to open an issue to discuss API changes and design, before investing too much time in the changes.

lowerthansound · 2018-10-02T14:18:25Z

Oh thanks, I first thought of rebasing this into multiple small commits that would be explanatory, but splitting into PR's look better. I will also keep in mind the API changes

lowerthansound · 2018-10-10T00:17:29Z

@TomAugspurger @jreback

What way would be easier to review if I'm thinking of making various progressive changes in the code? Some thoughts:

PR everything together explaining changes through commits
Work locally on branches that depend of one another, send PR one at a time
Work locally on branches that depend of one another, send all PR's at once

TomAugspurger · 2018-10-10T11:37:19Z

Probably 2 or 3, but I haven't had time to go through #22957 in detail.

You're timing here is just a bit to early :) We're in the middle of a largish refactor to our data types

sparse: SparseArray is an ExtensionArray #22325
period: REF: Make PeriodArray an ExtensionArray #22862
concat on EAs: BUG: Concat multiple different ExtensionArray types #22997

After that, we'll have one more for Datetimes and Timedeltas, and then the dust will settle a bit. There's a decent chance that sparse and period are done by next week, and datetimes / Timedeltas a little after that (a week or two).

I suspect that all those changes will fix the issues you saw with Period, Datetime, Timedelta, and Interval indexes.

So maybe see if there are any fixes you can make for MultiIndex that are relatively standalone?

lowerthansound · 2018-10-11T03:18:11Z

Hmmm... Sure! Didn't notice there was so much going on right now.

Think I will wait a little bit or focus on the MultiIndex as suggested

jreback · 2018-11-23T03:35:01Z

closing as stale. if you'd like to continue, pls ping.

araraonline added 28 commits September 14, 2018 11:44

Add test cases

96d8933

Rewrite DataFrame.append

fbd41eb

Fix test case

5db58d8

The test fixed another thing, but got broken by this PR due to its expected result being based indirectly in the bad upcasting of dtypes.

Let dtypes keep the same if result has one row

b3059aa

Add check for ignore_index

1fe121f

Fix _append_dict behaviour

945e96b

More sort mechanisms

8c16f53

Performance fix

734f2e3

Add test case

1ad6e82

missing space

dab0730

Write TypeError messages for append

3314bc5

Write TypeError split for mixed lists

96153c3

Preserve name of index in result

facafb8

Preserve columns index name

ecc2467

Convert result columns to specific dtypes

c07e84c

Add named index to test cases

6cf42b4

Ignore empty columns dtype on append

adc6a2f

TODO: This shall be reversed later, or be made a bit more strict. My best choice is: ignore when it is empty of dtype object, consider if it is empty of another dtype. May interact somewhat with the result float64 of reindex.

Remove sort=None handling

87cc878

Will be better made in a future version.

Merge pull request #7 from araraonline/duplicate

5dfe32e

Handle columns index duplicates

clean lines

b373368

Fix indexes.api._reindex

4b32c64

clean test

2617449

lowerthansound changed the title ~~[WIP] ENH: Refactor code for DataFrame.append~~ [WIP] API/ENH: Refactor code for DataFrame.append Oct 1, 2018

lowerthansound changed the title ~~[WIP] API/ENH: Refactor code for DataFrame.append~~ [WIP] ENH/API: Refactor DataFrame.append Oct 1, 2018

lowerthansound changed the title ~~[WIP] ENH/API: Refactor DataFrame.append~~ [WIP] CLN/API: Refactor DataFrame.append Oct 1, 2018

lowerthansound changed the title ~~[WIP] CLN/API: Refactor DataFrame.append~~ [WIP] API/CLN: Refactor DataFrame.append Oct 1, 2018

araraonline added 3 commits October 1, 2018 01:13

Fix small error: 'pd' is not defined

c1e8e0f

Also trying to use Index without arguments

Ignore empty indexes in _merge_index_list

a8f03b7

This reflects in the columns index of DataFrame.append, it will ignore empty indexes (of dtype object)! Some tests are not passing, but this is due to columns dtypes, not indexes.

This was referenced Oct 3, 2018

How should DataFrame.append behave related to indexes types? #22957

Closed

DOC: Documentation for pandas.core.indexes.api #22980

Merged

jreback closed this Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] API/CLN: Refactor DataFrame.append #22915

[WIP] API/CLN: Refactor DataFrame.append #22915

lowerthansound commented Oct 1, 2018 •

edited

pep8speaks commented Oct 1, 2018 •

edited

TomAugspurger commented Oct 2, 2018

lowerthansound commented Oct 2, 2018

lowerthansound commented Oct 10, 2018 •

edited

TomAugspurger commented Oct 10, 2018

lowerthansound commented Oct 11, 2018

jreback commented Nov 23, 2018

[WIP] API/CLN: Refactor DataFrame.append #22915

[WIP] API/CLN: Refactor DataFrame.append #22915

Conversation

lowerthansound commented Oct 1, 2018 • edited

How it is

Objective

Behavior changes

Task list (not in order)

pep8speaks commented Oct 1, 2018 • edited

Comment last updated on October 02, 2018 at 01:18 Hours UTC

TomAugspurger commented Oct 2, 2018

lowerthansound commented Oct 2, 2018

lowerthansound commented Oct 10, 2018 • edited

TomAugspurger commented Oct 10, 2018

lowerthansound commented Oct 11, 2018

jreback commented Nov 23, 2018

lowerthansound commented Oct 1, 2018 •

edited

pep8speaks commented Oct 1, 2018 •

edited

lowerthansound commented Oct 10, 2018 •

edited