Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

Merged
merged 53 commits into from Jul 17, 2019

Conversation

@pilkibun
Copy link
Contributor

commented Jul 9, 2019

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Related to #25915, #26587, #24859, #26113, #10056 (orderedicts), #11181/#11416 (list of namedtuple). #25911.

Update: #13304/#13309 Was merged three years ago so let's just make list of dicts act like list of OrderedDict (as pointed out by @jorisvandenbossche).

Actual

In [63]: data= [
    ...: {'name': 'Joe', 'state': 'NY', 'age': 18},
    ...: {'name': 'Jane', 'state': 'KY', 'age': 19}
    ...: ]
    ...: pd.DataFrame(data)
Out[63]: 
   age  name state
0   18   Joe    NY
1   19  Jane    KY

Expected

In [64]: pd.DataFrame(data)
Out[64]: 
   name state  age
0   Joe    NY   18
1  Jane    KY   19
Four years ago, #10056 asked for the implied order of columns in a list of `OrderedDict` to be preserved by the `DataFrame` constructor. @thatneat [commented](https://github.com//issues/10056#issuecomment-509383829) yesterday that with 3.7 guaranteed dict order, this should extend to dict-like in general. I think users have a reasonable expectation for this to work, and therefore that pandas should support it. @jreback [voted](https://github.com//issues/10056#issuecomment-98812435) +0 on adding this (four years ago).

namedtuple has the convenient property of homogeneous keys and key-order which a list of dicts
doesn't have, dicts are allowed to omit keys, and the key order also may change from dict to dict.
Given that, I settled on a reasonable compromise that matches user expectations in practice:

  1. Only look at the first dict in the list.
  2. Only guarantee the column order of the keys which actually appear in it.
  3. Clarification the order among columns not included in the first dict is undefined, except that they will appear after all the columns that do.
  4. Added Changes apply to Python3.6+ only

In practice, I think the only case users actually care about is sensible behavior when passing a list of dicts which is homogeneous in terms of key and key-order, which this PR provides.

@pilkibun pilkibun changed the title ENH: Support new case of implied column ordering in Dataframe() ENH: Preserve implied column ordering when passing list of dict to DataFrame Jul 9, 2019

@thatneat

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

Thanks for following through on this!

If I can ask for a small amount of scope creep, though - it seems like all other things being equal, the ordering of columns added "later on" should still match the input if possible. This would be analogous to the behavior of dict.update():

In [1]: d={5:4, 3:2}                                                                                                 

In [2]: list(d.keys())                                                                                               
Out[2]: [5, 3]

In [3]: d.update({3:3, 5:5, 4:4, 1:1})

In [4]: list(d.keys())
Out[4]: [5, 3, 4, 1]

That said, this is already a big improvement and I agree with you that you're covering the most common case. If implementing the above turns out to be too tricky for now, maybe you could just remove the part of the test case that explicitly shows that "added later" "XXX" and "YYY" columns are sorted - I wouldn't want to imply that that's a desired behavior and close down the option of implementing this later. [EDIT: I misread the test, it's great 😄 ]

@thatneat

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

I think you've misread the test.

yep, I totally did. Never mind! :shipit:

pilkibun added some commits Jul 9, 2019

@pep8speaks

This comment has been minimized.

Copy link

commented Jul 9, 2019

Hello @pilkibun! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-16 00:14:13 UTC

pilkibun added some commits Jul 9, 2019

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jul 10, 2019

So this has come up a few times in other incarnations but I don't really see the point of this PR. If the data is truly ordered then representing as a list of dicts on the way in is a rather inefficient way of storing the data. If it's not ordered picking the first row or just iterating keys in a 2D plane seems rather arbitrary.

There are a lot of edge cases and nuances that make behavior undefined so I generally don't see a value add

@thatneat

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2019

I don't really see the point of this PR

To supply a data point here:

My teams have run in to this issue fairly frequently and have been consistently surprised by/ had to work around DataFrame column ordering behaviors. A common case is in tests where you want it to be extremely easy to construct a DataFrame fixture with a particular shape, but turns out to be more complex because we have to explicitly pass columns or use less-straightforward data types than dict. It's not a huge cost to deal with it, but I can attest that this is a fairly common source of friction and this PR will help.

@WillAyd What are the edge cases and nuances you're thinking of?
Does my suggestion of using the maintaining-insertion-order behavior dict and dict.update() help? It may not always be ordering the user wants, but at least it is consistent and easy to explain.

You may find this python-dev conversation useful. Python core developers recently (late 2017) decided that they had all of the edge cases nailed down enough to commit to insertion ordering for dicts. Now that they've figured it out, maybe it's a natural course for pandas to follow along?

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jul 10, 2019

The topic of Python insertion order into a single dictionary is not entirely relevant when talking about a list of dictionaries. I'm not sure why we would assume the first row is really indicative of anything in all of the below cases (should be non-trivial to think of more)

>>> [{'z': 1}, {'c': 1, 'b': 1', 'a': 1]}]
>>> [{'a': 1, 'b': 1, 'c': 1}, {'c': 1, 'b': 1', 'a': 1]}]
>>> [{'a': 1, 'b': 1, 'c': 1}, {'z': 1, 'y': 1', 'x': 1]}]
>>> [{'x': 1, 'y': 1, 'z': 1}, {'a': 1, 'y': 1', 'z': 1]}]
@jreback
Copy link
Contributor

left a comment

I am -0 on this change because its hard to know the 'rule' for a user and its pretty arbitrary that its the first row's keys (which is of course data dependent).

it is much simpler to simply pass columns= if you actually do care of about the ordering or .reindex() anytime you actually want a specific ordering.

I am not sure there is a 'good' soln here.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 10, 2019

About the "other incarnations" that @WillAyd mentioned that recently came up, see #26587, #24859, #26113 and some others (I though we had a PR doing exactly the same change as this recently, but can't directly find it)

Personally, I would be fine with a change in the line of this PR. But for the me, the main reason is that I think that we should make the treatment of "lists of records" more consistent in general. Whether it are lists of series, lists of dicts or lists of namedtuples, those cases should be broadly equivalent.

  • Lists of Series currently preserves column order based on the index. And it seems they try to preserve the order of the first as much as possible, appending new observed ones.
  • Lists of namedtuples only takes the order (and length!) of the first, which leads to some broken cases (eg tests a list of namedtuples with different fields -> they get silently written to wrong columns)
  • Lists of dicts -> the keys are sorted.

So I don't think we should try to think of a new "set or rules" for dicts (as spelled out in the top post).
But I I would personally very much welcome a change trying to make this handling of "lists of record-like objects" consistent.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 10, 2019

Opened a separate issue for the unsafe namedtuple behaviour: #27329

@thatneat

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2019

it is much simpler to simply pass columns= if you actually do care of about the ordering or .reindex() anytime you actually want a specific ordering.

As a user, I beg to differ. I regularly see cases (see my previous comments) that would be more straightforward and intuitive if you didn't have to pass columns or reindex, as judged by seeing my team members try without columns or reindex first, and then be disappointed when they realize they have to.

Completely agree with @jorisvandenbossche above re: focusing on consistency and not reinventing the wheel. It sounds like Lists of Series already follows the maintain-insertion-order (MIO) rule that I advocated above. Landing on something consistent would be really helpful to me and my colleagues.

MIO seems like as good a rule as any. The data dependence of ordering is an odd side-effect, but it seems worth it for consistency.

@thatneat

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2019

I'd love to know if PRs moving things more towards MIO-based unified handling of "lists of record-like objects" would be considered in general. If that direction were agreed upon, this PR seems like one of the more important steps but there would be more that I would like to contribute.

@pilkibun pilkibun changed the title ENH: Preserve implied column ordering when passing list of dict to DataFrame ENH: Use key order for column ordering when passing list of homogeneous dicts to DataFrame Jul 11, 2019

@pilkibun pilkibun changed the title ENH: Use key order for column ordering when passing list of homogeneous dicts to DataFrame ENH: Preserve key order when passing list of homogeneous dicts to DataFrame Jul 11, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 11, 2019

Just to be clear, is the definition of record "an ordered set of key-value pairs"?

I don't have an exact definition, but something like that yes. I think Series, dicts and namedtuples are all similar enough (and record-like) to expect them to be handled similarly.

Lists of dicts -> the keys are sorted

did you mean sorted (as in lexically) or ordered (what we've been discussing)?

I was describing the current behaviour (which is sorted lexically, not random as you said in the top post)

Let's be precise about the ordering guarantee pandas provides. I'm working under the assumption that It's simply

If you pass a like-ordered like-keyed list of records, pandas will preserve the key ordering

That's it. We don't need to argue about other, undefined behavior.
That rule already holds for namedtuples, and this PR makes it true also for dicts.

That's the basic "rule" for Series, but not the full rule. For Series, other behaviour is not undefined (we don't want undefined behaviour, at least it should be deterministically sorted as is now).
I would need to look more closely at the actual code, but I think the logic for Series is basically doing a "union" of the indices (without sorting, so idx1.union(idx2, sort=False) for two indexes, but then of course using another routine that can handle multiple).

"There is no good solution" as @jreback said, to making any further guarantees on ordering. Because they'll have to be arbitrary in some sense.

There is a perfectly fine solution, it is what Series already does, and the same logic is followed by the Index.union method.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 11, 2019

There is a perfectly fine solution, it is what Series already does, and the same logic is followed by the Index.union method.

BTW, we already use this exact behaviour when passing OrderedDicts:

In [3]: records = [OrderedDict([('c', 1), ('a', 2)]), OrderedDict([('b', 3), ('a', 4)])]                                                                      

In [4]: pd.DataFrame(records)                                                                                                                                 
Out[4]: 
     c  a    b
0  1.0  2  NaN
1  NaN  4  3.0

So basically what I want to say: we already have the defined behaviour and code for Series and OrderdDict, so I don't see any reason to not do this for normal dicts as well since they are now ordered as well.
(and I would then separately also fix the namedtuple case, as that feels buggy, but let's discuss that separately)

Given the above, I think this PR can be simplified a lot. I think it is basically making the check for OrderedDict to also allow normal dicts:

if columns is None:
gen = (list(x.keys()) for x in data)
sort = not any(isinstance(d, OrderedDict) for d in data)
columns = lib.fast_unique_multiple_list_gen(gen, sort=sort)

CI
@jreback
Copy link
Contributor

left a comment

looks ok, ping . on green.

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/core/internals/construction.py Show resolved Hide resolved

@jreback jreback added this to the 0.25.0 milestone Jul 12, 2019

pilkibun added some commits Jul 12, 2019

@pilkibun

This comment was marked as outdated.

Copy link
Contributor Author

commented Jul 12, 2019

Addressed all comments.

Last minute qualm: do you think some users will miss the ability to have sorted keys? right now they can choose between OrderedDict for insertion-order and dict for sorted. That can be useful too.

Update:
sort_index(axis=1) covers the sorted keys case

pilkibun added some commits Jul 12, 2019

@jorisvandenbossche
Copy link
Member

left a comment

Last tiny remark, should be good otherwise (sorry for the many back and forth rounds)

pandas/core/internals/construction.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

pilkibun added some commits Jul 12, 2019

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/core/internals/construction.py Outdated Show resolved Hide resolved
pandas/core/internals/construction.py Show resolved Hide resolved
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

@jreback this PR has seen enough back and forth on small details, I think we can just merge as is.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

@jreback this PR has seen enough back and forth on small details, I think we can just merge as is.

I disagree

jorisvandenbossche and others added some commits Jul 15, 2019

@pilkibun

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

Thanks @jorisvandenbossche for the fixes. I pushed a few more of @jreback's comments.

For the record, the overhead of doing full column discovery is not negligible. I profile it at about 50% slower compared to passing columns explicitly. I think that's acceptable (it has been for OrderedDict), but it should be clear moving forward with this for dicts, and in the future for namedtuples too.

In [4]: import pandas as pd
   ...: 
   ...: from typing import NamedTuple
   ...: from collections import namedtuple
   ...: import gc 
   ...: 
   ...: Foo=namedtuple("Foo","a,b,c,d")
   ...: 
   ...: d=dict(a=1,b=2,c=3,d=4)        
   ...: data1=[d]*100000
   ...: %timeit pd.DataFrame(data1, columns=['a','b','c','d']) 
   ...: %timeit pd.DataFrame(data1)
96 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
132 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

For the record, the overhead of doing full column discovery is not negligible.

Yes, but note that this has not changed. Also on master, specifying the column names is faster than doing the discovery. The only change is that the union of the dict keys keep the order instead of doing a sort, and that has basically no performance implication.

@jreback jreback merged commit f1684a1 into pandas-dev:master Jul 17, 2019

15 checks passed

codecov/patch 100% of diff hit (target 50%)
Details
codecov/project 92.91% (target 82%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190716.1 succeeded
Details
pandas-dev.pandas (Checks) Checks succeeded
Details
pandas-dev.pandas (Docs) Docs succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_32bit) Linux py36_32bit succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details

@pilkibun pilkibun deleted the pilkibun:10056 branch Jul 17, 2019

@pilkibun

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2019

@jorisvandenbossche,thanks for saving this PR after a false-start and for your patience in reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.