Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly sort heterogeneous types in DictColumns #583

Merged
merged 1 commit into from
Mar 30, 2016
Merged

Conversation

philippjfr
Copy link
Member

As suggested in #580 this replaces the lexicographical sorting implemented on ArrayColumns and DictColumns with more useful general sorting as is already in place on the DFColumns and NdColumns interfaces. In the case of ArrayColumns we can simply create a record array view into the actual array, which we can apply the sorting to. For DictColumns we need to actually construct a record array but this shouldn't be any more expensive than the existing implementation, which already creates an intermediate array to compute the sort order on.

This should be ready to merge once the tests pass.

@philippjfr philippjfr force-pushed the columns_sorting branch 3 times, most recently from af6e461 to 3045f8f Compare March 29, 2016 22:51
@@ -557,6 +557,19 @@ def array(cls, columns, dimensions):
return Element.array(columns, dimensions)

@classmethod
def recarray(cls, columns, dimensions):
if dimensions is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems to be an odd method to introduce just for sorting purposes, especially as it is only needed to figure out a sort order (via argsort on the returned record array).

Generally I was hoping to avoid record arrays entirely and wouldn't want them to spread elsewhere. That said, the implementation here is fine and right now this method isn't being used in a way that is likely to cause problems later. I don't yet understand why introducing record arrays is simpler but I assume you wouldn't introduce them without a good reason!

Edit: I saw what you wrote in the issue (record arrays being the easiest way to get the desired sorting behaviour) so I guess this is the best we can do.

@jlstevens
Copy link
Contributor

I've made a comment but now I'm wondering what was wrong with the originally intended (lexicographical) sorting behavior...

@philippjfr philippjfr changed the title Correctly implemented numeric sort for ArrayColumns and DictColumns Correctly sort heterogeneous types in DictColumns Mar 29, 2016
@jlstevens
Copy link
Contributor

In Lancet, I use lexical sorting without record arrays. The default sort in Python is stable so cycles of sorting do the job. Here is the relevant code in Lancet:

        for (key, ascending) in sort_cycles:
            specs = sorted(specs, key=lambda s: s.get(key, None),
                           reverse=(not ascending))

Here specs is a list of dictionaries and you want to sort these dictionaries according to the values in a certain key order.

Not saying it is an efficient approach but it is simple, works for heterogenous types and doesn't use record arrays.

@philippjfr
Copy link
Member Author

Okay, so the problem wasn't lexicographical sorting but the fact that it only works on homogeneous arrays. I think the record array approach is still the way to go but I'm happy to turn the recarray method into a function that specifically computes the lexsorted indices.

@jlstevens
Copy link
Contributor

Ok, that sounds reasonable. I would make it a utility function (i.e in core.util) and maybe call it lexsorted_indices?

After that, I'm happy to merge once the tests pass.

@jlstevens jlstevens merged commit fd2b16e into master Mar 30, 2016
@philippjfr philippjfr deleted the columns_sorting branch April 1, 2016 14:24
@philippjfr philippjfr added this to the v1.5.0 milestone Apr 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants