-
-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly sort heterogeneous types in DictColumns #583
Conversation
af6e461
to
3045f8f
Compare
@@ -557,6 +557,19 @@ def array(cls, columns, dimensions): | |||
return Element.array(columns, dimensions) | |||
|
|||
@classmethod | |||
def recarray(cls, columns, dimensions): | |||
if dimensions is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still seems to be an odd method to introduce just for sorting purposes, especially as it is only needed to figure out a sort order (via argsort
on the returned record array).
Generally I was hoping to avoid record arrays entirely and wouldn't want them to spread elsewhere. That said, the implementation here is fine and right now this method isn't being used in a way that is likely to cause problems later. I don't yet understand why introducing record arrays is simpler but I assume you wouldn't introduce them without a good reason!
Edit: I saw what you wrote in the issue (record arrays being the easiest way to get the desired sorting behaviour) so I guess this is the best we can do.
I've made a comment but now I'm wondering what was wrong with the originally intended (lexicographical) sorting behavior... |
In Lancet, I use lexical sorting without record arrays. The default sort in Python is stable so cycles of sorting do the job. Here is the relevant code in Lancet: for (key, ascending) in sort_cycles:
specs = sorted(specs, key=lambda s: s.get(key, None),
reverse=(not ascending)) Here Not saying it is an efficient approach but it is simple, works for heterogenous types and doesn't use record arrays. |
Okay, so the problem wasn't lexicographical sorting but the fact that it only works on homogeneous arrays. I think the record array approach is still the way to go but I'm happy to turn the recarray method into a function that specifically computes the lexsorted indices. |
Ok, that sounds reasonable. I would make it a utility function (i.e in After that, I'm happy to merge once the tests pass. |
3045f8f
to
536b0f6
Compare
As suggested in #580 this replaces the lexicographical sorting implemented on ArrayColumns and DictColumns with more useful general sorting as is already in place on the DFColumns and NdColumns interfaces. In the case of ArrayColumns we can simply create a record array view into the actual array, which we can apply the sorting to. For DictColumns we need to actually construct a record array but this shouldn't be any more expensive than the existing implementation, which already creates an intermediate array to compute the sort order on.
This should be ready to merge once the tests pass.