Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorted columns and indices #2

Closed
jorenverspeurt opened this issue Oct 15, 2019 · 5 comments
Closed

Sorted columns and indices #2

jorenverspeurt opened this issue Oct 15, 2019 · 5 comments

Comments

@jorenverspeurt
Copy link

jorenverspeurt commented Oct 15, 2019

Hi! I love your tool, it makes a great complement to xsv for some of the csv processing I do, but there's one thing that kind of messes up my workflow: the indices and columns are in a seemingly-random order in the output. I assume that this order is based on the hash values of the column and index names?
Ideally the sorting would be a command line option, but if that's too complicated, maybe it's worth it to make the sorted output the default? I do this in the version I have installed locally by replacing the HashSets you use for the columns and indices in aggregation.rs with BTreeSets, for which the iterators return the items in (ascending) sorted order. This may be a bit simplistic and might reduce performance a bit, but in my experience for ~1GB csv files it doesn't seem to change much of the performance characteristics.
What do you think?

@maxblee
Copy link
Owner

maxblee commented Nov 13, 2019

Hi! Thanks for using it and thanks for the feedback.

I definitely think this is worth addressing. My instinct is that making the sorted output the default on the columns makes sense, especially since tools like xsv sort don't address the header sort order. Plus, columns in pivot tables tend to be categorical/have relatively few values, so performance shouldn't be much of a practical issue.

In terms of sorting indices, I'm curious about what the benefit is to having this tool handle sorting, rather than piping the output into something like xsv sort -s 1. I'm not against adding sorting the indices into the tool as a command-line option, but I would like to have a bit of a better sense of the reasoning behind doing so before I go ahead and design it.

@jorenverspeurt
Copy link
Author

For the index sorting you're right, having options for things that are already well-covered by xsv doesn't make much sense. On the other hand, what I'd expect as default behavior, as a regular user, would either be keeping the order in which the index values appear in the original, or sorting them. So from that perspective it may be worth it to provide a flag to switch between those two?
That last point can apply to the columns as well, except that in that case as you mention it makes more sense to provide more options, as xsv doesn't provide an easy way of sorting there.
So I guess those are the main motivations, user expectations and convenience on the one hand, and functions not available in other common csv processing tools like xsv on the other.
In terms of changes I see it this way: changing the behavior in this case can easily be achieved by changing the set implementation, to std::collections::BTreeSet for sorted values if you'd like to avoid adding a dependency and to something like indexmap::set::IndexSet or linked_hash_set::LinkedHashSet for values in the order they appear in, or sorted, if you don't mind adding one of those crates as a dependency. Whichever you pick is also going to be reversible, I checked. The linked_hash_set documentation mentions IndexSet saying that it's faster if you don't care about removal performance (and we don't), so that may be the best option.

@maxblee
Copy link
Owner

maxblee commented Nov 23, 2019

I think adding indexmap::set::IndexSet makes sense. In terms of changes, I think it makes sense to default the behavior to sorting the columns and maintaining the original order of the indexes. And then adding flags to present the columns in the original order and in reversed sorted order. Does that sound reasonable? I'm amenable to also adding flags to sort the indexes if you want, though I do think they should appear in the original order, rather than in sorted order, by default.

@jorenverspeurt
Copy link
Author

Sounds good! Agreed on the defaults. And I don't think adding the option to sort the indexes as well adds that much extra value, but you could add it for consistency.

@maxblee
Copy link
Owner

maxblee commented Apr 6, 2020

Hi. Thanks again for your suggestion. I wanted to refactor the existing code, but I finally found some time to do that; the current version has support for sorting rows and columns.

@maxblee maxblee closed this as completed Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants