Sorted columns and indices #2

jorenverspeurt · 2019-10-15T09:51:03Z

Hi! I love your tool, it makes a great complement to xsv for some of the csv processing I do, but there's one thing that kind of messes up my workflow: the indices and columns are in a seemingly-random order in the output. I assume that this order is based on the hash values of the column and index names?
Ideally the sorting would be a command line option, but if that's too complicated, maybe it's worth it to make the sorted output the default? I do this in the version I have installed locally by replacing the HashSets you use for the columns and indices in aggregation.rs with BTreeSets, for which the iterators return the items in (ascending) sorted order. This may be a bit simplistic and might reduce performance a bit, but in my experience for ~1GB csv files it doesn't seem to change much of the performance characteristics.
What do you think?

The text was updated successfully, but these errors were encountered:

maxblee · 2019-11-13T03:16:10Z

Hi! Thanks for using it and thanks for the feedback.

I definitely think this is worth addressing. My instinct is that making the sorted output the default on the columns makes sense, especially since tools like xsv sort don't address the header sort order. Plus, columns in pivot tables tend to be categorical/have relatively few values, so performance shouldn't be much of a practical issue.

In terms of sorting indices, I'm curious about what the benefit is to having this tool handle sorting, rather than piping the output into something like xsv sort -s 1. I'm not against adding sorting the indices into the tool as a command-line option, but I would like to have a bit of a better sense of the reasoning behind doing so before I go ahead and design it.

jorenverspeurt · 2019-11-20T12:41:02Z

For the index sorting you're right, having options for things that are already well-covered by xsv doesn't make much sense. On the other hand, what I'd expect as default behavior, as a regular user, would either be keeping the order in which the index values appear in the original, or sorting them. So from that perspective it may be worth it to provide a flag to switch between those two?
That last point can apply to the columns as well, except that in that case as you mention it makes more sense to provide more options, as xsv doesn't provide an easy way of sorting there.
So I guess those are the main motivations, user expectations and convenience on the one hand, and functions not available in other common csv processing tools like xsv on the other.
In terms of changes I see it this way: changing the behavior in this case can easily be achieved by changing the set implementation, to std::collections::BTreeSet for sorted values if you'd like to avoid adding a dependency and to something like indexmap::set::IndexSet or linked_hash_set::LinkedHashSet for values in the order they appear in, or sorted, if you don't mind adding one of those crates as a dependency. Whichever you pick is also going to be reversible, I checked. The linked_hash_set documentation mentions IndexSet saying that it's faster if you don't care about removal performance (and we don't), so that may be the best option.

maxblee · 2019-11-23T07:47:09Z

I think adding indexmap::set::IndexSet makes sense. In terms of changes, I think it makes sense to default the behavior to sorting the columns and maintaining the original order of the indexes. And then adding flags to present the columns in the original order and in reversed sorted order. Does that sound reasonable? I'm amenable to also adding flags to sort the indexes if you want, though I do think they should appear in the original order, rather than in sorted order, by default.

jorenverspeurt · 2019-11-28T16:56:06Z

Sounds good! Agreed on the defaults. And I don't think adding the option to sort the indexes as well adds that much extra value, but you could add it for consistency.

maxblee · 2020-04-06T15:26:01Z

Hi. Thanks again for your suggestion. I wanted to refactor the existing code, but I finally found some time to do that; the current version has support for sorting rows and columns.

maxblee closed this as completed Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorted columns and indices #2

Sorted columns and indices #2

jorenverspeurt commented Oct 15, 2019 •

edited

maxblee commented Nov 13, 2019

jorenverspeurt commented Nov 20, 2019

maxblee commented Nov 23, 2019

jorenverspeurt commented Nov 28, 2019

maxblee commented Apr 6, 2020

Sorted columns and indices #2

Sorted columns and indices #2

Comments

jorenverspeurt commented Oct 15, 2019 • edited

maxblee commented Nov 13, 2019

jorenverspeurt commented Nov 20, 2019

maxblee commented Nov 23, 2019

jorenverspeurt commented Nov 28, 2019

maxblee commented Apr 6, 2020

jorenverspeurt commented Oct 15, 2019 •

edited