Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from-ssv logic updated #845

Merged
merged 4 commits into from Oct 19, 2019

Conversation

thomasheartman
Copy link

@thomasheartman thomasheartman commented Oct 16, 2019

Hey! So a pull request with no prior discussion may be bad form, but I was tinkering around with this problem I had and managed to solve it, and due to how we're seemingly in very different time zones, I just wanted to get it out there without too much delay. Hope you don't mind.

What and why

Anyway: I'd like to change the table creation algorithm for from-ssv. Previously, the table would be created based only on an entries index in a row as decided by the separators in the row. This meant that two values, though vertically misaligned, could still belong together:

col1                col2
a        b

So in the above snippet, assuming a separator of two spaces, a belongs to col1, and b to col2. This makes a lot of sense to me, but it has one pretty big hole: what about empty values?

Given this table:

col1   col2   col3
a             c

It seems quite intuitive that c would go in col3 and that col2 just doesn't have a value in this case, but with the old (current) algorithm, c would belong to col2 and col3 would have an empty value.

With the new algorithm, the first table would give col1 the value of a b and col2 and empty string. The second table would pair col1 with a, col3 with c and col2 with an empty string again.

Thus, given that the only real world examples I have seen of space-separated values being used are output from kubectl (Kubernetes) or oc (OpenShift), and that these output values sometimes do have holes in them, I think this approach, basing an entry's column on its horizontal alignment, is more correct and intuitive.


Discussion on this is very welcome, and please do not feel any pressure to merge this if you disagree with the sentiment. I should have opened an issue for clarification but got a bit to absorbed by trying to solve the problem, so let this act as both a potential solution and a general discussion for the issue.
Cheers!

Thomas Hartmann added 2 commits October 17, 2019 00:17
New tests are added to test for additional cases that might be trickier to
handle with the new logic.

Old tests are updated where their expectations are no longer expected to hold true.
For instance: previously, lines would be treated separately, allowing any index
offset between columns on different rows, as long as they had the same row index
as decided by a separator. When this is no longer the case, some things need to
be adjusted.
The table parsing/creation logic has changed from treating every line the same
to processing each line in context of the column header's placement. Previously,
lines on separate rows would go towards the same column as long as they were the
same index based on separator alone. Now, each item's index is based on vertical
alignment to the column header.

This may seem brittle, but it solves the problem of some tables operating with
empty cells that would cause remaining values to be paired with the wrong
column.

Based on kubernetes output (get pods, events), the new method has shown to have
much greater success rates for parsing.
@gitpod-io
Copy link

gitpod-io bot commented Oct 16, 2019

@sophiajt
Copy link
Member

I think you're right it's going to often be columns. We have a way now to split on just the whitespace, and this gives you something closer to what you would want: being able to understand the columns more visually.

Do you think this is going to be the major way people use it? Is it worth having two modes or just this one?

@thomasheartman
Copy link
Author

Hmm, that's a good question. I don't really have a clear idea about how people use text in this format outside of the kubernetes ecosystem, but I think that it's generally going to be presented in a columnar fashion.

If you want to store text in a file format that uses a separator, you're more than likely going to use one of the more traditional formats such as csv or tsv. This makes it easier for humans to scan and for machines to work with. Having a separator be a single character instead of a combination of characters--especially when they're all spaces, which is a character that might appear frequently anyway--makes it clearer where the separation occurs, and easier to parse.

I think the main use case for space-separated tables like this is for output from command line interfaces and the like where the source of the data can't guarantee tab width, and where aligning values is more important than having a clear separator between them. This strikes me as being chiefly concerned with readability, and as such, very likely to align data with column headers.

Side note: I was playing around with it a bit and noticed that when piping it through from-ssv and into to-yaml (or any other format) still renders columns with empty values. This is expected, as I register them as empty spaces, if I instead didn't register them, would the keys be absent from the yaml output, or would they still appear? I'll have a look.

Previously it would split the last column on the first separator value found
between the start of the column and the end of the row. Changing this to using
everything from the start of the column to the end of the string makes it behave
more similarly to the other columns, making it less surprising.
@thomasheartman
Copy link
Author

Update on not registering values if they're empty: indeed, they don't show up in the JSON output, which is good. However, because the TaggedDict uses an IndexMap as its backing data structure, this may result in the columns being out of order if they're not all present in the first row of data.

For instance, in some of my tests, I saw columns that should have been roughly in the middle get moved to the end because they're only populated sometimes. I think this would be more confusing to users (and to me), than some values happening to return empty strings when converted to json or yaml (most of which could be filtered out relatively easily. I've been looking at the API, but haven't found an easy way to guarantee the order when values might be empty, so I think this might be a worthwhile trade-off in the meantime.

@sophiajt
Copy link
Member

Yeah that sounds good. I think we still need to work out exactly what happens in tables where the columns are different, both visually and in terms of how they serialize, but for now I think we can land this and keep iterating.

@sophiajt sophiajt merged commit 84a9899 into nushell:master Oct 19, 2019
@thomasheartman
Copy link
Author

Yeah, I agree. I'm happy with where it's at right now, but wouldn't mind expanding on it (or for anyone else to) if there are other suggestions.

You sound like you have some thoughts, so I'd be very interested to hear them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants