`from-ssv` logic updated #845

thomasheartman · 2019-10-16T22:47:27Z

Hey! So a pull request with no prior discussion may be bad form, but I was tinkering around with this problem I had and managed to solve it, and due to how we're seemingly in very different time zones, I just wanted to get it out there without too much delay. Hope you don't mind.

What and why

Anyway: I'd like to change the table creation algorithm for from-ssv. Previously, the table would be created based only on an entries index in a row as decided by the separators in the row. This meant that two values, though vertically misaligned, could still belong together:

col1                col2
a        b

So in the above snippet, assuming a separator of two spaces, a belongs to col1, and b to col2. This makes a lot of sense to me, but it has one pretty big hole: what about empty values?

Given this table:

col1   col2   col3
a             c

It seems quite intuitive that c would go in col3 and that col2 just doesn't have a value in this case, but with the old (current) algorithm, c would belong to col2 and col3 would have an empty value.

With the new algorithm, the first table would give col1 the value of a b and col2 and empty string. The second table would pair col1 with a, col3 with c and col2 with an empty string again.

Thus, given that the only real world examples I have seen of space-separated values being used are output from kubectl (Kubernetes) or oc (OpenShift), and that these output values sometimes do have holes in them, I think this approach, basing an entry's column on its horizontal alignment, is more correct and intuitive.

Discussion on this is very welcome, and please do not feel any pressure to merge this if you disagree with the sentiment. I should have opened an issue for clarification but got a bit to absorbed by trying to solve the problem, so let this act as both a potential solution and a general discussion for the issue.
Cheers!

New tests are added to test for additional cases that might be trickier to handle with the new logic. Old tests are updated where their expectations are no longer expected to hold true. For instance: previously, lines would be treated separately, allowing any index offset between columns on different rows, as long as they had the same row index as decided by a separator. When this is no longer the case, some things need to be adjusted.

The table parsing/creation logic has changed from treating every line the same to processing each line in context of the column header's placement. Previously, lines on separate rows would go towards the same column as long as they were the same index based on separator alone. Now, each item's index is based on vertical alignment to the column header. This may seem brittle, but it solves the problem of some tables operating with empty cells that would cause remaining values to be paired with the wrong column. Based on kubernetes output (get pods, events), the new method has shown to have much greater success rates for parsing.

gitpod-io · 2019-10-16T22:47:31Z

sophiajt · 2019-10-17T06:49:11Z

I think you're right it's going to often be columns. We have a way now to split on just the whitespace, and this gives you something closer to what you would want: being able to understand the columns more visually.

Do you think this is going to be the major way people use it? Is it worth having two modes or just this one?

thomasheartman · 2019-10-17T07:05:22Z

Hmm, that's a good question. I don't really have a clear idea about how people use text in this format outside of the kubernetes ecosystem, but I think that it's generally going to be presented in a columnar fashion.

If you want to store text in a file format that uses a separator, you're more than likely going to use one of the more traditional formats such as csv or tsv. This makes it easier for humans to scan and for machines to work with. Having a separator be a single character instead of a combination of characters--especially when they're all spaces, which is a character that might appear frequently anyway--makes it clearer where the separation occurs, and easier to parse.

I think the main use case for space-separated tables like this is for output from command line interfaces and the like where the source of the data can't guarantee tab width, and where aligning values is more important than having a clear separator between them. This strikes me as being chiefly concerned with readability, and as such, very likely to align data with column headers.

Side note: I was playing around with it a bit and noticed that when piping it through from-ssv and into to-yaml (or any other format) still renders columns with empty values. This is expected, as I register them as empty spaces, if I instead didn't register them, would the keys be absent from the yaml output, or would they still appear? I'll have a look.

Previously it would split the last column on the first separator value found between the start of the column and the end of the row. Changing this to using everything from the start of the column to the end of the string makes it behave more similarly to the other columns, making it less surprising.

thomasheartman · 2019-10-17T07:42:51Z

Update on not registering values if they're empty: indeed, they don't show up in the JSON output, which is good. However, because the TaggedDict uses an IndexMap as its backing data structure, this may result in the columns being out of order if they're not all present in the first row of data.

For instance, in some of my tests, I saw columns that should have been roughly in the middle get moved to the end because they're only populated sometimes. I think this would be more confusing to users (and to me), than some values happening to return empty strings when converted to json or yaml (most of which could be filtered out relatively easily. I've been looking at the API, but haven't found an easy way to guarantee the order when values might be empty, so I think this might be a worthwhile trade-off in the meantime.

sophiajt · 2019-10-19T18:26:00Z

Yeah that sounds good. I think we still need to work out exactly what happens in tables where the columns are different, both visually and in terms of how they serialize, but for now I think we can land this and keep iterating.

thomasheartman · 2019-10-20T20:39:57Z

Yeah, I agree. I'm happy with where it's at right now, but wouldn't mind expanding on it (or for anyone else to) if there are other suggestions.

You sound like you have some thoughts, so I'd be very interested to hear them.

Thomas Hartmann added 2 commits October 17, 2019 00:17

Formats file.

f214053

sophiajt merged commit 84a9899 into nushell:master Oct 19, 2019

thomasheartman mentioned this pull request Oct 28, 2019

Using --headerless on --from-csv removes the first row #875

Closed

elferherrera pushed a commit to elferherrera/nushell that referenced this pull request Feb 7, 2022

Flatten records. Not thoroughly tested though (nushell#845)

f4c0538

kubouch pushed a commit that referenced this pull request Feb 7, 2022

Flatten records. Not thoroughly tested though (#845)

d1ef4d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`from-ssv` logic updated #845

`from-ssv` logic updated #845

thomasheartman commented Oct 16, 2019 •

edited

gitpod-io bot commented Oct 16, 2019 •

edited

sophiajt commented Oct 17, 2019

thomasheartman commented Oct 17, 2019

thomasheartman commented Oct 17, 2019

sophiajt commented Oct 19, 2019

thomasheartman commented Oct 20, 2019

from-ssv logic updated #845

from-ssv logic updated #845

Conversation

thomasheartman commented Oct 16, 2019 • edited

What and why

gitpod-io bot commented Oct 16, 2019 • edited

sophiajt commented Oct 17, 2019

thomasheartman commented Oct 17, 2019

thomasheartman commented Oct 17, 2019

sophiajt commented Oct 19, 2019

thomasheartman commented Oct 20, 2019

`from-ssv` logic updated #845

`from-ssv` logic updated #845

thomasheartman commented Oct 16, 2019 •

edited

gitpod-io bot commented Oct 16, 2019 •

edited