Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detect tabular data from command outputs #1608

Closed
waldyrious opened this issue Apr 19, 2020 · 17 comments
Closed

Auto-detect tabular data from command outputs #1608

waldyrious opened this issue Apr 19, 2020 · 17 comments
Labels
delight this feature would delight users enhancement New feature or request Stale used for marking issues and prs as stale

Comments

@waldyrious
Copy link
Contributor

waldyrious commented Apr 19, 2020

This is likely something that has been discussed already, and parts of it may even be already implemented, but since I couldn't locate a discussion in the issue tracker, I thought I'd open an issue myself.

I understand that the current strategy of Nushell is to provide commands that output structured data, and offer a way for other commands to do so as well so that Nushell can integrate them within rich pipelines. I am not sure how much of this is expected to remain explicit, though — the 0.12 release post mentions inference of data types:

We’ve been hard at work at improving how we read in unstructured data. In this release, you’ll see the beginning of type inference as data is read in.

...but beyond individual data points, I suppose the long term ambition is to be able to also infer the structure of unstructured data, right? So I suppose that would look like the ability to automatically detect tab-separated output, or space-aligned values, or even interpret common types of ASCII or Unicode tables (such as those built with box-drawing characters), and seamlessly convert them to structured data Nushell can work with.

I know that this is already possible when opening structured files that Nushell knows how to recognize, as described in #1018:

You can use open some_excel_file.xlsx and it will do the import automatically. from-xslx is the command it will call for you (rather than having to do it manually)

So to be clear, I'm talking about inferring structure from the stdout of regular commands, like this one, or this one, or #619, or #443.

Is this part of the roadmap? Is this described or discussed elsewhere in more detail? To me it sounds like a crucial piece to allow Nushell to take off and integrate with existing tools; manually implementing structure-friendly versions of commands, as was done with ls, ps, etc., doesn't seem as scalable.

@thegedge thegedge added delight this feature would delight users enhancement New feature or request labels Apr 19, 2020
@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale used for marking issues and prs as stale label Jun 24, 2021
@waldyrious
Copy link
Contributor Author

I'd understand if maintainers close this issue based on an explicit decision not to implement it, but I wouldn't want it to be closed just for being old.

@github-actions github-actions bot removed the Stale used for marking issues and prs as stale label Jul 7, 2021
@github-actions
Copy link

github-actions bot commented Oct 6, 2021

This issue is being marked stale because it has been open for 90 days without activity. If you feel that this is in error, please comment below and we will keep it marked as active.

@github-actions github-actions bot added the Stale used for marking issues and prs as stale label Oct 6, 2021
@waldyrious
Copy link
Contributor Author

Same as my previous comment.

@github-actions github-actions bot removed the Stale used for marking issues and prs as stale label Oct 7, 2021
@Jacobious52
Copy link
Contributor

With 0.40 released we now have detect columns. This could be a good starting point to automatically detect tabular data. As it would be awesome to not need to command | detect columns every time without making aliases for certain commands.

@waldyrious
Copy link
Contributor Author

For reference, here's the output of git-fame (with non-tabular parts removed) as it is currently interpreted by detect columns:

Original output:

> git-fame | head -20 | tail -n +5
| Author                         |   loc |   coms |   fils |  distribution   |
|:-------------------------------|------:|-------:|-------:|:----------------|
| Ben Balter                     | 17610 |    243 |     90 | 59.7/14.2/19.0  |
| Mike Linksvayer                |  3232 |    690 |     84 | 11.0/40.3/17.7  |
| Jason Long                     |  1534 |     24 |     10 | 5.2/ 1.4/ 2.1   |
| Haacked                        |  1166 |     64 |     17 | 4.0/ 3.7/ 3.6   |
| Alexis Tyler                   |   930 |      1 |     14 | 3.2/ 0.1/ 3.0   |
| Laurent Joubert                |   572 |      2 |      1 | 1.9/ 0.1/ 0.2   |
| martinsievers                  |   405 |      8 |      1 | 1.4/ 0.5/ 0.2   |
| Andreas Renberg (IQAndreas)    |   347 |      6 |      3 | 1.2/ 0.4/ 0.6   |
| emidiostani                    |   323 |      4 |      1 | 1.1/ 0.2/ 0.2   |
| XhmikosR                       |   250 |     63 |     21 | 0.8/ 3.7/ 4.4   |
| Matthew Buckett                |   229 |      2 |      1 | 0.8/ 0.1/ 0.2   |
| Waldir Pimenta                 |   225 |     26 |     40 | 0.8/ 1.5/ 8.4   |
| TheInterestingSoul             |   225 |     12 |      1 | 0.8/ 0.7/ 0.2   |
| Benjamin J. Balter             |   187 |     15 |     23 | 0.6/ 0.9/ 4.9   |

Current detect columns behavior:

> git-fame | head -20 | tail -n +5 | detect columns
────┬────────────────────────────────┬────────────────────────────────┬────────────────────────────────┬────────────────────────────────┬────────────────────────────────┬────────────────────────────────
 #  │               |                │             Author             │              loc               │              coms              │              fils              │          distribution          
────┼────────────────────────────────┼────────────────────────────────┼────────────────────────────────┼────────────────────────────────┼────────────────────────────────┼────────────────────────────────
  0 │ |:---------------------------- │ |:---------------------------- │ |:---------------------------- │ |:---------------------------- │ |:---------------------------- │ |:---------------------------- 
    │ ---|------:|-------:|-------:| │ ---|------:|-------:|-------:| │ ---|------:|-------:|-------:| │ ---|------:|-------:|-------:| │ ---|------:|-------:|-------:| │ ---|------:|-------:|-------:| 
    │ :----------------|             │ :----------------|             │ :----------------|             │ :----------------|             │ :----------------|             │ :----------------|             
  1 │ |                              │ Balter                         │ 17610                          │ 243                            │ 90                             │ 59.7/14.2/19.0                 
  2 │ |                              │ Linksvayer                     │ 3232                           │ 690                            │ 84                             │ 11.0/40.3/17.7                 
  3 │ |                              │ Long                           │ 1534                           │ 24                             │ 10                             │ 2.1                            
  4 │ |                              │ Haacked                        │ 1166                           │ 64                             │ 17                             │ 3.6                            
  5 │ |                              │ Alexis                         │ 930                            │ 1                              │ 14                             │ 3.0                            
  6 │ |                              │ Laurent                        │ 572                            │ 2                              │ 1                              │ 0.2                            
  7 │ |                              │ martinsievers                  │ 405                            │ 8                              │ 1                              │ 0.2                            
  8 │ |                              │ Andreas                        │ 347                            │ 6                              │ 3                              │ 0.6                            
  9 │ |                              │ emidiostani                    │ 323                            │ 4                              │ 1                              │ 0.2                            
 10 │ |                              │ XhmikosR                       │ 250                            │ 63                             │ 21                             │ 4.4                            
 11 │ |                              │ Matthew                        │ 229                            │ 2                              │ 1                              │ 0.2                            
 12 │ |                              │ Waldir                         │ 225                            │ 26                             │ 40                             │ 8.4                            
 13 │ |                              │ TheInterestingSoul             │ 225                            │ 12                             │ 1                              │ 0.2                            
 14 │ |                              │ Benjamin                       │ 187                            │ 15                             │ 23                             │ 4.9                            
────┴────────────────────────────────┴────────────────────────────────┴────────────────────────────────┴────────────────────────────────┴────────────────────────────────┴────────────────────────────────

Some ways this could be improved:

The desired behavior (using detect columns, or perhaps a from-text-table command as proposed in #1183) would look like this:

> git-fame | head -20 | tail -n +5 | detect columns
────┬─────────────────────────────┬───────┬──────┬──────┬────────────────
 #  │             Author          │  loc  │ coms │ fils │  distribution  
────┼─────────────────────────────┼───────┼──────┼──────┼────────────────
  0 │ Ben Balter                  │ 17610 │  243 │   90 │ 59.7/14.2/19.0 
  1 │ Mike Linksvayer             │  3232 │  690 │   84 │ 11.0/40.3/17.7 
  2 │ Jason Long                  │  1534 │   24 │   10 │ 5.2/ 1.4/ 2.1  
  3 │ Haacked                     │  1166 │   64 │   17 │ 4.0/ 3.7/ 3.6  
  4 │ Alexis Tyler                │   930 │    1 │   14 │ 3.2/ 0.1/ 3.0  
  5 │ Laurent Joubert             │   572 │    2 │    1 │ 1.9/ 0.1/ 0.2  
  6 │ martinsievers               │   405 │    8 │    1 │ 1.4/ 0.5/ 0.2  
  7 │ Andreas Renberg (IQAndreas) │   347 │    6 │    3 │ 1.2/ 0.4/ 0.6  
  8 │ emidiostani                 │   323 │    4 │    1 │ 1.1/ 0.2/ 0.2  
  9 │ XhmikosR                    │   250 │   63 │   21 │ 0.8/ 3.7/ 4.4  
 10 │ Matthew Buckett             │   229 │    2 │    1 │ 0.8/ 0.1/ 0.2  
 11 │ Waldir Pimenta              │   225 │   26 │   40 │ 0.8/ 1.5/ 8.4  
 12 │ TheInterestingSoul          │   225 │   12 │    1 │ 0.8/ 0.7/ 0.2  
 13 │ Benjamin J. Balter          │   187 │   15 │   23 │ 0.6/ 0.9/ 4.9  
────┴─────────────────────────────┴───────┴──────┴──────┴────────────────

Once this behavior is available with commands that are manually added to the pipeline, it should be just a matter of automatically detecting the format in pipelines, like open works today for loading files (i.e. removing the need to pipe a file's contents into the corresponding from <format> command).

@fdncred
Copy link
Collaborator

fdncred commented Feb 18, 2022

you can start to approach wrapping this in a nushell table with the following command
open foo.txt | str find-replace '[─│┬┼┴]' '' -a | detect columns -s 3
I just saved the table to foo.txt to play with it iteratively.

@petrisch
Copy link

petrisch commented Jul 4, 2022

Leaving my 2cents here as well, since I was running into this.
I have my own tool written for internal use, which uses prettytable-rs to generate a nice table output in the form:

+---------------------------------------------+--------------------+
| Deutsch                                     | Englisch           |
+---------------------------------------------+--------------------+
|  Haus                                       |  house             |
+---------------------------------------------+--------------------+
|  Maus                                       |  mouse             |
+---------------------------------------------+--------------------+

I tried with str replace '[-+]' but that gives me blank lines and later blank rows which are hard to get rid off.
Maybe I will give myself a csv output option on my tool, but it would be really neat, if nushell could read this directly as a from prettytable-string sort of thing. I can see this is not a easy thing to get.

@github-actions github-actions bot added the Stale used for marking issues and prs as stale label Jan 17, 2023
@sophiajt
Copy link
Contributor

sophiajt commented Aug 9, 2023

I doubt we would support auto-detection as it would be a heavy burden to pay for processing the data that's being streamed between commands.

That said, we have commands like detect columns now which can help you turn text output from an external command into structured data.

Closing this issue, but please open specific issues against the detect columns command if you find examples it could do better.

@sophiajt sophiajt closed this as completed Aug 9, 2023
@waldyrious
Copy link
Contributor Author

waldyrious commented Aug 9, 2023

we have commands like detect columns now which can help you turn text output from an external command into structured data.

Closing this issue, but please open specific issues against the detect columns command if you find examples it could do better.

I already did so above. I would prefer this issue to be kept open and re-scoped to be about better behavior of detect columns than having to repeat the comments above in a new thread, which would lose context, reactions, subscriptions, etc.

@g-yziquel
Copy link

g-yziquel commented Sep 29, 2023

@waldyrious I concur. Such a use case for tabular data coming from the wild is indeed a serious one and would warrant an ongoing discussion. I am hitting all of the above issues, personally. If support for that kind of autodetection is too heavy a burden for the core nu shell team, it should be a separate plugin project. But no matter what, this feature needs to be supported somehow.

@amtoine
Copy link
Member

amtoine commented Sep 29, 2023

@waldyrious
i would be fine reopening this issue if you can

  • change the title
  • update the description with examples of input / ouput pairs

😉 😇

@g-yziquel
Copy link

@amtoine What's the problem with the title ?

@amtoine
Copy link
Member

amtoine commented Sep 29, 2023

What's the problem with the title ?

if this becomes about re-scop[ing] to be about better behavior of detect columns, the current title suggests that there is nothing to detect tabular data from another raw command, which is a bit confusing to me because we have detect columns already 😕

however, make detect columns better would be clearer i think 😋

@g-yziquel
Copy link

g-yziquel commented Sep 29, 2023

@amtoine Well, it seems unclear whether this is or is not about detect columns. As long as people seem not to agree as to whether this should be implemented in nushell or out of nushell in some plugin, it seems not quite right to state that this is about detect columns.

detect columns is a bit too brittle for my CSV use case. It mangles data in the wrong columns. Tools in the python csvkit package do a better job, but not a perfect one, and output data in the form of a markdown table that get consumed wrong by detect columns.

In my opinion, the question being debated here is the scope the nushell developers want to have for detect columns and what should be developed in some outside plugin to handle various kinds of data tables, which are too varied, IMO, to be handled only by detect columns without some kind of plugin.

So, I'd suggest the title: Clarify the scope of nushell support of detect columns for tabular data on the Wild Wide Web.

If that question is settled, we'll then be able to know what should be implemented in detect columns and what should be experimentally supported outside of detect columns. It's an ecosystem question, IMO.

Proper support for markdown tables in detect columns(without full markdown support) would unblock many aspects of this issue. So I'd advise, as the original poster suggested, to go for markdown table support in detect columns without full markdown support.

@amtoine
Copy link
Member

amtoine commented Sep 29, 2023

makes sense 👍

the title you put forward Clarify the scope of nushell support of detect columns for tabular data on the Wild Wide Web. sounds sensible as a first step 👍

detect columns / a plugin is almost an implementation detail, i agree 😋

@waldyrious
Copy link
Contributor Author

Upon revisiting the entire thread, I suppose it might make sense to let it stand as a record of both the explicit request, and rationale, for Nushell to automatically detect tabular data outputted by non-nushell CLI programs (explained in the opening comment), i.e. without manually piping the data to detect columns or another similar command; and of the project's decision against it:

I doubt we would support auto-detection as it would be a heavy burden to pay for processing the data that's being streamed between commands.

Therefore, even though I suggested earlier that I would be open to rescoping the issue, I now believe it would be a disservice to the project's records to hide that explicit discussion on auto-detection, so my stance is now that a separate issue might be the best approach here, especially because it can make the question under discussion more explicit and specific, as @g-yziquel argues for above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
delight this feature would delight users enhancement New feature or request Stale used for marking issues and prs as stale
Projects
None yet
Development

No branches or pull requests

8 participants