Integrate CSV linter #3493

wesley-dean-flexion · 2024-04-18T14:37:28Z

tl;dr: I would like to add csvclean (from csvkit) as a linter. I'm happy to do the work if people think this is a good idea.

Is your feature request related to a problem? Please describe.

One of my repos includes CSV files and they can sub-optimal. Just as we have linters (and reformaters) for JSON, XML, and YAML, I would like to add a CSV linter.

Describe the solution you'd like

There exists a package, csvkit, that includes a tool to lint and cleanup CSV files:

https://csvkit.readthedocs.io/en/latest/scripts/csvclean.html

I would like to add CSV_CSVCLEAN (the name isn't consequential to me; I just picked it because of YAML_YAMLLINT) that would lint the list of files. When run with APPLY_FIXES, it would not include the --dry-run flag to csvclean; when run without APPLY_FIXES set, it would include the --dry-run flag).

Running csvclean on a CSV file results in two files being created, per the documentation

Outputs [basename]_out.csv and [basename]_err.csv, the former containing all valid rows and the latter containing all error rows along with line numbers and descriptions

I noticed that running csvclean on a known messy file (i.e., one that produces errors due to being not totally valid) will NOT set $? but it will generate [basename]_err.csv, so something like this might be helpful:

csvcleanwrapper() {
  csvkit "$@" "$(-z "$APPLY_FIXES" ] && echo "--dry-run")" -- - && [ -e stderr_err.csv ]
}

(so $? is set, $@ would have the additional arguments, $APPLY_FIXES is set to something when we want to fix stuff, etc.. point: a little "syntactic sugar" could be helpful in making it work the way we want.)

Describe alternatives you've considered

I haven't thought through very many alternatives. I did look through prettier to see if it could clean up CSV like it can for YAML and such; however, it does not appear to have that functionality. If it's there and I missed it, cool, there's that much less work that needs to be done.

Additional context

There are a few images on Docker Hub that provide csvkit, but they're largely several years old. For what it's worth, csvkit regularly provides revisions, the most recent of which (latest / v1.5.0) was released on 28 March, 2024. (point: existing images are behind the current release). I can put together a pipeline to watch the csvkit repo for new releases and package / publish updated images.

I'm happy to do the work to implement this and submit a PR assuming folks are cool with the idea.

The fact that CSV has a bunch of limitations, that JSON, TOML, XML, or YAML (etc.) may be a better match to represent data. That's given and I don't dispute it. Unfortunately, it's not my call about how the data are represented but I do have responsibilities to make sure the pipeline from developer to production detects (and notifies me on) as much noise as possible.

The text was updated successfully, but these errors were encountered:

nvuillam · 2024-04-22T17:31:41Z

Hi @wesley-dean-flexion :)

That seems to be a good idea , you have my go to start implementing :)

About the complexity to call csvkit, you might need to create a python class to handle it :)

wesley-dean-flexion · 2024-04-22T18:26:46Z

(apologies.. I muscle-memoried cvsclean instead of csvclean when creating the branch... ugh...)

Started with csv-clean which is not yet ready for anything. A few questions:

linter_name is the command to run to do the linting; in this case, it would be csvclean because the name of the executable is csvclean .. right?
cli_lint_extra_args can be used to pass --dry-run while cli_lint_fix_remove_args can also be set to --dry-run so that in non-fix mode, it'll pass --dry-run while fix mode will not pass --dry-run... right?
we can use Python regex mechanics (e.g., (?i) to make a regex case-insensitive, [[:space:]]+ for 1 or more white spaces, etc.).. right?
csvclean gives differently-formatted output depending on if it's run with --dry-run or not:

$ csvclean --dry-run acronyms.csv
Line 289: Expected 4 columns, found 5 columns
Line 1196: Expected 4 columns, found 6 columns
Line 1241: Expected 4 columns, found 8 columns
Line 1242: Expected 4 columns, found 2 columns
Line 1307: Expected 4 columns, found 3 columns

### note: NO acronyms_err.csv generated here

$ csvclean acronyms.csv
5 errors logged to acronyms_err.csv

### note: acronyms_err.csv and acronyms_out.csv ARE generated here

so, this is where a wrapper which would live in the linters directory would reside... right? The Python class would look to see if we're running in fix mode or not and apply the --dry-run flag as-needed, grab the correct output, make sure the original file is what's pushed when in fix mode, etc.. right?

wesley-dean-flexion · 2024-04-29T17:20:03Z

I'm working with @jpmckinney on some interface changes (wireservice/csvkit#1239) that ought to simplify this integration. As a result, when v2.0.0 comes out, a lot of what I wrote before will no longer matter.

Additionally, I submitted wireservice/csvkit#1240 to containerize the tool and publish official images that could be used instead of building the tool via pip during the MegaLinter build process. Hopefully this will simplify the build and isolate MegaLinter from any build problems, interface refactoring, etc..

github-actions · 2024-05-30T00:52:26Z

This issue has been automatically marked as stale because it has not had recent activity.
It will be closed in 14 days if no further activity occurs.
Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

wesley-dean-flexion · 2024-05-30T13:37:29Z

I'm waiting on a PR approval from the csvkit folks so I can move forward with this.

github-actions · 2024-06-30T00:58:53Z

This issue has been automatically marked as stale because it has not had recent activity.
It will be closed in 14 days if no further activity occurs.
Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

wesley-dean-flexion · 2024-07-01T19:52:53Z

see the aforementioned

github-actions · 2024-08-01T01:01:15Z

This issue has been automatically marked as stale because it has not had recent activity.
It will be closed in 14 days if no further activity occurs.
Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

github-actions · 2024-09-05T00:59:47Z

This issue has been automatically marked as stale because it has not had recent activity.
It will be closed in 14 days if no further activity occurs.
Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

wesley-dean-flexion · 2024-09-16T15:36:12Z

the PR (wireservice/csvkit#1240) was approved. I just need to do the thing.

wesley-dean-flexion · 2024-10-01T19:26:27Z

please don't ding me, stalebot... ☹️

github-actions · 2024-11-01T01:09:19Z

This issue has been automatically marked as stale because it has not had recent activity.
It will be closed in 14 days if no further activity occurs.
Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

nvuillam · 2024-11-01T01:29:52Z

@wesley-dean-flexion what's the status 🥳

wesley-dean-flexion added the enhancement New feature or request label Apr 18, 2024

wesley-dean-flexion mentioned this issue Apr 22, 2024

Integrating with MegaLinter wireservice/csvkit#1239

Closed

wesley-dean-flexion mentioned this issue Apr 30, 2024

Add repolint linter #3530

Closed

github-actions bot added the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label May 30, 2024

github-actions bot removed the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label May 31, 2024

github-actions bot added the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label Jun 30, 2024

github-actions bot removed the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label Jul 2, 2024

github-actions bot added the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label Aug 1, 2024

nvuillam removed the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label Aug 5, 2024

github-actions bot added the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label Sep 5, 2024

nvuillam removed the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label Sep 9, 2024

github-actions bot added the O: stale 🤖 This issue or pull request is stale, it will be closed if there is no activity label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate CSV linter #3493

Integrate CSV linter #3493

wesley-dean-flexion commented Apr 18, 2024 •

edited

Loading

nvuillam commented Apr 22, 2024

wesley-dean-flexion commented Apr 22, 2024 •

edited

Loading

wesley-dean-flexion commented Apr 29, 2024

github-actions bot commented May 30, 2024

wesley-dean-flexion commented May 30, 2024

github-actions bot commented Jun 30, 2024

wesley-dean-flexion commented Jul 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Sep 5, 2024

wesley-dean-flexion commented Sep 16, 2024

wesley-dean-flexion commented Oct 1, 2024

github-actions bot commented Nov 1, 2024

nvuillam commented Nov 1, 2024

Integrate CSV linter #3493

Integrate CSV linter #3493

Comments

wesley-dean-flexion commented Apr 18, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

nvuillam commented Apr 22, 2024

wesley-dean-flexion commented Apr 22, 2024 • edited Loading

wesley-dean-flexion commented Apr 29, 2024

github-actions bot commented May 30, 2024

wesley-dean-flexion commented May 30, 2024

github-actions bot commented Jun 30, 2024

wesley-dean-flexion commented Jul 1, 2024

github-actions bot commented Aug 1, 2024

github-actions bot commented Sep 5, 2024

wesley-dean-flexion commented Sep 16, 2024

wesley-dean-flexion commented Oct 1, 2024

github-actions bot commented Nov 1, 2024

nvuillam commented Nov 1, 2024

wesley-dean-flexion commented Apr 18, 2024 •

edited

Loading

wesley-dean-flexion commented Apr 22, 2024 •

edited

Loading