Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558

Closed
1 of 2 tasks
richmahn opened this issue Jul 27, 2021 · 17 comments · Fixed by #17459
Closed
1 of 2 tasks

CSV Renderer may guess a TSV file is comma delimited even when tab delimited #16558

richmahn opened this issue Jul 27, 2021 · 17 comments · Fixed by #17459
Assignees

Comments

@richmahn
Copy link
Contributor

https://try.gitea.io/richmahn/test/src/branch/master/is_tsv_but_uses_commas.tsv

See the raw....it actually has 8 rows, yet because the "intro" row has many commas in its text, only rows that have NO commas (because the header has no commas), thus only one field, get shown.

Description

If a file is a .tsv file, which, while isn't a standard itself and thus uses CSV but with tab as the delimiter, but has more commas than tabs, the function here thinks the delimiter is comma, and thus all rows that have more fields than the header (which has no commas thus 1 field) get removed from the rendering.

I suggest, which I have done in my fork of Gitea, make a separate markup renderer like the "csv" one but call it "tsv" and always use \t as the delimiter due to the filename. Either that, or make sure the function that guesses the delimiter gets the filename of the file being rendered so it can determine from that.

Screenshots

Here is the blame for my .tsv file on try.gitea.io which shows it is tab delimited, with the 2nd row having LOTS of texts with commas:

image

Yet the table rendering removes many of the rows:

image

@richmahn richmahn self-assigned this Jul 27, 2021
@lunny
Copy link
Member

lunny commented Jul 27, 2021

So I think that's a bug of the function guesses the delimiter.

@silverwind
Copy link
Member

silverwind commented Jul 27, 2021

make a separate markup renderer like the "csv" one but call it "tsv" and always use \t as the delimiter due to the filename

I would not make a separate renderer but simply skip over guessDelimiter in case the extension provides a strong indication of the delimiter in use. That way, we can extend with more similar extensions later.

One such extension may be .psv for Pipe Separated Values (|).

@lunny
Copy link
Member

lunny commented Jul 27, 2021

Hm, We should allow external renderer has their own parameters in app.ini.

@richmahn
Copy link
Contributor Author

Yes, we need to have the filename available so that it can be use in determining which delimiter to use as well.

@richmahn
Copy link
Contributor Author

make a separate markup renderer like the "csv" one but call it "tsv" and always use \t as the delimiter due to the filename

I would not make a separate renderer but simply skip over guessDelimiter in case the extension provides a strong indication of the delimiter in use. That way, we can extend with more similar extensions later.

One such extension may be .psv for Pipe Separated Values (|).

Yes, but the problem is currently there is no extension passed to the renderer, as it is invoked by the list of extensions it renders. I offered that as a solution though, if we can get it passed in.

@richmahn richmahn changed the title CSV Renderer thinks a TSV file comma delimited since many commas in field CSV Renderer thinks a TSV file is comma delimited when many commas are in text Jul 27, 2021
@richmahn richmahn changed the title CSV Renderer thinks a TSV file is comma delimited when many commas are in text CSV Renderer guesses a TSV file is comma delimited when many commas are in text Jul 27, 2021
@richmahn richmahn changed the title CSV Renderer guesses a TSV file is comma delimited when many commas are in text CSV Renderer may guess a TSV file is comma delimited even when tab delimited Jul 27, 2021
@richmahn
Copy link
Contributor Author

richmahn commented Jul 27, 2021

Hm, We should allow external renderer has their own parameters in app.ini.

@lunny What kind of parameters? You mean what delimiter to use if it is a tsv, a csv, or a psv? Still csv is actually the only standard, which can be any character delimited (some say csv means character-separated-values), so one repo might use it different than another. Yet tsv and psv are csv files yet kind of hint at the delimiter in their extension.

@richmahn
Copy link
Contributor Author

richmahn commented Jul 27, 2021

From a coworker:

The history of this puzzles me a little @rich Mahn. If I was writing code to autodetect TSV vs CSV, I'd look at the first/header line. And that line only has the column headers and tabs and zero commas. Why would it need to look further? How many lines was the old code scanning?

I do wonder if the first row, whether a header or not, is enough to go by, rather than counting all tabs and commas and other delimiters ALL the content as it currently does. You'd think that would be using more tabs than commas, or commas than tabs especially if it is a real header. But even if not, seems like still the best way to score the characters.

Edit: I see now it scores based on the first 10 lines, yet as my example has, my 2nd row actually has a bunch of commas in the text. Headers wouldn't normally do that.

@richmahn
Copy link
Contributor Author

richmahn commented Jul 27, 2021

@lunny @silverwind

I suggest two things:

  1. get the file extension passed to the Renderer so it can go by that if it isn't just .csv

  2. When it is .csv, to guess the delimiter, we could do it from the first 10k or 10 rows, but why not then quickly see if the top scoring one matches with all 10 rows with the same number of fields, if not, the 2nd, 3rd, or 4th (assuming score > 0) and use the one that works, and if none do, return the top scoring one. If all have no score, then return , (comma). (how well it matches each row can be part of the score)

Thoughts?

@lunny
Copy link
Member

lunny commented Jul 28, 2021

I agree both. We can read ten lines to detect the delimiter. But one we need notice that some column have been quoted with double quote and contains delimiter characters in the double quote. i.e.

a,b,"c,b,d,",e

It should be 4 columns but not 7. If we just simple count single quote, that will be wrong.

@richmahn
Copy link
Contributor Author

@lunny yeah, I thought of that and well aware of that. I assume to actually use the encoding/cvs module with those 10 lines and the delimiter in question and see what it returns. We'd still be making a best guess, as some rows might have the wrong number of fields, but if it has a higher accuracy than other delimiters, will use that delimiter. That is why we can kind of do a score with the matching of the delimiter as well.

@richmahn
Copy link
Contributor Author

So just to be clear, making a fix for this that should also work with the CSV diff feature which has the same bug.

@zeripath
Copy link
Contributor

Why don't we provide a mechanism for choosing whether to guess the delimiter or not.

@richmahn
Copy link
Contributor Author

richmahn commented Aug 5, 2021

@zeripath So some sort of user interface the repo file view page? Yet would still have to guess something when first showing the file.

@richmahn
Copy link
Contributor Author

richmahn commented Aug 5, 2021

@zeripath Also, the DIFF view for .csv files has this bug, and thus you'd have to have a selection for each file diff.

@silverwind
Copy link
Member

Why don't we provide a mechanism for choosing whether to guess the delimiter or not.

I wouldn't. This is a kind of thing that can be handled automatically and should not need configuration for the most common types (csv,tsv,psv).

#16558 (comment) sounds good to me.

@zeripath
Copy link
Contributor

zeripath commented Aug 5, 2021

Every heuristic mechanism will get things wrong for some pathological case.

@richmahn
Copy link
Contributor Author

I still need to look at this. Will see if I can work on this. Found another bug with CSV diffs in that if the diff is after 32 lines, it fails to show the CSV table.

richmahn added a commit to richmahn/gitea that referenced this issue Oct 27, 2021
richmahn added a commit to richmahn/gitea that referenced this issue Oct 27, 2021
lunny pushed a commit that referenced this issue Oct 30, 2021
* Fixes #16558 CSV delimiter determiner

* Fixes #16558 - properly determine CSV delmiiter

* Moves quoteString to a new function

* Adds big test with lots of commas for tab delimited csv

* Adds comments

* Shortens the text of the test

* Removes single quotes from regexp as only double quotes need to be searched

* Fixes spelling

* Fixes check of length as it probalby will only be 1e4, not greater

* Makes sample size a const, properly removes truncated line

* Makes sample size a const, properly removes truncated line

* Fixes comment

* Fixes comment

* tests for FormatError() function

* Adds logic to find the limiter before or after a quoted value

* Simplifies regex

* Error tests

* Error tests

* Update modules/csv/csv.go

Co-authored-by: delvh <dev.lh@web.de>

* Update modules/csv/csv.go

Co-authored-by: delvh <dev.lh@web.de>

* Adds comments

* Update modules/csv/csv.go

Co-authored-by: delvh <dev.lh@web.de>

Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
Co-authored-by: zeripath <art27@cantab.net>
Co-authored-by: delvh <dev.lh@web.de>
Chianina pushed a commit to Chianina/gitea that referenced this issue Mar 28, 2022
* Fixes go-gitea#16558 CSV delimiter determiner

* Fixes go-gitea#16558 - properly determine CSV delmiiter

* Moves quoteString to a new function

* Adds big test with lots of commas for tab delimited csv

* Adds comments

* Shortens the text of the test

* Removes single quotes from regexp as only double quotes need to be searched

* Fixes spelling

* Fixes check of length as it probalby will only be 1e4, not greater

* Makes sample size a const, properly removes truncated line

* Makes sample size a const, properly removes truncated line

* Fixes comment

* Fixes comment

* tests for FormatError() function

* Adds logic to find the limiter before or after a quoted value

* Simplifies regex

* Error tests

* Error tests

* Update modules/csv/csv.go

Co-authored-by: delvh <dev.lh@web.de>

* Update modules/csv/csv.go

Co-authored-by: delvh <dev.lh@web.de>

* Adds comments

* Update modules/csv/csv.go

Co-authored-by: delvh <dev.lh@web.de>

Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
Co-authored-by: zeripath <art27@cantab.net>
Co-authored-by: delvh <dev.lh@web.de>
@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants