Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly determine CSV delimiter #17459

Merged
merged 32 commits into from Oct 30, 2021
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
fc8dd28
Fixes #16558 CSV delimiter determiner
richmahn Oct 25, 2021
6fe4f9f
Merge remote-tracking branch 'origin/main' into richmahn-16558-guess-…
richmahn Oct 25, 2021
c99d94e
Merge remote-tracking branch 'origin/main' into richmahn-16558-guess-…
richmahn Oct 26, 2021
58e98c5
Merge remote-tracking branch 'origin/main' into richmahn-16558-guess-…
richmahn Oct 27, 2021
56f0995
Fixes #16558 - properly determine CSV delmiiter
richmahn Oct 27, 2021
7da56ba
Moves quoteString to a new function
richmahn Oct 27, 2021
5b3a8d9
Adds big test with lots of commas for tab delimited csv
richmahn Oct 28, 2021
b0a6290
Adds comments
richmahn Oct 28, 2021
79ee659
Shortens the text of the test
richmahn Oct 28, 2021
f508559
Removes single quotes from regexp as only double quotes need to be se…
richmahn Oct 28, 2021
e0f2862
Fixes spelling
richmahn Oct 28, 2021
ab0583c
Merge branch 'main' into richmahn-16558-guess-delimiter
wxiaoguang Oct 28, 2021
e466ab7
Fixes check of length as it probalby will only be 1e4, not greater
richmahn Oct 28, 2021
a54e6ac
Merge branch 'richmahn-16558-guess-delimiter' of github.com:richmahn/…
richmahn Oct 28, 2021
62d10e3
Makes sample size a const, properly removes truncated line
richmahn Oct 28, 2021
0a2f06d
Makes sample size a const, properly removes truncated line
richmahn Oct 28, 2021
85eda5f
Fixes comment
richmahn Oct 28, 2021
fc02c05
Fixes comment
richmahn Oct 28, 2021
baf37d3
tests for FormatError() function
richmahn Oct 28, 2021
32a9c36
Adds logic to find the limiter before or after a quoted value
richmahn Oct 28, 2021
e8ac38a
Merge remote-tracking branch 'origin/main' into richmahn-16558-guess-…
richmahn Oct 28, 2021
feabe0e
Simplifies regex
richmahn Oct 28, 2021
f808d73
Merge branch 'main' into richmahn-16558-guess-delimiter
zeripath Oct 29, 2021
0084114
Error tests
richmahn Oct 29, 2021
967da1d
Merge branch 'richmahn-16558-guess-delimiter' of github.com:richmahn/…
richmahn Oct 29, 2021
683def1
Error tests
richmahn Oct 29, 2021
95ede28
Update modules/csv/csv.go
richmahn Oct 30, 2021
b938db4
Update modules/csv/csv.go
richmahn Oct 30, 2021
772095d
Adds comments
richmahn Oct 30, 2021
8a54f7a
Merge branch 'main' into richmahn-16558-guess-delimiter
wxiaoguang Oct 30, 2021
059ae3d
Update modules/csv/csv.go
zeripath Oct 30, 2021
88a2642
Merge branch 'main' into richmahn-16558-guess-delimiter
wxiaoguang Oct 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
110 changes: 66 additions & 44 deletions modules/csv/csv.go
Expand Up @@ -7,16 +7,18 @@ package csv
import (
"bytes"
stdcsv "encoding/csv"
"errors"
"io"
"path/filepath"
"regexp"
"strings"

"code.gitea.io/gitea/modules/markup"
"code.gitea.io/gitea/modules/translation"
"code.gitea.io/gitea/modules/util"
)

var quoteRegexp = regexp.MustCompile(`["'][\s\S]+?["']`)
const maxLines = 10
const guessSampleSize = 1e4 // 10k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we count file size in decimal or binary units?
Because in binary, this number should be a bit larger (10192, or 2^13, if I'm not mistaken).
I know that oftentimes, compilers prefer powers of two as those can be optimized better. I guess that should be the case here as well?

Copy link
Contributor Author

@richmahn richmahn Oct 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't something I added, as it was always a sample size of 10,000. So yeah, it is using 'k' here in the comments that were there before I worked on this as decimal. It's just an arbitrary sample size. I made it a constant so I can check if we may have a truncated line when getting 10 lines from that sample text.

See: https://github.com/go-gitea/gitea/blob/main/modules/csv/csv.go#L34

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just an arbitrary sample size

Yes. And I'm arguing to set it to a value that the compiler will be able to optimize better.

One small correction, though: Apparently, I thought that 2^(11+1) would be 5096, instead of 4096.
So, what I am asking for is to at best use either 1<<13 (8192) or 1<<14 (16384) bytes.
Which do you prefer?
Or do you prefer the 10000, which is slightly easier to read, but will be worse in runtime performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter whatever we choose, 1e4, 16384, 8192, all work, you won't feel any difference.

We are not writing an OS or a memory allocation library, we won't benefit from the page-aligned memory size.

We do not need to struggle on such trivial details.


// CreateReader creates a csv.Reader with the given delimiter.
func CreateReader(input io.Reader, delimiter rune) *stdcsv.Reader {
Expand All @@ -30,70 +32,90 @@ func CreateReader(input io.Reader, delimiter rune) *stdcsv.Reader {
return rd
}

// CreateReaderAndGuessDelimiter tries to guess the field delimiter from the content and creates a csv.Reader.
// Reads at most 10k bytes.
func CreateReaderAndGuessDelimiter(rd io.Reader) (*stdcsv.Reader, error) {
var data = make([]byte, 1e4)
// CreateReaderAndDetermineDelimiter tries to guess the field delimiter from the content and creates a csv.Reader.
// Reads at most guessSampleSize bytes.
func CreateReaderAndDetermineDelimiter(ctx *markup.RenderContext, rd io.Reader) (*stdcsv.Reader, error) {
var data = make([]byte, guessSampleSize)
size, err := util.ReadAtMost(rd, data)
if err != nil {
return nil, err
}

return CreateReader(
io.MultiReader(bytes.NewReader(data[:size]), rd),
guessDelimiter(data[:size]),
determineDelimiter(ctx, data[:size]),
), nil
}

// guessDelimiter scores the input CSV data against delimiters, and returns the best match.
func guessDelimiter(data []byte) rune {
maxLines := 10
text := quoteRegexp.ReplaceAllLiteralString(string(data), "")
lines := strings.SplitN(text, "\n", maxLines+1)
lines = lines[:util.Min(maxLines, len(lines))]

delimiters := []rune{',', ';', '\t', '|', '@'}
bestDelim := delimiters[0]
bestScore := 0.0
for _, delim := range delimiters {
score := scoreDelimiter(lines, delim)
if score > bestScore {
bestScore = score
bestDelim = delim
}
// determineDelimiter takes a RenderContext and if it isn't nil and the Filename has an extension that specifies the delimiter,
// it is used as the delimiter. Otherwise we call guessDelimiter with the data passed
func determineDelimiter(ctx *markup.RenderContext, data []byte) rune {
extension := ".csv"
if ctx != nil {
extension = strings.ToLower(filepath.Ext(ctx.Filename))
}

return bestDelim
var delimiter rune
switch extension {
case ".tsv":
delimiter = '\t'
case ".psv":
delimiter = '|'
default:
delimiter = guessDelimiter(data)
}

return delimiter
}

// scoreDelimiter uses a count & regularity metric to evaluate a delimiter against lines of CSV.
func scoreDelimiter(lines []string, delim rune) float64 {
countTotal := 0
countLineMax := 0
linesNotEqual := 0
// quoteRegexp follows the RFC-4180 CSV standard for when double-quotes are used to enclose fields, then a double-quote appearing inside a
// field must be escaped by preceding it with another double quote. https://www.ietf.org/rfc/rfc4180.txt
// This finds all quoted strings that have escaped quotes.
var quoteRegexp = regexp.MustCompile(`"(?:[^"\\]|\\.)*"`)

for _, line := range lines {
if len(line) == 0 {
continue
}
// removeQuotedStrings uses the quoteRegexp to remove all quoted strings so that we can realiably have each row on one line
zeripath marked this conversation as resolved.
Show resolved Hide resolved
// (quoted strings often have new lines within the string)
func removeQuotedString(text string) string {
return quoteRegexp.ReplaceAllLiteralString(text, "")
}

countLine := strings.Count(line, string(delim))
countTotal += countLine
if countLine != countLineMax {
if countLineMax != 0 {
linesNotEqual++
}
countLineMax = util.Max(countLine, countLineMax)
}
// guessDelimiter takes up to maxLines of the CSV text, iterates through the possible delimiters, and sees if the CSV Reader reads it without throwing any errors.
// If more than one delmiiter passes, the delimiter that results in the most columns is returned.
richmahn marked this conversation as resolved.
Show resolved Hide resolved
func guessDelimiter(data []byte) rune {
// Removes quoted values so we don't have columns with new lines in them
text := removeQuotedString(string(data))
zeripath marked this conversation as resolved.
Show resolved Hide resolved

// Make the text just be maxLines or less without truncated lines
richmahn marked this conversation as resolved.
Show resolved Hide resolved
lines := strings.SplitN(text, "\n", maxLines+1) // Will contain at least one line, and if there are more than MaxLines, the last item holds the rest of the lines
if len(lines) > maxLines {
// If the length of lines is > maxLines we know we have the max number of lines, trim it to maxLines
lines = lines[:maxLines]
} else if len(lines) > 1 && len(data) >= guessSampleSize {
// Even with data >= guessSampleSize, we don't have maxLines + 1 (no extra lines, must have really long lines)
// thus the last line is probably have a truncated line. Drop the last line if len(lines) > 1
lines = lines[:len(lines)-1]
}

return float64(countTotal) * (1 - float64(linesNotEqual)/float64(len(lines)))
// Put lines back together as a string
text = strings.Join(lines, "\n")

delimiters := []rune{',', '\t', ';', '|', '@'}
validDelim := delimiters[0]
validDelimColCount := 0
for _, delim := range delimiters {
csvReader := stdcsv.NewReader(strings.NewReader(text))
csvReader.Comma = delim
if rows, err := csvReader.ReadAll(); err == nil && len(rows) > 0 && len(rows[0]) > validDelimColCount {
validDelim = delim
validDelimColCount = len(rows[0])
}
}
return validDelim
}

// FormatError converts csv errors into readable messages.
func FormatError(err error, locale translation.Locale) (string, error) {
var perr *stdcsv.ParseError
if errors.As(err, &perr) {
if perr, ok := err.(*stdcsv.ParseError); ok {
zeripath marked this conversation as resolved.
Show resolved Hide resolved
if perr.Err == stdcsv.ErrFieldCount {
return locale.Tr("repo.error.csv.invalid_field_count", perr.Line), nil
}
Expand Down