Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(parsers.csv): Allow null-delimiters #12247

Merged
merged 12 commits into from
Dec 12, 2022
Merged

Conversation

Meceron
Copy link
Contributor

@Meceron Meceron commented Nov 16, 2022

https://github.com/influxdata/telegraf/issues/12243

resolves #12243

Fixing null delimiters in csv, change the selected bytes before parsing to pipeline.

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Nov 16, 2022
@powersj
Copy link
Contributor

powersj commented Nov 16, 2022

Hi,

Thanks for putting up this PR. It appears that this is essentially a "find and replace" before doing any parsing. Instead of having this hard-code the replacement, can we instead rename csv_force_replace_delimiter to csv_delimiter_replace and use the csv_delimiter as the replacement?

Essentially if I had:

csv_delimiter = ","
csv_delimiter_replace = "\u0000"

It would replace those null bytes with commas.

What do you think about that? Before you go off and make changes I would like to hear from @srebhan as well

@srebhan
Copy link
Contributor

srebhan commented Nov 16, 2022

@Meceron thanks for your PR! Instead of forcing the user to give another option how about the following (in pseudo-code):

  1. Check if delimiter is valid, i.e. copy the validDelim function of encoding/csv
  2. If invalid:
    2.1 replace all comma (, ) with \ufffd which is the unicode replacement character and should not appear in the text
    2.2 replace all user-specified delimiter instances by comma
    2.3 change CSV delimiter to comma
    2.4 set a flag that we replaced data e.g. p.delimiterReplaced = true
  3. do the csv parsing
  4. If p.delimiterReplaced == true replace \ufffd in all resulting fields back to comma

What do you think?

@Meceron
Copy link
Contributor Author

Meceron commented Nov 17, 2022

Thanks for the advice! I've changed code.

@powersj powersj self-assigned this Nov 17, 2022
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turn around! I have put up some comments that clarify the docs and rename the bool to invalidDelimiter.

Take a look and let us know!

plugins/parsers/csv/README.md Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
@srebhan
Copy link
Contributor

srebhan commented Dec 8, 2022

@Meceron any news on this PR?

Copy link
Contributor

@Hipska Hipska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I didn't finished my review..

Comment on lines 27 to 29
var replacementByte = "\ufffd"
var commaByte = "\u002C"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe const instead of var and directly a byte array so no conversion is needed on usage?

err := p.Init()
require.NoError(t, err)

testCSV := strings.Join([]string{"3.4", "70", "test_name"}, "\u0000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
testCSV := strings.Join([]string{"3.4", "70", "test_name"}, "\u0000")
testCSV := "3.4\u000070\u0000test_name"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the existing as it makes it very clear what is going on versus having to count and determine the end of a unicode value

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Dec 9, 2022
@Meceron
Copy link
Contributor Author

Meceron commented Dec 9, 2022

Sorry , I was on sickleave :(. As I see You All done everything, thank you!

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only do have two small comments @Meceron. Can you please address those and consider the other comment of @Hipska!?

plugins/parsers/csv/parser.go Outdated Show resolved Hide resolved
plugins/parsers/csv/parser.go Show resolved Hide resolved
powersj and others added 2 commits December 12, 2022 08:28
Co-authored-by: Sven Rebhan <36194019+srebhan@users.noreply.github.com>
Co-authored-by: Sven Rebhan <36194019+srebhan@users.noreply.github.com>
@powersj
Copy link
Contributor

powersj commented Dec 12, 2022

I wanted to try to land this for today's release, so I accepted the changes.

@telegraf-tiger
Copy link
Contributor

@Hipska
Copy link
Contributor

Hipska commented Dec 12, 2022

@powersj my comments are still open and un-addressed..

@srebhan
Copy link
Contributor

srebhan commented Dec 12, 2022

@Meceron can you please at least make those two strings const?

@Hipska
Copy link
Contributor

Hipska commented Dec 12, 2022

@srebhan This is what @powersj already did in a315d23.

@powersj
Copy link
Contributor

powersj commented Dec 12, 2022

@Meceron can you please at least make those two strings const?

This was done already. I accepted your change that did this.

edit: what Hipska said :)

@srebhan
Copy link
Contributor

srebhan commented Dec 12, 2022

Oh didn't see a commit here... Sorry, I should better take a look at the code... :-/

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice improvement @Meceron! LGTM.

@srebhan srebhan added plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins area/csv csv parser/serialiser related labels Dec 12, 2022
@powersj
Copy link
Contributor

powersj commented Dec 12, 2022

Merging as integration test is a known issue fixed on master

@srebhan srebhan changed the title fix: fixing null delimiters in csv, change the selected bytes before … fix(parsers.csv): Allow null-delimiters Dec 12, 2022
@powersj powersj merged commit e264721 into influxdata:master Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csv csv parser/serialiser related fix pr to fix corresponding bug plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Null delimiter cannot be used
4 participants