Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

lightning/mydump: support mydumper.csv.terminator #1297

Merged
merged 3 commits into from
Jul 5, 2021

Conversation

kennytm
Copy link
Collaborator

@kennytm kennytm commented Jun 29, 2021

What problem does this PR solve?

Allow user to specify a custom CSV line terminator (corresponding to LOAD DATA's LINES TERMINATED BY option.)

What is changed and how it works?

  • Added a config mydumper.csv.terminator. When it is not empty, this will be used as the line separator instead of [\r\n].
  • The parser has been refactored to use a more tokenizer-based approach to get rid of all the peekBytes stuff randomly appearing somewhere.

Check List

Tests

  • Unit test

Code changes

Side effects

Related changes

  • Need to update the documentation

Release note

  • Lightning now supports customized line ending other than \r/\n in CSV files.

Copy link
Member

@overvenus overvenus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 LGTM1 label Jul 2, 2021
@@ -535,13 +536,20 @@ func (cfg *Config) Adjust(ctx context.Context) error {
return errors.New("invalid config: `mydumper.csv.separator` and `mydumper.csv.delimiter` must not be prefix of each other")
}

if len(csv.Terminator) > 0 && cfg.Mydumper.StrictFormat {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also return an error if both Terminator and TrimLastSep are set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glorv Terminator and StrictFormat are currently mutually exclusive because the latter is hard-coded to use [\r\n] for now. There's no problem using TrimLastSep and Terminator together.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this will trim the last two seps, is there any scenario that actually need this? BTW, do we want to finally replace TrimLastSep with Terminator or they just two different feature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original purpose of TrimLastSep was to support importing TPC-H generated data in the form

a|b|c|
d|e|f|

, which we can use terminator = "|\n" instead.

i don't know if there are any other use cases 😅

pkg/lightning/mydump/csv_parser.go Show resolved Hide resolved
@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • glorv
  • overvenus

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added status/LGT2 LGTM2 and removed status/LGT1 LGTM1 labels Jul 2, 2021
@kennytm
Copy link
Collaborator Author

kennytm commented Jul 5, 2021

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 1d8dba0

@ti-chi-bot ti-chi-bot merged commit e6e79c0 into pingcap:master Jul 5, 2021
@kennytm kennytm deleted the csv-terminator branch July 14, 2021 07:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants