Skip to content
This repository has been archived by the owner on Aug 21, 2023. It is now read-only.

Write directly to S3 #155

Merged
merged 12 commits into from Sep 28, 2020
Merged

Write directly to S3 #155

merged 12 commits into from Sep 28, 2020

Conversation

tirsen
Copy link
Contributor

@tirsen tirsen commented Sep 21, 2020

What problem does this PR solve?

#8 Support S3

What is changed and how it works?

Switch to using VFS which supports local files, S3 and GCS. For now this only supports S3 but it should be easy to add support to e.g. GCS.

You simply prefix the output directory with "s3://bucket" e.g.:

dumpling -o s3://mybycket/dir

For now all config is passed in using the standard AWS environment variables.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    Will test by running dumpling against our staging databases.

Side effects

  • Introduces a new dependency on github.com/c2fo/vfs

Related changes

  • Need to update the documentation

Release note

  • support writing directly to S3

@CLAassistant
Copy link

CLAassistant commented Sep 21, 2020

CLA assistant check
All committers have signed the CLA.

@kennytm
Copy link
Collaborator

kennytm commented Sep 22, 2020

Hi @tirsen, thanks for the contribution! However, we're more interested in including S3 support through the github.com/pingcap/br/pkg/storage shared by all migration tools, which we can also get GCS support for free. Could you see if it's possible to rewrite this using the ExternalStorage interface?

@tirsen
Copy link
Contributor Author

tirsen commented Sep 22, 2020

It's not trivial. Afaict there is no Writer implementation which means I would have to implement one that buffers and then sends a chunk at a time using the Uploader. All doable of course but, yeah, not trivial. :-)

I could implement a Writer on top of ExternalStorage or is there one already?

@tirsen
Copy link
Contributor Author

tirsen commented Sep 22, 2020

Dumpling already does some level of "buffering" itself for example here but the buffer size is hard coded to be 1mb which is not necessarily optimal for S3. It needs to be configurable. It would be better to implement buffering at the github.com/pingcap/br/pkg/storage layer so that the configuration option for chunk size is consistent across clients.

@tirsen
Copy link
Contributor Author

tirsen commented Sep 22, 2020

Actually vfs isn't a good choice for this use case anyway, it buffers the entire content of a file in memory and sends all of it on close: https://github.com/C2FO/vfs/blob/master/backend/s3/file.go#L203

Or does dumpling split up tables into many smaller files? How large does individual files become if we dump say a 300gb large database?

@kennytm
Copy link
Collaborator

kennytm commented Sep 22, 2020

@tirsen

I could implement a Writer on top of ExternalStorage or is there one already?

Yes you could implement a Writer on top of ExternalStorage

Dumpling already does some level of "buffering" itself for example here but the buffer size is hard coded to be 1mb which is not necessarily optimal for S3. It needs to be configurable. It would be better to implement buffering at the github.com/pingcap/br/pkg/storage layer so that the configuration option for chunk size is consistent across clients.

Yes!

Or does dumpling split up tables into many smaller files? How large does individual files become if we dump say a 300gb large database?

If you supply -F or -r, dumpling do split up the table into smaller files according to file size or row count.

But if none of the argument is provided then yes Dumpling will happily create a 300 GB file.

@tirsen
Copy link
Contributor Author

tirsen commented Sep 22, 2020

@kennytm Here is the Writer implementation on top of Uploader: pingcap/br#524

@tirsen
Copy link
Contributor Author

tirsen commented Sep 25, 2020

Ok I think this is good to go now

@lichunzhu
Copy link
Contributor

Nice job! We will take a look soon.

tests/s3/run.sh Outdated Show resolved Hide resolved
v4/export/config.go Outdated Show resolved Hide resolved
v4/export/config.go Outdated Show resolved Hide resolved
v4/export/config.go Outdated Show resolved Hide resolved
v4/export/config.go Outdated Show resolved Hide resolved
v4/export/config.go Outdated Show resolved Hide resolved
@tirsen
Copy link
Contributor Author

tirsen commented Sep 25, 2020

Ok I think all comments resolved now. PHAL

Copy link
Contributor

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a configuration region to set region for s3 configurations? When I'm testing upload to S3, setting AWS_REGION doesn't work. If config.BackendOptions.S3.Region is set it will work correctly.
Rest LGTM, @kennytm PTAL

@kennytm
Copy link
Collaborator

kennytm commented Sep 27, 2020

@lichunzhu using the URL s3://bucket/path?region=us-west-2 should work.

@lichunzhu
Copy link
Contributor

@lichunzhu using the URL s3://bucket/path?region=us-west-2 should work.

Yes it works. So we just ask users to set region in the OutputDir URL?

tests/s3/run.sh Outdated Show resolved Hide resolved
v4/export/config.go Outdated Show resolved Hide resolved
v4/export/config.go Outdated Show resolved Hide resolved
v4/export/writer_util.go Outdated Show resolved Hide resolved
tirsen and others added 2 commits September 28, 2020 15:41
Co-authored-by: kennytm <kennytm@gmail.com>
@tirsen
Copy link
Contributor Author

tirsen commented Sep 28, 2020

CLI flags are now exposed so you can also set the region that way. Is AWS_REGION a standard env var? Should we support that in the underlying br library?

@kennytm
Copy link
Collaborator

kennytm commented Sep 28, 2020

Is AWS_REGION a standard env var?

This env var should be used by default, but unfortunately we have overridden the config making the env var useless.

I think we should fix this later on BR.

Copy link
Collaborator

@kennytm kennytm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM, thanks 😄

v4/export/config.go Outdated Show resolved Hide resolved
Co-authored-by: kennytm <kennytm@gmail.com>
@kennytm kennytm added the status/LGT1 One reviewer approved (LGTM1) label Sep 28, 2020
@kennytm kennytm merged commit 90ed407 into pingcap:master Sep 28, 2020
tisonkun pushed a commit to tisonkun/dumpling that referenced this pull request Oct 20, 2021
* Write directly to S3

* Use aws CLI from docker instead

* Re-implement on top of github.com/pingcap/br/pkg/storage

* Bump br dependency

* Move initialization of ExternalStorage to adjustConfig

* Use failpoint

* No need to munge the path

* Remove unused

Co-authored-by: kennytm <kennytm@gmail.com>

* Storage flags

* Remove TODO

Co-authored-by: kennytm <kennytm@gmail.com>

Co-authored-by: Chunzhu Li <lichunzhu@stu.xjtu.edu.cn>
Co-authored-by: kennytm <kennytm@gmail.com>
tisonkun pushed a commit to tisonkun/dumpling that referenced this pull request Oct 20, 2021
* Write directly to S3

* Use aws CLI from docker instead

* Re-implement on top of github.com/pingcap/br/pkg/storage

* Bump br dependency

* Move initialization of ExternalStorage to adjustConfig

* Use failpoint

* No need to munge the path

* Remove unused

Co-authored-by: kennytm <kennytm@gmail.com>

* Storage flags

* Remove TODO

Co-authored-by: kennytm <kennytm@gmail.com>

Co-authored-by: Chunzhu Li <lichunzhu@stu.xjtu.edu.cn>
Co-authored-by: kennytm <kennytm@gmail.com>
tisonkun pushed a commit to tisonkun/dumpling that referenced this pull request Oct 20, 2021
* Write directly to S3

* Use aws CLI from docker instead

* Re-implement on top of github.com/pingcap/br/pkg/storage

* Bump br dependency

* Move initialization of ExternalStorage to adjustConfig

* Use failpoint

* No need to munge the path

* Remove unused

Co-authored-by: kennytm <kennytm@gmail.com>

* Storage flags

* Remove TODO

Co-authored-by: kennytm <kennytm@gmail.com>

Co-authored-by: Chunzhu Li <lichunzhu@stu.xjtu.edu.cn>
Co-authored-by: kennytm <kennytm@gmail.com>
tisonkun pushed a commit to tisonkun/dumpling that referenced this pull request Oct 20, 2021
* Write directly to S3

* Use aws CLI from docker instead

* Re-implement on top of github.com/pingcap/br/pkg/storage

* Bump br dependency

* Move initialization of ExternalStorage to adjustConfig

* Use failpoint

* No need to munge the path

* Remove unused

Co-authored-by: kennytm <kennytm@gmail.com>

* Storage flags

* Remove TODO

Co-authored-by: kennytm <kennytm@gmail.com>

Co-authored-by: Chunzhu Li <lichunzhu@stu.xjtu.edu.cn>
Co-authored-by: kennytm <kennytm@gmail.com>
tisonkun pushed a commit to tisonkun/tidb that referenced this pull request Oct 20, 2021
* Write directly to S3

* Use aws CLI from docker instead

* Re-implement on top of github.com/pingcap/br/pkg/storage

* Bump br dependency

* Move initialization of ExternalStorage to adjustConfig

* Use failpoint

* No need to munge the path

* Remove unused

Co-authored-by: kennytm <kennytm@gmail.com>

* Storage flags

* Remove TODO

Co-authored-by: kennytm <kennytm@gmail.com>

Co-authored-by: Chunzhu Li <lichunzhu@stu.xjtu.edu.cn>
Co-authored-by: kennytm <kennytm@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status/LGT1 One reviewer approved (LGTM1)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants