Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the idiom for multiple S3 connections in parallel tasks #79

Closed
6Hhcy opened this issue Aug 19, 2019 · 2 comments
Closed

What is the idiom for multiple S3 connections in parallel tasks #79

6Hhcy opened this issue Aug 19, 2019 · 2 comments
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@6Hhcy
Copy link

6Hhcy commented Aug 19, 2019

Description

I have a pipeline which takes CSV files in an S3 bucket, processes them and outputs HDF5 files to a new S3 bucket. These are computationally intensive processes so I wish to run them on a large EC2 instance. Do I need to manually manage some sort of connection pool? Will the threads interfere with each other? I am writing my own Dataset specifically for the task based on the example in CSVS3Dataset.

Many thanks

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

  1. [First Step]
  2. [Second Step]
  3. [and so on...]

Expected Result

Tell us what should happen.

Actual Result

Tell us what happens instead.

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V):
  • Python version used (python -V):
  • Operating system and version:
@6Hhcy 6Hhcy added the Issue: Bug Report 🐞 Bug that needs to be fixed label Aug 19, 2019
@Flid
Copy link
Contributor

Flid commented Aug 20, 2019

Hi @6Hhcy
It's a good question!

Kedro uses s3fs, which uses boto library to access S3. Boto is not thread-safe indeed https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-multiprocessing - but only if you are trying to reuse the same Session object.
All Kedro S3 datasets maintain separate instances of S3FileSystem, which means separate boto sessions, so it's safe.
It's probably not great in terms of performance, and if you work with hundreds of S3 data sets in parallel, or thousands of small S3 datasets sequentially - the pipeline might run quite long and even fail on connection errors, but you are totally safe with a few dozens of them.

@6Hhcy
Copy link
Author

6Hhcy commented Aug 20, 2019

Good to know. Thank you.

@6Hhcy 6Hhcy closed this as completed Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

No branches or pull requests

2 participants