What is the idiom for multiple S3 connections in parallel tasks #79

6Hhcy · 2019-08-19T23:14:58Z

Description

I have a pipeline which takes CSV files in an S3 bucket, processes them and outputs HDF5 files to a new S3 bucket. These are computationally intensive processes so I wish to run them on a large EC2 instance. Do I need to manually manage some sort of connection pool? Will the threads interfere with each other? I am writing my own Dataset specifically for the task based on the example in CSVS3Dataset.

Many thanks

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

[First Step]
[Second Step]
[and so on...]

Expected Result

Tell us what should happen.

Actual Result

Tell us what happens instead.

-- If you received an error, place it here.

-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V):
Python version used (python -V):
Operating system and version:

The text was updated successfully, but these errors were encountered:

Flid · 2019-08-20T08:21:14Z

Hi @6Hhcy
It's a good question!

Kedro uses s3fs, which uses boto library to access S3. Boto is not thread-safe indeed https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-multiprocessing - but only if you are trying to reuse the same Session object.
All Kedro S3 datasets maintain separate instances of S3FileSystem, which means separate boto sessions, so it's safe.
It's probably not great in terms of performance, and if you work with hundreds of S3 data sets in parallel, or thousands of small S3 datasets sequentially - the pipeline might run quite long and even fail on connection errors, but you are totally safe with a few dozens of them.

6Hhcy · 2019-08-20T08:44:15Z

Good to know. Thank you.

6Hhcy added the Issue: Bug Report 🐞 Bug that needs to be fixed label Aug 19, 2019

6Hhcy closed this as completed Aug 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the idiom for multiple S3 connections in parallel tasks #79

What is the idiom for multiple S3 connections in parallel tasks #79

6Hhcy commented Aug 19, 2019

Flid commented Aug 20, 2019

6Hhcy commented Aug 20, 2019

What is the idiom for multiple S3 connections in parallel tasks #79

What is the idiom for multiple S3 connections in parallel tasks #79

Comments

6Hhcy commented Aug 19, 2019

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

Flid commented Aug 20, 2019

6Hhcy commented Aug 20, 2019