You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a pipeline which takes CSV files in an S3 bucket, processes them and outputs HDF5 files to a new S3 bucket. These are computationally intensive processes so I wish to run them on a large EC2 instance. Do I need to manually manage some sort of connection pool? Will the threads interfere with each other? I am writing my own Dataset specifically for the task based on the example in CSVS3Dataset.
Many thanks
Context
How has this bug affected you? What were you trying to accomplish?
Steps to Reproduce
[First Step]
[Second Step]
[and so on...]
Expected Result
Tell us what should happen.
Actual Result
Tell us what happens instead.
-- If you received an error, place it here.
-- Separate them if you have more than one.
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
Kedro version used (pip show kedro or kedro -V):
Python version used (python -V):
Operating system and version:
The text was updated successfully, but these errors were encountered:
Kedro uses s3fs, which uses boto library to access S3. Boto is not thread-safe indeed https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-multiprocessing - but only if you are trying to reuse the same Session object.
All Kedro S3 datasets maintain separate instances of S3FileSystem, which means separate boto sessions, so it's safe.
It's probably not great in terms of performance, and if you work with hundreds of S3 data sets in parallel, or thousands of small S3 datasets sequentially - the pipeline might run quite long and even fail on connection errors, but you are totally safe with a few dozens of them.
Description
I have a pipeline which takes CSV files in an S3 bucket, processes them and outputs HDF5 files to a new S3 bucket. These are computationally intensive processes so I wish to run them on a large EC2 instance. Do I need to manually manage some sort of connection pool? Will the threads interfere with each other? I am writing my own Dataset specifically for the task based on the example in CSVS3Dataset.
Many thanks
Context
How has this bug affected you? What were you trying to accomplish?
Steps to Reproduce
Expected Result
Tell us what should happen.
Actual Result
Tell us what happens instead.
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
):python -V
):The text was updated successfully, but these errors were encountered: