You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 27, 2022. It is now read-only.
One idea I was exploring in datafusion-contrib/datafusion-objectstore-s3#54 was implementing the AsyncWrite trait as an abstraction over multi-part upload. Does that seem like an agreeable addition to this crate?
Multi-part uploads are helpful when uploading large files. For example, you can write parquet files one row group at a time, uploading each row groups data as a part (though more likely there is some buffering in between to get good part sizes). This is the approach taken in Arrow C++ S3 FileSystem. In fact, if we could even upload parts in parallel for better throughput in some scenarios (something AWS recommends).
It seems that GCS supports this through their S3-compatible API (docs) and Azure Blob store has some notion of "block blobs" that might be applicable (docs).
One idea I was exploring in datafusion-contrib/datafusion-objectstore-s3#54 was implementing the
AsyncWritetrait as an abstraction over multi-part upload. Does that seem like an agreeable addition to this crate?Multi-part uploads are helpful when uploading large files. For example, you can write parquet files one row group at a time, uploading each row groups data as a part (though more likely there is some buffering in between to get good part sizes). This is the approach taken in Arrow C++ S3 FileSystem. In fact, if we could even upload parts in parallel for better throughput in some scenarios (something AWS recommends).
It seems that GCS supports this through their S3-compatible API (docs) and Azure Blob store has some notion of "block blobs" that might be applicable (docs).