Is your feature request related to a problem? Please describe.
The driver offers read support for Arrow datatypes. While it is feasible to bulkload arrow format by converting it to an iterator over tuples, I think if mssql-python took arrow input and serialized that over TDS, there may be a significant performance boost for arrow.
Transforming data in place in memory and directly serializing to TDS can show significant performance improvement. A past experiement https://github.com/microsoft/mssql-rs/blob/dev/saurabh/arrow-benchmarks/mssql-arrow/BENCHMARK_REPORT.md shows that serialization from Arrow buffers to TDS without materializing the Rust types can be significantly faster compared to materializing to the Rust datatypes.
Describe the solution you'd like
Offer an API called which may look like the bulkcopy API but for arrow.
In line with the other arrow APIs in the driver, bulkcopy_arrow(table, source...) where source can be
pyarrow.Table
pyarrow.RecordBatch
pyarrow.RecordBatchReader
- any object that exposes
__arrow_c_stream__ (Arrow PyCapsule
interface)
- an iterable of
pyarrow.RecordBatch (all batches must share
the same schema)
Describe alternatives you've considered
Arrow -> Iterator conversion -> BulkCopy
Is your feature request related to a problem? Please describe.
The driver offers read support for Arrow datatypes. While it is feasible to bulkload arrow format by converting it to an iterator over tuples, I think if mssql-python took arrow input and serialized that over TDS, there may be a significant performance boost for arrow.
Transforming data in place in memory and directly serializing to TDS can show significant performance improvement. A past experiement https://github.com/microsoft/mssql-rs/blob/dev/saurabh/arrow-benchmarks/mssql-arrow/BENCHMARK_REPORT.md shows that serialization from Arrow buffers to TDS without materializing the Rust types can be significantly faster compared to materializing to the Rust datatypes.
Describe the solution you'd like
Offer an API called which may look like the bulkcopy API but for arrow.
In line with the other arrow APIs in the driver,
bulkcopy_arrow(table, source...)where source can bepyarrow.Tablepyarrow.RecordBatchpyarrow.RecordBatchReader__arrow_c_stream__(Arrow PyCapsuleinterface)
pyarrow.RecordBatch(all batches must sharethe same schema)
Describe alternatives you've considered
Arrow -> Iterator conversion -> BulkCopy