Skip to content

Native Support for Arrow Bulk Copy #570

@saurabh500

Description

@saurabh500

Is your feature request related to a problem? Please describe.

The driver offers read support for Arrow datatypes. While it is feasible to bulkload arrow format by converting it to an iterator over tuples, I think if mssql-python took arrow input and serialized that over TDS, there may be a significant performance boost for arrow.
Transforming data in place in memory and directly serializing to TDS can show significant performance improvement. A past experiement https://github.com/microsoft/mssql-rs/blob/dev/saurabh/arrow-benchmarks/mssql-arrow/BENCHMARK_REPORT.md shows that serialization from Arrow buffers to TDS without materializing the Rust types can be significantly faster compared to materializing to the Rust datatypes.

Describe the solution you'd like

Offer an API called which may look like the bulkcopy API but for arrow.

In line with the other arrow APIs in the driver, bulkcopy_arrow(table, source...) where source can be

  • pyarrow.Table
  • pyarrow.RecordBatch
  • pyarrow.RecordBatchReader
  • any object that exposes __arrow_c_stream__ (Arrow PyCapsule
    interface)
  • an iterable of pyarrow.RecordBatch (all batches must share
    the same schema)

Describe alternatives you've considered

Arrow -> Iterator conversion -> BulkCopy

Metadata

Metadata

Assignees

No one assigned

    Labels

    triage neededFor new issues, not triaged yet.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions