Support to read/write from/to ODBC #849

jorgecarleitao · 2022-02-17T21:29:29Z

This PR adds support to reading from, and writing to, an ODBC driver.

I anticipate this to be one of the most efficient ways of loading data into arrow, as odbc-api offers an API to load data in a columnar format whereby most buffers are copied back-to-back into arrow (even when nulls are present). variable length and validity needs a small O(N) deserialization, so not as fast as Arrow IPC (but likely much faster than parquet).

codecov · 2022-02-17T21:38:25Z

Codecov Report

Merging #849 (d971e40) into main (972c6f8) will decrease coverage by 0.94%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #849      +/-   ##
==========================================
- Coverage   71.50%   70.56%   -0.95%     
==========================================
  Files         337      343       +6     
  Lines       18445    18693     +248     
==========================================
+ Hits        13190    13191       +1     
- Misses       5255     5502     +247

Impacted Files	Coverage Δ
src/io/odbc/mod.rs	`0.00% <0.00%> (ø)`
src/io/odbc/read/deserialize.rs	`0.00% <0.00%> (ø)`
src/io/odbc/read/schema.rs	`0.00% <0.00%> (ø)`
src/io/odbc/write/mod.rs	`0.00% <0.00%> (ø)`
src/io/odbc/write/schema.rs	`0.00% <0.00%> (ø)`
src/io/odbc/write/serialize.rs	`0.00% <0.00%> (ø)`
src/compute/arithmetics/time.rs	`26.60% <0.00%> (+0.91%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 972c6f8...d971e40. Read the comment docs.

ghuls · 2022-02-18T09:03:01Z

This might also be something to consider in the future when it is more worked out: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/

jorgecarleitao · 2022-02-18T18:53:34Z

One design question that I have no answer to: arrow uses variable sized utf8 and binary, but ODBC requires a fixed length.

Any ideas?

ritchie46 · 2022-02-18T19:21:26Z

One design question that I have no answer to: arrow uses variable sized utf8 and binary, but ODBC requires a fixed length.

Any ideas?

How does that work with different sized strings? Does it take the max length? Or pointers to the heap?

jorgecarleitao · 2022-02-18T19:26:37Z

It has the following representation: given a maximum length N and a batch size of n we have:

values: vec![u8; N * n]
lengths: vec![isize; n]

where length == -1 for null entries. A (valid) value is retrieved at &values[index * N..index * N + lengths[index]].

ritchie46 · 2022-02-18T19:34:04Z

It has the following representation: given a maximum length N and a batch size of n we have:
values: vec![u8; N * n]
lengths: vec![isize; n]
where length == -1 for null entries. A (valid) value is retrieved at &values[index * N..index * N + lengths[index]].

So a slot in values is padded witn n = max_len - current_len bytes?

jorgecarleitao · 2022-02-18T19:39:54Z

exactly, some bytes are un-used:

["hello world", "hi", None, "bye bye", "confortably numb"]

is represented as

values
 <-----N=16----->
[hello world     ][hi              ][                ][bye bye         ][confortably numb]
lengths: [11, 2, -1, 7, 16]

(in rows)
 <-----N=16----->
[hello world     ] 11
[hi              ] 2
[                ] -1
[bye bye         ] 7
[confortably numb] 16

ritchie46 · 2022-02-18T19:51:00Z

Right.. and this is per batch? If so, I would choose a relatively low size so that ouliers don't blow up memory.

jorgecarleitao · 2022-03-03T05:24:01Z

This is almost done, pending pacman82/odbc-api#174

ritchie46 · 2022-03-03T10:11:22Z

Just curious. Are the odbc drivers compiled via cargo or is this something that should be setup at the client side? I know that setting up ODBC with TurboDBC was non trivial.

edit

Note to self, click a bit before asking questions:

To build this library you need to link against the odbc library of your systems ODBC driver manager. It should be automatically detected by the build. On Windows systems it is preinstalled. On Linux and OS-X unix-odbc must be installed. To create a Connections to a data source, its ODBC driver must also be installed.

I do wonder how this would go with precompiled binaries. If we can dynamically link here.

pacman82 · 2022-03-03T15:09:11Z

I know that setting up ODBC with TurboDBC was non trivial

Usually the complexity comes from building turbodbc, with the right Arrow / Python / C++ ABI. This is because Turbodbc does depend on the Python ABI, C++ ABI and Arrow version, for calling into C++ Code, and creating pyarrow arrays.

Following the issue tracker of turbodbc I feel like well over 90% of the issues are bulid chain related. This had been my motivation to create arrow-odbc-py to begin with.

This crate does not link against the Python Interpreter directly. It does not rely an ABI compatibility with the Arrow C++ interface. It does not even use C++ (and therfore not depend on the ABI). My hope is, that it should be way easier to set up.

Setting up the ODBC driver itself manager is not required on windows, and usually trivial on MacOS or Linux, at least if you have sudo privileges. A brew install or apt-get install usually does the trick.

Installing the driver for a specific data source is an additional step. Yet this is also the point of ODBC, we don't want to ship every driver ever. And we could not for the ones which will be written in the future. The effort it takes to install a specifc drivers vary hugly between vendors. Usually the system package manage works fine, though.

Do you remember which specifc parts of the setup gave you trouble?
Would you have the same trouble using https://github.com/pacman82/arrow-odbc-py? Currently I find it likely that this crate will roughly require the same setup.

I do wonder how this would go with precompiled binaries. If we can dynamically link here.

a) Redeploying unix_odbc might be a licensing issue.
b) I made experiments with shipping unix_odbc dynamic libraries directly in a wheel, yet this makes setting up drivers so much harder, as it might not pick up on drivers installed in the system right away.

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>

jorgecarleitao added the feature A new feature label Feb 17, 2022

jorgecarleitao force-pushed the odbc branch from 3b87798 to 8b1c081 Compare February 17, 2022 22:45

jorgecarleitao force-pushed the odbc branch from 9f98d7a to 76fbc46 Compare February 18, 2022 17:15

jorgecarleitao mentioned this pull request Feb 18, 2022

Add support for ODBC #829

Closed

jorgecarleitao changed the title ~~WIP: Support to read from ODBC~~ Support to read from ODBC Feb 18, 2022

Dandandan marked this pull request as ready for review February 18, 2022 19:56

Dandandan marked this pull request as draft February 18, 2022 19:57

jorgecarleitao changed the title ~~Support to read from ODBC~~ Support to read/write from/to ODBC Feb 19, 2022

jorgecarleitao force-pushed the odbc branch from a612bd6 to e340822 Compare February 19, 2022 23:34

jorgecarleitao and others added 10 commits March 5, 2022 18:16

Added docker

659188e

Poc: fetching string from MSSQL

4fbc29f

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>

Added deserializers for basic ODBC types

936dc48

Added CI

85202eb

Tweaks to documentation

3b02186

Added tests for remaining cases and Binary/Text support

93b053f

Cleanup

1b158c3

Moved functions

c6fe2a7

Moved files

cded1e9

Added support to write to ODBC

2d59f9d

jorgecarleitao added 5 commits March 5, 2022 18:16

Updated README

98a532e

Added Utf8 and Binary

8fad214

Added example

d022a49

WIP

76eb0f8

Improved ODBC docs

d971e40

jorgecarleitao force-pushed the odbc branch from e340822 to d971e40 Compare March 5, 2022 18:32

jorgecarleitao marked this pull request as ready for review March 5, 2022 18:46

jorgecarleitao merged commit c999595 into main Mar 5, 2022

jorgecarleitao deleted the odbc branch March 5, 2022 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support to read/write from/to ODBC #849

Support to read/write from/to ODBC #849

jorgecarleitao commented Feb 17, 2022 •

edited

codecov bot commented Feb 17, 2022 •

edited

ghuls commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022

ritchie46 commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022 •

edited

ritchie46 commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022 •

edited

ritchie46 commented Feb 18, 2022

jorgecarleitao commented Mar 3, 2022

ritchie46 commented Mar 3, 2022 •

edited

pacman82 commented Mar 3, 2022 •

edited

Support to read/write from/to ODBC #849

Support to read/write from/to ODBC #849

Conversation

jorgecarleitao commented Feb 17, 2022 • edited

codecov bot commented Feb 17, 2022 • edited

Codecov Report

ghuls commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022

ritchie46 commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022 • edited

ritchie46 commented Feb 18, 2022

jorgecarleitao commented Feb 18, 2022 • edited

ritchie46 commented Feb 18, 2022

jorgecarleitao commented Mar 3, 2022

ritchie46 commented Mar 3, 2022 • edited

pacman82 commented Mar 3, 2022 • edited

jorgecarleitao commented Feb 17, 2022 •

edited

codecov bot commented Feb 17, 2022 •

edited

jorgecarleitao commented Feb 18, 2022 •

edited

jorgecarleitao commented Feb 18, 2022 •

edited

ritchie46 commented Mar 3, 2022 •

edited

pacman82 commented Mar 3, 2022 •

edited