Use Arrow as in-memory data format #7210

cydrain · 2021-08-20T10:27:56Z

What would you like to be added:

Use Arrow as in-memory data format

Why is this needed:

There are too many data conversion between column-based and row-based.
We can use Arrow format as the internal data format to avoid frequent data conversion.

https://wiki.lfaidata.foundation/display/MIL/MEP+13+--+Support+Apache+Arrow+As+In-Memory+Data+Format

xiaofan-luan · 2021-08-24T08:43:22Z

Let's only change data format in pulsar. Rest of the format let's leave to later version after 2.0

cydrain · 2021-08-26T02:49:56Z

Let's only change data format in pulsar. Rest of the format let's leave to later version after 2.0

@xiaofan-luan
Currently, the binlog file written into Minio is with Parquet format wrapped by our data structure, and each RecordBatch (segment) is saved into several files by column.
My suggestion is:

one binlog file for one RecordBatch (segment)
use pure Parquet file format, which can be written from arrow data directly

xiaofan-luan · 2021-08-26T08:59:00Z

I would prefer to keep multiple binlog files, otherwise if you want to query on disk it will cost too many ios.
2)Can we make a delegator to read from binlog to arrow rather than change the file format directly? Change the data format on minio will cause incompatibility, how are we gonna to handle this?

cydrain · 2021-08-26T10:04:25Z

I would prefer to keep multiple binlog files, otherwise if you want to query on disk it will cost too many ios.
2)Can we make a delegator to read from binlog to arrow rather than change the file format directly? Change the data format on minio will cause incompatibility, how are we gonna to handle this?

@xiaofan-luan
1). You're right, since binlog files are saved in Minio not local disk, it can save a lot of network IO to save binlog files by columns.
2). It's OK to keep current binlog file format un-changed for compatibility.

cydrain · 2021-09-08T02:18:55Z

After deep investigation, we think Arrow is not suitable for Milvus.
I described the detailed reason in MEP document:
wiki.lfaidata.foundation/display/MIL/MEP+13+--+Support+Apache+Arrow+As+In-Memory+Data+Format

cydrain · 2021-09-08T02:19:12Z

/close

sre-ci-robot · 2021-09-08T02:19:15Z

@cydrain: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cydrain added the kind/enhancement Issues or changes related to enhancement label Aug 20, 2021

cydrain self-assigned this Aug 20, 2021

yhmo added this to the 2.0-Backlog milestone Aug 20, 2021

cydrain changed the title ~~Use Arrow to optimize data flow graph~~ Use Arrow as in-memory data format Aug 25, 2021

congqixia mentioned this issue Aug 27, 2021

Memory consumed totally 3-4 times as data size #7056

Closed

sre-ci-robot closed this as completed Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Arrow as in-memory data format #7210

Use Arrow as in-memory data format #7210

cydrain commented Aug 20, 2021 •

edited

xiaofan-luan commented Aug 24, 2021

cydrain commented Aug 26, 2021 •

edited

xiaofan-luan commented Aug 26, 2021

cydrain commented Aug 26, 2021

cydrain commented Sep 8, 2021

cydrain commented Sep 8, 2021

sre-ci-robot commented Sep 8, 2021

Use Arrow as in-memory data format #7210

Use Arrow as in-memory data format #7210

Comments

cydrain commented Aug 20, 2021 • edited

What would you like to be added:

Why is this needed:

xiaofan-luan commented Aug 24, 2021

cydrain commented Aug 26, 2021 • edited

xiaofan-luan commented Aug 26, 2021

cydrain commented Aug 26, 2021

cydrain commented Sep 8, 2021

cydrain commented Sep 8, 2021

sre-ci-robot commented Sep 8, 2021

cydrain commented Aug 20, 2021 •

edited

cydrain commented Aug 26, 2021 •

edited