Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Arrow as in-memory data format #7210

Closed
cydrain opened this issue Aug 20, 2021 · 7 comments
Closed

Use Arrow as in-memory data format #7210

cydrain opened this issue Aug 20, 2021 · 7 comments
Assignees
Labels
kind/enhancement Issues or changes related to enhancement
Milestone

Comments

@cydrain
Copy link
Contributor

cydrain commented Aug 20, 2021

What would you like to be added:

Use Arrow as in-memory data format

Why is this needed:

There are too many data conversion between column-based and row-based.
We can use Arrow format as the internal data format to avoid frequent data conversion.

https://wiki.lfaidata.foundation/display/MIL/MEP+13+--+Support+Apache+Arrow+As+In-Memory+Data+Format

@cydrain cydrain added the kind/enhancement Issues or changes related to enhancement label Aug 20, 2021
@cydrain cydrain self-assigned this Aug 20, 2021
@yhmo yhmo added this to the 2.0-Backlog milestone Aug 20, 2021
@xiaofan-luan
Copy link
Contributor

Let's only change data format in pulsar. Rest of the format let's leave to later version after 2.0

@cydrain cydrain changed the title Use Arrow to optimize data flow graph Use Arrow as in-memory data format Aug 25, 2021
@cydrain
Copy link
Contributor Author

cydrain commented Aug 26, 2021

Let's only change data format in pulsar. Rest of the format let's leave to later version after 2.0

@xiaofan-luan
Currently, the binlog file written into Minio is with Parquet format wrapped by our data structure, and each RecordBatch (segment) is saved into several files by column.
My suggestion is:

  1. one binlog file for one RecordBatch (segment)
  2. use pure Parquet file format, which can be written from arrow data directly

@xiaofan-luan
Copy link
Contributor

  1. I would prefer to keep multiple binlog files, otherwise if you want to query on disk it will cost too many ios.
    2)Can we make a delegator to read from binlog to arrow rather than change the file format directly? Change the data format on minio will cause incompatibility, how are we gonna to handle this?

@cydrain
Copy link
Contributor Author

cydrain commented Aug 26, 2021

  1. I would prefer to keep multiple binlog files, otherwise if you want to query on disk it will cost too many ios.
    2)Can we make a delegator to read from binlog to arrow rather than change the file format directly? Change the data format on minio will cause incompatibility, how are we gonna to handle this?

@xiaofan-luan
1). You're right, since binlog files are saved in Minio not local disk, it can save a lot of network IO to save binlog files by columns.
2). It's OK to keep current binlog file format un-changed for compatibility.

@cydrain
Copy link
Contributor Author

cydrain commented Sep 8, 2021

After deep investigation, we think Arrow is not suitable for Milvus.
I described the detailed reason in MEP document:
wiki.lfaidata.foundation/display/MIL/MEP+13+--+Support+Apache+Arrow+As+In-Memory+Data+Format

@cydrain
Copy link
Contributor Author

cydrain commented Sep 8, 2021

/close

@sre-ci-robot
Copy link
Contributor

@cydrain: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Issues or changes related to enhancement
Projects
None yet
Development

No branches or pull requests

4 participants