-
Notifications
You must be signed in to change notification settings - Fork 448
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support Arrow dictionary serialization:
- Put serialized schema and dictionary together on system memory due to limited APIs exported by Arrow serializer; - Remove duplicate schema from serialized records on video memory.
- Loading branch information
Showing
6 changed files
with
293 additions
and
108 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asuhan this seems like something that could be improved, would it be possible for you to open a JIRA describing the use case and how you would like the Arrow IPC API to work ideally?
baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm I didn't find any suitable API of
RecordBatchStreamWriter
so I usedipc::WriteRecordBatch
for serializing separate record batch. Could be a minor refactoring here.baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are writing record batches separate from the schema, that is the right way right now. The function / class documentation in
arrow/ipc/writer.h
andreader.h
could be improved to make the intended usage more clearbaf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a note on https://issues.apache.org/jira/browse/ARROW-1226 so we can improve the documentation about this
baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's working, but the API that returns a
FileBlock
is kinda low-level even its dummy. Next step we want to support distributed results on multi-gpu, I will return a IPC buffer list to each MapD client.baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if you could describe a higher-level API that would work better for your use case I would be happy to implement, let me know.
baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is minor, but a wrapper of it in
RecordBatchStreamWriter
looks better for serializing record batch only as well as the schema or maybe separate dictionaries, so the caller can simply decide where to put them separately or together w/o any idea ofEOS
or padding in btw for deserializers.baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 0.5.0 there is a new
arrow::ipc::MessageReader
abstract interface https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.h#L74, so that you can read from an arbitrary stream of messages (which need not be contiguous in an InputStream). It might be useful to have something similar when writing a sequence of record batches, or at least some APIs to write the stream components in a less monolithic waybaf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be great to make every section self-descriptive and easy to deserialize arbitrarily, so every time I deserialize one or more sections in each buffer, I can get some object pointers w/ meta-info indicating if they are schemata, dicts or records(using say
dynamic_cast
) or maybe memory types.baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- there is now a
ReadRecordBatch
that takes a Message instance https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.h#L154The Message is a new type that indicates the type of IPC message and contains buffers with the metadata and message body: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata.h#L143
When writing a stream it's a bit trickier since it may be more efficient to call
OutputStream::Write
for the metadata then for the buffers and padding, but there might be a less efficient form where instances ofMessage
are created in-memory and then you can write the message to an output stream withMessage::SerializeTo
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata.h#L190baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out everything I need. Im also figuring out how to create an
OutputStream
on a preallocated buffer like CPU shared memory instead of a mmap'd file.baf2945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this you want
FixedSizeBufferWriter
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/memory.h#L88