Skip to content

Commit

Permalink
Support Arrow dictionary serialization:
Browse files Browse the repository at this point in the history
- Put serialized schema and dictionary together on system memory
due to limited APIs exported by Arrow serializer;
- Remove duplicate schema from serialized records on video memory.
  • Loading branch information
m1mc authored and asuhan committed Jun 29, 2017
1 parent 1fab4c2 commit baf2945
Show file tree
Hide file tree
Showing 6 changed files with 293 additions and 108 deletions.
7 changes: 5 additions & 2 deletions QueryEngine/ResultSet.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
#include "../Chunk/Chunk.h"

#ifdef ENABLE_ARROW_CONVERTER
#include "arrow/ipc/metadata.h"
#include "arrow/table.h"
// Arrow defines macro UNUSED conflict w/ that in jni_md.h
#ifdef UNUSED
Expand Down Expand Up @@ -206,7 +207,8 @@ struct OneIntegerColumnRow {

#ifdef ENABLE_ARROW_CONVERTER
struct ArrowResult {
std::shared_ptr<arrow::Buffer> schema;
std::vector<char> sm_handle;
int64_t sm_size;
std::vector<char> df_handle;
int64_t df_size;
};
Expand Down Expand Up @@ -447,7 +449,8 @@ class ResultSet {
int getGpuCount() const;

#ifdef ENABLE_ARROW_CONVERTER
arrow::RecordBatch convertToArrow(const std::vector<std::string>& col_names) const;
arrow::RecordBatch convertToArrow(const std::vector<std::string>& col_names, arrow::ipc::DictionaryMemo& memo) const;
std::shared_ptr<const std::vector<std::string>> getDictionary(const int dict_id) const;
std::pair<std::vector<std::shared_ptr<arrow::Array>>, size_t> getArrowColumns(
const std::vector<std::shared_ptr<arrow::Field>>& fields) const;

Expand Down
Loading

12 comments on commit baf2945

@wesm
Copy link
Contributor

@wesm wesm commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put serialized schema and dictionary together on system memory
due to limited APIs exported by Arrow serializer;

@asuhan this seems like something that could be improved, would it be possible for you to open a JIRA describing the use case and how you would like the Arrow IPC API to work ideally?

@m1mc
Copy link
Contributor Author

@m1mc m1mc commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm I didn't find any suitable API of RecordBatchStreamWriter so I used ipc::WriteRecordBatch for serializing separate record batch. Could be a minor refactoring here.

@wesm
Copy link
Contributor

@wesm wesm commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are writing record batches separate from the schema, that is the right way right now. The function / class documentation in arrow/ipc/writer.h and reader.h could be improved to make the intended usage more clear

@wesm
Copy link
Contributor

@wesm wesm commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a note on https://issues.apache.org/jira/browse/ARROW-1226 so we can improve the documentation about this

@m1mc
Copy link
Contributor Author

@m1mc m1mc commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's working, but the API that returns a FileBlock is kinda low-level even its dummy. Next step we want to support distributed results on multi-gpu, I will return a IPC buffer list to each MapD client.

@wesm
Copy link
Contributor

@wesm wesm commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if you could describe a higher-level API that would work better for your use case I would be happy to implement, let me know.

@m1mc
Copy link
Contributor Author

@m1mc m1mc commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is minor, but a wrapper of it in RecordBatchStreamWriter looks better for serializing record batch only as well as the schema or maybe separate dictionaries, so the caller can simply decide where to put them separately or together w/o any idea of EOS or padding in btw for deserializers.

@wesm
Copy link
Contributor

@wesm wesm commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 0.5.0 there is a new arrow::ipc::MessageReader abstract interface https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.h#L74, so that you can read from an arbitrary stream of messages (which need not be contiguous in an InputStream). It might be useful to have something similar when writing a sequence of record batches, or at least some APIs to write the stream components in a less monolithic way

@m1mc
Copy link
Contributor Author

@m1mc m1mc commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be great to make every section self-descriptive and easy to deserialize arbitrarily, so every time I deserialize one or more sections in each buffer, I can get some object pointers w/ meta-info indicating if they are schemata, dicts or records(using say dynamic_cast) or maybe memory types.

@wesm
Copy link
Contributor

@wesm wesm commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- there is now a ReadRecordBatch that takes a Message instance https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.h#L154

The Message is a new type that indicates the type of IPC message and contains buffers with the metadata and message body: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata.h#L143

When writing a stream it's a bit trickier since it may be more efficient to call OutputStream::Write for the metadata then for the buffers and padding, but there might be a less efficient form where instances of Message are created in-memory and then you can write the message to an output stream with Message::SerializeTo https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata.h#L190

@m1mc
Copy link
Contributor Author

@m1mc m1mc commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out everything I need. Im also figuring out how to create an OutputStream on a preallocated buffer like CPU shared memory instead of a mmap'd file.

@wesm
Copy link
Contributor

@wesm wesm commented on baf2945 Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.