Skip to content
Rewrite of Avro storage functions for Pig (with Trevni support)
Java Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Fast Avro Storage

I got frustrated with the version of AvroStorage bundled with Apache Pig (in Piggybank), so I decided to write my own.

Why did you bother?

The AvroStorage code is very complicated. It does a lot of unnecesary copying. It doesn't support the latest version of Avro (so it doesn't support Snappy compression). All of these things are bad.

What did you do differently?

I decided on a different approach. In Pig, Tuples are implemented as an Interface. I realized that I could just wrap Avro objects (GenericData objects) into another object that implemented the Tuple interface. That helped reduce the amount of copying required.

I used the latest version of Avro (1.7.2) as as starting point, and rewrote AvroStorage from scratch.

I also wrote a function to load and store data in Trevni format from Pig. Trevni is Doug Cutting's new format for column-oriented stores; it's designed to accept Avro objects and return Avro objects.

Why didn't you contribute this to Apache?

I will contribute this to Apache when I have worked out some bugs, done some performance testing and tuning, and added unit tests.

In the mean time, feel free to try this out. It's alpha quality software; I can make no promises about it of any kind, other than that I wrote it. Hope you find it useful!

-- Joe

Something went wrong with that request. Please try again.