Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Rewrite of Avro storage functions for Pig (with Trevni support)
Java Shell
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
src
README.md
pom.xml

README.md

Fast Avro Storage

I got frustrated with the version of AvroStorage bundled with Apache Pig (in Piggybank), so I decided to write my own.

Why did you bother?

The AvroStorage code is very complicated. It does a lot of unnecesary copying. It doesn't support the latest version of Avro (so it doesn't support Snappy compression). All of these things are bad.

What did you do differently?

I decided on a different approach. In Pig, Tuples are implemented as an Interface. I realized that I could just wrap Avro objects (GenericData objects) into another object that implemented the Tuple interface. That helped reduce the amount of copying required.

I used the latest version of Avro (1.7.2) as as starting point, and rewrote AvroStorage from scratch.

I also wrote a function to load and store data in Trevni format from Pig. Trevni is Doug Cutting's new format for column-oriented stores; it's designed to accept Avro objects and return Avro objects.

Why didn't you contribute this to Apache?

I will contribute this to Apache when I have worked out some bugs, done some performance testing and tuning, and added unit tests.

In the mean time, feel free to try this out. It's alpha quality software; I can make no promises about it of any kind, other than that I wrote it. Hope you find it useful!

-- Joe

Something went wrong with that request. Please try again.