Hadoop MapReduce tool to convert Avro data files to Parquet format.
Java
Switch branches/tags
Nothing to show
Latest commit 11f24f9 May 22, 2013 @laserson Added readme
Permalink
Failed to load latest commit information.
src/main/java/com/cloudera/science/avro2parquet Initial commit May 22, 2013
.gitignore Initial commit May 22, 2013
README.md Added readme May 22, 2013
pom.xml Initial commit May 22, 2013

README.md

avro2parquet

Hadoop MapReduce program to convert Avro data files to Parquet format.

Installation

git clone https://github.com/laserson/avro2parquet.git
cd avro2parquet
mvn clean package

This will generate the jar files in the target/ directory.

Usage

This tool will work on Avro container files (which I believe is just the standard Avro data file format). It contains the Avro GenericRecord objects as the key and a NullWritable as the value.

The tool is currently hardcoded to output Snappy-compressed Parquet. It is simply a MapReduce job using the Tool interface.

The command is like so:

hadoop jar <avro2parquet jar file> \
com.cloudera.science.avro2parquet.Avro2Parquet \
<and generic options to the JVM> \
hdfs:///path/to/avro/schema.avsc \
hdfs:///path/to/avro/data \
hdfs:///output/path

so for example:

hadoop jar avro2parquet-0.1.0-jar-with-dependencies.jar \
com.cloudera.science.avro2parquet.Avro2Parquet \
-D mapred.child.java.opts=-Xmx1024M \
hdfs:///user/lasersou/schemas/data.avsc \
hdfs:///user/lasersou/data \
hdfs:///user/lasersou/output