Hadoop MapReduce tool to convert Avro data files to Parquet format.
Switch branches/tags
Nothing to show
Latest commit 11f24f9 May 22, 2013 @laserson Added readme
Failed to load latest commit information.
src/main/java/com/cloudera/science/avro2parquet Initial commit May 22, 2013
.gitignore Initial commit May 22, 2013
README.md Added readme May 22, 2013
pom.xml Initial commit May 22, 2013



Hadoop MapReduce program to convert Avro data files to Parquet format.


git clone https://github.com/laserson/avro2parquet.git
cd avro2parquet
mvn clean package

This will generate the jar files in the target/ directory.


This tool will work on Avro container files (which I believe is just the standard Avro data file format). It contains the Avro GenericRecord objects as the key and a NullWritable as the value.

The tool is currently hardcoded to output Snappy-compressed Parquet. It is simply a MapReduce job using the Tool interface.

The command is like so:

hadoop jar <avro2parquet jar file> \
com.cloudera.science.avro2parquet.Avro2Parquet \
<and generic options to the JVM> \
hdfs:///path/to/avro/schema.avsc \
hdfs:///path/to/avro/data \

so for example:

hadoop jar avro2parquet-0.1.0-jar-with-dependencies.jar \
com.cloudera.science.avro2parquet.Avro2Parquet \
-D mapred.child.java.opts=-Xmx1024M \
hdfs:///user/lasersou/schemas/data.avsc \
hdfs:///user/lasersou/data \