Set of hadoop input/output formats for use in combination with hadoop streaming (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html)
Input formats that reads Avro or Parquet files and converts them to text or json and then feeds into streaming MapRed job as input. Output formats for converting text or json output of streaming MapRed jobs and storing it in Avro or Parquet. Output format that can write to many files based on record prefix and can be combined with above output formats
Heavily based on code from ASF (http://www.apache.org/) and Twitter (http://twitter.com)
Credits to:
Vladimir Klimontovich vklimontovich@getintent.com Evgeny Terentiev terentev@gmail.com
Build with 'mvn package' and put on all your hadoop nodes (and a box which starts the job) into HADOOP_CLASSPATH (or send with -libjars)
Usage examples (assuming iow-hadoop-streaming-1.0.jar is present in HADOOP_CLASSPATH):
Reading Avro file:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-inputformat net.iponweb.hadoop.streaming.parquet.AvroAsJsonInputFormat
-input -output -mapper -reducer
Reading Parquet file:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D parquet.read.support.class=net.iponweb.hadoop.streaming.parquet.GroupReadSupport
-D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-inputformat net.iponweb.hadoop.streaming.parquet.ParquetAsJsonInputFormat
-input -output -mapper -reducer
Writing Avro file:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D iow.streaming.output.schema= -files
-D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-outputformat net.iponweb.hadoop.streaming.parquet.AvroAsJsonOutputFormat
-input -output -mapper -reducer
Writing Parquet file:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D iow.streaming.output.schema= -files
-D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-D parquet.compression=gzip -D parquet.enable.dictionary=true
-D parquet.write.support.class=net.iponweb.hadoop.streaming.parquet.GroupWriteSupport
-outputformat net.iponweb.hadoop.streaming.parquet.ParquetAsJsonOutputFormat
-input -output -mapper -reducer
Of course, you can combine any inputformat with any outputformat, jsut make sure that your output records correspond to format (text/json) and schema
Using ByKeyOutputFormat is a bit tricky. At first, every file generated by reducer should have unique name. File name is effectively the first field in your reducer output. (Field separator is map.output.key.field.separator). Easiest way for this is to add slash and Reducer ID, so you will have something like this:
/job_output_dir/output_type_0/0.avro /job_output_dir/output_type_0/1.avro ...
If your output has single schema, this is enough. If different schemas required, they should be named before the colon in the first field:
schema0:output_type_0/0
Schemas should be present in job work directory (passed by -files) and iow.streaming.schema.use.prefix should be set to true (-D iow.streaming.schema.use.prefix=true). Schema name could be 'default', in that case, schema from net.iponweb.output.streaming.schema is used. Actual underlying format should be specified in iow.streaming.bykeyoutputformat. Default value is text. Supported values are:
text, sequence, avrotext, avrojson, parquettext, parquetjson
Example:
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-files pq_schema0,pq_schema1 -D mapreduce.output.fileoutputformat.compress=true
-D stream.reduce.output=text -D parquet.compression=gzip -D parquet.enable.dictionary=true
-D parquet.write.support.class=net.iponweb.hadoop.streaming.parquet.GroupWriteSupport
-D iow.streaming.bykeyoutputformat=parquetjson -D iow.streaming.schema.use.prefix=true
-outputformat net.iponweb.hadoop.streaming.ByKeyOutputFormat
-input -output -mapper -reducer
Nikita Makeev, IPONWEB Please contact me with questions and feedback by e-mail: whale2.box@gmail.com