iow-hadoop-streaming

Set of hadoop input/output formats for use in combination with hadoop streaming (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html)

Input formats that reads Avro or Parquet files and converts them to text or json and then feeds into streaming MapRed job as input. Output formats for converting text or json output of streaming MapRed jobs and storing it in Avro or Parquet. Output format that can write to many files based on record prefix and can be combined with above output formats

Heavily based on code from ASF (http://www.apache.org/) and Twitter (http://twitter.com)

Credits to:

Vladimir Klimontovich vklimontovich@getintent.com Evgeny Terentiev terentev@gmail.com

Build with 'mvn package' and put on all your hadoop nodes (and a box which starts the job) into HADOOP_CLASSPATH (or send with -libjars)

Usage examples (assuming iow-hadoop-streaming-1.0.jar is present in HADOOP_CLASSPATH):

Reading Avro file:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-inputformat net.iponweb.hadoop.streaming.parquet.AvroAsJsonInputFormat
-input -output -mapper -reducer

Reading Parquet file:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D parquet.read.support.class=net.iponweb.hadoop.streaming.parquet.GroupReadSupport
-D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-inputformat net.iponweb.hadoop.streaming.parquet.ParquetAsJsonInputFormat
-input -output -mapper -reducer

Writing Avro file:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D iow.streaming.output.schema= -files -D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-outputformat net.iponweb.hadoop.streaming.parquet.AvroAsJsonOutputFormat
-input -output -mapper -reducer

Writing Parquet file:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-D iow.streaming.output.schema= -files -D mapreduce.output.fileoutputformat.compress=true -D stream.reduce.output=text
-D parquet.compression=gzip -D parquet.enable.dictionary=true
-D parquet.write.support.class=net.iponweb.hadoop.streaming.parquet.GroupWriteSupport
-outputformat net.iponweb.hadoop.streaming.parquet.ParquetAsJsonOutputFormat
-input -output -mapper -reducer

Of course, you can combine any inputformat with any outputformat, jsut make sure that your output records correspond to format (text/json) and schema

Using ByKeyOutputFormat is a bit tricky. At first, every file generated by reducer should have unique name. File name is effectively the first field in your reducer output. (Field separator is map.output.key.field.separator). Easiest way for this is to add slash and Reducer ID, so you will have something like this:

/job_output_dir/output_type_0/0.avro /job_output_dir/output_type_0/1.avro ...

If your output has single schema, this is enough. If different schemas required, they should be named before the colon in the first field:

schema0:output_type_0/0

Schemas should be present in job work directory (passed by -files) and iow.streaming.schema.use.prefix should be set to true (-D iow.streaming.schema.use.prefix=true). Schema name could be 'default', in that case, schema from net.iponweb.output.streaming.schema is used. Actual underlying format should be specified in iow.streaming.bykeyoutputformat. Default value is text. Supported values are:

text, sequence, avrotext, avrojson, parquettext, parquetjson

Example:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-files pq_schema0,pq_schema1 -D mapreduce.output.fileoutputformat.compress=true
-D stream.reduce.output=text -D parquet.compression=gzip -D parquet.enable.dictionary=true
-D parquet.write.support.class=net.iponweb.hadoop.streaming.parquet.GroupWriteSupport
-D iow.streaming.bykeyoutputformat=parquetjson -D iow.streaming.schema.use.prefix=true
-outputformat net.iponweb.hadoop.streaming.ByKeyOutputFormat
-input -output -mapper -reducer

Nikita Makeev, IPONWEB Please contact me with questions and feedback by e-mail: whale2.box@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src/main/java/net/iponweb/hadoop/streaming		src/main/java/net/iponweb/hadoop/streaming
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iow-hadoop-streaming

About

Releases

Packages

Languages

License

mrusoff/iow-hadoop-streaming

Folders and files

Latest commit

History

Repository files navigation

iow-hadoop-streaming

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages