GitHub - mganta/avrostream: An example to read and write avro files using map reduce

README

This example shows how to read and write avro files in a map reduce program.

It can be used as an identity mapper and identity reducer on avro files.

It can also be used to merge smaller avro files into larger ones by specifying the number of reducers.

It can also be used to sort the avro records using the fields enabled for sorting in the schema (fields with out "order": "ignore").

It assumes the first field in the avro record as the key for record partitioning. You can change it by changing the field sent to the write method in the mapper

Here are the steps

Prepare HDFS folder

 hadoop fs -mkdir /user/tom/input

 hadoop fs -mkdir /user/tom/config

 hadoop fs -put schema.avsc /user/tom/config

 hadoop fs -put table2.avro /user/tom/input

Run the job

 hadoop jar avrostream-0.0.1-SNAPSHOT-job.jar -Dmapred.reduce.tasks=2 /user/tom/input /user/tom/output /user/tom/config/schema.avsc

 hadoop fs -ls /user/tom/output

If curious, check the data via hive

 a. create hive table

	hive -f create_table.sql

 b. at hive prompt:

		describe mytable;  

 c. Run a select

		select email, name,zipcode, others['salary'] from mytable;

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
hiveql		hiveql
input		input
schema		schema
src/main		src/main
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

README

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages