Scalding with Avro

This project shows how to use Avro to model your data and later process it using Scalding. It also illustrates an error that is occurring as soon as nested Avro structures hit the disk.

Tweets

Of course this example, too, has to model Twitter tweets. In contrast to most tutorials, we are modeling Tweets close to the original, as documented by Twitter. This means not only a simple flat model, but nested and nullable properties of various types.

Jobs

CountTweetsJob - count all tweets per day.

TrendingTagsJob - takes all hashtags, counts them per day, and keeps the top 20 ones.

Running the Project

Note: This project currently uses Maven to reproduce my client's original environment as close as possible.

Run mvn clean compile scala:cc for interactive development (or just import into IntelliJ). Run mvn clean compile scala:cctest to run tests while developing.

To run the job on Hadoop, package everything via mvn clean package. The fat JAR will be placed into the target folder and can be submitted to Hadoop.

hadoop jar target/avro-example-1.0.0-SNAPSHOT.jar com.mariussoutier.avroexample.jobs.<JobName> --input ... --output ... --hdfs

In time, the project will be moved to sbt.

Test Data

The project also contains ScalaCheck generators. Easily create 100 tweets in /tmp/tweets by running mvn scala:run -DmainClass="com.mariussoutier.avroexample.test.MakeTweets".

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
src		src
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Scalding with Avro

Tweets

Jobs

Running the Project

Test Data

About

Releases

Packages

Languages

mariussoutier/avro-example

Folders and files

Latest commit

History

Repository files navigation

Scalding with Avro

Tweets

Jobs

Running the Project

Test Data

About

Resources

Stars

Watchers

Forks

Languages