Mortar Project with examples for several different public data sets and data types/formats
PigLatin Python R
Latest commit d7bd9bd Oct 29, 2015 @celenechang celenechang Merge pull request #26 from mortardata/celene/small_updates
small updates to pig examples

README.md

Welcome to Mortar!

Mortar is a platform-as-a-service for Hadoop. With Mortar, you can run jobs on Hadoop using Apache Pig and Python without any special training.

Getting Started

To get started follow the Mortar Example Tutorial.

Help

For lots more help and tutorials on running Mortar, check out the Mortar Help site.

Examples

airline_travel: CSV data from Bureau of Labor Statistics

The airline_travel pigscript takes data from the Bureau of Transportation Statistics and uses it to find out how airlines perform when we normalize for the airports they fly from and to.

coffee_tweets: JSON data from Twitter

The coffee_tweets pigscript answers the question "Which US state contains the highest concentration of coffee snobs?". It analyzes and aggregates twitter data from the twitter-gardenhose, looking for telltale signs of coffee snobbery in tweets.

common_crawl_trending_topics: Dataset of technology news webpages taken from the Common Crawl

The common_crawl_trending_topics pigscript finds single-word trending topics by month from a corpus of technology news webpages (techcrunch, gigaom, and allthingsd). It does this by calculating the frequency of each word in each month, finding the "frequency velocity" from month to month, and selecting the words with the highest frequency velocity in each month.

excite: Search log data from excite! search engine

The excite pigscript shows an example of loading search engine logs from the excite! search engine and joining them up to a users table. This is a common pattern for web log analysis.

millionsong: Million song dataset

Two pigscripts explore the publicly-available Million Song Dataset.

The first, top_density_songs finds the songs with the most beats per second in the 1MM song dataset. Code to REALLY FAST music!

The second, hottest_song_of_the_decade figures out which song is the hottest for each decade of data in the million song dataset.

nasa_logs: Apache logs from NASA

The nasa_logs pigscript is an example of parsing Apache logs to find the most-served resources by date. It takes a sample of two month's worth of logs from NASA Kennedy Space Center's web server in 1995 and finds for each date the number of requests served, the number of bytes served, and the top 10 resources served (images are filtered out since most of the requests are just for icons). It can take a parameter ORDERING equal to either 'num_requests', to rank resources by the number of requests served, or 'num_bytes', to rank resources by number of bytes served.

twitter_sentiment: JSON data from Twitter

The twitter_sentiment pigscript finds which words are most likely to appear in tweets expressing a "postive sentiment" and which words are most likely to appear in tweets expressing a "negative sentiment". It calculates these likelihoods by looking at the frequency of a word in the corpus of positive/negative tweets diveded by the frequency of that word in the corpus of all processed tweets. The words that cause tweets to be classified as positive/negative (ex. "awesome", "disappointing") in the first place are excluded from the associations, so you can see what caused the sentiments instead of the sentiments themselves. The tweets are taken from the twitter-gardenhose.

Advanced Examples

Twitter Pagerank

A separate Mortar project, twitter-pagerank shows how to embed Pig in a Jython controlscript. This project runs Pagerank, an algorithm that uses several iteration steps, on a subset of the Twitter follower graph. The result is a list of who influential people on Twitter tend to follow. There is a tutorial on the Mortar help site which walks through the twitter-pagerank project.