Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
A Seriously Fun guide to Big Data Analytics in Practice
Ruby PigLatin Other
Branch: joeman
Pull request Compare This branch is 4 commits ahead, 728 commits behind master.

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
code
data @ 3506d0e
images
notes
.dexy
.gitignore
.gitmodules
.gitscribe
README.asciidoc
Rakefile
aa01-about.asciidoc
aa02-TODO.asciidoc
aa03-hello_reviewers.asciidoc
ba00-first_exploration.asciidoc
ba00-intro_to_wukong.asciidoc
ba01-chimpanzee_and_elephant.asciidoc
ba01-explanation.asciidoc
ba02-simple_stream.asciidoc
ba03-herding_cats.asciidoc
ba04-toolset.asciidoc
ba05-overview_of_datasets.asciidoc
ba06-semi_structured_data-a.asciidoc
ba06-semi_structured_data-b-wikipedia_other.asciidoc
ba06-semi_structured_data-c-patterns.asciidoc
ba06-semi_structured_data-c-wikipedia_corpus.asciidoc
ba06-semi_structured_data-d-airline_flights.asciidoc
ba06-semi_structured_data-daily_weather.asciidoc
ba06-semi_structured_data-f-truth_and_error.asciidoc
ba06-semi_structured_data-z-other_strategies.asciidoc
ba07-data_formats.asciidoc
book.asciidoc
fu-00-gotchas.asciidoc
fu00-overview_of_problems.asciidoc
fu01-sampling.asciidoc
fu02-server_logs.asciidoc
fu02-statistics-distribution_of_weather_measurements.asciidoc
fu02-statistics.asciidoc
fu03-advanced_pig.asciidoc
fu04-processing_text.asciidoc
fu05-an_elephants_eye_view_of_the_world.asciidoc
fu05-geographic_data.asciidoc
fu06-processing_graphs-pagerank.asciidoc
fu06-processing_graphs.asciidoc
fu07-time_series_data.asciidoc
fu08-hadoop_api.asciidoc
fu08-pig_udfs.asciidoc
fu09-hadoop_internals.asciidoc
fu10-tuning-hadoop_settings.asciidoc
fu10-tuning-pathologies.asciidoc
fu10-tuning.asciidoc
mu02-why_hadoop.asciidoc
mu03-how_to_think.asciidoc
mu04-data_modeling.asciidoc
mu05-cloud-vs-static.asciidoc
mu05-rules_of_scaling.asciidoc
mu06-best_practices_and_pedantic_points_of_style.asciidoc
mu07-tao_te_chimp.asciidoc
pr01-datasets.asciidoc
pr04-wikipedia_dbpedia.asciidoc
pr05-airline_flights.asciidoc
pr06-access_logs.asciidoc
pr08-data_formats-arc.asciidoc
tr00-back_cover.asciidoc
tr01-authors.asciidoc
tr02-LICENSE.asciidoc
tr03-colophon.asciidoc
working.asciidoc
xx01-simple_machine_learning.asciidoc
xx02-hbase_and_databases.asciidoc
xx03-flume_and_stream_processing.asciidoc
zz00-other_datasets_on_the_web.asciidoc
zz01-notes_for_chimpmark.asciidoc

README.asciidoc

Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing

Building the book

To build the book, use https://github.com/mrflip/dexy_scribe

Outline

File organization:

    aaNN-name.asciidoc  -- preface material: outline, TODO, about
    baNN-name.asciidoc  -- basic chapters: the mechanics of working with data at scale
    fuNN-name.asciidoc  -- advanced fu: algorithms and methods for specific data challenges
    muNN-name.asciidoc  -- musings: how to think at scale and other omphaloskepses. (Later, these will be interleaved with the basic and algorithm sections)
    prNN-name.asciidoc  -- programmer appendices: datasets, code, etc
    trNN-name.asciidoc  -- follow-on material

    xxNN-name.asciidoc  -- chopping block: these may not make the final draft of the book.


	book.asciidoc

	aa01-about.asciidoc
	aa02-TODO.asciidoc

	ba01-chimps_and_elephants.asciidoc
	ba02-simple_stream.asciidoc

	ba04-toolset.asciidoc
	ba05-semi_structured_data-airline_flights.asciidoc
	ba05-semi_structured_data-wikipedia_corpus.asciidoc
	ba05-semi_structured_data-wikipedia_other.asciidoc
	ba05-semi_structured_data.asciidoc
	ba06-herding_cats.asciidoc
	ba06-overview_of_datasets.asciidoc

	fu06-overview_of_problems.asciidoc
	fu06-statistics.asciidoc
	fu06-tuning.asciidoc
	fu07-data_formats.asciidoc
	fu07-time_series_data.asciidoc
	fu08-advanced_pig.asciidoc
	fu11-processing_text.asciidoc
	fu12-geographic_data.asciidoc
	fu14-processing_graphs.asciidoc
	fu19-pig_udfs.asciidoc
	ff01-authors.asciidoc
	mu02-why_hadoop.asciidoc
	mu03-how_to_think.asciidoc
	mu04-data_modeling.asciidoc
	mu05-rules_of_scaling.asciidoc
	mu06-best_practices_and_pedantic_points_of_style.asciidoc
	pr01-datasets.asciidoc
	pr03-airline_flights.asciidoc
	xx01-simple_machine_learning.asciidoc
	xx02-hbase_and_databases.asciidoc
	xx03-flume_and_stream_processing.asciidoc
	xx16-operations.asciidoc

Mechanics of Working with Data at Scale

  1. Chimpanzee and Elephant Save Christmas link:<chimpanzee_and_elephant>

    • stream of disordered records

    • group/sort records by their label

    • process each group of records

  2. Heraclitus and the Stream

    • Simple disordered stream (map-only) in Wukong

    • Simple ordered-group transform (map+reduce) in Wukong

  3. Musing: Why Hadoop Works

    • the locality problem

    • the Hadoop haiku

    • robots are inexpensive, programmers are not

  4. Herding `cat`s: the mechanics of wrangling massive data

    • getting data within Hadoop’s reach

    • launching jobs

    • seeing the data

    • seeing the logs

    • clicking to

    • simple debugging

    • wu-lign

  5. Data Formats

  6. Semi-structured Data

    • Wikipedia

    • Datasets:

    • Full-text of Articles (wikipedia_articles) — TSV

    • Wikipedia Page properties (wikipedia_pageinfos) — TSV

    • Wikipedia Pagelinks (wikipedia_links) — TSV

    • Pageview Counts (wikipedia_pageviews) — TSV

    • (Page Properties from DBpedia) (wikipedia_dbpedia) — TSV

    • Munging:

    • parse_raw_articles (xml splitter, xml parser)

    • figure out splitter

    • make it be one line per file (by `&#XX;’ing the newlines

    • keep any interesting metadata

    • parse_raw_links (sql dump)

    • parse_pageinfos (sql dump)

    • parse_raw_pageviews (simple tsv load)

    • prepare_articles

    • add minimal metadata

    • prepare_links

    • minimal metadata; label category pages, redirect, etc

    • adjacency list? labelled low-id-first edge list

    • prepare_pages

    • calculate degree (in, out, symmetric) & other simple stats, add to page metadata table.

    • Airline Flights and Flight Delays

    • Datasets:

    • Airline Flights with delay information (airline_flights/flights)

    • Airlines (airline_flights/airlines)

    • Airports (airline_flights/airports)

    • Airplanes (airline_flights/airplanes)

    • Munging:

    • parse_raw_wikipedia_identifiers

    • parse_raw_openflights_airports

    • parse_raw_dataexpo_airports

    • prepare_timezone_mapping

    • parse_dataexpo_flights

    • reconcile_airports

    • timezoneize_flights

    • Global Weather

    • Datasets

    • Daily observations (weather/daily_observations)

    • Hourly observations (weather/hourly_observations) (we’ll only use one of daily vs hourly)

    • Weather stations (weather/weather_stations)

    • Munging:

    • Logs

    • World Cup (weblogs/worldcup_apachelogs)

    • Star Wars Kid (weblogs/starwarskid_apachelogs)

  • Logs

    • figure out apache log parser in pig

  • page links

    • X prepare

      1. Statistics

    • sum, average, standard deviation, etc (airline_flights)

    • medians and percentiles

    • construct a histogram

    • normalize data by mapping to percentile

    • normalize data by mapping to Z-score

      1. Advanced Pig

    • map-side join

    • merge join

    • skew joins

    • Performance and efficiency

      1. Processing Text

    • grep’ing for simple matches

    • tokenize text

    • simple document analysis

    • minhash clustering

      1. Geo Data

    • quadkeys and grid coordinate system

    • skkkkkkkkk — map wikipedia

    • k-means clustering to produce readable summaries

    • partial quad keys for "area" data

    • voronoi cells to do "nearby"-ness

    • Scripts:

    • calculate_voronoi_cells — use weather station locations to calculate voronoi polygons

    • voronoi_grid_assignment — cells that have a piece of border, or the largest grid cell that has no border on it

    • a

    • Using polymaps to see results

      1. Processing Graphs

    • subuniverse extraction

    • Pagerank

    • identify strong links

    • clustering coefficient

      1. Black-Box Machine Learning

    • Simple Naive Bayes classification

    • Document clustering

      1. Flume and Stream Processing

    • sources, sinks and decorators

    • deploying a wukong script as a decorator

    • parse the twitter stream API feed

      1. Time Series

    • windowing

    • simple anomaly detection

    • rolling statistics

      1. Pig UDFs

    • Basic UDF

    • why algebraic is awesome and how to be algebraic

    • Wonderdog: a LoadFunc / StoreFunc for elasticsearch

      1. Installing and Operating a Cluster

      2. Tuning

      3. HBase and Databases

      4. How to Scale Dirty and its Influence on People

    • How to think at scale

    • Pedantic Points of Style

    • Best Practices

Something went wrong with that request. Please try again.