Skip to content
Browse files

Moved coprolites into the 'supplementary' directory; moving towards a…

… one-file-per-chapter-or-so world
  • Loading branch information...
1 parent 479f2da commit efb3f6c46b169652901522721f0171893e0d804a Philip (flip) Kromer committed
Showing with 1,101 additions and 1,660 deletions.
  1. +500 −0 00-preface.asciidoc
  2. +0 −500 00a-about.asciidoc
  3. +187 −0 02-simple_transform.asciidoc
  4. +0 −88 02a-c_and_e_start_a_business.asciidoc
  5. +0 −1 02a-simple_transform-intro.asciidoc
  6. +0 −74 02b-running_a_hadoop_job.asciidoc
  7. +0 −38 02c-end_of_simple_transform.asciidoc
  8. +2 −0 03-transform_pivot.asciidoc
  9. +0 −2 03a-c_and_e_save_xmas.asciidoc
  10. +0 −2 03a-intro.asciidoc
  11. +1 −2 07-intro_to_storm+trident.asciidoc
  12. +33 −0 09-statistics.asciidoc
  13. +0 −33 09a-statistics-intro.asciidoc
  14. +45 −0 18-java_api.asciidoc
  15. +0 −46 18a-hadoop_api.asciidoc
  16. +0 −11 19b-pig_udfs.asciidoc
  17. +241 −0 21-hadoop_internals.asciidoc
  18. +0 −231 21d-hadoop_internals-tuning.asciidoc.md
  19. +25 −0 25-appendix.asciidoc
  20. +0 −15 25a-authors.asciidoc
  21. +0 −8 25b-colophon.asciidoc
  22. +4 −0 25c-references.asciidoc
  23. 0 22b-scripts.asciidoc → 25d-overview_of_scripts.asciidoc
  24. +50 −0 25f-glossary.asciidoc
  25. +0 −163 25j-appendix-tools_landscape.asciidoc
  26. +3 −47 25g-back_cover.asciidoc → 26-back_cover.asciidoc
  27. +0 −46 88-storm-glossary.asciidoc
  28. +0 −225 book-full.asciidoc
  29. +0 −121 book-reviewers.asciidoc
  30. +0 −7 book.asciidoc
  31. 0 { → cheatsheets}/24-cheatsheets.asciidoc
  32. 0 { → cheatsheets}/24a-unix_cheatsheet.asciidoc
  33. 0 { → cheatsheets}/24b-regular_expression_cheatsheet.asciidoc
  34. 0 { → cheatsheets}/24c-pig_cheatsheet.asciidoc
  35. 0 { → cheatsheets}/24d-hadoop_tunables_cheatsheet.asciidoc
  36. 0 { → cheatsheets}/25i-asciidoc_cheatsheet_and_style_guide.asciidoc
  37. 0 { → supplementary}/00b-intro-more_outlines.asciidoc
  38. 0 { → supplementary}/09a-summarizing.asciidoc
  39. 0 { → supplementary}/09b-sampling.asciidoc
  40. 0 { → supplementary}/09c-distribution_of_weather_measurements.asciidoc
  41. 0 { → supplementary}/09e-exercises.asciidoc
  42. +10 −0 { → supplementary}/14c-data_modeling.asciidoc
  43. 0 { → supplementary}/15-graphs.asciidoc
  44. 0 { → supplementary}/15a-representing_graphs.asciidoc
  45. 0 { → supplementary}/15b-community_extractions.asciidoc
  46. 0 { → supplementary}/15c-pagerank.asciidoc
  47. 0 { → supplementary}/15d-label_propagation.asciidoc
  48. 0 { → supplementary}/16a-simple_machine_learning.asciidoc
  49. 0 16d-misc.asciidoc → supplementary/16d-notes_on_ml_algorithms.asciidoc
  50. 0 { → supplementary}/17-best_practices.asciidoc
  51. 0 { → supplementary}/17a-why_hadoop.asciidoc
  52. 0 { → supplementary}/17b-how_to_think.asciidoc
  53. 0 { → supplementary}/17d-cloud-vs-static.asciidoc
  54. 0 { → supplementary}/17e-rules_of_scaling.asciidoc
  55. 0 { → supplementary}/17f-best_practices_and_pedantic_points_of_style.asciidoc
  56. 0 { → supplementary}/17g-tao_te_chimp.asciidoc
  57. 0 { → supplementary}/20b-even_yet_still_more_hbase.asciidoc
  58. 0 { → supplementary}/21b-hadoop_internals-logs.asciidoc
  59. 0 { → supplementary}/22d-use_method_checklist.asciidoc
  60. 0 { → supplementary}/23-datasets_and_scripts.asciidoc
  61. 0 { → supplementary}/23a-overview_of_datasets.asciidoc
  62. 0 { → supplementary}/23c-datasets.asciidoc
  63. 0 { → supplementary}/23c-wikipedia_dbpedia.asciidoc
  64. 0 { → supplementary}/23d-airline_flights.asciidoc
  65. 0 { → supplementary}/23e-access_logs.asciidoc
  66. 0 { → supplementary}/23f-data_formats-arc.asciidoc
  67. 0 { → supplementary}/23g-other_datasets_on_the_web.asciidoc
  68. 0 { → supplementary}/23h-notes_for_chimpmark.asciidoc
  69. 0 { → supplementary}/25c-acquiring_a_hadoop_cluster.asciidoc
  70. 0 { → supplementary}/25h-TODO.asciidoc
  71. 0 { → supplementary}/88-storm-junk_ignore.asciidoc
  72. 0 { → supplementary}/88-storm-lifecycle_of_a_record.asciidoc
View
500 00-preface.asciidoc
@@ -1,3 +1,503 @@
+// :author: Philip (flip) Kromer
+// :doctype: book
+// :toc:
+// :icons:
+// :lang: en
+// :encoding: utf-8
+
[[preface]]
== Preface
+
+image::images/front_cover.jpg[Front Cover]
+
+=== Mission Statement ===
+
+Big Data for Chimps will:
+
+1. Explain a practical, actionable view of big data, centered on tested best practices as well as give readers street fighting smarts with Hadoop
+2. Readers will also come away with a useful, conceptual idea of big data; big data in its simplest form is a small cluster of well-tagged information that sits upon a central pivot, and can be manipulated through various shifts and rotations with the purpose of delivering insights ("Insight is data in context"). Key to understanding big data is scalability: infinite amounts of data can rest upon infinite pivot points (Flip - is that accurate or would you say there's just one central pivot - like a Rubic's cube?)
+3. Finally, the book will contain examples with real data and real problems that will bring the concepts and applications for business to life.
+
+=== Hello, Early Releasers ===
+
+Hello and thanks, courageous and farsighted early released-to'er! We want to make sure the book delivers value to you now, and rewards your early confidence by becoming a book you're proud to own.
+
+==== Our Questions for You ====
+
+* The rule of thumb I'm using on introductory material is "If it's well-covered on the internet, leave it out". It's annoying when tech books give a topic the bus-tour-of-London ("On your window to the left is the outside of the British Museum!") treatment, but you should never find yourself completely stranded. Please let me know if that's the case.
+* Analogies: We'll be accompanied on part of our journey by Chimpanzee and Elephant, whose adventures are surprisingly relevant to understanding the internals of Hadoop. I don't want to waste your time laboriously remapping those adventures back to the problem at hand, but I definitely don't want to get too cute with the analogy. Again, please let me know if I err on either side.
+
+==== Chimpanzee and Elephant
+
+Starting with Chapter 2, you'll meet the zealous members of the Chimpanzee and Elephant Computing Company. Elephants have prodgious memories and move large heavy volumes with ease. They'll give you a physical analogue for using relationships to assemble data into context, and help you understand what's easy and what's hard in moving around massive amounts of data. Chimpanzees are clever but can only think about one thing at a time. They'll show you how to write simple transformations with a single concern and how to analyze a petabytes data with no more than megabytes of working space.
+
+Together, they'll equip you with a physical metaphor for how to work with data at scale.
+
+
+==== What's Covered in This Book? ====
+
+///Revise each chapter summary into paragraph form, as you've done for Chapter 1. This can stay in the final book. Amy////
+
+1. *First Exploration*:
+
+Objective: Show you a thing you couldn’t do without hadoop, you couldn’t do it any other way. Your mind should be blown and when you’re slogging through the data munging chapter you should think back to this and remember why you started this mess in the first place.
+
+A walkthrough of problem you'd use Hadoop to solve, showing the workflow and thought process. Hadoop asks you to write code poems that compose what we'll call _transforms_ (process records independently) and _pivots_ (restructure data).
+
+2. *Simple Transform*:
+
+Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you'll take over the task of translating text to Pig Latin. This is an "embarrassingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard.
+ - Chimpanzee and Elephant start a business
+ - Pig Latin translation
+ - Your first job: test at commandline
+ - Run it on cluster
+ - Input Splits
+ - Why Hadoop I: Simple Parallelism
+
+3. *Transform-Pivot Job*:
+
+C&E help SantaCorp optimize the Christmas toymaking process, demonstrating the essential problem of data locality (the central challenge of Big Data). We'll follow along with a job requiring map and reduce, and learn a bit more about Wukong (a Ruby-language framework for Hadoop).
+ - Locality: the central challenge of distributed computing
+ - The Hadoop Haiku
+
+Note: we will almost certainly re-arrange the content in chapters 1-3.
+They will be centered around the following 3 explorations
+(in pig and wukong)
+- Pig Latin exploration (map only job)
+- Count UFO visits by month
+ - might visit the jobtracker
+- Counting Wikipedia pageviews by hour (or whatever)
+ - should be same as UFO exploration, but:
+ - will actually require Hadoop
+ - also do a total sort at the end
+ - will visit the jobtracker
+
+By this point in the book you should:
+- Have your mind blown
+- See some compelling enough data and a compelling enough question, and a wukong job that answers that job by using only a mapper
+- see some compelling enough data and a compelling enough question, which requires a map and reduce job, written in both pig and wukong
+- believe the mapreduce story, i.e. you know, in general, the high-level conceptual mechanics of a mapreduce job
+
+You should have:
+- whimsical & concrete explanations of mapreduce, what’s happening as a job is born and run, and HDFS
+
+You shouldn’t have:
+a lot of words in pig
+a good feel for how to deftly use wukong yet
+
+5. *The Hadoop Toolset and Other Practical Matters*
+ - toolset overview
+- It’s a necessarily polyglot sport
+- Pig is a language that excels at describing
+- we think you are doing it wrong if you are not using :
+- a declarative orchestration language, a high-level scripting language for the dirty stuff (e.g. parsing, contacting external apis, etc..)
+- udfs (without saying udfs) are for accessing a java-native library, e.g. geospacial libraries, when you really care about performance, to gift pig with a new ability, custom loaders, etc…
+- there are a lot of tools, they all have merits: Hive, Pig, Cascading, Scalding, Wukong, MrJob, R, Julia (with your eyes open), Crunch. There aren’t others that we would recommend for production use, although we see enough momentum from impala and spark that you can adopt them with confidence that they will mature.
+ - launching and debugging jobs
+ - overview of Wukong
+ - overview of Pig
+
+6. Fundamental Data Operations in Hadoop*
+
+here’s the stuff you’d like to be able to do with data, in wukong and in pig
+ - Foreach/filter operations (messing around inside a record)
+ - reading data (brief practical directions on the level of “this is what you type in”)
+ - pig schema
+ - wukong model
+ - loading TSV
+- loading generic JSON
+- storing JSON
+- loading schematized JSON
+- loading parquet or Trevni
+- (Reference the section on working with compressed files; call back to the points about splitability and performance/size tradeoffs)
+- TSV, JSON, not XML; Protobufs, Thrift, Avro; Trevni, Parquet; Sequence Files; HAR
+- compression: gz, bz2, snappy, LZO
+ - subsetting your data
+- limit
+- filter
+- sample
+- using a hash digest function to take a signature
+- top k and reservoir sampling
+- refer to subuniverse which is probably elsewhere
+- group
+- join
+- ??cogroup?? (does this go with group? Does it go anywhere?)
+- sort, etc.. : cross cube
+- total sort
+- partitioner
+- basic UDFs
+- ?using ruby or python within a pig dataflow?
+
+7. *Filesystem Mojo and `cat` herding*
+
+ - dumping, listing, moving and manipulating files on the HDFS and local filesystems
+ - total sort
+ - transformations from the commandline (grep, cut, wc, etc)
+ - pivots from the commandline (head, sort, etc)
+ - commandline workflow tips
+ - advanced hadoop filesystem (chmod, setrep, fsck)
+
+8. *Intro to Storm+Trident*
+
+9. *Statistics*:
+
+ - (this is first deep experience with Storm+Trident)
+ - Summarizing: Averages, Percentiles, and Normalization
+ - running / windowed stream summaries
+ - make a "SummarizingTap" trident operation that collects {Sum Count Min Max Avg Stddev SomeExampleValuesReservoirSampled} (fill in the details of what exactly this means)
+ - also, maybe: Median+Deciles, Histogram
+ - understand the flow of data going on in preparing such an aggregate, by either making sure the mechanics of working with Trident don't overwhelm that or by retracing the story of records in an aggregation
+ - you need a group operation -> means everything in group goes to exactly one executor, exactly one machine, aggregator hits everything in a group
+- combiner-aggregators (in particular), do some aggregation beforehand, and send an intermediate aggregation to the executor that hosts the group operation
+ - by default, always use persistent aggregate until we find out why you wouldn’t
+
+ - (BUBBLE) highlight the corresponding map/reduce dataflow and illuminate the connection
+ - (BUBBLE) Median / calculation of quantiles at large enough scale that doing so is hard
+ - (in next chapter we can do histogram)
+ - Use a sketching algorithm to get an approximate but distributed answer to a holistic aggregation problem eg most frequent elements
+ - Rolling timeseries averages
+ - Sampling responsibly: it's harder and more important than you think
+ - consistent sampling using hashing
+ - don’t use an RNG
+ - appreciate that external data sources may have changed
+ - reservoir sampling
+ - connectivity sampling (BUBBLE)
+ - subuniverse sampling (LOC?)
+ - Statistical aggregates and the danger of large numbers
+ - numerical stability
+ - overflow/underflow
+ - working with distributions at scale
+ - your intuition is often incomplete
+ - with trillions of things, 1 in billion chance things happen thousands of times
+
+ - weather temperature histogram in streaming fashion
+
+approximate distinct counts (using HyperLogLog)
+approximate percentiles (based on quantile digest)
+
+10. *Time Series and Event Log Processing*:
+ - Parsing logs and using regular expressions with Hadoop
+ - logline model
+ - regexp to match lines, highlighting this as a parser pattern
+ - reinforce the source blob -> source model -> domain model practice
+ - Histograms and time series of pageviews using Hadoop
+ -
+ - sessionizing
+ - flow chart throughout site?
+ - "n-views": pages viewed in sequence
+ - ?? Audience metrics:
+ - make sure that this serves the later chapter with the live recommender engine (lambda architecture)
+ - Geolocate visitors based on IP with Hadoop
+ - use World Cup data?
+ - demonstrate using lookup table,
+ - explain it as a range query
+ - use a mapper-only (replicated) join -- explain why using that (small with big) but don't explain what it's doing (will be covered later)
+ - (Ab)Using Hadoop to stress-test your web server
+
+Exercise: what predicts the team a country will root for next? In particular: if say Mexico knocks out Greece, do Greeks root for, or against, Mexico in general?
+
+11. *Geographic Data*:
+Spatial join (find all UFO sightings near Airports) of points with points
+map points to grid cell in the mapper
+truncate at a certain zoom level (explain how to choose zoom level)
+must send points to reducers for own grid key and also neighbors (9 total squares).
+Perhaps, be clever about not having to use all 9 quad grid neighbors by partitioning on a grid size more fine-grained than your original one and then use that to send points only the pertinent grid cell reducers
+Perhaps generate the four points that are x away from you and use their quad cells.
+In the reducer, do point-by-point comparisons
+*Maybe* a secondary sort???
+Geospacial data model, i.e. the terms and fields that you use in, e.g. GeoJSON
+We choose X, we want the focus to be on data science not on GIS
+Still have to explain ‘feature’, ‘region’, ‘latitude’, ‘longitude’, etc…
+Decomposing a map into quad-cell mapping at constant zoom level
+mapper input: `<name of region, GeoJSON region boundary>`
+Goal 1: have a mapping from region -> quad cells it covers
+Goal 2: have a mapping from quad key to partial GeoJSON objects on it
+mapper output:
+[thing, quadkey]
+[quadkey, list of region ids, hash of region ids to GeoJSON region boundaries]
+Spacial join of points with regions, e.g. what congressional district are you in?
+in mapper for points emit truncated quad key, the rest of the quad key, just stream the regions through (with result from prior exploration)
+a reducer has quadcell, all points that lie within that quadcell, and all regions (truncated) that lie on that quadcell. Do a brute force search for the regions that the points lie on
+Nearness query
+suppose the set of items you want to find nearness to is not huge
+produce the voronoi diagrams
+Decomposing a map into quad-cell mapping at multiple zoom levels
+in particular, use voronoi regions to make show multi-scale decomposition
+Re-do spacial join with Voronoi cells in multi-scale fashion (fill in details later)
+Framing the problem (NYC vs Pacific Ocean)
+Discuss how, given a global set of features, to decompose into a multi-scale grid representation
+Other mechanics of working with geo data
+
+12. *Conceptual Model for Data Analysis*
+
+See bottom
+13. *Data Munging (Semi-Structured Data)*:
+
+Welcome to chapter to 13. This is some f'real shit, yo.
+
+Wiki pageviews - String encoding and other bullshit
+Airport data -Reconciling to *mostly* agreeing datasets
+Something that has errors (SW Kid) - dealing with bad records
+Weather Data - Parsing a flat pack file
+ - bear witness, explain that you DID have to temporarily become an ameteur meteorologist, and had to write code to work with that many fields.
+- when your schema is so complicated, it needs to be automated, too.
+- join hell, when your keys change over time
+
+Data formats
+ - airing of grievances on XML
+ - airing of grievances on CSV
+ - don’t quote, escape
+ - the only 3 formats you should use, and when to use them
+
+- Just do a data munging project from beginning to end that wasn’t too horrible
+- Talk about the specific strategies and tactics
+ - source blob to source domain object, source domain object to business object. e.g. you want your initial extraction into a model mirrors closely the source domain data format. Mainly because you do not want mix your extraction logic and business logic (extraction logic will pollute business objects code). Also, will end up building the wrong model for the business object, i.e. it will look like the source domain.
+
+Airport data - chief challenge is reconciling data sets, dealing with conflicting errors
+
+The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
+ - Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
+ - Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
+ - US Commercial Airline Flights: every commercial airline flight since 1987
+ - Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
+ - "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio's waxy.org).
+
+13. *Machine Learning without Grad School*: We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using
+ - Naive Bayes
+ - Logistic Regression
+ - Random Forest (using Mahout)
+ We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win.
+14. *Full Application: Regional Flavor*
+
+15. *Hadoop Native Java API*
+ - don't
+
+16. *Advanced Pig*
+ - Specialized joins that can dramatically speed up (or make feasible) your data transformations
+ - why algebraic UDFs are awesome and how to be algebraic
+ - Custom Loaders
+ - Performance efficiency and tunables
+ - using a filter after a cogroup will get pushed up by Pig, sez Jacob
+
+
+17. *Data Modeling for HBase-style Database*
+
+17. *Hadoop Internals*
+
+ - What happens when a job is launched
+ - A shallow dive into the HDFS
+
+===== HDFS
+
+Lifecycle of a File:
+
+* What happens as the Namenode and Datanode collaborate to create a new file.
+* How that file is replicated to acknowledged by other Datanodes.
+* What happens when a Datanode goes down or the cluster is rebalanced.
+* Briefly, the S3 DFS facade // (TODO: check if HFS?).
+
+===== Hadoop Job Execution
+
+* Lifecycle of a job at the client level including figuring out where all the source data is; figuring out how to split it; sending the code to the JobTracker, then tracking it to completion.
+* How the JobTracker and TaskTracker cooperate to run your job, including: The distinction between Job, Task and Attempt., how each TaskTracker obtains its Attempts, and dispatches progress and metrics back to the JobTracker, how Attempts are scheduled, including what happens when an Attempt fails and speculative execution, ________, Split.
+* How TaskTracker child and Datanode cooperate to execute an Attempt, including; what a child process is, making clear the distinction between TaskTracker and child process.
+* Briefly, how the Hadoop Streaming child process works.
+
+==== Skeleton: Map-Reduce Internals
+
+* How the mapper and Datanode handle record splitting and how and when the partial records are dispatched.
+* The mapper sort buffer and spilling to disk (maybe here or maybe later, the I/O.record.percent).
+* Briefly note that data is not sent from mapper-to-reducer using HDFS and so you should pay attention to where you put the Map-Reduce scratch space and how stupid it is about handling an overflow volume.
+* Briefly that combiners are a thing.
+* Briefly how records are partitioned to reducers and that custom partitioners are a thing.
+* How the Reducer accepts and tracks its mapper outputs.
+* Details of the merge/sort (shuffle and sort), including the relevant buffers and flush policies and why it can skip the last merge phase.
+* (NOTE: Secondary sort and so forth will have been described earlier.)
+* Delivery of output data to the HDFS and commit whether from mapper or reducer.
+* Highlight the fragmentation problem with map-only jobs.
+* Where memory is used, in particular, mapper-sort buffers, both kinds of reducer-merge buffers, application internal buffers.
+
+18. *Hadoop Tuning*
+ - Tuning for the Wise and Lazy
+ - Tuning for the Brave and Foolish
+ - The USE Method for understanding performance and diagnosing problems
+
+19. *Storm+Trident Internals*
+
+* Understand the lifecycle of a Storm tuple, including spout, tupletree and acking.
+* (Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked.
+* Understand the lifecycle of partitions within a Trident batch and thus, the context behind partition operations such as Apply or PartitionPersist.
+* Understand Trident’s transactional mechanism, in the case of a PartitionPersist.
+* Understand how Aggregators, Statemap and the Persistence methods combine to give you _exactly once_ processing with transactional guarantees. Specifically, what an OpaqueValue record will look like in the database and why.
+* Understand how the master batch coordinator and spout coordinator for the Kafka spout in particular work together to uniquely and efficiently process all records in a Kafka topic.
+* One specific: how Kafka partitions relate to Trident partitions.
+
+20. *Storm+Trident Tuning*
+
+23. *Overview of Datasets and Scripts*
+ - Datasets
+ - Wikipedia (corpus, pagelinks, pageviews, dbpedia, geolocations)
+ - Airline Flights
+ - UFO Sightings
+ - Global Hourly Weather
+ - Waxy.org "Star Wars Kid" Weblogs
+ - Scripts
+
+24. *Cheatsheets*:
+ - Regular Expressions
+ - Sizes of the Universe
+ - Hadoop Tuning & Configuration Variables
+
+
+Chopping block
+
+1. Interlude I: *Organizing Data*:
+ - How to design your data models
+ - How to serialize their contents (orig, scratch, prod)
+ - How to organize your scripts and your data
+
+2. *Graph Processing*:
+ - Graph Representations
+ - Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents
+ - Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages
+
+3. *Text Processing*: We'll show how to combine powerful existing libraries with hadoop to do effective text handling and Natural Language Processing:
+ - Indexing documents
+ - Tokenizing documents using Lucene
+ - Pointwise Mutual Information
+ - K-means Clustering
+
+4. Interlude II: *Best Practices and Pedantic Points of style*
+ - Pedantic Points of Style
+ - Best Practices
+ - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
+ - Why Hadoop
+ - robots are cheap, people are important
+
+
+17. Interlude II: *Best Practices and Pedantic Points of style*
+ - Pedantic Points of Style
+ - Best Practices
+ - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
+ - Why Hadoop
+ - robots are cheap, people are important
+
+14. Interlude I: *Organizing Data*:
+ - How to design your data models
+ - How to serialize their contents (orig, scratch, prod)
+ - How to organize your scripts and your data
+
+
+==== Hadoop ====
+
+In Doug Cutting's words, Hadoop is the "kernel of the big-data operating system". It's the dominant batch-processing solution, has both commercial enterprise support and a huge open source community, runs on every platform and cloud, and there are no signs any of that will change in the near term.
+
+The code in this book will run unmodified on your laptop computer and on an industrial-strength Hadoop cluster. (Of course you will need to use a reduced data set for the laptop). You do need a Hadoop installation of some sort -- Appendix (TODO: ref) describes your options, including instructions for running hadoop on a multi-machine cluster in the public cloud -- for a few dollars a day you can analyze terabyte-scale datasets.
+
+==== A Note on Ruby and Wukong ====
+
+We've chosen Ruby for two reasons. First, it's one of several high-level languages (along with Python, Scala, R and others) that have both excellent Hadoop frameworks and widespread support. More importantly, Ruby is a very readable language -- the closest thing to practical pseudocode we know. The code samples provided should map cleanly to those high-level languages, and the approach we recommend is available in any language.
+
+In particular, we've chosen the Ruby-language Wukong framework. We're the principal authors, but it's open-source and widely used. It's also the only framework I'm aware of that runs on both Hadoop and Storm+Trident.
+
+
+
+==== Helpful Reading ====
+
+* Hadoop the Definitive Guide by Tom White is a must-have. Don't try to absorb its whole -- the most powerful parts of Hadoop are its simplest parts -- but you'll refer to often it as your applications reach production.
+* Hadoop Operations by Eric Sammer -- hopefully you can hand this to someone else, but the person who runs your hadoop cluster will eventually need this guide to configuring and hardening a large production cluster.
+* "Big Data: principles and best practices of scalable realtime data systems" by Nathan Marz
+* ...
+
+
+==== What This Book Does Not Cover ====
+
+We are not currently planning to cover Hive. The Pig scripts will translate naturally for folks who are already familiar with it. There will be a brief section explaining why you might choose it over Pig, and why I chose it over Hive. If there's popular pressure I may add a "translation guide".
+
+This book picks up where the internet leaves off -- apart from cheatsheets at the end of the book, I'm not going to spend any real time on information well-covered by basic tutorials and core documentation. Other things we do not plan to include:
+
+* Installing or maintaining Hadoop
+* we will cover how to design HBase schema, but not how to use HBase as _database_
+* Other map-reduce-like platforms (disco, spark, etc), or other frameworks (MrJob, Scalding, Cascading)
+* At a few points we'll use Mahout, R, D3.js and Unix text utils (cut/wc/etc), but only as tools for an immediate purpose. I can't justify going deep into any of them; there are whole O'Reilly books on each.
+
+==== Feedback ====
+
+* The http://github.com/infochimps-labs/big_data_for_chimps[source code for the book] -- all the prose, images, the whole works -- is on github at `http://github.com/infochimps-labs/big_data_for_chimps`.
+* Contact us! If you have questions, comments or complaints, the http://github.com/infochimps-labs/big_data_for_chimps/issues[issue tracker] http://github.com/infochimps-labs/big_data_for_chimps/issues is the best forum for sharing those. If you'd like something more direct, please email meghan@oreilly.com (the ever-patient editor) and flip@infochimps.com (your eager author). Please include both of us.
+
+OK! On to the book. Or, on to the introductory parts of the book and then the book.
+
+[[about]]
+=== About ===
+
+[[about_coverage]]
+==== What this book covers ====
+
+'Big Data for Chimps' shows you how to solve important hard problems using simple, fun, elegant tools.
+
+Geographic analysis is an important hard problem. To understand a disease outbreak in Europe, you need to see the data from Zurich in the context of Paris, Milan, Frankfurt and Munich; but to understand the situation in Munich requires context from Zurich, Prague and Vienna; and so on. How do you understand the part when you can't hold the whole world in your hand?
+
+Finding patterns in massive event streams is an important hard problem. Most of the time, there aren't earthquakes -- but the patterns that will let you predict one in advance lie within the data from those quiet periods. How do you compare the trillions of subsequences in billions of events, each to each other, to find the very few that matter? Once you have those patterns, how do you react to them in real-time?
+
+We've chosen case studies anyone can understand that generalize to problems like those and the problems you're looking to solve. Our goal is to equip you with:
+
+* How to think at scale -- equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations.
+* Detailed example programs applying Hadoop to interesting problems in context
+* Advice and best practices for efficient software development
+
+All of the examples use real data, and describe patterns found in many problem domains:
+
+* Statistical Summaries
+* Identify patterns and groups in the data
+* Searching, filtering and herding records in bulk
+* Advanced queries against spatial or time-series data sets.
+
+The emphasis on simplicity and fun should make this book especially appealing to beginners, but this is not an approach you'll outgrow. We've found it's the most powerful and valuable approach for creative analytics. One of our maxims is "Robots are cheap, Humans are important": write readable, scalable code now and find out later whether you want a smaller cluster. The code you see is adapted from programs we write at Infochimps to solve enterprise-scale business problems, and these simple high-level transformations (most of the book) plus the occasional Java extension (chapter XXX) meet our needs.
+
+Many of the chapters have exercises included. If you're a beginning user, I highly recommend you work out at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you _read_ it than from having the book next to you while you *write* code inspired by it. There are sample solutions and result datasets on the book's website.
+
+Feel free to hop around among chapters; the application chapters don't have large dependencies on earlier chapters.
+
+
+[[about_is_for]]
+==== Who This Book Is For ====
+
+We'd like for you to be familiar with at least one programming language, but it doesn't have to be Ruby. Familiarity with SQL will help a bit, but isn't essential.
+
+Most importantly, you should have an actual project in mind that requires a big data toolkit to solve -- a problem that requires scaling out across multiple machines. If you don't already have a project in mind but really want to learn about the big data toolkit, take a quick browse through the exercises. At least a few of them should have you jumping up and down with excitement to learn this stuff.
+
+[[about_is_not_for]]
+==== Who This Book Is Not For ====
+
+This is not "Hadoop the Definitive Guide" (that's been written, and well); this is more like "Hadoop: a Highly Opinionated Guide". The only coverage of how to use the bare Hadoop API is to say "In most cases, don't". We recommend storing your data in one of several highly space-inefficient formats and in many other ways encourage you to willingly trade a small performance hit for a large increase in programmer joy. The book has a relentless emphasis on writing *scalable* code, but no content on writing *performant* code beyond the advice that the best path to a 2x speedup is to launch twice as many machines.
+
+That is because for almost everyone, the cost of the cluster is far less than the opportunity cost of the data scientists using it. If you have not just big data but huge data -- let's say somewhere north of 100 terabytes -- then you will need to make different tradeoffs for jobs that you expect to run repeatedly in production.
+
+The book does have some content on machine learning with Hadoop, on provisioning and deploying Hadoop, and on a few important settings. But it does not cover advanced algorithms, operations or tuning in any real depth.
+
+[[about_how_written]]
+==== How this book is being written ====
+
+I plan to push chapters to the publicly-viewable http://github.com/infochimps-labs/big_data_for_chimps['Hadoop for Chimps' git repo] as they are written, and to post them periodically to the http://blog.infochimps.com[Infochimps blog] after minor cleanup.
+
+We really mean it about the git social-coding thing -- please https://github.com/blog/622-inline-commit-notes[comment] on the text, http://github.com/infochimps-labs/big_data_for_chimps/issues[file issues] and send pull requests. However! We might not use your feedback, no matter how dazzlingly cogent it is; and while we are soliciting comments from readers, we are not seeking content from collaborators.
+
+
+==== How to Contact Us ====
+
+Please address comments and questions concerning this book to the publisher:
+
+O'Reilly Media, Inc.
+1005 Gravenstein Highway North
+Sebastopol, CA 95472
+(707) 829-0515 (international or local)
+
+To comment or ask technial questions about this book, send email to bookquestions@oreilly.com
+
+To reach the authors:
+
+Flip Kromer is @mrflip on Twitter
+
+For comments or questions on the material, file a github issue at http://github.com/infochimps-labs/big_data_for_chimps/issues
View
500 00a-about.asciidoc
@@ -1,500 +0,0 @@
-// :author: Philip (flip) Kromer
-// :doctype: book
-// :toc:
-// :icons:
-// :lang: en
-// :encoding: utf-8
-
-image::images/front_cover.jpg[Front Cover]
-
-=== Mission Statement ===
-
-Big Data for Chimps will:
-
-1. Explain a practical, actionable view of big data, centered on tested best practices as well as give readers street fighting smarts with Hadoop
-2. Readers will also come away with a useful, conceptual idea of big data; big data in its simplest form is a small cluster of well-tagged information that sits upon a central pivot, and can be manipulated through various shifts and rotations with the purpose of delivering insights ("Insight is data in context"). Key to understanding big data is scalability: infinite amounts of data can rest upon infinite pivot points (Flip - is that accurate or would you say there's just one central pivot - like a Rubic's cube?)
-3. Finally, the book will contain examples with real data and real problems that will bring the concepts and applications for business to life.
-
-=== Hello, Early Releasers ===
-
-Hello and thanks, courageous and farsighted early released-to'er! We want to make sure the book delivers value to you now, and rewards your early confidence by becoming a book you're proud to own.
-
-==== Our Questions for You ====
-
-* The rule of thumb I'm using on introductory material is "If it's well-covered on the internet, leave it out". It's annoying when tech books give a topic the bus-tour-of-London ("On your window to the left is the outside of the British Museum!") treatment, but you should never find yourself completely stranded. Please let me know if that's the case.
-* Analogies: We'll be accompanied on part of our journey by Chimpanzee and Elephant, whose adventures are surprisingly relevant to understanding the internals of Hadoop. I don't want to waste your time laboriously remapping those adventures back to the problem at hand, but I definitely don't want to get too cute with the analogy. Again, please let me know if I err on either side.
-
-==== Chimpanzee and Elephant
-
-Starting with Chapter 2, you'll meet the zealous members of the Chimpanzee and Elephant Computing Company. Elephants have prodgious memories and move large heavy volumes with ease. They'll give you a physical analogue for using relationships to assemble data into context, and help you understand what's easy and what's hard in moving around massive amounts of data. Chimpanzees are clever but can only think about one thing at a time. They'll show you how to write simple transformations with a single concern and how to analyze a petabytes data with no more than megabytes of working space.
-
-Together, they'll equip you with a physical metaphor for how to work with data at scale.
-
-
-==== What's Covered in This Book? ====
-
-///Revise each chapter summary into paragraph form, as you've done for Chapter 1. This can stay in the final book. Amy////
-
-1. *First Exploration*:
-
-Objective: Show you a thing you couldn’t do without hadoop, you couldn’t do it any other way. Your mind should be blown and when you’re slogging through the data munging chapter you should think back to this and remember why you started this mess in the first place.
-
-A walkthrough of problem you'd use Hadoop to solve, showing the workflow and thought process. Hadoop asks you to write code poems that compose what we'll call _transforms_ (process records independently) and _pivots_ (restructure data).
-
-2. *Simple Transform*:
-
-Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you'll take over the task of translating text to Pig Latin. This is an "embarrassingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard.
- - Chimpanzee and Elephant start a business
- - Pig Latin translation
- - Your first job: test at commandline
- - Run it on cluster
- - Input Splits
- - Why Hadoop I: Simple Parallelism
-
-3. *Transform-Pivot Job*:
-
-C&E help SantaCorp optimize the Christmas toymaking process, demonstrating the essential problem of data locality (the central challenge of Big Data). We'll follow along with a job requiring map and reduce, and learn a bit more about Wukong (a Ruby-language framework for Hadoop).
- - Locality: the central challenge of distributed computing
- - The Hadoop Haiku
-
-Note: we will almost certainly re-arrange the content in chapters 1-3.
-They will be centered around the following 3 explorations
-(in pig and wukong)
-- Pig Latin exploration (map only job)
-- Count UFO visits by month
- - might visit the jobtracker
-- Counting Wikipedia pageviews by hour (or whatever)
- - should be same as UFO exploration, but:
- - will actually require Hadoop
- - also do a total sort at the end
- - will visit the jobtracker
-
-By this point in the book you should:
-- Have your mind blown
-- See some compelling enough data and a compelling enough question, and a wukong job that answers that job by using only a mapper
-- see some compelling enough data and a compelling enough question, which requires a map and reduce job, written in both pig and wukong
-- believe the mapreduce story, i.e. you know, in general, the high-level conceptual mechanics of a mapreduce job
-
-You should have:
-- whimsical & concrete explanations of mapreduce, what’s happening as a job is born and run, and HDFS
-
-You shouldn’t have:
-a lot of words in pig
-a good feel for how to deftly use wukong yet
-
-5. *The Hadoop Toolset and Other Practical Matters*
- - toolset overview
-- It’s a necessarily polyglot sport
-- Pig is a language that excels at describing
-- we think you are doing it wrong if you are not using :
-- a declarative orchestration language, a high-level scripting language for the dirty stuff (e.g. parsing, contacting external apis, etc..)
-- udfs (without saying udfs) are for accessing a java-native library, e.g. geospacial libraries, when you really care about performance, to gift pig with a new ability, custom loaders, etc…
-- there are a lot of tools, they all have merits: Hive, Pig, Cascading, Scalding, Wukong, MrJob, R, Julia (with your eyes open), Crunch. There aren’t others that we would recommend for production use, although we see enough momentum from impala and spark that you can adopt them with confidence that they will mature.
- - launching and debugging jobs
- - overview of Wukong
- - overview of Pig
-
-6. Fundamental Data Operations in Hadoop*
-
-here’s the stuff you’d like to be able to do with data, in wukong and in pig
- - Foreach/filter operations (messing around inside a record)
- - reading data (brief practical directions on the level of “this is what you type in”)
- - pig schema
- - wukong model
- - loading TSV
-- loading generic JSON
-- storing JSON
-- loading schematized JSON
-- loading parquet or Trevni
-- (Reference the section on working with compressed files; call back to the points about splitability and performance/size tradeoffs)
-- TSV, JSON, not XML; Protobufs, Thrift, Avro; Trevni, Parquet; Sequence Files; HAR
-- compression: gz, bz2, snappy, LZO
- - subsetting your data
-- limit
-- filter
-- sample
-- using a hash digest function to take a signature
-- top k and reservoir sampling
-- refer to subuniverse which is probably elsewhere
-- group
-- join
-- ??cogroup?? (does this go with group? Does it go anywhere?)
-- sort, etc.. : cross cube
-- total sort
-- partitioner
-- basic UDFs
-- ?using ruby or python within a pig dataflow?
-
-7. *Filesystem Mojo and `cat` herding*
-
- - dumping, listing, moving and manipulating files on the HDFS and local filesystems
- - total sort
- - transformations from the commandline (grep, cut, wc, etc)
- - pivots from the commandline (head, sort, etc)
- - commandline workflow tips
- - advanced hadoop filesystem (chmod, setrep, fsck)
-
-8. *Intro to Storm+Trident*
-
-9. *Statistics*:
-
- - (this is first deep experience with Storm+Trident)
- - Summarizing: Averages, Percentiles, and Normalization
- - running / windowed stream summaries
- - make a "SummarizingTap" trident operation that collects {Sum Count Min Max Avg Stddev SomeExampleValuesReservoirSampled} (fill in the details of what exactly this means)
- - also, maybe: Median+Deciles, Histogram
- - understand the flow of data going on in preparing such an aggregate, by either making sure the mechanics of working with Trident don't overwhelm that or by retracing the story of records in an aggregation
- - you need a group operation -> means everything in group goes to exactly one executor, exactly one machine, aggregator hits everything in a group
-- combiner-aggregators (in particular), do some aggregation beforehand, and send an intermediate aggregation to the executor that hosts the group operation
- - by default, always use persistent aggregate until we find out why you wouldn’t
-
- - (BUBBLE) highlight the corresponding map/reduce dataflow and illuminate the connection
- - (BUBBLE) Median / calculation of quantiles at large enough scale that doing so is hard
- - (in next chapter we can do histogram)
- - Use a sketching algorithm to get an approximate but distributed answer to a holistic aggregation problem eg most frequent elements
- - Rolling timeseries averages
- - Sampling responsibly: it's harder and more important than you think
- - consistent sampling using hashing
- - don’t use an RNG
- - appreciate that external data sources may have changed
- - reservoir sampling
- - connectivity sampling (BUBBLE)
- - subuniverse sampling (LOC?)
- - Statistical aggregates and the danger of large numbers
- - numerical stability
- - overflow/underflow
- - working with distributions at scale
- - your intuition is often incomplete
- - with trillions of things, 1 in billion chance things happen thousands of times
-
- - weather temperature histogram in streaming fashion
-
-approximate distinct counts (using HyperLogLog)
-approximate percentiles (based on quantile digest)
-
-10. *Time Series and Event Log Processing*:
- - Parsing logs and using regular expressions with Hadoop
- - logline model
- - regexp to match lines, highlighting this as a parser pattern
- - reinforce the source blob -> source model -> domain model practice
- - Histograms and time series of pageviews using Hadoop
- -
- - sessionizing
- - flow chart throughout site?
- - "n-views": pages viewed in sequence
- - ?? Audience metrics:
- - make sure that this serves the later chapter with the live recommender engine (lambda architecture)
- - Geolocate visitors based on IP with Hadoop
- - use World Cup data?
- - demonstrate using lookup table,
- - explain it as a range query
- - use a mapper-only (replicated) join -- explain why using that (small with big) but don't explain what it's doing (will be covered later)
- - (Ab)Using Hadoop to stress-test your web server
-
-Exercise: what predicts the team a country will root for next? In particular: if say Mexico knocks out Greece, do Greeks root for, or against, Mexico in general?
-
-11. *Geographic Data*:
-Spatial join (find all UFO sightings near Airports) of points with points
-map points to grid cell in the mapper
-truncate at a certain zoom level (explain how to choose zoom level)
-must send points to reducers for own grid key and also neighbors (9 total squares).
-Perhaps, be clever about not having to use all 9 quad grid neighbors by partitioning on a grid size more fine-grained than your original one and then use that to send points only the pertinent grid cell reducers
-Perhaps generate the four points that are x away from you and use their quad cells.
-In the reducer, do point-by-point comparisons
-*Maybe* a secondary sort???
-Geospacial data model, i.e. the terms and fields that you use in, e.g. GeoJSON
-We choose X, we want the focus to be on data science not on GIS
-Still have to explain ‘feature’, ‘region’, ‘latitude’, ‘longitude’, etc…
-Decomposing a map into quad-cell mapping at constant zoom level
-mapper input:
-[name of region, GeoJSON region boundary]
-Goal 1: have a mapping from region -> quad cells it covers
-Goal 2: have a mapping from quad key to partial GeoJSON objects on it
-mapper output:
-[thing, quadkey]
-[quadkey, list of region ids, hash of region ids to GeoJSON region boundaries]
-Spacial join of points with regions, e.g. what congressional district are you in?
-in mapper for points emit truncated quad key, the rest of the quad key, just stream the regions through (with result from prior exploration)
-a reducer has quadcell, all points that lie within that quadcell, and all regions (truncated) that lie on that quadcell. Do a brute force search for the regions that the points lie on
-Nearness query
-suppose the set of items you want to find nearness to is not huge
-produce the voronoi diagrams
-Decomposing a map into quad-cell mapping at multiple zoom levels
-in particular, use voronoi regions to make show multi-scale decomposition
-Re-do spacial join with Voronoi cells in multi-scale fashion (fill in details later)
-Framing the problem (NYC vs Pacific Ocean)
-Discuss how, given a global set of features, to decompose into a multi-scale grid representation
-Other mechanics of working with geo data
-
-12. *Conceptual Model for Data Analysis*
-
-See bottom
-13. *Data Munging (Semi-Structured Data)*:
-
-Welcome to chapter to 13. This is some f'real shit, yo.
-
-Wiki pageviews - String encoding and other bullshit
-Airport data -Reconciling to *mostly* agreeing datasets
-Something that has errors (SW Kid) - dealing with bad records
-Weather Data - Parsing a flat pack file
- - bear witness, explain that you DID have to temporarily become an ameteur meteorologist, and had to write code to work with that many fields.
-- when your schema is so complicated, it needs to be automated, too.
-- join hell, when your keys change over time
-
-Data formats
- - airing of grievances on XML
- - airing of grievances on CSV
- - don’t quote, escape
- - the only 3 formats you should use, and when to use them
-
-- Just do a data munging project from beginning to end that wasn’t too horrible
-- Talk about the specific strategies and tactics
- - source blob to source domain object, source domain object to business object. e.g. you want your initial extraction into a model mirrors closely the source domain data format. Mainly because you do not want mix your extraction logic and business logic (extraction logic will pollute business objects code). Also, will end up building the wrong model for the business object, i.e. it will look like the source domain.
-
-Airport data - chief challenge is reconciling data sets, dealing with conflicting errors
-
-The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
- - Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
- - Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
- - US Commercial Airline Flights: every commercial airline flight since 1987
- - Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
- - "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio's waxy.org).
-
-13. *Machine Learning without Grad School*: We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using
- - Naive Bayes
- - Logistic Regression
- - Random Forest (using Mahout)
- We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win.
-14. *Full Application: Regional Flavor*
-
-15. *Hadoop Native Java API*
- - don't
-
-16. *Advanced Pig*
- - Specialized joins that can dramatically speed up (or make feasible) your data transformations
- - why algebraic UDFs are awesome and how to be algebraic
- - Custom Loaders
- - Performance efficiency and tunables
- - using a filter after a cogroup will get pushed up by Pig, sez Jacob
-
-
-17. *Data Modeling for HBase-style Database*
-
-17. *Hadoop Internals*
-
- - What happens when a job is launched
- - A shallow dive into the HDFS
-
-===== HDFS
-
-Lifecycle of a File:
-
-* What happens as the Namenode and Datanode collaborate to create a new file.
-* How that file is replicated to acknowledged by other Datanodes.
-* What happens when a Datanode goes down or the cluster is rebalanced.
-* Briefly, the S3 DFS facade // (TODO: check if HFS?).
-
-===== Hadoop Job Execution
-
-* Lifecycle of a job at the client level including figuring out where all the source data is; figuring out how to split it; sending the code to the JobTracker, then tracking it to completion.
-* How the JobTracker and TaskTracker cooperate to run your job, including: The distinction between Job, Task and Attempt., how each TaskTracker obtains its Attempts, and dispatches progress and metrics back to the JobTracker, how Attempts are scheduled, including what happens when an Attempt fails and speculative execution, ________, Split.
-* How TaskTracker child and Datanode cooperate to execute an Attempt, including; what a child process is, making clear the distinction between TaskTracker and child process.
-* Briefly, how the Hadoop Streaming child process works.
-
-==== Skeleton: Map-Reduce Internals
-
-* How the mapper and Datanode handle record splitting and how and when the partial records are dispatched.
-* The mapper sort buffer and spilling to disk (maybe here or maybe later, the I/O.record.percent).
-* Briefly note that data is not sent from mapper-to-reducer using HDFS and so you should pay attention to where you put the Map-Reduce scratch space and how stupid it is about handling an overflow volume.
-* Briefly that combiners are a thing.
-* Briefly how records are partitioned to reducers and that custom partitioners are a thing.
-* How the Reducer accepts and tracks its mapper outputs.
-* Details of the merge/sort (shuffle and sort), including the relevant buffers and flush policies and why it can skip the last merge phase.
-* (NOTE: Secondary sort and so forth will have been described earlier.)
-* Delivery of output data to the HDFS and commit whether from mapper or reducer.
-* Highlight the fragmentation problem with map-only jobs.
-* Where memory is used, in particular, mapper-sort buffers, both kinds of reducer-merge buffers, application internal buffers.
-
-18. *Hadoop Tuning*
- - Tuning for the Wise and Lazy
- - Tuning for the Brave and Foolish
- - The USE Method for understanding performance and diagnosing problems
-
-19. *Storm+Trident Internals*
-
-* Understand the lifecycle of a Storm tuple, including spout, tupletree and acking.
-* (Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked.
-* Understand the lifecycle of partitions within a Trident batch and thus, the context behind partition operations such as Apply or PartitionPersist.
-* Understand Trident’s transactional mechanism, in the case of a PartitionPersist.
-* Understand how Aggregators, Statemap and the Persistence methods combine to give you _exactly once_ processing with transactional guarantees. Specifically, what an OpaqueValue record will look like in the database and why.
-* Understand how the master batch coordinator and spout coordinator for the Kafka spout in particular work together to uniquely and efficiently process all records in a Kafka topic.
-* One specific: how Kafka partitions relate to Trident partitions.
-
-20. *Storm+Trident Tuning*
-
-23. *Overview of Datasets and Scripts*
- - Datasets
- - Wikipedia (corpus, pagelinks, pageviews, dbpedia, geolocations)
- - Airline Flights
- - UFO Sightings
- - Global Hourly Weather
- - Waxy.org "Star Wars Kid" Weblogs
- - Scripts
-
-24. *Cheatsheets*:
- - Regular Expressions
- - Sizes of the Universe
- - Hadoop Tuning & Configuration Variables
-
-
-Chopping block
-
-1. Interlude I: *Organizing Data*:
- - How to design your data models
- - How to serialize their contents (orig, scratch, prod)
- - How to organize your scripts and your data
-
-2. *Graph Processing*:
- - Graph Representations
- - Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents
- - Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages
-
-3. *Text Processing*: We'll show how to combine powerful existing libraries with hadoop to do effective text handling and Natural Language Processing:
- - Indexing documents
- - Tokenizing documents using Lucene
- - Pointwise Mutual Information
- - K-means Clustering
-
-4. Interlude II: *Best Practices and Pedantic Points of style*
- - Pedantic Points of Style
- - Best Practices
- - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
- - Why Hadoop
- - robots are cheap, people are important
-
-
-17. Interlude II: *Best Practices and Pedantic Points of style*
- - Pedantic Points of Style
- - Best Practices
- - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
- - Why Hadoop
- - robots are cheap, people are important
-
-14. Interlude I: *Organizing Data*:
- - How to design your data models
- - How to serialize their contents (orig, scratch, prod)
- - How to organize your scripts and your data
-
-
-==== Hadoop ====
-
-In Doug Cutting's words, Hadoop is the "kernel of the big-data operating system". It's the dominant batch-processing solution, has both commercial enterprise support and a huge open source community, runs on every platform and cloud, and there are no signs any of that will change in the near term.
-
-The code in this book will run unmodified on your laptop computer and on an industrial-strength Hadoop cluster. (Of course you will need to use a reduced data set for the laptop). You do need a Hadoop installation of some sort -- Appendix (TODO: ref) describes your options, including instructions for running hadoop on a multi-machine cluster in the public cloud -- for a few dollars a day you can analyze terabyte-scale datasets.
-
-==== A Note on Ruby and Wukong ====
-
-We've chosen Ruby for two reasons. First, it's one of several high-level languages (along with Python, Scala, R and others) that have both excellent Hadoop frameworks and widespread support. More importantly, Ruby is a very readable language -- the closest thing to practical pseudocode we know. The code samples provided should map cleanly to those high-level languages, and the approach we recommend is available in any language.
-
-In particular, we've chosen the Ruby-language Wukong framework. We're the principal authors, but it's open-source and widely used. It's also the only framework I'm aware of that runs on both Hadoop and Storm+Trident.
-
-
-
-==== Helpful Reading ====
-
-* Hadoop the Definitive Guide by Tom White is a must-have. Don't try to absorb its whole -- the most powerful parts of Hadoop are its simplest parts -- but you'll refer to often it as your applications reach production.
-* Hadoop Operations by Eric Sammer -- hopefully you can hand this to someone else, but the person who runs your hadoop cluster will eventually need this guide to configuring and hardening a large production cluster.
-* "Big Data: principles and best practices of scalable realtime data systems" by Nathan Marz
-* ...
-
-
-==== What This Book Does Not Cover ====
-
-We are not currently planning to cover Hive. The Pig scripts will translate naturally for folks who are already familiar with it. There will be a brief section explaining why you might choose it over Pig, and why I chose it over Hive. If there's popular pressure I may add a "translation guide".
-
-This book picks up where the internet leaves off -- apart from cheatsheets at the end of the book, I'm not going to spend any real time on information well-covered by basic tutorials and core documentation. Other things we do not plan to include:
-
-* Installing or maintaining Hadoop
-* we will cover how to design HBase schema, but not how to use HBase as _database_
-* Other map-reduce-like platforms (disco, spark, etc), or other frameworks (MrJob, Scalding, Cascading)
-* At a few points we'll use Mahout, R, D3.js and Unix text utils (cut/wc/etc), but only as tools for an immediate purpose. I can't justify going deep into any of them; there are whole O'Reilly books on each.
-
-==== Feedback ====
-
-* The http://github.com/infochimps-labs/big_data_for_chimps[source code for the book] -- all the prose, images, the whole works -- is on github at `http://github.com/infochimps-labs/big_data_for_chimps`.
-* Contact us! If you have questions, comments or complaints, the http://github.com/infochimps-labs/big_data_for_chimps/issues[issue tracker] http://github.com/infochimps-labs/big_data_for_chimps/issues is the best forum for sharing those. If you'd like something more direct, please email meghan@oreilly.com (the ever-patient editor) and flip@infochimps.com (your eager author). Please include both of us.
-
-OK! On to the book. Or, on to the introductory parts of the book and then the book.
-
-[[about]]
-=== About ===
-
-[[about_coverage]]
-==== What this book covers ====
-
-'Big Data for Chimps' shows you how to solve important hard problems using simple, fun, elegant tools.
-
-Geographic analysis is an important hard problem. To understand a disease outbreak in Europe, you need to see the data from Zurich in the context of Paris, Milan, Frankfurt and Munich; but to understand the situation in Munich requires context from Zurich, Prague and Vienna; and so on. How do you understand the part when you can't hold the whole world in your hand?
-
-Finding patterns in massive event streams is an important hard problem. Most of the time, there aren't earthquakes -- but the patterns that will let you predict one in advance lie within the data from those quiet periods. How do you compare the trillions of subsequences in billions of events, each to each other, to find the very few that matter? Once you have those patterns, how do you react to them in real-time?
-
-We've chosen case studies anyone can understand that generalize to problems like those and the problems you're looking to solve. Our goal is to equip you with:
-
-* How to think at scale -- equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations.
-* Detailed example programs applying Hadoop to interesting problems in context
-* Advice and best practices for efficient software development
-
-All of the examples use real data, and describe patterns found in many problem domains:
-
-* Statistical Summaries
-* Identify patterns and groups in the data
-* Searching, filtering and herding records in bulk
-* Advanced queries against spatial or time-series data sets.
-
-The emphasis on simplicity and fun should make this book especially appealing to beginners, but this is not an approach you'll outgrow. We've found it's the most powerful and valuable approach for creative analytics. One of our maxims is "Robots are cheap, Humans are important": write readable, scalable code now and find out later whether you want a smaller cluster. The code you see is adapted from programs we write at Infochimps to solve enterprise-scale business problems, and these simple high-level transformations (most of the book) plus the occasional Java extension (chapter XXX) meet our needs.
-
-Many of the chapters have exercises included. If you're a beginning user, I highly recommend you work out at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you _read_ it than from having the book next to you while you *write* code inspired by it. There are sample solutions and result datasets on the book's website.
-
-Feel free to hop around among chapters; the application chapters don't have large dependencies on earlier chapters.
-
-
-[[about_is_for]]
-==== Who This Book Is For ====
-
-We'd like for you to be familiar with at least one programming language, but it doesn't have to be Ruby. Familiarity with SQL will help a bit, but isn't essential.
-
-Most importantly, you should have an actual project in mind that requires a big data toolkit to solve -- a problem that requires scaling out across multiple machines. If you don't already have a project in mind but really want to learn about the big data toolkit, take a quick browse through the exercises. At least a few of them should have you jumping up and down with excitement to learn this stuff.
-
-[[about_is_not_for]]
-==== Who This Book Is Not For ====
-
-This is not "Hadoop the Definitive Guide" (that's been written, and well); this is more like "Hadoop: a Highly Opinionated Guide". The only coverage of how to use the bare Hadoop API is to say "In most cases, don't". We recommend storing your data in one of several highly space-inefficient formats and in many other ways encourage you to willingly trade a small performance hit for a large increase in programmer joy. The book has a relentless emphasis on writing *scalable* code, but no content on writing *performant* code beyond the advice that the best path to a 2x speedup is to launch twice as many machines.
-
-That is because for almost everyone, the cost of the cluster is far less than the opportunity cost of the data scientists using it. If you have not just big data but huge data -- let's say somewhere north of 100 terabytes -- then you will need to make different tradeoffs for jobs that you expect to run repeatedly in production.
-
-The book does have some content on machine learning with Hadoop, on provisioning and deploying Hadoop, and on a few important settings. But it does not cover advanced algorithms, operations or tuning in any real depth.
-
-[[about_how_written]]
-==== How this book is being written ====
-
-I plan to push chapters to the publicly-viewable http://github.com/infochimps-labs/big_data_for_chimps['Hadoop for Chimps' git repo] as they are written, and to post them periodically to the http://blog.infochimps.com[Infochimps blog] after minor cleanup.
-
-We really mean it about the git social-coding thing -- please https://github.com/blog/622-inline-commit-notes[comment] on the text, http://github.com/infochimps-labs/big_data_for_chimps/issues[file issues] and send pull requests. However! We might not use your feedback, no matter how dazzlingly cogent it is; and while we are soliciting comments from readers, we are not seeking content from collaborators.
-
-
-==== How to Contact Us ====
-
-Please address comments and questions concerning this book to the publisher:
-
-O'Reilly Media, Inc.
-1005 Gravenstein Highway North
-Sebastopol, CA 95472
-(707) 829-0515 (international or local)
-
-To comment or ask technial questions about this book, send email to bookquestions@oreilly.com
-
-To reach the authors:
-
-Flip Kromer is @mrflip on Twitter
-
-For comments or questions on the material, file a github issue at http://github.com/infochimps-labs/big_data_for_chimps/issues
View
187 02-simple_transform.asciidoc
@@ -1,2 +1,189 @@
[[simple_transform]]
== Simple Transform
+
+//// Use the start of this chapter, before you get into the Chimpanzee and Elephant business, to write about your insights and conclusions about the simple transform aspect of working with big data. Prime the reader's mind. Make it fascinating -- this first, beginning section of each chapter is an ideal space for making big data fascinating and showing how magical data manipulation can be. Then, go on to write your introductory text for this chapter. Such as, "...in this chapter, we'll study an "embarrassingly parallel" problem in order to learn the mechanics of launching a job. It will also provide a course understanding of..." Amy////
+
+=== Chimpanzee and Elephant Start a Business ===
+
+Chimpanzees love nothing more than sitting at keyboards processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. The chimpanzees and the elephants realized there was a real business opportunity from combining their strengths, and so they formed the Chimpanzee and Elephant Data Shipping Corporation.
+
+They were soon hired by a publishing firm to translate the works of Shakespeare into every language.
+In the system they set up, each chimpanzee sits at a typewriter doing exactly one thing well: read a set of passages, and type out the corresponding text in a new language. Each elephant has a pile of books, which she breaks up into "blocks" (a consecutive bundle of pages, tied up with string).
+
+=== A Simple Streamer ===
+
+We're hardly as clever as one of these multilingual chimpanzees, but even we can translate text into Pig Latin. For the unfamiliar, you turn standard English into Pig Latin as follows:
+
+* If the word begins with a consonant-sounding letter or letters, move them to the end of the word adding "ay": "happy" becomes "appy-hay", "chimp" becomes "imp-chay" and "yes" becomes "es-yay".
+* In words that begin with a vowel, just append the syllable "way": "another" becomes "another-way", "elephant" becomes "elephant-way".
+
+<<pig_latin_translator>> is a program to do that translation. It's written in Wukong, a simple library to rapidly develop big data analyses. Like the chimpanzees, it is single-concern: there's nothing in there about loading files, parallelism, network sockets or anything else. Yet you can run it over a text file from the commandline or run it over petabytes on a cluster (should you somehow have a petabyte crying out for pig-latinizing).
+
+
+[[pig_latin_translator]]
+.Pig Latin translator, actual version
+----
+ CONSONANTS = "bcdfghjklmnpqrstvwxz"
+ UPPERCASE_RE = /[A-Z]/
+ PIG_LATIN_RE = %r{
+ \b # word boundary
+ ([#{CONSONANTS}]*) # all initial consonants
+ ([\w\']+) # remaining wordlike characters
+ }xi
+
+ each_line do |line|
+ latinized = line.gsub(PIG_LATIN_RE) do
+ head, tail = [$1, $2]
+ head = 'w' if head.blank?
+ tail.capitalize! if head =~ UPPERCASE_RE
+ "#{tail}-#{head.downcase}ay"
+ end
+ yield(latinized)
+ end
+----
+
+[[pig_latin_translator]]
+.Pig Latin translator, pseudocode
+----
+ for each line,
+ recognize each word in the line and change it as follows:
+ separate the head consonants (if any) from the tail of the word
+ if there were no initial consonants, use 'w' as the head
+ give the tail the same capitalization as the word
+ change the word to "{tail}-#{head}ay"
+ end
+ emit the latinized version of the line
+ end
+----
+
+.Ruby helper
+****
+* The first few lines define "regular expressions" selecting the initial characters (if any) to move. Writing their names in ALL CAPS makes them be constants.
+* Wukong calls the `each_line do ... end` block with each line; the `|line|` part puts it in the `line` variable.
+* the `gsub` ("globally substitute") statement calls its `do ... end` block with each matched word, and replaces that word with the last line of the block.
+* `yield(latinized)` hands off the `latinized` string for wukong to output
+****
+
+To test the program on the commandline, run
+
+ wu-local examples/text/pig_latin.rb data/magi.txt -
+
+The last line of its output should look like
+
+ Everywhere-way ey-thay are-way isest-way. Ey-thay are-way e-thay agi-may.
+
+So that's what it looks like when a `cat` is feeding the program data; let's see how it works when an elephant is setting the pace.
+
+
+=== Running a Hadoop Job ===
+
+_Note: this assumes you have a working Hadoop cluster, however large or small._
+
+As you've surely guessed, Hadoop is organized very much like the Chimpanzee & Elephant team. Let's dive in and see it in action.
+
+First, copy the data onto the cluster:
+
+ hadoop fs -mkdir ./data
+ hadoop fs -put wukong_example_data/text ./data/
+
+These commands understand `./data/text` to be a path on the HDFS, not your local disk; the dot `.` is treated as your HDFS home directory (use it as you would `~` in Unix.). The `wu-put` command, which takes a list of local paths and copies them to the HDFS, treats its final argument as an HDFS path by default, and all the preceding paths as being local.
+
+First, let's test on the same tiny little file we used at the commandline. Make sure to notice how much _longer_ it takes this elephant to squash a flea than it took to run without hadoop.
+//// I'd delete the 'squash a flea' here in order to be direct and simple, not cute. Amy////
+
+ wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
+
+//// I suggest adding something about what the reader can expect to see on screen before moving on. Amy////
+After outputting a bunch of happy robot-ese to your screen, the job should appear on the jobtracker window within a few seconds. The whole job should complete in far less time than it took to set it up. You can compare its output to the earlier by running
+
+ hadoop fs -cat ./output/latinized_magi/\*
+
+Now let's run it on the full Shakespeare corpus. Even this is hardly enough data to make Hadoop break a sweat, but it does show off the power of distributed computing.
+
+ wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
+
+//// I suggest rounding out the exercise with something along the lines of a supportive "You should now see...on screen" for the reader. Amy////
+
+=== Brief Anatomy of a Hadoop Job ===
+
+We'll go into much more detail in (TODO: ref), but here are the essentials of what you just performed.
+
+==== Copying files to the HDFS ====
+
+When you ran the `hadoop fs -mkdir` command, the Namenode (Nanette's Hadoop counterpart) simply made a notation in its directory: no data was stored. If you're familiar with the term, think of the namenode as a 'File Allocation Table (FAT)' for the HDFS.
+
+When you run `hadoop fs -put ...`, the putter process does the following for each file:
+
+1. Contacts the namenode to create the file. This also just makes a note of the file; the namenode doesn't ever have actual data pass through it.
+2. Instead, the putter process asks the namenode to allocate a new data block. The namenode designates a set of datanodes (typically three), along with a permanently-unique block ID.
+3. The putter process transfers the file over the network to the first data node in the set; that datanode transfers its contents to the next one, and so forth. The putter doesn't consider its job done until a full set of replicas have acknowledged successful receipt.
+4. As soon as each HDFS block fills, even if it is mid-record, it is closed; steps 2 and 3 are repeated for the next block.
+
+==== Running on the cluster ====
+
+Now let's look at what happens when you run your job.
+
+(TODO: verify this is true in detail. @esammer?)
+
+* _Runner_: The program you launch sends the job and its assets (code files, etc) to the jobtracker. The jobtracker hands a `job_id` back (something like `job_201204010203_0002` -- the datetime the jobtracker started and the count of jobs launched so far); you'll use this to monitor and if necessary kill the job.
+* _Jobtracker_: As tasktrackers "heartbeat" in, the jobtracker hands them a set of 'task's -- the code to run and the data segment to process (the "split", typically an HDFS block).
+* _Tasktracker_: each tasktracker launches a set of 'mapper child processes', each one an 'attempt' of the tasks it received. (TODO verify:) It periodically reassures the jobtracker with progress and in-app metrics.
+* _Jobtracker_: the Jobtracker continually updates the job progress and app metrics. As each tasktracker reports a complete attempt, it receives a new one from the jobtracker.
+* _Tasktracker_: after some progress, the tasktrackers also fire off a set of reducer attempts, similar to the mapper step.
+* _Runner_: stays alive, reporting progress, for the full duration of the job. As soon as the job_id is delivered, though, the Hadoop job itself doesn't depend on the runner -- even if you stop the process or disconnect your terminal the job will continue to run.
+
+[WARNING]
+===============================
+Please keep in mind that the tasktracker does _not_ run your code directly -- it forks a separate process in a separate JVM with its own memory demands. The tasktracker rarely needs more than a few hundred megabytes of heap, and you should not see it consuming significant I/O or CPU.
+===============================
+
+=== Chimpanzee and Elephant: Splits ===
+
+I've danced around a minor but important detail that the workers take care of. For the Chimpanzees, books are chopped up into set numbers of pages -- but the chimps translate _sentences_, not pages, and a page block boundary might happen mid-sentence.
+//// Provide a real world analogous example here to help readers correlate this story to their world and data analysis needs, "...This example is similar to..." Amy////
+
+The Hadoop equivalent of course is that a data record may cross and HDFS block boundary. (In fact, you can force map-reduce splits to happen anywhere in the file, but the default and typically most-efficient choice is to split at HDFS blocks.)
+
+A mapper will skip the first record of a split if it's partial and carry on from there. Since there are many records in each split, that's no big deal. When it gets to the end of the split, the task doesn't stop processing until is completes the current record -- the framework makes the overhanging data seamlessley appear.
+//// Again, here, correlate this example to a real world scenario; "...so if you were translating x, this means that..." Amy////
+
+In practice, Hadoop users only need to worry about record splitting when writing a custom `InputFormat` or when practicing advanced magick. You'll see lots of reference to it though -- it's a crucial subject for those inside the framework, but for regular users the story I just told is more than enough detail.
+
+=== Exercises ===
+
+==== Exercise 1.1: Running time ====
+
+It's important to build your intuition about what makes a program fast or slow.
+
+Write the following scripts:
+
+* *null.rb* -- emits nothing.
+* *identity.rb* -- emits every line exactly as it was read in.
+
+Let's run the *reverse.rb* and *piglatin.rb* scripts from this chapter, and the *null.rb* and *identity.rb* scripts from exercise 1.1, against the 30 Million Wikipedia Abstracts dataset.
+
+First, though, write down an educated guess for how much longer each script will take than the `null.rb` script takes (use the table below). So, if you think the `reverse.rb` script will be 10% slower, write '10%'; if you think it will be 10% faster, write '- 10%'.
+
+Next, run each script three times, mixing up the order. Write down
+
+* the total time of each run
+* the average of those times
+* the actual percentage difference in run time between each script and the null.rb script
+
+ script | est % incr | run 1 | run 2 | run 3 | avg run time | actual % incr |
+ null: | | | | | | |
+ identity: | | | | | | |
+ reverse: | | | | | | |
+ pig_latin: | | | | | | |
+
+Most people are surprised by the result.
+
+==== Exercise 1.2: A Petabyte-scalable `wc` command ====
+
+Create a script, `wc.rb`, that emit the length of each line, the count of bytes it occupies, and the number of words it contains.
+
+Notes:
+
+* The `String` methods `chomp`, `length`, `bytesize`, `split` are useful here.
+* Do not include the end-of-line characters (`\n` or `\r`) in your count.
+* As a reminder -- for English text the byte count and length are typically similar, but the funny characters in a string like "Iñtërnâtiônàlizætiøn" require more than one byte each. The character count says how many distinct 'letters' the string contains, regardless of how it's stored in the computer. The byte count describes how much space a string occupies, and depends on arcane details of how strings are stored.
View
88 02a-c_and_e_start_a_business.asciidoc
@@ -1,88 +0,0 @@
-=== Chimpanzee and Elephant Start a Business ===
-
-Chimpanzees love nothing more than sitting at keyboards processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. The chimpanzees and the elephants realized there was a real business opportunity from combining their strengths, and so they formed the Chimpanzee and Elephant Data Shipping Corporation.
-
-They were soon hired by a publishing firm to translate the works of Shakespeare into every language.
-In the system they set up, each chimpanzee sits at a typewriter doing exactly one thing well: read a set of passages, and type out the corresponding text in a new language. Each elephant has a pile of books, which she breaks up into "blocks" (a consecutive bundle of pages, tied up with string).
-
-=== A Simple Streamer ===
-
-We're hardly as clever as one of these multilingual chimpanzees, but even we can translate text into Pig Latin. For the unfamiliar, you turn standard English into Pig Latin as follows:
-
-* If the word begins with a consonant-sounding letter or letters, move them to the end of the word adding "ay": "happy" becomes "appy-hay", "chimp" becomes "imp-chay" and "yes" becomes "es-yay".
-* In words that begin with a vowel, just append the syllable "way": "another" becomes "another-way", "elephant" becomes "elephant-way".
-
-<<pig_latin_translator>> is a program to do that translation. It's written in Wukong, a simple library to rapidly develop big data analyses. Like the chimpanzees, it is single-concern: there's nothing in there about loading files, parallelism, network sockets or anything else. Yet you can run it over a text file from the commandline or run it over petabytes on a cluster (should you somehow have a petabyte crying out for pig-latinizing).
-
-
-[[pig_latin_translator]]
-.Pig Latin translator, actual version
-----
- CONSONANTS = "bcdfghjklmnpqrstvwxz"
- UPPERCASE_RE = /[A-Z]/
- PIG_LATIN_RE = %r{
- \b # word boundary
- ([#{CONSONANTS}]*) # all initial consonants
- ([\w\']+) # remaining wordlike characters
- }xi
-
- each_line do |line|
- latinized = line.gsub(PIG_LATIN_RE) do
- head, tail = [$1, $2]
- head = 'w' if head.blank?
- tail.capitalize! if head =~ UPPERCASE_RE
- "#{tail}-#{head.downcase}ay"
- end
- yield(latinized)
- end
-----
-
-[[pig_latin_translator]]
-.Pig Latin translator, pseudocode
-----
- for each line,
- recognize each word in the line and change it as follows:
- separate the head consonants (if any) from the tail of the word
- if there were no initial consonants, use 'w' as the head
- give the tail the same capitalization as the word
- change the word to "{tail}-#{head}ay"
- end
- emit the latinized version of the line
- end
-----
-
-.Ruby helper
-****
-* The first few lines define "regular expressions" selecting the initial characters (if any) to move. Writing their names in ALL CAPS makes them be constants.
-* Wukong calls the `each_line do ... end` block with each line; the `|line|` part puts it in the `line` variable.
-* the `gsub` ("globally substitute") statement calls its `do ... end` block with each matched word, and replaces that word with the last line of the block.
-* `yield(latinized)` hands off the `latinized` string for wukong to output
-****
-
-//// Is the Ruby helper necessary? It may be, but wanted to query it as I don't see it appearing often as a feature. Amy////
-
-To test the program on the commandline, run
-
- wu-local examples/text/pig_latin.rb data/magi.txt -
-
-The last line of its output should look like
-
- Everywhere-way ey-thay are-way isest-way. Ey-thay are-way e-thay agi-may.
-
-So that's what it looks like when a `cat` is feeding the program data; let's see how it works when an elephant is setting the pace.
-
-=== Chimpanzee and Elephant: A Day at Work ===
-
-Each day, the chimpanzee's foreman, a gruff silverback named J.T., hands each chimp the day's translation manual and a passage to translate as they clock in. Throughout the day, he also coordinates assigning each block of pages to chimps as they signal the need for a fresh assignment.
-
-Some passages are harder than others, so it's important that any elephant can deliver page blocks to any chimpanzee -- otherwise you'd have some chimps goofing off while others are stuck translating _King Lear_ into Kinyarwanda. On the other hand, sending page blocks around arbitrarily will clog the hallways and exhaust the elephants.
-
-The elephants' chief librarian, Nanette, employs several tricks to avoid this congestion.
-
-Since each chimpanzee typically shares a cubicle with an elephant, it's most convenient to hand a new page block across the desk rather then carry it down the hall. J.T. assigns tasks accordingly, using a manifest of page blocks he requests from Nanette. Together, they're able to make most tasks be "local".
-
-//// This would be a good spot to consider offering a couple additional, real-world analogies to aid your reader in making the leap to how this Chimp and Elephant story correlates to the needs for data manipulation in the readers' world - that is, relate it to science, epidemiology, finance, or what have you, some area of real-world application. Amy////
-
-Second, the page blocks of each play are distributed all around the office, not stored in one book together. One elephant might have pages from Act I of _Hamlet_, Act II of _The Tempest_, and the first four scenes of _Troilus and Cressida_ footnote:[Does that sound complicated? It is -- Nanette is able to keep track of all those blocks, but if she calls in sick, nobody can get anything done. You do NOT want Nanette to call in sick.]. Also, there are multiple 'replicas' (typically three) of each book collectively on hand. So even if a chimp falls behind, JT can depend that some other colleague will have a cubicle-local replica. (There's another benefit to having multiple copies: it ensures there's always a copy available. If one elephant is absent for the day, leaving her desk locked, Nanette will direct someone to make a xerox copy from either of the two other replicas.)
-
-Nanette and J.T. exercise a bunch more savvy optimizations (like handing out the longest passages first, or having folks who finish early pitch in so everyone can go home at the same time, and more). There's no better demonstration of power through simplicity.
View
1 02a-simple_transform-intro.asciidoc
@@ -1 +0,0 @@
-//// Use the start of this chapter, before you get into the Chimpanzee and Elephant business, to write about your insights and conclusions about the simple transform aspect of working with big data. Prime the reader's mind. Make it fascinating -- this first, beginning section of each chapter is an ideal space for making big data fascinating and showing how magical data manipulation can be. Then, go on to write your introductory text for this chapter. Such as, "...in this chapter, we'll study an "embarrassingly parallel" problem in order to learn the mechanics of launching a job. It will also provide a course understanding of..." Amy////
View
74 02b-running_a_hadoop_job.asciidoc
@@ -1,74 +0,0 @@
-
-=== Running a Hadoop Job ===
-
-_Note: this assumes you have a working Hadoop cluster, however large or small._
-
-As you've surely guessed, Hadoop is organized very much like the Chimpanzee & Elephant team. Let's dive in and see it in action.
-
-First, copy the data onto the cluster:
-
- hadoop fs -mkdir ./data
- hadoop fs -put wukong_example_data/text ./data/
-
-These commands understand `./data/text` to be a path on the HDFS, not your local disk; the dot `.` is treated as your HDFS home directory (use it as you would `~` in Unix.). The `wu-put` command, which takes a list of local paths and copies them to the HDFS, treats its final argument as an HDFS path by default, and all the preceding paths as being local.
-
-First, let's test on the same tiny little file we used at the commandline. Make sure to notice how much _longer_ it takes this elephant to squash a flea than it took to run without hadoop.
-//// I'd delete the 'squash a flea' here in order to be direct and simple, not cute. Amy////
-
- wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
-
-//// I suggest adding something about what the reader can expect to see on screen before moving on. Amy////
-After outputting a bunch of happy robot-ese to your screen, the job should appear on the jobtracker window within a few seconds. The whole job should complete in far less time than it took to set it up. You can compare its output to the earlier by running
-
- hadoop fs -cat ./output/latinized_magi/\*
-
-Now let's run it on the full Shakespeare corpus. Even this is hardly enough data to make Hadoop break a sweat, but it does show off the power of distributed computing.
-
- wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
-
-//// I suggest rounding out the exercise with something along the lines of a supportive "You should now see...on screen" for the reader. Amy////
-
-=== Brief Anatomy of a Hadoop Job ===
-
-We'll go into much more detail in (TODO: ref), but here are the essentials of what you just performed.
-
-==== Copying files to the HDFS ====
-
-When you ran the `hadoop fs -mkdir` command, the Namenode (Nanette's Hadoop counterpart) simply made a notation in its directory: no data was stored. If you're familiar with the term, think of the namenode as a 'File Allocation Table (FAT)' for the HDFS.
-
-When you run `hadoop fs -put ...`, the putter process does the following for each file:
-
-1. Contacts the namenode to create the file. This also just makes a note of the file; the namenode doesn't ever have actual data pass through it.
-2. Instead, the putter process asks the namenode to allocate a new data block. The namenode designates a set of datanodes (typically three), along with a permanently-unique block ID.
-3. The putter process transfers the file over the network to the first data node in the set; that datanode transfers its contents to the next one, and so forth. The putter doesn't consider its job done until a full set of replicas have acknowledged successful receipt.
-4. As soon as each HDFS block fills, even if it is mid-record, it is closed; steps 2 and 3 are repeated for the next block.
-
-==== Running on the cluster ====
-
-Now let's look at what happens when you run your job.
-
-(TODO: verify this is true in detail. @esammer?)
-
-* _Runner_: The program you launch sends the job and its assets (code files, etc) to the jobtracker. The jobtracker hands a `job_id` back (something like `job_201204010203_0002` -- the datetime the jobtracker started and the count of jobs launched so far); you'll use this to monitor and if necessary kill the job.
-* _Jobtracker_: As tasktrackers "heartbeat" in, the jobtracker hands them a set of 'task's -- the code to run and the data segment to process (the "split", typically an HDFS block).
-* _Tasktracker_: each tasktracker launches a set of 'mapper child processes', each one an 'attempt' of the tasks it received. (TODO verify:) It periodically reassures the jobtracker with progress and in-app metrics.
-* _Jobtracker_: the Jobtracker continually updates the job progress and app metrics. As each tasktracker reports a complete attempt, it receives a new one from the jobtracker.
-* _Tasktracker_: after some progress, the tasktrackers also fire off a set of reducer attempts, similar to the mapper step.
-* _Runner_: stays alive, reporting progress, for the full duration of the job. As soon as the job_id is delivered, though, the Hadoop job itself doesn't depend on the runner -- even if you stop the process or disconnect your terminal the job will continue to run.
-
-[WARNING]
-===============================
-Please keep in mind that the tasktracker does _not_ run your code directly -- it forks a separate process in a separate JVM with its own memory demands. The tasktracker rarely needs more than a few hundred megabytes of heap, and you should not see it consuming significant I/O or CPU.
-===============================
-
-=== Chimpanzee and Elephant: Splits ===
-
-I've danced around a minor but important detail that the workers take care of. For the Chimpanzees, books are chopped up into set numbers of pages -- but the chimps translate _sentences_, not pages, and a page block boundary might happen mid-sentence.
-//// Provide a real world analogous example here to help readers correlate this story to their world and data analysis needs, "...This example is similar to..." Amy////
-
-The Hadoop equivalent of course is that a data record may cross and HDFS block boundary. (In fact, you can force map-reduce splits to happen anywhere in the file, but the default and typically most-efficient choice is to split at HDFS blocks.)
-
-A mapper will skip the first record of a split if it's partial and carry on from there. Since there are many records in each split, that's no big deal. When it gets to the end of the split, the task doesn't stop processing until is completes the current record -- the framework makes the overhanging data seamlessley appear.
-//// Again, here, correlate this example to a real world scenario; "...so if you were translating x, this means that..." Amy////
-
-In practice, Hadoop users only need to worry about record splitting when writing a custom `InputFormat` or when practicing advanced magick. You'll see lots of reference to it though -- it's a crucial subject for those inside the framework, but for regular users the story I just told is more than enough detail.
View
38 02c-end_of_simple_transform.asciidoc
@@ -1,38 +0,0 @@
-=== Exercises ===
-
-==== Exercise 1.1: Running time ====
-
-It's important to build your intuition about what makes a program fast or slow.
-
-Write the following scripts:
-
-* *null.rb* -- emits nothing.
-* *identity.rb* -- emits every line exactly as it was read in.
-
-Let's run the *reverse.rb* and *piglatin.rb* scripts from this chapter, and the *null.rb* and *identity.rb* scripts from exercise 1.1, against the 30 Million Wikipedia Abstracts dataset.
-
-First, though, write down an educated guess for how much longer each script will take than the `null.rb` script takes (use the table below). So, if you think the `reverse.rb` script will be 10% slower, write '10%'; if you think it will be 10% faster, write '- 10%'.
-
-Next, run each script three times, mixing up the order. Write down
-
-* the total time of each run
-* the average of those times
-* the actual percentage difference in run time between each script and the null.rb script
-
- script | est % incr | run 1 | run 2 | run 3 | avg run time | actual % incr |
- null: | | | | | | |
- identity: | | | | | | |
- reverse: | | | | | | |
- pig_latin: | | | | | | |
-
-Most people are surprised by the result.
-
-==== Exercise 1.2: A Petabyte-scalable `wc` command ====
-
-Create a script, `wc.rb`, that emit the length of each line, the count of bytes it occupies, and the number of words it contains.
-
-Notes:
-
-* The `String` methods `chomp`, `length`, `bytesize`, `split` are useful here.
-* Do not include the end-of-line characters (`\n` or `\r`) in your count.
-* As a reminder -- for English text the byte count and length are typically similar, but the funny characters in a string like "Iñtërnâtiônàlizætiøn" require more than one byte each. The character count says how many distinct 'letters' the string contains, regardless of how it's stored in the computer. The byte count describes how much space a string occupies, and depends on arcane details of how strings are stored.
View
2 03-transform_pivot.asciidoc
@@ -1,2 +1,4 @@
[[transform_pivot]]
== Chimpanzee and Elephant Save Christmas ==
+
+//// Say more about conceptualizing big data and tie in what readers have already learned (up to this chapter, the simple transform and first exploration material) and weave that in to help them have 'ah ha' moments and really grasp this material. Then, write your "...in this chapter..." Finally, a good discussion of "locality" would go well anchored here. And then you can dive into the Elves. Amy////
View
2 03a-c_and_e_save_xmas.asciidoc
@@ -74,8 +74,6 @@ The chimps places their toyforms into these file folders as fast as their dexter
image::images/chimps_and_elves/bchm_0206.png[toyforms go off in batches]
-
-
=== Hadoop vs Traditional Databases ===
Fundamentally, the storage engine at the heart of a traditional relational database does two things: it holds all the records, and it maintains a set of indexes for lookups and other operations. To retrieve a record, it must consult the appropriate index to find the location of the record, then load it from the disk. This is very fast for record-by-record retrieval, but becomes cripplingly inefficient for general high-throughput access. If the records are stored by location and arrival time (as the mailbags were on the Bag Tree), then
View
2 03a-intro.asciidoc
@@ -1,2 +0,0 @@
-
-//// Say more about conceptualizing big data and tie in what readers have already learned (up to this chapter, the simple transform and first exploration material) and weave that in to help them have 'ah ha' moments and really grasp this material. Then, write your "...in this chapter..." Finally, a good discussion of "locality" would go well anchored here. And then you can dive into the Elves. Amy////
View
3 07-intro_to_storm+trident.asciidoc
@@ -1,7 +1,6 @@
== Intro to Storm+Trident
-
-==== Intro: Storm+Trident Fundamentals
+=== Intro: Storm+Trident Fundamentals
At this point, you have good familiarity with Hadoop’s batch processing power and the powerful inquiries it unlocks above and as a counterpart to traditional database approach. Stream analytics is a third mode of data analysis and, it is becoming clear, one that is just as essential and transformative as massive scale batch processing has been.
View
33 09-statistics.asciidoc
@@ -1,2 +1,35 @@
[[statistics]]
== Statistics
+
+=== Skeleton: Statistics
+
+Data is worthless. Actually, it's worse than worthless. It costs you money to gather, store, manage, replicate and analyze. What you really want is insight -- a relevant summary of the essential patterns in that data -- produced using relationships to analyze data in context.
+
+Statistical summaries are the purest form of this activity, and will be used repeatedly in the book to come, so now that you see how Hadoop is used it's a good place to focus.
+
+Some statistical measures let you summarize the whole from summaries of the parts: I can count all the votes in the state by summing the votes from each county, and the votes in each county by summing the votes at each polling station. Those types of aggregations -- average/standard deviation, correlation, and so forth -- are naturally scalable, but just having billions of objects introduces some practical problems you need to avoid. We'll also use them to introduce Pig, a high-level language for SQL-like queries on large datasets.
+
+Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That's especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution -- the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.)
+
+But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to
+
+* Holistic vs algebraic aggregations
+* Underflow and the "Law of Huge Numbers"
+* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
+* Count-min sketch for most frequent elements
+* Approx histogram
+
+- Counting
+ - total burgers sold - total commits, repos,
+- counting a running and or smoothed average
+- standard deviation
+- sampling
+ - uniform
+ - top k
+ - reservior
+ - ?rolling topk/reservior sampling?
+- algebraic vs holistic aggregate
+- use countmin sketch to turn a holistic aggregate into an algebraic aggregate
+- quantile or histogram
+- numeric stability
+
View
33 09a-statistics-intro.asciidoc
@@ -1,33 +0,0 @@
-
-==== Skeleton: Statistics
-
-Data is worthless. Actually, it's worse than worthless. It costs you money to gather, store, manage, replicate and analyze. What you really want is insight -- a relevant summary of the essential patterns in that data -- produced using relationships to analyze data in context.
-
-Statistical summaries are the purest form of this activity, and will be used repeatedly in the book to come, so now that you see how Hadoop is used it's a good place to focus.
-
-Some statistical measures let you summarize the whole from summaries of the parts: I can count all the votes in the state by summing the votes from each county, and the votes in each county by summing the votes at each polling station. Those types of aggregations -- average/standard deviation, correlation, and so forth -- are naturally scalable, but just having billions of objects introduces some practical problems you need to avoid. We'll also use them to introduce Pig, a high-level language for SQL-like queries on large datasets.
-
-Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That's especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution -- the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.)
-
-But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to
-
-* Holistic vs algebraic aggregations
-* Underflow and the "Law of Huge Numbers"
-* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
-* Count-min sketch for most frequent elements
-* Approx histogram
-
-- Counting
- - total burgers sold - total commits, repos,
-- counting a running and or smoothed average
-- standard deviation
-- sampling
- - uniform
- - top k
- - reservior
- - ?rolling topk/reservior sampling?
-- algebraic vs holistic aggregate
-- use countmin sketch to turn a holistic aggregate into an algebraic aggregate
-- quantile or histogram
-- numeric stability
-
View
45 18-java_api.asciidoc
@@ -1,3 +1,48 @@
[[java_api]]
== Java Api
+=== When to use the Hadoop Java API ===
+
+Don't.
+
+=== How to use the Hadoop Java API ===
+
+The Java API provides direct access but requires you to write the entirety of your program from boilerplate
+
+Decompose your problem into small isolable transformations. Implement each as a Pig Load/StoreFunc or UDF (User-Defined Function) making calls to the Hadoop API.[^1]
+
+=== The Skeleton of a Hadoop Java API program ===
+
+I'll trust that for very good reasons -- to interface with an outside system, performance, a powerful library with a Java-only API -- Java is the best choice to implement
+
+ [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
+
+When this happens around the office, we sing this little dirge[^2]:
+
+ Life, sometimes, is Russian Novel. Is having unhappy marriage and much snow and little vodka.
+ But when Russian Novel it is short, then quickly we finish and again is Sweet Valley High.
+
+What we *don't* do is write a pure Hadoop-API Java program. In practice, those look like this:
+
+ HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT PARSING PARAMS
+ HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT READING FILES
+ COBBLED-TOGETHER CODE TO DESERIALIZE THE FILE, HANDLE SPLITS, ETC
+
+ [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
+
+ COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S FLATTEN COMMAND DOES
+ COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S CROSS COMMAND DOES
+ A SIMPLE COMBINER COPIED FROM TOM WHITE'S BOOK
+ 1000 LINES OF CODE TO DO WHAT RUBY COULD IN THREE LINES OF CODE
+ HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT WRITING FILES
+ UGLY BUT NECESSARY CODE TO GLUE THIS TO THE REST OF THE ECOSYSTEM
+
+The surrounding code is ugly and boring; it will take more time, produce more bugs, and carry a higher maintenance burden than the important stuff. More importantly, the high-level framework provides an implementation far better than it's worth your time to recreate.[^3]
+
+Instead, we write
+
+ A SKELETON FOR A PIG UDF DEFINITION
+ [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
+
+ A PIG SCRIPT
+
View
46 18a-hadoop_api.asciidoc
@@ -1,46 +0,0 @@
-== The Hadoop Java API ==
-
-=== When to use the Hadoop Java API ===
-
-Don't.
-
-=== How to use the Hadoop Java API ===
-
-The Java API provides direct access but requires you to write the entirety of your program from boilerplate
-
-Decompose your problem into small isolable transformations. Implement each as a Pig Load/StoreFunc or UDF (User-Defined Function) making calls to the Hadoop API.[^1]
-
-=== The Skeleton of a Hadoop Java API program ===
-
-I'll trust that for very good reasons -- to interface with an outside system, performance, a powerful library with a Java-only API -- Java is the best choice to implement
-
- [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
-
-When this happens around the office, we sing this little dirge[^2]:
-
- Life, sometimes, is Russian Novel. Is having unhappy marriage and much snow and little vodka.
- But when Russian Novel it is short, then quickly we finish and again is Sweet Valley High.
-
-What we *don't* do is write a pure Hadoop-API Java program. In practice, those look like this:
-
- HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT PARSING PARAMS
- HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT READING FILES
- COBBLED-TOGETHER CODE TO DESERIALIZE THE FILE, HANDLE SPLITS, ETC
-
- [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
-
- COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S FLATTEN COMMAND DOES
- COBBLED-TOGETHER CODE THAT KINDA DOES WHAT PIG'S CROSS COMMAND DOES
- A SIMPLE COMBINER COPIED FROM TOM WHITE'S BOOK
- 1000 LINES OF CODE TO DO WHAT RUBY COULD IN THREE LINES OF CODE
- HORRIBLE BOILERPLATE TO DO A CRAPPY BUT SERVICABLE JOB AT WRITING FILES
- UGLY BUT NECESSARY CODE TO GLUE THIS TO THE REST OF THE ECOSYSTEM
-
-The surrounding code is ugly and boring; it will take more time, produce more bugs, and carry a higher maintenance burden than the important stuff. More importantly, the high-level framework provides an implementation far better than it's worth your time to recreate.[^3]
-
-Instead, we write
-
- A SKELETON FOR A PIG UDF DEFINITION
- [ Your breathtakingly elegant, stunningly performant solution to a novel problem ]
-
- A PIG SCRIPT
View
11 19b-pig_udfs.asciidoc
@@ -4,17 +4,6 @@
placeholder
-''''
-
-[^1] Doesn't have to be Pig -- Hive, Cascading, and Crunch and other high-level frameworks abstract out the boring stuff while still making it easy to write custom components.
-
-[^2] If the novel lasts all week, someone will tell this joke and then we will walk carefully to the bar.
-
- The church, it is close by -- but the way is cold and icy.
- The bar, it is far away -- but we shall walk carefully.
-
-[^3] ... and when the harsh reality of a production dataset reveals that your data has an unforseen and crippling "stuck reducer" problem, you're facing a fundamental re-think of your program's design rather than a one-line change from `JOIN` to `SKEW JOIN`. See the chapter on Advanced Pig.
-
=== Algebraic UDFs let Pig go fast ===
View
241 21-hadoop_internals.asciidoc
@@ -1,3 +1,244 @@
[[hadoop_internals]]
== Hadoop Internals
+One of the wonderful and terrible things about Hadoop (or anything else at Big Data scale) is that there are very few boundary cases for performance optimization. If your dataflow has the wrong shape, it is typically so catastrophically inefficient as to be unworkable. Otherwise, Hadoop’s scalability makes the price of simply throwing more hardware at the problem competitive with investing to optimize it, especially for exploratory analytics. That’s why the repeated recommendation of this book is: Get the algorithm right, get the contextability right and size your cluster to your job.
+
+When you begin to productionize the results of all your clever work, it becomes valuable to look for these 30-percent, 10-percent improvements. Debug loop time is also important, though, so it is useful to get a feel for when and why early optimization is justified.
+
+The first part of this chapter discusses Tuning for the Wise and Lazy -- showing you how to answer the question, “Is my job slow and should I do anything about it?” Next, we will discuss Tuning for the Brave and Foolish, diving into the details of Hadoop’s maddeningly numerous, often twitchy, configuration knobs. There are low-level setting changes that can dramatically improve runtime; we will show you how to recognize them and how to determine what per-job overrides to apply. Lastly, we will discuss a formal approach for diagnosing the health and performance of your Hadoop cluster and your Hadoop jobs to help you confidently and efficiently identify the source of deeper flaws or bottlenecks.
+
+=== Chimpanzee and Elephant: A Day at Work ===
+
+Each day, the chimpanzee's foreman, a gruff silverback named J.T., hands each chimp the day's translation manual and a passage to translate as they clock in. Throughout the day, he also coordinates assigning each block of pages to chimps as they signal the need for a fresh assignment.
+
+Some passages are harder than others, so it's important that any elephant can deliver page blocks to any chimpanzee -- otherwise you'd have some chimps goofing off while others are stuck translating _King Lear_ into Kinyarwanda. On the other hand, sending page blocks around arbitrarily will clog the hallways and exhaust the elephants.
+
+The elephants' chief librarian, Nanette, employs several tricks to avoid this congestion.
+
+Since each chimpanzee typically shares a cubicle with an elephant, it's most convenient to hand a new page block across the desk rather then carry it down the hall. J.T. assigns tasks accordingly, using a manifest of page blocks he requests from Nanette. Together, they're able to make most tasks be "local".
+
+Second, the page blocks of each play are distributed all around the office, not stored in one book together. One elephant might have pages from Act I of _Hamlet_, Act II of _The Tempest_, and the first four scenes of _Troilus and Cressida_ footnote:[Does that sound complicated? It is -- Nanette is able to keep track of all those blocks, but if she calls in sick, nobody can get anything done. You do NOT want Nanette to call in sick.]. Also, there are multiple 'replicas' (typically three) of each book collectively on hand. So even if a chimp falls behind, JT can depend that some other colleague will have a cubicle-local replica. (There's another benefit to having multiple copies: it ensures there's always a copy available. If one elephant is absent for the day, leaving her desk locked, Nanette will direct someone to make a xerox copy from either of the two other replicas.)
+
+Nanette and J.T. exercise a bunch more savvy optimizations (like handing out the longest passages first, or having folks who finish early pitch in so everyone can go home at the same time, and more). There's no better demonstration of power through simplicity.
+
+=== Tuning For The Wise and Lazy
+
+Before tuning your job, it is important to first understand whether it is slow and if so, where the bottleneck arises. Generally speaking, your Hadoop job will either be CPU-bound within your code, memory-bound within your code, bound by excessive Reducer data (causing too many merge passes) or throughput-bound within the framework or underlying system. (In the rare cases it is not one of those, you would next apply the more comprehensive “USE method” (TODO: REF) described later in this chapter.) But for data-intensive jobs, the fundamental upper bound on Hadoop’s performance is its effective throughput from disk. You cannot make your job process data faster than it can load it.
+
+Here is the conceptual model we will use: Apart from necessary overhead, job coordination and so forth, a Hadoop job must:
+
+* Streams data to the Mapper, either locally from disk or remotely over the network;
+* Runs that data through your code;
+* Spills the midstream data to disk one or more times;
+* Applies combiners, if any;
+* Merge/Sorts the spills and sends the over the network to the Reducer.
+
+The Reducer:
+
+* Writes each Mapper’s output to disk;
+* Performs some number of Merge/Sort passes;
+* Reads the data from disk;
+* Runs that data through your code;
+* Writes its output to the DataNode, which writes that data once to disk and twice through the network;
+* Receives two other Reducers’ output from the network and writes those to disk.
+
+Hadoop is, of course, pipelined; to every extent possible, those steps are happening at the same time. What we will do, then, is layer in these stages one by one, at each point validating that your job is as fast as your disk until we hit the stage where it is not.
+
+==== Fixed Overhead
+
+The first thing we want is a job that does nothing; this will help us understand the fixed overhead costs. Actually, what we will run is a job that does almost nothing; it is useful to know that your test really did run.
+
+----
+(TODO: Disable combining splits)
+Load 10_tiny_files
+Filter most of it out
+Store to disk
+
+(TODO: Restrict to 50 Mapper slots or rework)
+Load 10000_tiny_files
+Filter most of it out
+Store to disk
+----
+(TODO: is there a way to limit the number of Reduce slots in Pig? Otherwise, revisit the below.)
+
+In (TODO: REF), there is a performance comparison worksheet that you should copy and fill in as we go along. It lists the performance figures for several reference clusters on both cloud and dedicated environments for each of the tests we will perform. If your figures do not compare well with the appropriate reference cluster, it is probably worthwhile adjusting the overall configuration. Assuming your results are acceptable, you can tape the worksheet to your wall and use it to baseline all the jobs you will write in the future. The rest of the chapter will assume that your cluster is large enough to warrant tuning but not grossly larger than the largest reference cluster.
+
+If you run the Pig script above (TODO: REF), Hadoop will execute two jobs: one with 10 Mappers and no Reducers and another with 10,000 Mappers and no Reducers. From the Hadoop Job Tracker page for your job, click on the link showing the number of Map tasks to see the full task listing. All 10 tasks for the first job should have started at the same time and uniformly finished a few seconds after that. Back on the main screen, you should see that the total job completion time was more or less identical to that of the slowest Map task.
+
+The second job ran its 10,000 Map tasks through a purposefully restricted 50 Mapper slots -- so each Mapper slot will have processed around 200 files. If you click through to the Task listing, the first wave of tasks should start simultaneously and all of them should run in the same amount of time that the earlier test did.
+
+(TODO: show how to find out if one node is way slower)
+
+Even in this trivial case, there is more variance in launch and runtimes than you might first suspect (if you don't, you definitely will in the next -- but for continuity, we will discuss it here). If that splay -- the delay between the bulk of jobs finishing and the final job finishing -- is larger than the runtime of a typical task, however, it may indicate a problem, but as long as it is only a few seconds, don’t sweat it. If you are interested in a minor but worth-it tweak, adjust the `mapred.job.reuse.jvm.num.tasks` setting to ‘-1’, which causes each Mapper to use the same child process JVM for its attempts, eliminating the brief but noticeable JVM startup time. If you are writing your own native Java code, you might know a reason to go with the default (disabled) setting but it is harmless for any well-behaved program.
+
+On the Job screen, you should see that the total runtime for the job was about 200 times slower for the second job than the first and not much more than 200 times the typical task’s runtime; if not, you may be putting pressure on the Job Tracker. Rerun your job and watch the Job Tracker’s heap size; you would like the Job Tracker heap to spend most of its life below, say 50-percent, so if you see it making any significant excursions toward 100-percent, that would unnecessarily impede cluster performance. The 1 GB out-of-the-box setting is fairly small; for production use we recommend at least 3 GB of heap on a dedicated machine with at least 7 GB total RAM.
+
+If the Job coordination overhead is unacceptable but the Job Tracker heap is not to blame, a whole host of other factors might be involved; apply the USE method, described (TODO: REF).
+
+=== Mapper Input
+
+Now that we’ve done almost nothing, let’s do almost something -- read in a large amount of data, writing just enough to disk to know that we really were there.
+
+----
+Load 100 GB from disk
+Filter all but 100 MB
+Store it to disk
+----
+
+Run that job on the 100-GB GitHub archive dataset. (TODO: Check that it will do speculative execution.) Once the job completes, you will see as many successful Map tasks as there were HDFS blocks in the input; if you are running a 128-MB block size, this will be about (TODO: How many blocks are there?).
+
+Again, each Map task should complete in a uniform amount of time and the job as a whole should take about ‘length_of_Map_task*number_of_Map_tasks=number_of_Mapper_slots’. The Map phase does not end until every Mapper task has completed and, as we saw in the previous example, even in typical cases, there is some amount of splay in runtimes.
+
+(TODO: Move some of JT and Nanette’s optimizations forward to this chapter). Like the chimpanzees at quitting time, the Map phase cannot finish until all Mapper tasks have completed.
+
+You will probably notice a half-dozen or so killed attempts as well. The ‘TODO: name of speculative execution setting’, which we recommend enabling, causes Hadoop to opportunistically launch a few duplicate attempts for the last few tasks in a job. The faster job cycle time justifies the small amount of duplicate work.
+
+Check that there are few non-local Map tasks -- Hadoop tries to assign Map attempts (TODO: check tasks versus attempts) to run on a machine whose DataNode holds that input block, thus avoiding a trip across the network (or in the chimpanzees’ case, down the hallway). It is not that costly, but if you are seeing a large number of non-local tasks on a lightly-loaded cluster, dig deeper.
+
+Dividing the average runtime by a full block of Map task by the size of an HDFS block gives you the Mapper’s data rate. In this case, since we did almost nothing and wrote almost nothing, that value is your cluster’s effective top speed. This has two implications: First, you cannot expect a data-intensive job to run faster than its top speed. Second, there should be apparent reasons for any job that runs much slower than its top speed. Tuning Hadoop is basically about making sure no other part of the system is slower than the fundamental limit at which it can stream from disk.
+
+While setting up your cluster, it might be worth baselining Hadoop’s top speed against the effective speed of your disk and your network. Follow the instructions for the ‘scripts/baseline_performance’ script (TODO: write script) from the example code above. It uses a few dependable user-level processes to measure the effective data rate to disk (‘DD’ and ‘CP’) and the effective network rate (‘NC’ and ‘SCP’). (We have purposely used user-level processes to account for system overhead; if you want to validate that as well, use a benchmark like Bonnie++ (TODO: link)). If you are dedicated hardware, the network throughput should be comfortably larger than the disk throughput. If you are on cloud machines, this, unfortunately, might not hold but it should not be atrociously lower.
+
+If the effective top speed you measured above is not within (TODO: figure out healthy percent) percent, dig deeper; otherwise, record each of these numbers on your performance comparison chart.
+
+If you're setting up your cluster, take the time to generate enough additional data to keep your cluster fully saturated for 20 or more minutes and then ensure that each machine processed about the same amount of data. There is a lot more variance in effective performance among machines than you might expect, especially in a public cloud environment; it can also catch a machine with faulty hardware or setup. This is a crude but effective benchmark, but if you're investing heavily in a cluster consider running a comprehensive benchmarking suite on all the nodes -- the chapter on Stupid Hadoop Tricks shows how (TODO ref)
+
+=== The Many Small Files Problem
+
+One of the most pernicious ways to impair a Hadoop cluster’s performance is the “many-small-files” problem. With a 128-MB block size (which we will assume for the following discussion), a 128-MB file takes one block (obviously), a 1-byte file takes one block and a 128-MB+1 byte file takes two blocks, one of them full, the other with one solitary byte.
+
+Storing 10 GB of data in, say, 100 files is harmless -- the average block occupancy is a mostly-full 100 MB. Storing that same 10GB in say 10,000 files is, however, harmful in several ways. At the heart of the Namenode is a table that lists every file and block. As you would expect, the memory usage of that table roughly corresponds to the number of files plus the number of blocks, so the many-small-files example uses about 100 times as much memory as warranted. Engage in that bad habit often enough and you will start putting serious pressure on the Namenode heap and lose your job shortly thereafter. What is more, the many-small-files version will require 10,000 Map tasks, causing memory pressure on the Job Tracker and a job whose runtime is dominated by task overhead. Lastly, there is the simple fact that working with 10,000 things is more annoying than working with 100 -- it takes up space in datanode heartbeats, client requests, your terminal screen and your head.
+
+Causing this situation is easier to arrive at than you might expect; in fact, you just did so. The 100-GB job you just ran most likely used 800 Mapper slots yet output only a few MB of data. Any time your mapper output is significantly smaller than its input -- for example, when you apply a highly-restrictive filter to a large input -- your output files will have poor occupancy.
+
+A sneakier version of this is a slightly “expansive” Mapper-Only job. A job whose Mappers turned a 128-MB block into, say, 150 MB of output data would reduce the block occupancy by nearly half and require nearly double the Mapper slots in the following jobs. Done once, that is merely annoying but in a workflow that iterates or has many stages, the cascading dilution could become dangerous.
+
+You can audit your HDFS to see if this is an issue using the ‘hadoop fsck [directory]’ command. Running that command against the directory holding the GitHub data should show 100 GB of data in about 800 blocks. Running it against your last job’s output should show only a few MB of data in an equivalent number of blocks.
+
+You can always distill a set of files by doing ‘group_by’ with a small number of Reducers using the record itself as a key. Pig and Hive both have settings to mitigate the many-small-files problem. In Pig, the (TODO: find name of option) setting will feed multiple small files to the same Mapper; in Hive (TODO: look up what to do in Hive). In both cases, we recommend modifying your configuration to make that the default and disable it on a per-job basis when warranted.
+
+=== Midstream Data
+
+Now let’s start to understand the performance of a proper Map/Reduce job. Run the following script, again, against the 100 GB GitHub data.
+
+----
+Parallel 50
+Disable optimizations for pushing up filters and for Combiners
+Load 100 GB of data
+Group by record itself
+Filter out almost everything
+Store data
+----
+
+The purpose of that job is to send 100 GB of data at full speed through the Mappers and midstream processing stages but to do almost nothing in the Reducers and write almost nothing to disk. To keep Pig from “helpfully” economizing the amount of midstream data, you will notice in the script we disabled some of its optimizations. The number of Map tasks and their runtime should be effectively the same as in the previous example, and all the sanity checks we’ve given so far should continue to apply. The overall runtime of the Map phase should only be slightly longer (TODO: how much is slightly?) than in the previous Map-only example, depending on how well your network is able to outpace your disk.
+
+It is an excellent idea to get into the habit of predicting the record counts and data sizes in and out of both Mapper and Reducer based on what you believe Hadoop will be doing to each record and then comparing to what you see on the Job Tracker screen. In this case, you will see identical record counts for Mapper input, Mapper output and Reducer input and nearly identical data sizes for HDFS bytes read, Mapper output, Mapper file bytes written and Reducer input. The reason for the small discrepancies is that, for the file system metrics, Hadoop is recording everything that is read or written, including logged files and so forth.
+
+Midway or so through the job -- well before the finish of the Map phase -- you should see the Reducer tasks start up; their eagerness can be adjusted using the (TODO: name of setting) setting. By starting them early, the Reducers are able to begin merge/sorting the various Map task outputs in parallel with the Map phase. If you err low on this setting, you will disappoint your coworkers by consuming Reducer slots with lots of idle time early but that is better than starting them too late, which will sabotage parallels.
+
+Visit the Reducer tasks listing. Each Reducer task should have taken a uniform amount of time, very much longer than the length of the Map tasks. Open a few of those tasks in separate browser tabs and look at their counters; each should have roughly the same input record count and data size. It is annoying that this information is buried as deeply as it is because it is probably the single most important indicator of a flawed job; we will discuss it in detail a bit later on.
+
+==== Spills
+
+First, though, let’s finish understanding the data’s detailed journey from Mapper to Reducer. As a Map task outputs records, Hadoop sorts them in the fixed-size io.sort buffer. Hadoop files records into the buffer in partitioned, sorted order as it goes. When that buffer fills up (or the attempt completes), Hadoop begins writing to a new empty io.sort buffer and, in parallel, “spills” that buffer to disk. As the Map task concludes, Hadoop merge/sorts these spills (if there were more than one) and sends the sorted chunks to each Reducer for further merge/sorting.
+
+The Job Tracker screen shows the number of Mapper spills. If the number of spills equals the number of Map tasks, all is good -- the Mapper output is checkpointed to disk before being dispatched to the Reducer. If the size of your Map output data is large, having multiple spills is the natural outcome of using memory efficiently; that data was going to be merge/sorted anyway, so it is a sound idea to do it on the Map side where you are confident it will have a uniform size distribution.
+
+(TODO: do combiners show as multiple spills?)
+
+What you hate to see, though, are Map tasks with two or three spills. As soon as you have more than one spill, the data has to be initially flushed to disk as output, then read back in full and written again in full for at least one merge/sort pass. Even the first extra spill can cause roughly a 30-percent increase in Map task runtime.
+
+There are two frequent causes of unnecessary spills. First is the obvious one: Mapper output size that slightly outgrows the io.sort buffer size. We recommend sizing the io.sort buffer to comfortably accommodate Map task output slightly larger than your typical HDFS block size -- the next section (TODO: REF) shows you how to calculate. In the significant majority of jobs that involve a Reducer, the Mapper output is the same or nearly the same size -- JOINs or GROUPs that are direct, are preceded by a projection or filter or have a few additional derived fields. If you see many of your Map tasks tripping slightly over that limit, it is probably worth requesting a larger io.sort buffer specifically for your job.
+
+There is also a disappointingly sillier way to cause unnecessary spills: The io.sort buffer holds both the records it will later spill to disk and an index to maintain the sorted order. An unfortunate early design decision set a fixed size on both of those with fairly confusing control knobs. The ‘iosortrecordpercent’ (TODO: check name of setting) setting gives the size of that index as a fraction of the sort buffer. Hadoop spills to disk when either the fraction devoted to records or the fraction devoted to the index becomes full. If your output is long and skinny, cumulatively not much more than an HDFS block but with a typical record size smaller than, say, 100 bytes, you will end up spilling multiple small chunks to disk when you could have easily afforded to increase the size of the bookkeeping buffer.
+
+There are lots of ways to cause long, skinny output but set a special triggers in your mind for cases where you have long, skinny input; turn an adjacency-listed graph into an edge-listed graph or otherwise FLATTEN bags of records on the Mapper side. In each of these cases, the later section (TODO: REF) will show you how to calculate it.
+
+(TODO: either here or later, talk about the surprising cases where you fill up MapRed scratch space or FS.S3.buffer.dir and the rest of the considerations about where to put this).
+
+
+==== Combiners
+
+It is a frequent case that the Reducer output is smaller than its input (and kind of annoying that the word “Reducer” was chosen, since it also frequently is not smaller). “Algebraic” aggregations such as COUNT, AVG and so forth, and many others can implement part of the Reducer operation on the Map side, greatly lessening the amount of data sent to the Reducer.
+
+Pig and Hive are written to use Combiners whenever generically appropriate. Applying a Combiner requires extra passes over your data on the Map side and so, in some cases, can themselves cost much more time than they save.
+
+If you ran a distinct operation over a data set with 50-percent duplicates, the Combiner is easily justified since many duplicate pairs will be eliminated early. If, however, only a tiny fraction of records are duplicated, only a disappearingly-tiny fraction will occur on the same Mapper, so you will have spent disk and CPU without reducing the data size.
+
+Whenever your Job Tracker output shows that Combiners are being applied, check that the Reducer input data is, in fact, diminished. (TODO: check which numbers show this) If Pig or Hive have guessed badly, disable the (TODO: name of setting) setting in Pig or the (TODO: name of setting) setting in Hive.
+
+==== Reducer Merge (aka Shuffle and Sort)
+
+We are now ready to dig into the stage with the most significant impact on job performance, the merge/sort performed by the Reducer before processing. In almost all the rest of the cases discussed here, an inefficient choice causes only a marginal impact on runtime. Bring down too much data on your Reducers, however, and you will find that, two hours into the execution of what you thought was a one-hour job, a handful of Reducers indicate they have four hours left to run.
+
+First, let’s understand what is going on and describe healthy execution; then, we will discuss various ways it can go wrong and how to address them.
+
+As you just saw, data arrives from the Mappers pre-sorted. The Reducer reads them from memory into its own sort buffers. Once a threshold (controlled by the (TODO: name of setting) setting) of data has been received, the Reducer commissions a new sort buffer and separately spills the data to disk, merge/sorting the Mapper chunks as it goes. (TODO: check that this first merge/sort happens on spill)
+
+Enough of these spills later (controlled by the (TODO: setting) setting), the Reducer begins merge/sorting the spills into a larger combined chunk. All of this activity is happening in parallel, so by the time the last Map task output is received, the typical healthy situation is to have a modest number of large sorted chunks and one small-ish chunk holding the dregs of the final spill. Once the number of chunks is below the (TODO: look up name of setting) threshold, the merge/sort is complete -- it does not need to fully merge the data into a single file onto disk. Instead, it opens an input stream onto each of those final chunks, consuming them in sort order.
+
+Notice that the Reducer flushes the last spill of received Map data to disk, then immediately starts reconsuming it. If the memory needs of your Reducer are modest, you can instruct Hadoop to use the sort buffer directly in the final merge, eliminating the cost and delay of that final spill. It is a nice marginal improvement when it works but if you are wrong about how modest your Reducer’s memory needs are, the negative consequences are high and if your Reducers have to perform multiple merge/sort passes, the benefits are insignificant.
+
+For a well-tested job heading to production that requires one or fewer merge/sort passes, you may judiciously (TODO: describe how to adjust this).
+
+(TODO: discuss buffer sizes here or in Brave and Foolish section)
+(TODO: there is another setting that I’m forgetting here - what is it?)
+
+Once your job has concluded, you can find the number of merge/sort passes by consulting the Reduce tasks counters (TODO: DL screenshot and explanation). During the job, however, the only good mechanism is to examine the Reducer logs directly. At some reasonable time after the Reducer has started, you will see it initiate spills to disk (TODO: tell what the log line looks like). At some later point, it will begin merge/sorting those spills (TODO: tell what the log line looks like).
+
+The CPU burden of a merge/sort is disappearingly small against the dominating cost of reading then writing the data to disk. If, for example, your job only triggered one merge/sort pass halfway through receiving its data, the cost of the merge/sort is effectively one and a half times the base cost of writing that data at top speed to disk: all of the data was spilled once, half of it was rewritten as merged output. Comparing the total size of data received by the Reducer to the merge/sort settings will let you estimate the expected number of merge/sort passes; that number, along with the “top speed” figure you collected above, will, in turn, allow you to estimate how long the Reduce should take. Much of this action happens in parallel but it happens in parallel with your Mapper’s mapping, spilling and everything else that is happening on the machine.
+
+A healthy, data-intensive job will have Mappers with nearly top speed throughput, the expected number of merge/sort passes and the merge/sort should conclude shortly after the last Map input is received. (TODO: tell what the log line looks like). In general, if the amount of data each Reducer receives is less than a factor of two to three times its share of machine RAM, (TODO: should I supply a higher-fidelity thing to compare against?) all those conditions should hold. Otherwise, consult the USE method (TODO: REF).
+
+If the merge/sort phase is killing your job’s performance, it is most likely because either all of your Reducers are receiving more data than they can accommodate or because some of your Reducers are receiving far more than their fair share. We will take the uniform distribution case first.
+
+The best fix to apply is to send less data to your Reducers. The chapters on writing Map/Reduce jobs (TODO: REF or whatever we are calling Chapter 5) and the chapter on advanced Pig (TODO: REF or whatever we are calling that now) both have generic recommendations for how to send around less data and throughout the book, we have described powerful methods in a domain-specific context which might translate to your problem.
+
+If you cannot lessen the data burden, well, the laws of physics and economics must be obeyed. The cost of a merge/sort is ‘O(N LOG N)’. In a healthy job, however, most of the merge/sort has been paid down by the time the final merge pass begins, so up to that limit, your Hadoop job should run in ‘O(N)’ time governed by its top speed.
+
+The cost of excessive merge passes, however, accrues directly to the total runtime of the job. Even though there are other costs that increase with the number of machines, the benefits of avoiding excessive merge passes are massive. A cloud environment makes it particularly easy to arbitrage the laws of physics against the laws of economics -- it costs the same to run 60 machines for two hours as it does to run ten machines for 12 hours, as long as your runtime stays roughly linear with the increased number of machines, you should always size your cluster to your job, not the other way around. The thresholding behavior of excessive reduces makes it exceptionally valuable to do so. This is why we feel exploratory data analytics is far more efficiently done in an elastic cloud environment, even given the quite significant performance hit you take. Any physical cluster is too large and also too small; you are overpaying for your cluster overnight while your data scientists sleep and you are overpaying your data scientists to hold roller chair sword fights while their undersized cluster runs. Our rough rule of thumb is to have not more than 2-3 times as much total reducer data as you have total child heap size on all the reducer machines you'll use.
+
+(TODO: complete)
+
+==== Skewed Data and Stuck Reducers
+
+
+
+(TODO: complete)
+
+==== Reducer Processing
+
+(TODO: complete)
+
+==== Commit and Replication
+
+(TODO: complete)
+
+
+=== Top-line Performance/Sanity Checks
+
+* The first wave of Mappers should start simultaneously.
+* In general, all a job’s full block Map attempts should take roughly the same amount of time.
+* The full map phase should take around ‘average_Map_task_time*(number_of_Map_tasks/number_of_Mapper_slots+1)’
+* Very few non-local Map tasks.
+* Number of spills equals number of Map tasks (unless there are Combiners).
+* If there are Combiners, the Reducer input data should be much less than the Mapper output data (TODO: check this).
+* Record counts and data sizes for Mapper input, Reducer input and Reducer output should correspond to your conception of what the job is doing.
+* Map tasks are full speed (data rate matches your measured baseline)
+* Most Map tasks process a full block of data.
+* Processing stage of Reduce attempts should be full speed.
+* Not too many Merge passes in Reducers.
+* Shuffle and sort time is explained by the number of Merge passes.
+* Commit phase should be brief.
+* Total job runtime is not much more than the combined Map phase and Reduce phase runtimes.
+* Reducers generally process the same amount of data.
+* Most Reducers process at least enough data to be worth it.
+*
+
+// ____________________________________
+
+=== Performance Comparison Worksheet
+
+(TODO: DL Make a table comparing performance baseline figures on AWS and fixed hardware. reference clusters.)
+
+
View
231 21d-hadoop_internals-tuning.asciidoc.md
@@ -1,231 +0,0 @@
-18. *Hadoop Tuning*
- - Tuning for the Wise and Lazy
- - Tuning for the Brave and Foolish
- - The USE Method for understanding performance and diagnosing problems
-
-One of the wonderful and terrible things about Hadoop (or anything else at Big Data scale) is that there are very few boundary cases for performance optimization. If your dataflow has the wrong shape, it is typically so catastrophically inefficient as to be unworkable. Otherwise, Hadoop’s scalability makes the price of simply throwing more hardware at the problem competitive with investing to optimize it, especially for exploratory analytics. That’s why the repeated recommendation of this book is: Get the algorithm right, get the contextability right and size your cluster to your job.
-
-When you begin to productionize the results of all your clever work, it becomes valuable to look for these 30-percent, 10-percent improvements. Debug loop time is also important, though, so it is useful to get a feel for when and why early optimization is justified.
-
-The first part of this chapter discusses Tuning for the Wise and Lazy -- showing you how to answer the question, “Is my job slow and should I do anything about it?” Next, we will discuss Tuning for the Brave and Foolish, diving into the details of Hadoop’s maddeningly numerous, often twitchy, configuration knobs. There are low-level setting changes that can dramatically improve runtime; we will show you how to recognize them and how to determine what per-job overrides to apply. Lastly, we will discuss a formal approach for diagnosing the health and performance of your Hadoop cluster and your Hadoop jobs to help you confidently and efficiently identify the source of deeper flaws or bottlenecks.
-
-=== Tuning For The Wise and Lazy
-
-Before tuning your job, it is important to first understand whether it is slow and if so, where the bottleneck arises. Generally speaking, your Hadoop job will either be CPU-bound within your code, memory-bound within your code, bound by excessive Reducer data (causing too many merge passes) or throughput-bound within the framework or underlying system. (In the rare cases it is not one of those, you would next apply the more comprehensive “USE method” (TODO: REF) described later in this chapter.) But for data-intensive jobs, the fundamental upper bound on Hadoop’s performance is its effective throughput from disk. You cannot make your job process data faster than it can load it.
-
-Here is the conceptual model we will use: Apart from necessary overhead, job coordination and so forth, a Hadoop job must:
-
-* Streams data to the Mapper, either locally from disk or remotely over the network;
-* Runs that data through your code;
-* Spills the midstream data to disk one or more times;
-* Applies combiners, if any;
-* Merge/Sorts the spills and sends the over the network to the Reducer.
-
-The Reducer:
-
-* Writes each Mapper’s output to disk;
-* Performs some number of Merge/Sort passes;
-* Reads the data from disk;
-* Runs that data through your code;
-* Writes its output to the DataNode, which writes that data once to disk and twice through the network;
-* Receives two other Reducers’ output from the network and writes those to disk.
-
-Hadoop is, of course, pipelined; to every extent possible, those steps are happening at the same time. What we will do, then, is layer in these stages one by one, at each point validating that your job is as fast as your disk until we hit the stage where it is not.
-
-==== Fixed Overhead
-
-The first thing we want is a job that does nothing; this will help us understand the fixed overhead costs. Actually, what we will run is a job that does almost nothing; it is useful to know that your test really did run.
-
-----
-(TODO: Disable combining splits)
-Load 10_tiny_files
-Filter most of it out
-Store to disk
-
-(TODO: Restrict to 50 Mapper slots or rework)
-Load 10000_tiny_files
-Filter most of it out
-Store to disk
-----
-(TODO: is there a way to limit the number of Reduce slots in Pig? Otherwise, revisit the below.)
-
-In (TODO: REF), there is a performance comparison worksheet that you should copy and fill in as we go along. It lists the performance figures for several reference clusters on both cloud and dedicated environments for each of the tests we will perform. If your figures do not compare well with the appropriate reference cluster, it is probably worthwhile adjusting the overall configuration. Assuming your results are acceptable, you can tape the worksheet to your wall and use it to baseline all the jobs you will write in the future. The rest of the chapter will assume that your cluster is large enough to warrant tuning but not grossly larger than the largest reference cluster.
-
-If you run the Pig script above (TODO: REF), Hadoop will execute two jobs: one with 10 Mappers and no Reducers and another with 10,000 Mappers and no Reducers. From the Hadoop Job Tracker page for your job, click on the link showing the number of Map tasks to see the full task listing. All 10 tasks for the first job should have started at the same time and uniformly finished a few seconds after that. Back on the main screen, you should see that the total job completion time was more or less identical to that of the slowest Map task.
-
-The second job ran its 10,000 Map tasks through a purposefully restricted 50 Mapper slots -- so each Mapper slot will have processed around 200 files. If you click through to the Task listing, the first wave of tasks should start simultaneously and all of them should run in the same amount of time that the earlier test did.
-
-(TODO: show how to find out if one node is way slower)
-
-Even in this trivial case, there is more variance in launch and runtimes than you might first suspect (if you don't, you definitely will in the next -- but for continuity, we will discuss it here). If that splay -- the delay between the bulk of jobs finishing and the final job finishing -- is larger than the runtime of a typical task, however, it may indicate a problem, but as long as it is only a few seconds, don’t sweat it. If you are interested in a minor but worth-it tweak, adjust the `mapred.job.reuse.jvm.num.tasks` setting to ‘-1’, which causes each Mapper to use the same child process JVM for its attempts, eliminating the brief but noticeable JVM startup time. If you are writing your own native Java code, you might know a reason to go with the default (disabled) setting but it is harmless for any well-behaved program.
-
-On the Job screen, you should see that the total runtime for the job was about 200 times slower for the second job than the first and not much more than 200 times the typical task’s runtime; if not, you may be putting pressure on the Job Tracker. Rerun your job and watch the Job Tracker’s heap size; you would like the Job Tracker heap to spend most of its life below, say 50-percent, so if you see it making any significant excursions toward 100-percent, that would unnecessarily impede cluster performance. The 1 GB out-of-the-box setting is fairly small; for production use we recommend at least 3 GB of heap on a dedicated machine with at least 7 GB total RAM.
-
-If the Job coordination overhead is unacceptable but the Job Tracker heap is not to blame, a whole host of other factors might be involved; apply the USE method, described (TODO: REF).
-
-=== Mapper Input
-
-Now that we’ve done almost nothing, let’s do almost something -- read in a large amount of data, writing just enough to disk to know that we really were there.
-
-----
-Load 100 GB from disk
-Filter all but 100 MB
-Store it to disk
-----
-
-Run that job on the 100-GB GitHub archive dataset. (TODO: Check that it will do speculative execution.) Once the job completes, you will see as many successful Map tasks as there were HDFS blocks in the input; if you are running a 128-MB block size, this will be about (TODO: How many blocks are there?).
-
-Again, each Map task should complete in a uniform amount of time and the job as a whole should take about ‘length_of_Map_task*number_of_Map_tasks=number_of_Mapper_slots’. The Map phase does not end until every Mapper task has completed and, as we saw in the previous example, even in typical cases, there is some amount of splay in runtimes.
-
-(TODO: Move some of JT and Nanette’s optimizations forward to this chapter). Like the chimpanzees at quitting time, the Map phase cannot finish until all Mapper tasks have completed.
-
-You will probably notice a half-dozen or so killed attempts as well. The ‘TODO: name of speculative execution setting’, which we recommend enabling, causes Hadoop to opportunistically launch a few duplicate attempts for the last few tasks in a job. The faster job cycle time justifies the small amount of duplicate work.
-
-Check that there are few non-local Map tasks -- Hadoop tries to assign Map attempts (TODO: check tasks versus attempts) to run on a machine whose DataNode holds that input block, thus avoiding a trip across the network (or in the chimpanzees’ case, down the hallway). It is not that costly, but if you are seeing a large number of non-local tasks on a lightly-loaded cluster, dig deeper.
-
-Dividing the average runtime by a full block of Map task by the size of an HDFS block gives you the Mapper’s data rate. In this case, since we did almost nothing and wrote almost nothing, that value is your cluster’s effective top speed. This has two implications: First, you cannot expect a data-intensive job to run faster than its top speed. Second, there should be apparent reasons for any job that runs much slower than its top speed. Tuning Hadoop is basically about making sure no other part of the system is slower than the fundamental limit at which it can stream from disk.
-
-While setting up your cluster, it might be worth baselining Hadoop’s top speed against the effective speed of your disk and your network. Follow the instructions for the ‘scripts/baseline_performance’ script (TODO: write script) from the example code above. It uses a few dependable user-level processes to measure the effective data rate to disk (‘DD’ and ‘CP’) and the effective network rate (‘NC’ and ‘SCP’). (We have purposely used user-level processes to account for system overhead; if you want to validate that as well, use a benchmark like Bonnie++ (TODO: link)). If you are dedicated hardware, the network throughput should be comfortably larger than the disk throughput. If you are on cloud machines, this, unfortunately, might not hold but it should not be atrociously lower.
-
-If the effective top speed you measured above is not within (TODO: figure out healthy percent) percent, dig deeper; otherwise, record each of these numbers on your performance comparison chart.
-
-If you're setting up your cluster, take the time to generate enough additional data to keep your cluster fully saturated for 20 or more minutes and then ensure that each machine processed about the same amount of data. There is a lot more variance in effective performance among machines than you might expect, especially in a public cloud environment; it can also catch a machine with faulty hardware or setup. This is a crude but effective benchmark, but if you're investing heavily in a cluster consider running a comprehensive benchmarking suite on all the nodes -- the chapter on Stupid Hadoop Tricks shows how (TODO ref)
-
-=== The Many Small Files Problem
-
-One of the most pernicious ways to impair a Hadoop cluster’s performance is the “many-small-files” problem. With a 128-MB block size (which we will assume for the following discussion), a 128-MB file takes one block (obviously), a 1-byte file takes one block and a 128-MB+1 byte file takes two blocks, one of them full, the other with one solitary byte.
-
-Storing 10 GB of data in, say, 100 files is harmless -- the average block occupancy is a mostly-full 100 MB. Storing that same 10GB in say 10,000 files is, however, harmful in several ways. At the heart of the Namenode is a table that lists every file and block. As you would expect, the memory usage of that table roughly corresponds to the number of files plus the number of blocks, so the many-small-files example uses about 100 times as much memory as warranted. Engage in that bad habit often enough and you will start putting serious pressure on the Namenode heap and lose your job shortly thereafter. What is more, the many-small-files version will require 10,000 Map tasks, causing memory pressure on the Job Tracker and a job whose runtime is dominated by task overhead. Lastly, there is the simple fact that working with 10,000 things is more annoying than working with 100 -- it takes up space in datanode heartbeats, client requests, your terminal screen and your head.
-
-Causing this situation is easier to arrive at than you might expect; in fact, you just did so. The 100-GB job you just ran most likely used 800 Mapper slots yet output only a few MB of data. Any time your mapper output is significantly smaller than its input -- for example, when you apply a highly-restrictive filter to a large input -- your output files will have poor occupancy.
-
-A sneakier version of this is a slightly “expansive” Mapper-Only job. A job whose Mappers turned a 128-MB block into, say, 150 MB of output data would reduce the block occupancy by nearly half and require nearly double the Mapper slots in the following jobs. Done once, that is merely annoying but in a workflow that iterates or has many stages, the cascading dilution could become dangerous.
-
-You can audit your HDFS to see if this is an issue using the ‘hadoop fsck [directory]’ command. Running that command against the directory holding the GitHub data should show 100 GB of data in about 800 blocks. Running it against your last job’s output should show only a few MB of data in an equivalent number of blocks.
-
-You can always distill a set of files by doing ‘group_by’ with a small number of Reducers using the record itself as a key. Pig and Hive both have settings to mitigate the many-small-files problem. In Pig, the (TODO: find name of option) setting will feed multiple small files to the same Mapper; in Hive (TODO: look up what to do in Hive). In both cases, we recommend modifying your configuration to make that the default and disable it on a per-job basis when warranted.
-
-=== Midstream Data
-
-Now let’s start to understand the performance of a proper Map/Reduce job. Run the following script, again, against the 100 GB GitHub data.
-
-----
-Parallel 50
-Disable optimizations for pushing up filters and for Combiners
-Load 100 GB of data
-Group by record itself
-Filter out almost everything
-Store data
-----
-
-The purpose of that job is to send 100 GB of data at full speed through the Mappers and midstream processing stages but to do almost nothing in the Reducers and write almost nothing to disk. To keep Pig from “helpfully” economizing the amount of midstream data, you will notice in the script we disabled some of its optimizations. The number of Map tasks and their runtime should be effectively the same as in the previous example, and all the sanity checks we’ve given so far should continue to apply. The overall runtime of the Map phase should only be slightly longer (TODO: how much is slightly?) than in the previous Map-only example, depending on how well your network is able to outpace your disk.
-
-It is an excellent idea to get into the habit of predicting the record counts and data sizes in and out of both Mapper and Reducer based on what you believe Hadoop will be doing to each record and then comparing to what you see on the Job Tracker screen. In this case, you will see identical record counts for Mapper input, Mapper output and Reducer input and nearly identical data sizes for HDFS bytes read, Mapper output, Mapper file bytes written and Reducer input. The reason for the small discrepancies is that, for the file system metrics, Hadoop is recording everything that is read or written, including logged files and so forth.
-
-Midway or so through the job -- well before the finish of the Map phase -- you should see the Reducer tasks start up; their eagerness can be adjusted using the (TODO: name of setting) setting. By starting them early, the Reducers are able to begin merge/sorting the various Map task outputs in parallel with the Map phase. If you err low on this setting, you will disappoint your coworkers by consuming Reducer slots with lots of idle time early but that is better than starting them too late, which will sabotage parallels.
-
-Visit the Reducer tasks listing. Each Reducer task should have taken a uniform amount of time, very much longer than the length of the Map tasks. Open a few of those tasks in separate browser tabs and look at their counters; each should have roughly the same input record count and data size. It is annoying that this information is buried as deeply as it is because it is probably the single most important indicator of a flawed job; we will discuss it in detail a bit later on.
-
-==== Spills
-
-First, though, let’s finish understanding the data’s detailed journey from Mapper to Reducer. As a Map task outputs records, Hadoop sorts them in the fixed-size io.sort buffer. Hadoop files records into the buffer in partitioned, sorted order as it goes. When that buffer fills up (or the attempt completes), Hadoop begins writing to a new empty io.sort buffer and, in parallel, “spills” that buffer to disk. As the Map task concludes, Hadoop merge/sorts these spills (if there were more than one) and sends the sorted chunks to each Reducer for further merge/sorting.
-
-The Job Tracker screen shows the number of Mapper spills. If the number of spills equals the number of Map tasks, all is good -- the Mapper output is checkpointed to disk before being dispatched to the Reducer. If the size of your Map output data is large, having multiple spills is the natural outcome of using memory efficiently; that data was going to be merge/sorted anyway, so it is a sound idea to do it on the Map side where you are confident it will have a uniform size distribution.
-
-(TODO: do combiners show as multiple spills?)
-
-What you hate to see, though, are Map tasks with two or three spills. As soon as you have more than one spill, the data has to be initially flushed to disk as output, then read back in full and written again in full for at least one merge/sort pass. Even the first extra spill can cause roughly a 30-percent increase in Map task runtime.
-
-There are two frequent causes of unnecessary spills. First is the obvious one: Mapper output size that slightly outgrows the io.sort buffer size. We recommend sizing the io.sort buffer to comfortably accommodate Map task output slightly larger than your typical HDFS block size -- the next section (TODO: REF) shows you how to calculate. In the significant majority of jobs that involve a Reducer, the Mapper output is the same or nearly the same size -- JOINs or GROUPs that are direct, are preceded by a projection or filter or have a few additional derived fields. If you see many of your Map tasks tripping slightly over that limit, it is probably worth requesting a larger io.sort buffer specifically for your job.
-
-There is also a disappointingly sillier way to cause unnecessary spills: The io.sort buffer holds both the records it will later spill to disk and an index to maintain the sorted order. An unfortunate early design decision set a fixed size on both of those with fairly confusing control knobs. The ‘iosortrecordpercent’ (TODO: check name of setting) setting gives the size of that index as a fraction of the sort buffer. Hadoop spills to disk when either the fraction devoted to records or the fraction devoted to the index becomes full. If your output is long and skinny, cumulatively not much more than an HDFS block but with a typical record size smaller than, say, 100 bytes, you will end up spilling multiple small chunks to disk when you could have easily afforded to increase the size of the bookkeeping buffer.
-
-There are lots of ways to cause long, skinny output but set a special triggers in your mind for cases where you have long, skinny input; turn an adjacency-listed graph into an edge-listed graph or otherwise FLATTEN bags of records on the Mapper side. In each of these cases, the later section (TODO: REF) will show you how to calculate it.
-
-(TODO: either here or later, talk about the surprising cases where you fill up MapRed scratch space or FS.S3.buffer.dir and the rest of the considerations about where to put this).
-
-
-==== Combiners
-
-It is a frequent case that the Reducer output is smaller than its input (and kind of annoying that the word “Reducer” was chosen, since it also frequently is not smaller). “Algebraic” aggregations such as COUNT, AVG and so forth, and many others can implement part of the Reducer operation on the Map side, greatly lessening the amount of data sent to the Reducer.
-
-Pig and Hive are written to use Combiners whenever generically appropriate. Applying a Combiner requires extra passes over your data on the Map side and so, in some cases, can themselves cost much more time than they save.
-
-If you ran a distinct operation over a data set with 50-percent duplicates, the Combiner is easily justified since many duplicate pairs will be eliminated early. If, however, only a tiny fraction of records are duplicated, only a disappearingly-tiny fraction will occur on the same Mapper, so you will have spent disk and CPU without reducing the data size.
-
-Whenever your Job Tracker output shows that Combiners are being applied, check that the Reducer input data is, in fact, diminished. (TODO: check which numbers show this) If Pig or Hive have guessed badly, disable the (TODO: name of setting) setting in Pig or the (TODO: name of setting) setting in Hive.
-
-==== Reducer Merge (aka Shuffle and Sort)
-
-We are now ready to dig into the stage with the most significant impact on job performance, the merge/sort performed by the Reducer before processing. In almost all the rest of the cases discussed here, an inefficient choice causes only a marginal impact on runtime. Bring down too much data on your Reducers, however, and you will find that, two hours into the execution of what you thought was a one-hour job, a handful of Reducers indicate they have four hours left to run.
-
-First, let’s understand what is going on and describe healthy execution; then, we will discuss various ways it can go wrong and how to address them.
-
-As you just saw, data arrives from the Mappers pre-sorted. The Reducer reads them from memory into its own sort buffers. Once a threshold (controlled by the (TODO: name of setting) setting) of data has been received, the Reducer commissions a new sort buffer and separately spills the data to disk, merge/sorting the Mapper chunks as it goes. (TODO: check that this first merge/sort happens on spill)
-
-Enough of these spills later (controlled by the (TODO: setting) setting), the Reducer begins merge/sorting the spills into a larger combined chunk. All of this activity is happening in parallel, so by the time the last Map task output is received, the typical healthy situation is to have a modest number of large sorted chunks and one small-ish chunk holding the dregs of the final spill. Once the number of chunks is below the (TODO: look up name of setting) threshold, the merge/sort is complete -- it does not need to fully merge the data into a single file onto disk. Instead, it opens an input stream onto each of those final chunks, consuming them in sort order.
-
-Notice that the Reducer flushes the last spill of received Map data to disk, then immediately starts reconsuming it. If the memory needs of your Reducer are modest, you can instruct Hadoop to use the sort buffer directly in the final merge, eliminating the cost and delay of that final spill. It is a nice marginal improvement when it works but if you are wrong about how modest your Reducer’s memory needs are, the negative consequences are high and if your Reducers have to perform multiple merge/sort passes, the benefits are insignificant.
-
-For a well-tested job heading to production that requires one or fewer merge/sort passes, you may judiciously (TODO: describe how to adjust this).
-
-(TODO: discuss buffer sizes here or in Brave and Foolish section)
-(TODO: there is another setting that I’m forgetting here - what is it?)
-
-Once your job has concluded, you can find the number of merge/sort passes by consulting the Reduce tasks counters (TODO: DL screenshot and explanation). During the job, however, the only good mechanism is to examine the Reducer logs directly. At some reasonable time after the Reducer has started, you will see it initiate spills to disk (TODO: tell what the log line looks like). At some later point, it will begin merge/sorting those spills (TODO: tell what the log line looks like).
-
-The CPU burden of a merge/sort is disappearingly small against the dominating cost of reading then writing the data to disk. If, for example, your job only triggered one merge/sort pass halfway through receiving its data, the cost of the merge/sort is effectively one and a half times the base cost of writing that data at top speed to disk: all of the data was spilled once, half of it was rewritten as merged output. Comparing the total size of data received by the Reducer to the merge/sort settings will let you estimate the expected number of merge/sort passes; that number, along with the “top speed” figure you collected above, will, in turn, allow you to estimate how long the Reduce should take. Much of this action happens in parallel but it happens in parallel with your Mapper’s mapping, spilling and everything else that is happening on the machine.
-
-A healthy, data-intensive job will have Mappers with nearly top speed throughput, the expected number of merge/sort passes and the merge/sort should conclude shortly after the last Map input is received. (TODO: tell what the log line looks like). In general, if the amount of data each Reducer receives is less than a factor of two to three times its share of machine RAM, (TODO: should I supply a higher-fidelity thing to compare against?) all those conditions should hold. Otherwise, consult the USE method (TODO: REF).
-
-If the merge/sort phase is killing your job’s performance, it is most likely because either all of your Reducers are receiving more data than they can accommodate or because some of your Reducers are receiving far more than their fair share. We will take the uniform distribution case first.
-
-The best fix to apply is to send less data to your Reducers. The chapters on writing Map/Reduce jobs (TODO: REF or whatever we are calling Chapter 5) and the chapter on advanced Pig (TODO: REF or whatever we are calling that now) both have generic recommendations for how to send around less data and throughout the book, we have described powerful methods in a domain-specific context which might translate to your problem.
-
-If you cannot lessen the data burden, well, the laws of physics and economics must be obeyed. The cost of a merge/sort is ‘O(N LOG N)’. In a healthy job, however, most of the merge/sort has been paid down by the time the final merge pass begins, so up to that limit, your Hadoop job should run in ‘O(N)’ time governed by its top speed.
-
-The cost of excessive merge passes, however, accrues directly to the total runtime of the job. Even though there are other costs that increase with the number of machines, the benefits of avoiding excessive merge passes are massive. A cloud environment makes it particularly easy to arbitrage the laws of physics against the laws of economics -- it costs the same to run 60 machines for two hours as it does to run ten machines for 12 hours, as long as your runtime stays roughly linear with the increased number of machines, you should always size your cluster to your job, not the other way around. The thresholding behavior of excessive reduces makes it exceptionally valuable to do so. This is why we feel exploratory data analytics is far more efficiently done in an elastic cloud environment, even given the quite significant performance hit you take. Any physical cluster is too large and also too small; you are overpaying for your cluster overnight while your data scientists sleep and you are overpaying your data scientists to hold roller chair sword fights while their undersized cluster runs. Our rough rule of thumb is to have not more than 2-3 times as much total reducer data as you have total child heap size on all the reducer machines you'll use.
-
-(TODO: complete)
-
-==== Skewed Data and Stuck Reducers
-
-
-
-(TODO: complete)
-
-==== Reducer Processing
-
-(TODO: complete)
-
-==== Commit and Replication
-
-(TODO: complete)
-
-
-=== Top-line Performance/Sanity Checks
-
-* The first wave of Mappers should start simultaneously.
-* In general, all a job’s full block Map attempts should take roughly the same amount of time.
-* The full map phase should take around ‘average_Map_task_time*(number_of_Map_tasks/number_of_Mapper_slots+1)’
-* Very few non-local Map tasks.
-* Number of spills equals number of Map tasks (unless there are Combiners).
-* If there are Combiners, the Reducer input data should be much less than the Mapper output data (TODO: check this).
-* Record counts and data sizes for Mapper input, Reducer input and Reducer output should correspond to your conception of what the job is doing.
-* Map tasks are full speed (data rate matches your measured baseline)
-* Most Map tasks process a full block of data.
-* Processing stage of Reduce attempts should be full speed.
-* Not too many Merge passes in Reducers.
-* Shuffle and sort time is explained by the number of Merge passes.
-* Commit phase should be brief.
-* Total job runtime is not much more than the combined Map phase and Reduce phase runtimes.
-* Reducers generally process the same amount of data.
-* Most Reducers process at least enough data to be worth it.
-*
-
-// ____________________________________
-
-=== Performance Comparison Worksheet
-
-(TODO: DL Make a table comparing performance baseline figures on AWS and fixed hardware. reference clusters.)
-
View
25 25-appendix.asciidoc
@@ -1,3 +1,28 @@
[[appendix]]
== Appendix
+=== Author ===
+
+Philip (flip) Kromer is cofounder of Infochimps, a big data platform that makes acquiring, storing and analyzing massive data streams transformatively easier. Infochimps became part of Computer Sciences Corporation in 2013, and their big data platform now serves customers such as Cisco, HGST and Infomart. He enjoys Bowling, Scrabble, working on old cars or new wood, and rooting for the Red Sox.
+
+Graduate School, Dept. of Physics - University of Texas at Austin, 2001-2007
+Bachelor of Arts, Computer Science - Cornell University, Ithaca NY, 1992-1996
+
+* Cofounder of Infochimps, now Head of Technology and Architecture at Infochimps, a CSC Company.
+* Core committer for Storm, a framework for scalable stream processing and analytics
+* Core committer for Ironfan, a framework for provisioning complex distributed systems in the cloud or data center
+* Original author and core committer of Wukong, the leading Ruby library for Hadoop
+* Contributed chapter to _The Definitive Guide to Hadoop_ by Tom White
+
+Dieterich Lawson is a recent graduate of Stanford University.
+
+TODO DL: biography
+
+=== Colophon ===
+
+Writing a book with O'Reilly is magic in so many ways, but none moreso than their Atlas authoring platform. Rather than the XML hellscape that most publishers require, Atlas allows us to write simple text documents with readable markup, do a `git push`, and see a holy-smokes-that-looks-like-a-real-book PDF file moments later.
+
+* Emacs, because a text editor that isn't its own operating system might as well be edlin.
+* Visuwords.com give good word when brain not able think word good.
+* Posse East Bar and Epoch Coffee House in Austin provided us just the right amount of noise and focus for cranking out content.
+* http://dexy.it is a visionary tool for writing documentation. With it, we are able to directly use the runnable example scripts as the code samples for the book.
View
15 25a-authors.asciidoc
@@ -1,15 +0,0 @@
-=== Author ===
-
-Philip (flip) Kromer is the founder and CTO at Infochimps.com, a big data platform that makes acquiring, storing and analyzing massive data streams transformatively easier. I enjoy Bowling, Scrabble, working on old cars or new wood, and rooting for the Red Sox.
-
-Graduate School, Dept. of Physics - University of Texas at Austin, 2001-2007
-Bachelor of Arts, Computer Science - Cornell University, Ithaca NY, 1992-1996
-
-* Core committer for Wukong, the leading ruby library for Hadoop
-* Core committer for Ironfan, a framework for provisioning complex distributed systems in the cloud or data center.
-* Wrote the most widely-used cookbook for deploying hadoop clusters using Chef
-* Contributed chapter to _The Definitive Guide to Hadoop_ by Tom White
-
-
-
-
View
8 25b-colophon.asciidoc
@@ -1,8 +0,0 @@
-=== A sort of colophon ===
-
-the http://github.com/schacon/git-scribe[git-scribe toolchain] was very useful creating this book. Instructions on how to install the tool and use it for things like editing this book, submitting errata and providing translations can be found at that site.
-
-* Posse East Bar
-* Visuwords.com
-* Emacs
-* Epoch Coffee House
View
4 25c-references.asciidoc
@@ -27,3 +27,7 @@
* https://github.com/yohasebe/wp2txt[wp2txt], by http://yohasebe.com[Yoichiro Hasebe]
+
+
+
+the http://github.com/schacon/git-scribe[git-scribe toolchain] was very useful creating this book. Instructions on how to install the tool and use it for things like editing this book, submitting errata and providing translations can be found at that site.
View
0 22b-scripts.asciidoc → 25d-overview_of_scripts.asciidoc
File renamed without changes.
View
50 25f-glossary.asciidoc
@@ -23,3 +23,53 @@ Questions: