Permalink
Browse files

Moved coprolites into the 'supplementary' directory; moving towards a…

… one-file-per-chapter-or-so world
  • Loading branch information...
Philip (flip) Kromer
Philip (flip) Kromer committed Dec 25, 2013
1 parent 479f2da commit efb3f6c46b169652901522721f0171893e0d804a
Showing with 1,101 additions and 1,660 deletions.
  1. +500 −0 00-preface.asciidoc
  2. +0 −500 00a-about.asciidoc
  3. +187 −0 02-simple_transform.asciidoc
  4. +0 −88 02a-c_and_e_start_a_business.asciidoc
  5. +0 −1 02a-simple_transform-intro.asciidoc
  6. +0 −74 02b-running_a_hadoop_job.asciidoc
  7. +0 −38 02c-end_of_simple_transform.asciidoc
  8. +2 −0 03-transform_pivot.asciidoc
  9. +0 −2 03a-c_and_e_save_xmas.asciidoc
  10. +0 −2 03a-intro.asciidoc
  11. +1 −2 07-intro_to_storm+trident.asciidoc
  12. +33 −0 09-statistics.asciidoc
  13. +0 −33 09a-statistics-intro.asciidoc
  14. +45 −0 18-java_api.asciidoc
  15. +0 −46 18a-hadoop_api.asciidoc
  16. +0 −11 19b-pig_udfs.asciidoc
  17. +241 −0 21-hadoop_internals.asciidoc
  18. +0 −231 21d-hadoop_internals-tuning.asciidoc.md
  19. +25 −0 25-appendix.asciidoc
  20. +0 −15 25a-authors.asciidoc
  21. +0 −8 25b-colophon.asciidoc
  22. +4 −0 25c-references.asciidoc
  23. 0 22b-scripts.asciidoc → 25d-overview_of_scripts.asciidoc
  24. +50 −0 25f-glossary.asciidoc
  25. +0 −163 25j-appendix-tools_landscape.asciidoc
  26. +3 −47 25g-back_cover.asciidoc → 26-back_cover.asciidoc
  27. +0 −46 88-storm-glossary.asciidoc
  28. +0 −225 book-full.asciidoc
  29. +0 −121 book-reviewers.asciidoc
  30. +0 −7 book.asciidoc
  31. 0 { → cheatsheets}/24-cheatsheets.asciidoc
  32. 0 { → cheatsheets}/24a-unix_cheatsheet.asciidoc
  33. 0 { → cheatsheets}/24b-regular_expression_cheatsheet.asciidoc
  34. 0 { → cheatsheets}/24c-pig_cheatsheet.asciidoc
  35. 0 { → cheatsheets}/24d-hadoop_tunables_cheatsheet.asciidoc
  36. 0 { → cheatsheets}/25i-asciidoc_cheatsheet_and_style_guide.asciidoc
  37. 0 { → supplementary}/00b-intro-more_outlines.asciidoc
  38. 0 { → supplementary}/09a-summarizing.asciidoc
  39. 0 { → supplementary}/09b-sampling.asciidoc
  40. 0 { → supplementary}/09c-distribution_of_weather_measurements.asciidoc
  41. 0 { → supplementary}/09e-exercises.asciidoc
  42. +10 −0 { → supplementary}/14c-data_modeling.asciidoc
  43. 0 { → supplementary}/15-graphs.asciidoc
  44. 0 { → supplementary}/15a-representing_graphs.asciidoc
  45. 0 { → supplementary}/15b-community_extractions.asciidoc
  46. 0 { → supplementary}/15c-pagerank.asciidoc
  47. 0 { → supplementary}/15d-label_propagation.asciidoc
  48. 0 { → supplementary}/16a-simple_machine_learning.asciidoc
  49. 0 16d-misc.asciidoc → supplementary/16d-notes_on_ml_algorithms.asciidoc
  50. 0 { → supplementary}/17-best_practices.asciidoc
  51. 0 { → supplementary}/17a-why_hadoop.asciidoc
  52. 0 { → supplementary}/17b-how_to_think.asciidoc
  53. 0 { → supplementary}/17d-cloud-vs-static.asciidoc
  54. 0 { → supplementary}/17e-rules_of_scaling.asciidoc
  55. 0 { → supplementary}/17f-best_practices_and_pedantic_points_of_style.asciidoc
  56. 0 { → supplementary}/17g-tao_te_chimp.asciidoc
  57. 0 { → supplementary}/20b-even_yet_still_more_hbase.asciidoc
  58. 0 { → supplementary}/21b-hadoop_internals-logs.asciidoc
  59. 0 { → supplementary}/22d-use_method_checklist.asciidoc
  60. 0 { → supplementary}/23-datasets_and_scripts.asciidoc
  61. 0 { → supplementary}/23a-overview_of_datasets.asciidoc
  62. 0 { → supplementary}/23c-datasets.asciidoc
  63. 0 { → supplementary}/23c-wikipedia_dbpedia.asciidoc
  64. 0 { → supplementary}/23d-airline_flights.asciidoc
  65. 0 { → supplementary}/23e-access_logs.asciidoc
  66. 0 { → supplementary}/23f-data_formats-arc.asciidoc
  67. 0 { → supplementary}/23g-other_datasets_on_the_web.asciidoc
  68. 0 { → supplementary}/23h-notes_for_chimpmark.asciidoc
  69. 0 { → supplementary}/25c-acquiring_a_hadoop_cluster.asciidoc
  70. 0 { → supplementary}/25h-TODO.asciidoc
  71. 0 { → supplementary}/88-storm-junk_ignore.asciidoc
  72. 0 { → supplementary}/88-storm-lifecycle_of_a_record.asciidoc
View

Large diffs are not rendered by default.

Oops, something went wrong.
View

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -1,2 +1,189 @@
[[simple_transform]]
== Simple Transform
+
+//// Use the start of this chapter, before you get into the Chimpanzee and Elephant business, to write about your insights and conclusions about the simple transform aspect of working with big data. Prime the reader's mind. Make it fascinating -- this first, beginning section of each chapter is an ideal space for making big data fascinating and showing how magical data manipulation can be. Then, go on to write your introductory text for this chapter. Such as, "...in this chapter, we'll study an "embarrassingly parallel" problem in order to learn the mechanics of launching a job. It will also provide a course understanding of..." Amy////
+
+=== Chimpanzee and Elephant Start a Business ===
+
+Chimpanzees love nothing more than sitting at keyboards processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. The chimpanzees and the elephants realized there was a real business opportunity from combining their strengths, and so they formed the Chimpanzee and Elephant Data Shipping Corporation.
+
+They were soon hired by a publishing firm to translate the works of Shakespeare into every language.
+In the system they set up, each chimpanzee sits at a typewriter doing exactly one thing well: read a set of passages, and type out the corresponding text in a new language. Each elephant has a pile of books, which she breaks up into "blocks" (a consecutive bundle of pages, tied up with string).
+
+=== A Simple Streamer ===
+
+We're hardly as clever as one of these multilingual chimpanzees, but even we can translate text into Pig Latin. For the unfamiliar, you turn standard English into Pig Latin as follows:
+
+* If the word begins with a consonant-sounding letter or letters, move them to the end of the word adding "ay": "happy" becomes "appy-hay", "chimp" becomes "imp-chay" and "yes" becomes "es-yay".
+* In words that begin with a vowel, just append the syllable "way": "another" becomes "another-way", "elephant" becomes "elephant-way".
+
+<<pig_latin_translator>> is a program to do that translation. It's written in Wukong, a simple library to rapidly develop big data analyses. Like the chimpanzees, it is single-concern: there's nothing in there about loading files, parallelism, network sockets or anything else. Yet you can run it over a text file from the commandline or run it over petabytes on a cluster (should you somehow have a petabyte crying out for pig-latinizing).
+
+
+[[pig_latin_translator]]
+.Pig Latin translator, actual version
+----
+ CONSONANTS = "bcdfghjklmnpqrstvwxz"
+ UPPERCASE_RE = /[A-Z]/
+ PIG_LATIN_RE = %r{
+ \b # word boundary
+ ([#{CONSONANTS}]*) # all initial consonants
+ ([\w\']+) # remaining wordlike characters
+ }xi
+
+ each_line do |line|
+ latinized = line.gsub(PIG_LATIN_RE) do
+ head, tail = [$1, $2]
+ head = 'w' if head.blank?
+ tail.capitalize! if head =~ UPPERCASE_RE
+ "#{tail}-#{head.downcase}ay"
+ end
+ yield(latinized)
+ end
+----
+
+[[pig_latin_translator]]
+.Pig Latin translator, pseudocode
+----
+ for each line,
+ recognize each word in the line and change it as follows:
+ separate the head consonants (if any) from the tail of the word
+ if there were no initial consonants, use 'w' as the head
+ give the tail the same capitalization as the word
+ change the word to "{tail}-#{head}ay"
+ end
+ emit the latinized version of the line
+ end
+----
+
+.Ruby helper
+****
+* The first few lines define "regular expressions" selecting the initial characters (if any) to move. Writing their names in ALL CAPS makes them be constants.
+* Wukong calls the `each_line do ... end` block with each line; the `|line|` part puts it in the `line` variable.
+* the `gsub` ("globally substitute") statement calls its `do ... end` block with each matched word, and replaces that word with the last line of the block.
+* `yield(latinized)` hands off the `latinized` string for wukong to output
+****
+
+To test the program on the commandline, run
+
+ wu-local examples/text/pig_latin.rb data/magi.txt -
+
+The last line of its output should look like
+
+ Everywhere-way ey-thay are-way isest-way. Ey-thay are-way e-thay agi-may.
+
+So that's what it looks like when a `cat` is feeding the program data; let's see how it works when an elephant is setting the pace.
+
+
+=== Running a Hadoop Job ===
+
+_Note: this assumes you have a working Hadoop cluster, however large or small._
+
+As you've surely guessed, Hadoop is organized very much like the Chimpanzee & Elephant team. Let's dive in and see it in action.
+
+First, copy the data onto the cluster:
+
+ hadoop fs -mkdir ./data
+ hadoop fs -put wukong_example_data/text ./data/
+
+These commands understand `./data/text` to be a path on the HDFS, not your local disk; the dot `.` is treated as your HDFS home directory (use it as you would `~` in Unix.). The `wu-put` command, which takes a list of local paths and copies them to the HDFS, treats its final argument as an HDFS path by default, and all the preceding paths as being local.
+
+First, let's test on the same tiny little file we used at the commandline. Make sure to notice how much _longer_ it takes this elephant to squash a flea than it took to run without hadoop.
+//// I'd delete the 'squash a flea' here in order to be direct and simple, not cute. Amy////
+
+ wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
+
+//// I suggest adding something about what the reader can expect to see on screen before moving on. Amy////
+After outputting a bunch of happy robot-ese to your screen, the job should appear on the jobtracker window within a few seconds. The whole job should complete in far less time than it took to set it up. You can compare its output to the earlier by running
+
+ hadoop fs -cat ./output/latinized_magi/\*
+
+Now let's run it on the full Shakespeare corpus. Even this is hardly enough data to make Hadoop break a sweat, but it does show off the power of distributed computing.
+
+ wukong launch examples/text/pig_latin.rb ./data/text/magi.txt ./output/latinized_magi
+
+//// I suggest rounding out the exercise with something along the lines of a supportive "You should now see...on screen" for the reader. Amy////
+
+=== Brief Anatomy of a Hadoop Job ===
+
+We'll go into much more detail in (TODO: ref), but here are the essentials of what you just performed.
+
+==== Copying files to the HDFS ====
+
+When you ran the `hadoop fs -mkdir` command, the Namenode (Nanette's Hadoop counterpart) simply made a notation in its directory: no data was stored. If you're familiar with the term, think of the namenode as a 'File Allocation Table (FAT)' for the HDFS.
+
+When you run `hadoop fs -put ...`, the putter process does the following for each file:
+
+1. Contacts the namenode to create the file. This also just makes a note of the file; the namenode doesn't ever have actual data pass through it.
+2. Instead, the putter process asks the namenode to allocate a new data block. The namenode designates a set of datanodes (typically three), along with a permanently-unique block ID.
+3. The putter process transfers the file over the network to the first data node in the set; that datanode transfers its contents to the next one, and so forth. The putter doesn't consider its job done until a full set of replicas have acknowledged successful receipt.
+4. As soon as each HDFS block fills, even if it is mid-record, it is closed; steps 2 and 3 are repeated for the next block.
+
+==== Running on the cluster ====
+
+Now let's look at what happens when you run your job.
+
+(TODO: verify this is true in detail. @esammer?)
+
+* _Runner_: The program you launch sends the job and its assets (code files, etc) to the jobtracker. The jobtracker hands a `job_id` back (something like `job_201204010203_0002` -- the datetime the jobtracker started and the count of jobs launched so far); you'll use this to monitor and if necessary kill the job.
+* _Jobtracker_: As tasktrackers "heartbeat" in, the jobtracker hands them a set of 'task's -- the code to run and the data segment to process (the "split", typically an HDFS block).
+* _Tasktracker_: each tasktracker launches a set of 'mapper child processes', each one an 'attempt' of the tasks it received. (TODO verify:) It periodically reassures the jobtracker with progress and in-app metrics.
+* _Jobtracker_: the Jobtracker continually updates the job progress and app metrics. As each tasktracker reports a complete attempt, it receives a new one from the jobtracker.
+* _Tasktracker_: after some progress, the tasktrackers also fire off a set of reducer attempts, similar to the mapper step.
+* _Runner_: stays alive, reporting progress, for the full duration of the job. As soon as the job_id is delivered, though, the Hadoop job itself doesn't depend on the runner -- even if you stop the process or disconnect your terminal the job will continue to run.
+
+[WARNING]
+===============================
+Please keep in mind that the tasktracker does _not_ run your code directly -- it forks a separate process in a separate JVM with its own memory demands. The tasktracker rarely needs more than a few hundred megabytes of heap, and you should not see it consuming significant I/O or CPU.
+===============================
+
+=== Chimpanzee and Elephant: Splits ===
+
+I've danced around a minor but important detail that the workers take care of. For the Chimpanzees, books are chopped up into set numbers of pages -- but the chimps translate _sentences_, not pages, and a page block boundary might happen mid-sentence.
+//// Provide a real world analogous example here to help readers correlate this story to their world and data analysis needs, "...This example is similar to..." Amy////
+
+The Hadoop equivalent of course is that a data record may cross and HDFS block boundary. (In fact, you can force map-reduce splits to happen anywhere in the file, but the default and typically most-efficient choice is to split at HDFS blocks.)
+
+A mapper will skip the first record of a split if it's partial and carry on from there. Since there are many records in each split, that's no big deal. When it gets to the end of the split, the task doesn't stop processing until is completes the current record -- the framework makes the overhanging data seamlessley appear.
+//// Again, here, correlate this example to a real world scenario; "...so if you were translating x, this means that..." Amy////
+
+In practice, Hadoop users only need to worry about record splitting when writing a custom `InputFormat` or when practicing advanced magick. You'll see lots of reference to it though -- it's a crucial subject for those inside the framework, but for regular users the story I just told is more than enough detail.
+
+=== Exercises ===
+
+==== Exercise 1.1: Running time ====
+
+It's important to build your intuition about what makes a program fast or slow.
+
+Write the following scripts:
+
+* *null.rb* -- emits nothing.
+* *identity.rb* -- emits every line exactly as it was read in.
+
+Let's run the *reverse.rb* and *piglatin.rb* scripts from this chapter, and the *null.rb* and *identity.rb* scripts from exercise 1.1, against the 30 Million Wikipedia Abstracts dataset.
+
+First, though, write down an educated guess for how much longer each script will take than the `null.rb` script takes (use the table below). So, if you think the `reverse.rb` script will be 10% slower, write '10%'; if you think it will be 10% faster, write '- 10%'.
+
+Next, run each script three times, mixing up the order. Write down
+
+* the total time of each run
+* the average of those times
+* the actual percentage difference in run time between each script and the null.rb script
+
+ script | est % incr | run 1 | run 2 | run 3 | avg run time | actual % incr |
+ null: | | | | | | |
+ identity: | | | | | | |
+ reverse: | | | | | | |
+ pig_latin: | | | | | | |
+
+Most people are surprised by the result.
+
+==== Exercise 1.2: A Petabyte-scalable `wc` command ====
+
+Create a script, `wc.rb`, that emit the length of each line, the count of bytes it occupies, and the number of words it contains.
+
+Notes:
+
+* The `String` methods `chomp`, `length`, `bytesize`, `split` are useful here.
+* Do not include the end-of-line characters (`\n` or `\r`) in your count.
+* As a reminder -- for English text the byte count and length are typically similar, but the funny characters in a string like "Iñtërnâtiônàlizætiøn" require more than one byte each. The character count says how many distinct 'letters' the string contains, regardless of how it's stored in the computer. The byte count describes how much space a string occupies, and depends on arcane details of how strings are stored.
@@ -1,88 +0,0 @@
-=== Chimpanzee and Elephant Start a Business ===
-
-Chimpanzees love nothing more than sitting at keyboards processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. The chimpanzees and the elephants realized there was a real business opportunity from combining their strengths, and so they formed the Chimpanzee and Elephant Data Shipping Corporation.
-
-They were soon hired by a publishing firm to translate the works of Shakespeare into every language.
-In the system they set up, each chimpanzee sits at a typewriter doing exactly one thing well: read a set of passages, and type out the corresponding text in a new language. Each elephant has a pile of books, which she breaks up into "blocks" (a consecutive bundle of pages, tied up with string).
-
-=== A Simple Streamer ===
-
-We're hardly as clever as one of these multilingual chimpanzees, but even we can translate text into Pig Latin. For the unfamiliar, you turn standard English into Pig Latin as follows:
-
-* If the word begins with a consonant-sounding letter or letters, move them to the end of the word adding "ay": "happy" becomes "appy-hay", "chimp" becomes "imp-chay" and "yes" becomes "es-yay".
-* In words that begin with a vowel, just append the syllable "way": "another" becomes "another-way", "elephant" becomes "elephant-way".
-
-<<pig_latin_translator>> is a program to do that translation. It's written in Wukong, a simple library to rapidly develop big data analyses. Like the chimpanzees, it is single-concern: there's nothing in there about loading files, parallelism, network sockets or anything else. Yet you can run it over a text file from the commandline or run it over petabytes on a cluster (should you somehow have a petabyte crying out for pig-latinizing).
-
-
-[[pig_latin_translator]]
-.Pig Latin translator, actual version
-----
- CONSONANTS = "bcdfghjklmnpqrstvwxz"
- UPPERCASE_RE = /[A-Z]/
- PIG_LATIN_RE = %r{
- \b # word boundary
- ([#{CONSONANTS}]*) # all initial consonants
- ([\w\']+) # remaining wordlike characters
- }xi
-
- each_line do |line|
- latinized = line.gsub(PIG_LATIN_RE) do
- head, tail = [$1, $2]
- head = 'w' if head.blank?
- tail.capitalize! if head =~ UPPERCASE_RE
- "#{tail}-#{head.downcase}ay"
- end
- yield(latinized)
- end
-----
-
-[[pig_latin_translator]]
-.Pig Latin translator, pseudocode
-----
- for each line,
- recognize each word in the line and change it as follows:
- separate the head consonants (if any) from the tail of the word
- if there were no initial consonants, use 'w' as the head
- give the tail the same capitalization as the word
- change the word to "{tail}-#{head}ay"
- end
- emit the latinized version of the line
- end
-----
-
-.Ruby helper
-****
-* The first few lines define "regular expressions" selecting the initial characters (if any) to move. Writing their names in ALL CAPS makes them be constants.
-* Wukong calls the `each_line do ... end` block with each line; the `|line|` part puts it in the `line` variable.
-* the `gsub` ("globally substitute") statement calls its `do ... end` block with each matched word, and replaces that word with the last line of the block.
-* `yield(latinized)` hands off the `latinized` string for wukong to output
-****
-
-//// Is the Ruby helper necessary? It may be, but wanted to query it as I don't see it appearing often as a feature. Amy////
-
-To test the program on the commandline, run
-
- wu-local examples/text/pig_latin.rb data/magi.txt -
-
-The last line of its output should look like
-
- Everywhere-way ey-thay are-way isest-way. Ey-thay are-way e-thay agi-may.
-
-So that's what it looks like when a `cat` is feeding the program data; let's see how it works when an elephant is setting the pace.
-
-=== Chimpanzee and Elephant: A Day at Work ===
-
-Each day, the chimpanzee's foreman, a gruff silverback named J.T., hands each chimp the day's translation manual and a passage to translate as they clock in. Throughout the day, he also coordinates assigning each block of pages to chimps as they signal the need for a fresh assignment.
-
-Some passages are harder than others, so it's important that any elephant can deliver page blocks to any chimpanzee -- otherwise you'd have some chimps goofing off while others are stuck translating _King Lear_ into Kinyarwanda. On the other hand, sending page blocks around arbitrarily will clog the hallways and exhaust the elephants.
-
-The elephants' chief librarian, Nanette, employs several tricks to avoid this congestion.
-
-Since each chimpanzee typically shares a cubicle with an elephant, it's most convenient to hand a new page block across the desk rather then carry it down the hall. J.T. assigns tasks accordingly, using a manifest of page blocks he requests from Nanette. Together, they're able to make most tasks be "local".
-
-//// This would be a good spot to consider offering a couple additional, real-world analogies to aid your reader in making the leap to how this Chimp and Elephant story correlates to the needs for data manipulation in the readers' world - that is, relate it to science, epidemiology, finance, or what have you, some area of real-world application. Amy////
-
-Second, the page blocks of each play are distributed all around the office, not stored in one book together. One elephant might have pages from Act I of _Hamlet_, Act II of _The Tempest_, and the first four scenes of _Troilus and Cressida_ footnote:[Does that sound complicated? It is -- Nanette is able to keep track of all those blocks, but if she calls in sick, nobody can get anything done. You do NOT want Nanette to call in sick.]. Also, there are multiple 'replicas' (typically three) of each book collectively on hand. So even if a chimp falls behind, JT can depend that some other colleague will have a cubicle-local replica. (There's another benefit to having multiple copies: it ensures there's always a copy available. If one elephant is absent for the day, leaving her desk locked, Nanette will direct someone to make a xerox copy from either of the two other replicas.)
-
-Nanette and J.T. exercise a bunch more savvy optimizations (like handing out the longest passages first, or having folks who finish early pitch in so everyone can go home at the same time, and more). There's no better demonstration of power through simplicity.
Oops, something went wrong.

0 comments on commit efb3f6c

Please sign in to comment.