Permalink
Browse files

Rename to chapter order

  • Loading branch information...
1 parent 3b1798b commit 561b91bb375e8e62cf8f6af8bb68c6e8d25f285c Philip (flip) Kromer committed Feb 17, 2013
Showing with 209 additions and 241 deletions.
  1. +3 −0 00-preface.asciidoc
  2. +23 −22 00a-about.asciidoc
  3. +4 −2 00b-topics.asciidoc
  4. +2 −0 01-first_exploration.asciidoc
  5. +2 −0 014-organizing_data.asciidoc
  6. +2 −0 02-simple_transform.asciidoc
  7. +2 −0 03-transform_pivot.asciidoc
  8. 0 03a-locality-pivot.asciidoc → 03a-locality.asciidoc
  9. 0 03b-locality-saving_christmas.asciidoc → 03b-saving_christmas.asciidoc
  10. 0 03c-locality-simple_reshape.asciidoc → 03c-simple_reshape.asciidoc
  11. 0 03d-locality-efficient_santa.asciidoc → 03d-efficient_santa.asciidoc
  12. 0 03f-locality-partition_and_sort_keys.asciidoc → 03f-partition_and_sort_keys.asciidoc
  13. +2 −0 04-geographic_flavor.asciidoc
  14. +2 −0 05-toolset.asciidoc
  15. 0 08a-tools.asciidoc → 05a-tools.asciidoc
  16. 0 05b-launching_and_debugging.asciidoc
  17. 0 08b-tools-intro_to_wukong.asciidoc → 05c-intro_to_wukong.asciidoc
  18. 0 08c-tools-intro_to_pig.asciidoc → 05d-intro_to_pig.asciidoc
  19. +2 −0 06-filesystem_mojo.asciidoc
  20. +2 −0 07-server_logs.asciidoc
  21. 0 05a-server_logs.asciidoc → 07a-server_logs.asciidoc
  22. +2 −0 08-text_processing.asciidoc
  23. 0 05a-processing_text.asciidoc → 08a-processing_text.asciidoc
  24. +2 −0 09-statistics.asciidoc
  25. 0 11b-statistics.asciidoc → 09a-summarizing.asciidoc
  26. 0 11f-sampling.asciidoc → 09b-sampling.asciidoc
  27. 0 ...distribution_of_weather_measurements.asciidoc → 09c-distribution_of_weather_measurements.asciidoc
  28. 0 11e-statistics-exercises.asciidoc → 09e-exercises.asciidoc
  29. +2 −0 10-time_series.asciidoc
  30. 0 14a-time_series_data.asciidoc → 10a-time_series_data.asciidoc
  31. +2 −0 11-geographic.asciidoc
  32. 0 09a-geographic_data.asciidoc → 11a-spatial_join.asciidoc
  33. 0 09f-an_elephants_eye_view_of_the_world.asciidoc → 11f-an_elephants_eye_view_of_the_world.asciidoc
  34. +2 −0 12-cat_herding.asciidoc
  35. 0 08e-herding_cats.asciidoc → 12a-herding_cats.asciidoc
  36. +2 −0 13-data_munging.asciidoc
  37. 0 06b-semi_structured_data-wikipedia_other.asciidoc → 13a-wikipedia_other.asciidoc
  38. 0 06c-semi_structured_data-wikipedia_corpus.asciidoc → 13c-wikipedia_corpus.asciidoc
  39. 0 06d-semi_structured_data-patterns.asciidoc → 13d-patterns.asciidoc
  40. 0 06e-semi_structured_data-airline_flights.asciidoc → 13e-airline_flights.asciidoc
  41. 0 06f-semi_structured_data-daily_weather.asciidoc → 13f-daily_weather.asciidoc
  42. 0 06g-semi_structured_data-truth_and_error.asciidoc → 13g-truth_and_error.asciidoc
  43. 0 06h-semi_structured_data-other_strategies.asciidoc → 13h-other_strategies.asciidoc
  44. 0 07a-data_formats.asciidoc → 14a-data_formats.asciidoc
  45. 0 24c-data_modeling.asciidoc → 14c-data_modeling.asciidoc
  46. +2 −0 15-graphs.asciidoc
  47. 0 12a-processing_graphs.asciidoc → 15a-representing_graphs.asciidoc
  48. 0 12c-processing_graphs-community.asciidoc → 15b-community_extractions.asciidoc
  49. 0 12c-processing_graphs-pagerank.asciidoc → 15c-pagerank.asciidoc
  50. +2 −0 16-machine_learning.asciidoc
  51. 0 15a-simple_machine_learning.asciidoc → 16a-simple_machine_learning.asciidoc
  52. 0 22d-misc.asciidoc → 16d-misc.asciidoc
  53. +2 −0 17-best_practices.asciidoc
  54. 0 24a-why_hadoop.asciidoc → 17a-why_hadoop.asciidoc
  55. 0 24b-how_to_think.asciidoc → 17b-how_to_think.asciidoc
  56. 0 24d-cloud-vs-static.asciidoc → 17d-cloud-vs-static.asciidoc
  57. 0 24e-rules_of_scaling.asciidoc → 17e-rules_of_scaling.asciidoc
  58. 0 ..._and_pedantic_points_of_style.asciidoc → 17f-best_practices_and_pedantic_points_of_style.asciidoc
  59. 0 24g-tao_te_chimp.asciidoc → 17g-tao_te_chimp.asciidoc
  60. +2 −0 18-java_api.asciidoc
  61. 0 19a-hadoop_api.asciidoc → 18a-hadoop_api.asciidoc
  62. +2 −0 19-advanced_pig.asciidoc
  63. 0 20a-advanced_pig.asciidoc → 19a-advanced_pig.asciidoc
  64. 0 20b-pig_udfs.asciidoc → 19b-pig_udfs.asciidoc
  65. +2 −0 20-hbase_data_modeling.asciidoc
  66. 0 21b-hbase_and_databases.asciidoc → 20b-hbase_and_databases.asciidoc
  67. +2 −0 21-hadoop_internals.asciidoc
  68. 0 16a-hadoop_internals.asciidoc → 21a-hadoop_internals.asciidoc
  69. 0 16b-hadoop_internals-logs.asciidoc → 21b-hadoop_internals-logs.asciidoc
  70. +2 −0 22-hadoop_tuning.asciidoc
  71. 0 17a-tuning-wise_and_lazy.asciidoc → 22a-tuning-wise_and_lazy.asciidoc
  72. 0 22-datasets_and_scripts.asciidoc → 22b-scripts.asciidoc
  73. 0 17b-tuning-pathology.asciidoc → 22b-tuning-pathology.asciidoc
  74. 0 17c-tuning-brave_and_foolish.asciidoc → 22c-tuning-brave_and_foolish.asciidoc
  75. 0 17d-use_method_checklist.asciidoc → 22d-use_method_checklist.asciidoc
  76. +2 −0 23-datasets_and_scripts.asciidoc
  77. 0 08d-overview_of_datasets.asciidoc → 23a-overview_of_datasets.asciidoc
  78. 0 25a-datasets.asciidoc → 23c-datasets.asciidoc
  79. 0 25c-wikipedia_dbpedia.asciidoc → 23c-wikipedia_dbpedia.asciidoc
  80. 0 25d-airline_flights.asciidoc → 23d-airline_flights.asciidoc
  81. 0 25e-access_logs.asciidoc → 23e-access_logs.asciidoc
  82. 0 25f-data_formats-arc.asciidoc → 23f-data_formats-arc.asciidoc
  83. 0 25g-other_datasets_on_the_web.asciidoc → 23g-other_datasets_on_the_web.asciidoc
  84. 0 25h-notes_for_chimpmark.asciidoc → 23h-notes_for_chimpmark.asciidoc
  85. +2 −0 24-cheatsheets.asciidoc
  86. +76 −0 24a-unix_cheatsheet.asciidoc
  87. +0 −91 23a-cheatsheets.asciidoc → 24b-regular_expression_cheatsheet.asciidoc
  88. +13 −0 24c-pig_cheatsheet.asciidoc
  89. +3 −0 24d-hadoop_tunables_cheatsheet.asciidoc
  90. +2 −0 25-appendix.asciidoc
  91. 0 30a-authors.asciidoc → 25a-authors.asciidoc
  92. 0 30b-colophon.asciidoc → 25b-colophon.asciidoc
  93. 0 25b-acquiring_a_hadoop_cluster.asciidoc → 25c-acquiring_a_hadoop_cluster.asciidoc
  94. 0 30c-references.asciidoc → 25c-references.asciidoc
  95. 0 30f-glossary.asciidoc → 25f-glossary.asciidoc
  96. 0 30g-back_cover.asciidoc → 25g-back_cover.asciidoc
  97. 0 30h-TODO.asciidoc → 25h-TODO.asciidoc
  98. 0 30i-asciidoc_cheatsheet_and_style_guide.asciidoc → 25i-asciidoc_cheatsheet_and_style_guide.asciidoc
  99. +0 −61 MANIFEST-early_release.asciidoc
  100. +0 −65 MANIFEST-tech_review.asciidoc
  101. +37 −0 foo.rb
View
@@ -0,0 +1,3 @@
+[[preface]]
+== Preface
+
View
@@ -1,5 +1,3 @@
-== Preface
-
// :author: Philip (flip) Kromer
// :doctype: book
// :toc:
@@ -41,7 +39,7 @@ This is the plan. We'll roll material out over the next few months. Should we fi
5. *The Hadoop Toolset*
- toolset overview
- - launching jobs
+ - launching and debugging jobs
- overview of wukong
- overview of pig
@@ -60,53 +58,54 @@ This is the plan. We'll roll material out over the next few months. Should we fi
- Pointwise Mutual Information
- K-means Clustering
-9. Interlude I: *Data Models, Data Formats, Data Management*:
- - How to design your data models
- - How to serialize their contents (orig, scratch, prod)
- - How to organize your scripts and your data
-
-10. *Statistics*:
- - Averages, Percentiles, and Normalization
+9. *Statistics*:
+ - Summarizing: Averages, Percentiles, and Normalization
- Sampling responsibly: it's harder and more important than you think
- Statistical aggregates and the danger of large numbers
-11. *Time Series*
+10. *Time Series*
-12. *Geographic Data*:
+11. *Geographic Data*:
- Spatial join (find all UFO sightings near Airports)
-
-13. *`cat` herding*
+12. *`cat` herding*
- total sort
- transformations from the commandline (grep, cut, wc, etc)
- pivots from the commandline (head, sort, etc)
- commandline workflow tips
- advanced hadoop filesystem (chmod, setrep, fsck)
-14. *Data Munging (Semi-Structured Data)*: The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
+13. *Data Munging (Semi-Structured Data)*: The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
- Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
- Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
- US Commercial Airline Flights: every commercial airline flight since 1987
- Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
- "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio's waxy.org).
-15. Interlude II: *Best Practices and Pedantic Points of style*
- - Pedantic Points of Style
- - Best Practices
- - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
- - Why Hadoop
- - robots are cheap, people are important
+14. Interlude I: *Organizing Data*:
+ - How to design your data models
+ - How to serialize their contents (orig, scratch, prod)
+ - How to organize your scripts and your data
-16. *Graph Processing*:
+15. *Graph Processing*:
+ - Graph Representations
- Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents
- Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages
-17. *Machine Learning without Grad School*: We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using
+16. *Machine Learning without Grad School*: We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using
- Naive Bayes
- Logistic Regression
- Random Forest (using Mahout)
We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win.
+17. Interlude II: *Best Practices and Pedantic Points of style*
+ - Pedantic Points of Style
+ - Best Practices
+ - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
+ - Why Hadoop
+ - robots are cheap, people are important
+
PRACTICAL
@@ -147,6 +146,8 @@ APPENDIX
- Sizes of the Universe
- Hadoop Tuning & Configuration Variables
+25. *Appendix*:
+
==== Not Contents ====
I'm not currently planning to cover Hive -- I believe the pig scripts will translate naturally for folks who are already familiar with it. There will be a brief section explaining why you might choose it over Pig, and why I chose it over Hive. If there's popular pressure I may add a "translation guide".
View
@@ -29,8 +29,8 @@
- (visualize)
INTERMEDIATE
-
-5. *The Hadoop Toolset*
+
+5. *The Toolset*
- toolset overview
- pig vs hive vs impala
- hbase & elasticsearch (not accumulo or cassandra)
@@ -226,3 +226,5 @@ APPENDIX
- Regular Expressions
- Sizes of the Universe
- Hadoop Tuning & Configuration Variables
+
+25. *Appendix*
@@ -0,0 +1,2 @@
+[[first_exploration]]== First Exploration
+
@@ -0,0 +1,2 @@
+[[data_management]]== Data Management
+
@@ -0,0 +1,2 @@
+[[simple_transform]]== Simple Transform
+
@@ -0,0 +1,2 @@
+[[transform_pivot]]== Transform Pivot
+
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,2 @@
+[[geographic_flavor]]== Geographic Flavor
+
View
@@ -0,0 +1,2 @@
+[[toolset]]== Toolset
+
File renamed without changes.
No changes.
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,2 @@
+[[filesystem_mojo]]== Filesystem Mojo
+
View
@@ -0,0 +1,2 @@
+[[server_logs]]== Server Logs
+
File renamed without changes.
@@ -0,0 +1,2 @@
+[[text_processing]]== Text Processing
+
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[statistics]]== Statistics
+
File renamed without changes.
File renamed without changes.
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[time_series]]== Time Series
+
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[geographic]]== Geographic
+
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[cat_herding]]== Cat Herding
+
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[data_munging]]== Data Munging
+
File renamed without changes.
File renamed without changes.
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[graphs]]== Graphs
+
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,2 @@
+[[machine_learning]]== Machine Learning
+
File renamed without changes.
@@ -0,0 +1,2 @@
+[[best_practices]]== Best Practices
+
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[java_api]]== Java Api
+
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[advanced_pig]]== Advanced Pig
+
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,2 @@
+[[hbase_data_modeling]]== Hbase Data Modeling
+
File renamed without changes.
@@ -0,0 +1,2 @@
+[[hadoop_internals]]== Hadoop Internals
+
File renamed without changes.
@@ -0,0 +1,2 @@
+[[hadoop_tuning]]== Hadoop Tuning
+
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,2 @@
+[[datasets_and_scripts]]== Datasets And Scripts
+
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
View
@@ -0,0 +1,2 @@
+[[cheatsheets]]== Cheatsheets
+
@@ -0,0 +1,76 @@
+== Cheatsheets ==
+
+=== Terminal Commands ===
+
+[[hadoop_filesystem_commands]]
+.Hadoop Filesystem Commands
+[options="header"]
+|=======
+| action | command
+| |
+| list files | `hadoop fs -ls`
+| list files' disk usage | `hadoop fs -du`
+| total HDFS usage/available | visit namenode console
+| |
+| |
+| copy local -> HDFS |
+| copy HDFS -> local |
+| copy HDFS -> remote HDFS |
+| |
+| make a directory | `hadoop fs -mkdir ${DIR}`
+| move/rename | `hadoop fs -mv ${FILE}`
+| dump file to console | `hadoop fs -cat ${FILE} \| cut -c 10000 \| head -n 10000`
+| |
+| |
+| remove a file |
+| remove a directory tree |
+| remove a file, skipping Trash |
+| empty the trash NOW |
+| |
+| health check of HDFS |
+| report block usage of files |
+| |
+| decommission nodes |
+| |
+| |
+| list running jobs |
+| kill a job |
+| kill a task attempt |
+| |
+| |
+| CPU usage by process | `htop`, or `top` if that's not installed
+| Disk activity |
+| Network activity |
+| |
+| | `grep -e '[regexp]'`
+| | `head`, `tail`
+| | `wc`
+| | `uniq -c`
+| | `sort -n -k2`
+| tuning | csshX, htop, dstat, ulimit
+|
+| also useful: | cat, echo, true, false, yes, tee, time, watch, time
+| dos-to-unix line endings | `ruby -ne 'puts $_.gsub(/\r\n?/, "\n")'`
+| |
+| |
+|======
+
+[[commandline_tricks]]
+.UNIX commandline tricks
+[options="header"]
+|=======
+| action | command | Flags
+| Sort data | `sort` | reverse the sort: `-r`; sort numerically: `-n`; sort on a field: `-t [delimiter] -k [index]`
+| Sort large amount of data | `sort --parallel=4 -S 500M` | use four cores and a 500 megabyte sort buffer
+| Cut delimited field | `cut -f 1,3-7 -d ','` | emit comma-separated fields one and three through seven
+| Cut range of characters | `cut -c 1,3-7` | emit characters one and three through seven
+| Split on spaces | `| ruby -ne 'puts $_.split(/\\s+/).join("\t")'` | split on continuous runs of whitespace, re-emit as tab-separated
+| Distinct fields | `| sort | uniq` | only dupes: `-d`
+| Quickie histogram | `| sort | uniq -c` | TODO: check the rendering for backslash
+| Per-process usage | `htop` | Installed
+| Running system usage | `dstat -drnycmf -t 5` | 5-second rolling system stats. You likely will have to http://dag.wieers.com/home-made/dstat/[install dstat] yourself. If that's not an option, use `iostat -x 5 & sleep 3 ; ifstat 5` for an interleaved 5-second running average.
+|======
+
+For example: `cat * | cut -c 1-4 | sort | uniq -c` cuts the first 4-character
+
+Not all commands available on all platforms; OSX users should use Homebrew, Windows users should use Cygwin.
@@ -1,79 +1,3 @@
-== Cheatsheets ==
-
-=== Terminal Commands ===
-
-[[hadoop_filesystem_commands]]
-.Hadoop Filesystem Commands
-[options="header"]
-|=======
-| action | command
-| |
-| list files | `hadoop fs -ls`
-| list files' disk usage | `hadoop fs -du`
-| total HDFS usage/available | visit namenode console
-| |
-| |
-| copy local -> HDFS |
-| copy HDFS -> local |
-| copy HDFS -> remote HDFS |
-| |
-| make a directory | `hadoop fs -mkdir ${DIR}`
-| move/rename | `hadoop fs -mv ${FILE}`
-| dump file to console | `hadoop fs -cat ${FILE} \| cut -c 10000 \| head -n 10000`
-| |
-| |
-| remove a file |
-| remove a directory tree |
-| remove a file, skipping Trash |
-| empty the trash NOW |
-| |
-| health check of HDFS |
-| report block usage of files |
-| |
-| decommission nodes |
-| |
-| |
-| list running jobs |
-| kill a job |
-| kill a task attempt |
-| |
-| |
-| CPU usage by process | `htop`, or `top` if that's not installed
-| Disk activity |
-| Network activity |
-| |
-| | `grep -e '[regexp]'`
-| | `head`, `tail`
-| | `wc`
-| | `uniq -c`
-| | `sort -n -k2`
-| tuning | csshX, htop, dstat, ulimit
-|
-| also useful: | cat, echo, true, false, yes, tee, time, watch, time
-| dos-to-unix line endings | `ruby -ne 'puts $_.gsub(/\r\n?/, "\n")'`
-| |
-| |
-|======
-
-[[commandline_tricks]]
-.UNIX commandline tricks
-[options="header"]
-|=======
-| action | command | Flags
-| Sort data | `sort` | reverse the sort: `-r`; sort numerically: `-n`; sort on a field: `-t [delimiter] -k [index]`
-| Sort large amount of data | `sort --parallel=4 -S 500M` | use four cores and a 500 megabyte sort buffer
-| Cut delimited field | `cut -f 1,3-7 -d ','` | emit comma-separated fields one and three through seven
-| Cut range of characters | `cut -c 1,3-7` | emit characters one and three through seven
-| Split on spaces | `| ruby -ne 'puts $_.split(/\\s+/).join("\t")'` | split on continuous runs of whitespace, re-emit as tab-separated
-| Distinct fields | `| sort | uniq` | only dupes: `-d`
-| Quickie histogram | `| sort | uniq -c` | TODO: check the rendering for backslash
-| Per-process usage | `htop` | Installed
-| Running system usage | `dstat -drnycmf -t 5` | 5-second rolling system stats. You likely will have to http://dag.wieers.com/home-made/dstat/[install dstat] yourself. If that's not an option, use `iostat -x 5 & sleep 3 ; ifstat 5` for an interleaved 5-second running average.
-|======
-
-For example: `cat * | cut -c 1-4 | sort | uniq -c` cuts the first 4-character
-
-Not all commands available on all platforms; OSX users should use Homebrew, Windows users should use Cygwin.
=== Regular Expressions ===
@@ -253,18 +177,3 @@ Ascii table:
"~"
"\x7F" \c
"\x80" \c
-
-
-=== Pig Operators ===
-
-[[pig_cheatsheet]]
-.Pig Operator Cheatsheet
-[options="header"]
-|=======
-| action | operator
-| |
-| | JOIN
-| | FILTER
-| |
-|=======
-
@@ -0,0 +1,13 @@
+=== Pig Operators ===
+
+[[pig_cheatsheet]]
+.Pig Operator Cheatsheet
+[options="header"]
+|=======
+| action | operator
+| |
+| | JOIN
+| | FILTER
+| |
+|=======
+
@@ -0,0 +1,3 @@
+=== Hadoop Tunables Cheatsheet
+
+
View
@@ -0,0 +1,2 @@
+[[appendix]]== Appendix
+
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Oops, something went wrong.

0 comments on commit 561b91b

Please sign in to comment.