Skip to content

Commit

Permalink
Rename to chapter order
Browse files Browse the repository at this point in the history
  • Loading branch information
Philip (flip) Kromer committed Feb 17, 2013
1 parent 3b1798b commit 561b91b
Show file tree
Hide file tree
Showing 101 changed files with 209 additions and 241 deletions.
3 changes: 3 additions & 0 deletions 00-preface.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,3 @@
[[preface]]
== Preface

45 changes: 23 additions & 22 deletions 00a-about.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -1,5 +1,3 @@
== Preface

// :author: Philip (flip) Kromer // :author: Philip (flip) Kromer
// :doctype: book // :doctype: book
// :toc: // :toc:
Expand Down Expand Up @@ -41,7 +39,7 @@ This is the plan. We'll roll material out over the next few months. Should we fi


5. *The Hadoop Toolset* 5. *The Hadoop Toolset*
- toolset overview - toolset overview
- launching jobs - launching and debugging jobs
- overview of wukong - overview of wukong
- overview of pig - overview of pig


Expand All @@ -60,53 +58,54 @@ This is the plan. We'll roll material out over the next few months. Should we fi
- Pointwise Mutual Information - Pointwise Mutual Information
- K-means Clustering - K-means Clustering


9. Interlude I: *Data Models, Data Formats, Data Management*: 9. *Statistics*:
- How to design your data models - Summarizing: Averages, Percentiles, and Normalization
- How to serialize their contents (orig, scratch, prod)
- How to organize your scripts and your data

10. *Statistics*:
- Averages, Percentiles, and Normalization
- Sampling responsibly: it's harder and more important than you think - Sampling responsibly: it's harder and more important than you think
- Statistical aggregates and the danger of large numbers - Statistical aggregates and the danger of large numbers


11. *Time Series* 10. *Time Series*


12. *Geographic Data*: 11. *Geographic Data*:
- Spatial join (find all UFO sightings near Airports) - Spatial join (find all UFO sightings near Airports)
- -


13. *`cat` herding* 12. *`cat` herding*
- total sort - total sort
- transformations from the commandline (grep, cut, wc, etc) - transformations from the commandline (grep, cut, wc, etc)
- pivots from the commandline (head, sort, etc) - pivots from the commandline (head, sort, etc)
- commandline workflow tips - commandline workflow tips
- advanced hadoop filesystem (chmod, setrep, fsck) - advanced hadoop filesystem (chmod, setrep, fsck)


14. *Data Munging (Semi-Structured Data)*: The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book: 13. *Data Munging (Semi-Structured Data)*: The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
- Wikipedia Articles: Every English-language article (12 million) from Wikipedia. - Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
- Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007. - Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
- US Commercial Airline Flights: every commercial airline flight since 1987 - US Commercial Airline Flights: every commercial airline flight since 1987
- Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s. - Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
- "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio's waxy.org). - "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio's waxy.org).


15. Interlude II: *Best Practices and Pedantic Points of style* 14. Interlude I: *Organizing Data*:
- Pedantic Points of Style - How to design your data models
- Best Practices - How to serialize their contents (orig, scratch, prod)
- How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models. - How to organize your scripts and your data
- Why Hadoop
- robots are cheap, people are important


16. *Graph Processing*: 15. *Graph Processing*:
- Graph Representations
- Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents - Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents
- Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages - Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages


17. *Machine Learning without Grad School*: We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using 16. *Machine Learning without Grad School*: We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using
- Naive Bayes - Naive Bayes
- Logistic Regression - Logistic Regression
- Random Forest (using Mahout) - Random Forest (using Mahout)
We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win. We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win.


17. Interlude II: *Best Practices and Pedantic Points of style*
- Pedantic Points of Style
- Best Practices
- How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
- Why Hadoop
- robots are cheap, people are important



PRACTICAL PRACTICAL


Expand Down Expand Up @@ -147,6 +146,8 @@ APPENDIX
- Sizes of the Universe - Sizes of the Universe
- Hadoop Tuning & Configuration Variables - Hadoop Tuning & Configuration Variables


25. *Appendix*:

==== Not Contents ==== ==== Not Contents ====


I'm not currently planning to cover Hive -- I believe the pig scripts will translate naturally for folks who are already familiar with it. There will be a brief section explaining why you might choose it over Pig, and why I chose it over Hive. If there's popular pressure I may add a "translation guide". I'm not currently planning to cover Hive -- I believe the pig scripts will translate naturally for folks who are already familiar with it. There will be a brief section explaining why you might choose it over Pig, and why I chose it over Hive. If there's popular pressure I may add a "translation guide".
Expand Down
6 changes: 4 additions & 2 deletions 00b-topics.asciidoc
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@
- (visualize) - (visualize)
INTERMEDIATE INTERMEDIATE

5. *The Hadoop Toolset* 5. *The Toolset*
- toolset overview - toolset overview
- pig vs hive vs impala - pig vs hive vs impala
- hbase & elasticsearch (not accumulo or cassandra) - hbase & elasticsearch (not accumulo or cassandra)
Expand Down Expand Up @@ -226,3 +226,5 @@ APPENDIX
- Regular Expressions - Regular Expressions
- Sizes of the Universe - Sizes of the Universe
- Hadoop Tuning & Configuration Variables - Hadoop Tuning & Configuration Variables
25. *Appendix*
2 changes: 2 additions & 0 deletions 01-first_exploration.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[first_exploration]]== First Exploration

2 changes: 2 additions & 0 deletions 014-organizing_data.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[data_management]]== Data Management

2 changes: 2 additions & 0 deletions 02-simple_transform.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[simple_transform]]== Simple Transform

2 changes: 2 additions & 0 deletions 03-transform_pivot.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[transform_pivot]]== Transform Pivot

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 04-geographic_flavor.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[geographic_flavor]]== Geographic Flavor

2 changes: 2 additions & 0 deletions 05-toolset.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[toolset]]== Toolset

File renamed without changes.
Empty file.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 06-filesystem_mojo.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[filesystem_mojo]]== Filesystem Mojo

2 changes: 2 additions & 0 deletions 07-server_logs.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[server_logs]]== Server Logs

File renamed without changes.
2 changes: 2 additions & 0 deletions 08-text_processing.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[text_processing]]== Text Processing

File renamed without changes.
2 changes: 2 additions & 0 deletions 09-statistics.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[statistics]]== Statistics

File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 10-time_series.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[time_series]]== Time Series

File renamed without changes.
2 changes: 2 additions & 0 deletions 11-geographic.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[geographic]]== Geographic

File renamed without changes.
2 changes: 2 additions & 0 deletions 12-cat_herding.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[cat_herding]]== Cat Herding

File renamed without changes.
2 changes: 2 additions & 0 deletions 13-data_munging.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[data_munging]]== Data Munging

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 15-graphs.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[graphs]]== Graphs

File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 16-machine_learning.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[machine_learning]]== Machine Learning

File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 17-best_practices.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[best_practices]]== Best Practices

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 18-java_api.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[java_api]]== Java Api

File renamed without changes.
2 changes: 2 additions & 0 deletions 19-advanced_pig.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[advanced_pig]]== Advanced Pig

File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 20-hbase_data_modeling.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[hbase_data_modeling]]== Hbase Data Modeling

File renamed without changes.
2 changes: 2 additions & 0 deletions 21-hadoop_internals.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[hadoop_internals]]== Hadoop Internals

File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 22-hadoop_tuning.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[hadoop_tuning]]== Hadoop Tuning

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 23-datasets_and_scripts.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[datasets_and_scripts]]== Datasets And Scripts

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions 24-cheatsheets.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[cheatsheets]]== Cheatsheets

76 changes: 76 additions & 0 deletions 24a-unix_cheatsheet.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,76 @@
== Cheatsheets ==

=== Terminal Commands ===

[[hadoop_filesystem_commands]]
.Hadoop Filesystem Commands
[options="header"]
|=======
| action | command
| |
| list files | `hadoop fs -ls`
| list files' disk usage | `hadoop fs -du`
| total HDFS usage/available | visit namenode console
| |
| |
| copy local -> HDFS |
| copy HDFS -> local |
| copy HDFS -> remote HDFS |
| |
| make a directory | `hadoop fs -mkdir ${DIR}`
| move/rename | `hadoop fs -mv ${FILE}`
| dump file to console | `hadoop fs -cat ${FILE} \| cut -c 10000 \| head -n 10000`
| |
| |
| remove a file |
| remove a directory tree |
| remove a file, skipping Trash |
| empty the trash NOW |
| |
| health check of HDFS |
| report block usage of files |
| |
| decommission nodes |
| |
| |
| list running jobs |
| kill a job |
| kill a task attempt |
| |
| |
| CPU usage by process | `htop`, or `top` if that's not installed
| Disk activity |
| Network activity |
| |
| | `grep -e '[regexp]'`
| | `head`, `tail`
| | `wc`
| | `uniq -c`
| | `sort -n -k2`
| tuning | csshX, htop, dstat, ulimit
|
| also useful: | cat, echo, true, false, yes, tee, time, watch, time
| dos-to-unix line endings | `ruby -ne 'puts $_.gsub(/\r\n?/, "\n")'`
| |
| |
|======

[[commandline_tricks]]
.UNIX commandline tricks
[options="header"]
|=======
| action | command | Flags
| Sort data | `sort` | reverse the sort: `-r`; sort numerically: `-n`; sort on a field: `-t [delimiter] -k [index]`
| Sort large amount of data | `sort --parallel=4 -S 500M` | use four cores and a 500 megabyte sort buffer
| Cut delimited field | `cut -f 1,3-7 -d ','` | emit comma-separated fields one and three through seven
| Cut range of characters | `cut -c 1,3-7` | emit characters one and three through seven
| Split on spaces | `| ruby -ne 'puts $_.split(/\\s+/).join("\t")'` | split on continuous runs of whitespace, re-emit as tab-separated
| Distinct fields | `| sort | uniq` | only dupes: `-d`
| Quickie histogram | `| sort | uniq -c` | TODO: check the rendering for backslash
| Per-process usage | `htop` | Installed
| Running system usage | `dstat -drnycmf -t 5` | 5-second rolling system stats. You likely will have to http://dag.wieers.com/home-made/dstat/[install dstat] yourself. If that's not an option, use `iostat -x 5 & sleep 3 ; ifstat 5` for an interleaved 5-second running average.
|======

For example: `cat * | cut -c 1-4 | sort | uniq -c` cuts the first 4-character

Not all commands available on all platforms; OSX users should use Homebrew, Windows users should use Cygwin.
Original file line number Original file line Diff line number Diff line change
@@ -1,79 +1,3 @@
== Cheatsheets ==

=== Terminal Commands ===

[[hadoop_filesystem_commands]]
.Hadoop Filesystem Commands
[options="header"]
|=======
| action | command
| |
| list files | `hadoop fs -ls`
| list files' disk usage | `hadoop fs -du`
| total HDFS usage/available | visit namenode console
| |
| |
| copy local -> HDFS |
| copy HDFS -> local |
| copy HDFS -> remote HDFS |
| |
| make a directory | `hadoop fs -mkdir ${DIR}`
| move/rename | `hadoop fs -mv ${FILE}`
| dump file to console | `hadoop fs -cat ${FILE} \| cut -c 10000 \| head -n 10000`
| |
| |
| remove a file |
| remove a directory tree |
| remove a file, skipping Trash |
| empty the trash NOW |
| |
| health check of HDFS |
| report block usage of files |
| |
| decommission nodes |
| |
| |
| list running jobs |
| kill a job |
| kill a task attempt |
| |
| |
| CPU usage by process | `htop`, or `top` if that's not installed
| Disk activity |
| Network activity |
| |
| | `grep -e '[regexp]'`
| | `head`, `tail`
| | `wc`
| | `uniq -c`
| | `sort -n -k2`
| tuning | csshX, htop, dstat, ulimit
|
| also useful: | cat, echo, true, false, yes, tee, time, watch, time
| dos-to-unix line endings | `ruby -ne 'puts $_.gsub(/\r\n?/, "\n")'`
| |
| |
|======

[[commandline_tricks]]
.UNIX commandline tricks
[options="header"]
|=======
| action | command | Flags
| Sort data | `sort` | reverse the sort: `-r`; sort numerically: `-n`; sort on a field: `-t [delimiter] -k [index]`
| Sort large amount of data | `sort --parallel=4 -S 500M` | use four cores and a 500 megabyte sort buffer
| Cut delimited field | `cut -f 1,3-7 -d ','` | emit comma-separated fields one and three through seven
| Cut range of characters | `cut -c 1,3-7` | emit characters one and three through seven
| Split on spaces | `| ruby -ne 'puts $_.split(/\\s+/).join("\t")'` | split on continuous runs of whitespace, re-emit as tab-separated
| Distinct fields | `| sort | uniq` | only dupes: `-d`
| Quickie histogram | `| sort | uniq -c` | TODO: check the rendering for backslash
| Per-process usage | `htop` | Installed
| Running system usage | `dstat -drnycmf -t 5` | 5-second rolling system stats. You likely will have to http://dag.wieers.com/home-made/dstat/[install dstat] yourself. If that's not an option, use `iostat -x 5 & sleep 3 ; ifstat 5` for an interleaved 5-second running average.
|======

For example: `cat * | cut -c 1-4 | sort | uniq -c` cuts the first 4-character

Not all commands available on all platforms; OSX users should use Homebrew, Windows users should use Cygwin.


=== Regular Expressions === === Regular Expressions ===


Expand Down Expand Up @@ -253,18 +177,3 @@ Ascii table:
"~" "~"
"\x7F" \c "\x7F" \c
"\x80" \c "\x80" \c


=== Pig Operators ===

[[pig_cheatsheet]]
.Pig Operator Cheatsheet
[options="header"]
|=======
| action | operator
| |
| | JOIN
| | FILTER
| |
|=======

13 changes: 13 additions & 0 deletions 24c-pig_cheatsheet.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,13 @@
=== Pig Operators ===

[[pig_cheatsheet]]
.Pig Operator Cheatsheet
[options="header"]
|=======
| action | operator
| |
| | JOIN
| | FILTER
| |
|=======

3 changes: 3 additions & 0 deletions 24d-hadoop_tunables_cheatsheet.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,3 @@
=== Hadoop Tunables Cheatsheet


2 changes: 2 additions & 0 deletions 25-appendix.asciidoc
Original file line number Original file line Diff line number Diff line change
@@ -0,0 +1,2 @@
[[appendix]]== Appendix

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 561b91b

Please sign in to comment.