Skip to content
Browse files

Fixing section titles and arrangement

  • Loading branch information...
1 parent fb20d99 commit 62f9c95923002bcc3079e687a4786d43adb56324 Philip (flip) Kromer committed
View
2 07g-end_of_event_streams.asciidoc
@@ -1,5 +1,5 @@
-=== References
+=== Refs ===
* http://www.botsvsbrowsers.com/[Database of Robot User Agent strings]
View
2 08a-processing_text.asciidoc
@@ -36,7 +36,7 @@ that quantifies the probability
* http://cogcomp.cs.illinois.edu/page/software_view/4
* http://opennlp.apache.org/
-=== References ===
+=== Refs ===
* http://thedatachef.blogspot.com/2011/04/tf-idf-with-apache-pig.html
View
8 09a-summarizing.asciidoc
@@ -1,6 +1,5 @@
-== Statistics ==
- simplify and summarize patterns in data: even simple counts and frequencies require some craft at large scale; measures that require any global context, like the median, become fiendish.
+simplify and summarize patterns in data: even simple counts and frequencies require some craft at large scale; measures that require any global context, like the median, become fiendish.
Describe long-tail and normal distribution
@@ -490,13 +489,10 @@ I know that cities of the world lie between 1 and 8 billion. If I want to know m
=== Sampling ===
-
==== Random numbers + Hadoop considered harmful ====
Don't generate a random number as a sampling or sort key in a map job. The problem is that map tasks can be restarted - because of speculative execution, a failed machine, etc. -- and with random records, each of those runs will dispatch differently. It also makes life hard in general when your jobs aren't predictable run-to-run. You want to make friends with a couple records early in the so urge, and keep track of its passage though the full data flow. Similarly to the best practice of using intrinsic vs synthetic keys, it's always better to use intrinsic metadata -- truth should flow from the edge inward.
-
-
-=== References ===
+=== Refs ===
* http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html[What Every Computer Scientist Should Know About Floating-Point Arithmetic]
View
5 10a-time_series_data.asciidoc
@@ -1,8 +1,3 @@
-== Time Series ==
-
-=== sessionizing ===
-
-
=== anomaly detection ===
View
2 11g-end_of_geographic.asciidoc
@@ -63,7 +63,7 @@ Example: merge all cells whose contents lie within 10% of each other
-=== References ===
+=== Refs ===
* http://kartoweb.itc.nl/geometrics/Introduction/introduction.html -- an excellent overview of projections, reference surfaces and other fundamentals of geospatial analysis.
* http://msdn.microsoft.com/en-us/library/bb259689.aspx
View
2 12a-herding_cats.asciidoc
@@ -1,5 +1,3 @@
-== Herding `cat`s ==
-
=== Moving things to and fro ===
TIMINGS
View
7 14a-data_formats.asciidoc
@@ -1,9 +1,4 @@
-== Data Formats and Schemata ==
-
-
-there's only three data serialization formats you should use: TSV, JSON, or Avro.
-
-TESTING
+There are only three data serialization formats you should use: TSV, JSON, or Avro.
=== Good Format 1: TSV (It's simple) ===
View
4 14c-data_modeling.asciidoc
@@ -1,6 +1,4 @@
-== Data Modeling
-
-=== Datasets ===
+=== Data Modeling ===
There are six and a half shapes of record-oriented datasets:
View
3 08f-gotchas.asciidoc → 14d-gotchas.asciidoc
@@ -1,4 +1,4 @@
-== Gotchas ==
+=== Gotchas ===
* Hadoop commandline tools use the trash to delete items; programs using the hadoop API do not. <remark>verify Pig uses the trash</remark>
@@ -12,7 +12,6 @@
* Don't hit external services -- see <<server_logs_ddos>>.
-
== Tips ==
* Remove files from the trash immediately with `hadoop fs -rm -skipTrash /users/(you)/.Trash`
View
2 15a-representing_graphs.asciidoc
@@ -1,5 +1,3 @@
-== Processing Graphs ==
-
Assumptions:
* V fits in ram
View
2 16a-simple_machine_learning.asciidoc
@@ -1,4 +1,4 @@
-== Black-Box Machine Learning ==
+=== Black-Box Machine Learning ===
Most machine-learning discussions begin with an amuse-bouche about infinite-dimensional vector spaces or multinomial distributions over simplices as a way of easing in to the really hard stuff.
View
4 19a-advanced_pig.asciidoc
@@ -1,5 +1,3 @@
-== Advanced Pig ==
-
=== Advanced Join Fu ===
Pig has three special-purpose join strategies: the "map-side" (aka 'fragment replicate') join
@@ -171,6 +169,6 @@ TBD
TBD
-=== References ===
+=== Refs ===
* http://pig.apache.org/docs/r0.10.0/perf.html#replicated-joins:[map-side join]
View
2 20a-hbase_schema.asciidoc
@@ -518,7 +518,7 @@ NOTE notation -- HBase makes heavy use of composite keys (several values combine
HBase is a database for storing "billions of rows and millions of columns"
-=== References ===
+=== Refs ===
* I've drawn heavily on the wisdom of http://hbase.apache.org/book.html[HBase Book]
View
19 21a-hadoop_internals.asciidoc
@@ -1,6 +1,6 @@
-== Hadoop Execution in Detail ==
+=== Hadoop Execution in Detail ===
-=== Launch ===
+==== Launch ====
When you launch a job (with `pig`, `wukong run`, or `hadoop jar`), it starts a local process that
@@ -24,7 +24,7 @@ Run `tail -f /tmp/my_job-*.log` to keep watching the job's progress.
The job draws its default configuration from the _launch_ machine's config file. Make sure those defaults doesn't conflict with appropriate values for the workers that will actually execute the job! One great way to screw this up is to launch a job from your dev machine, go to dinner and come back to find it using one reducer and a tiny heap size. Another is to start your job from a master that is provisioned differently from the workers.
======
-=== Split ===
+==== Split ====
Input files are split and assigned to mappers.
@@ -45,9 +45,9 @@ Exercises:
* Create a 2GB file having a 128MB block size on the HDFS. Run `wu-stream cat cat --min_split_mb=1900` on it. How many map tasks will launch? What will the "non-local" cell on the jobtracker report? Try it out for 1900, and also for values of 128, 120, 130, 900 and 1100.
-=== Mappers ===
+==== Mappers ====
-==== Hadoop Streaming (Wukong, MrJob, etc) ====
+===== Hadoop Streaming (Wukong, MrJob, etc) =====
If it's a Hadoop "streaming" job (Wukong, MrJob, etc), the child process is a Java jar that itself hosts your script file:
@@ -70,14 +70,13 @@ Once the maps start, it's normal for them to seemingly sit at 0% progress for a
- the mapper sortbuf data threshold
- the mapper sortbuf total threshold
-==== Speculative Execution ====
+===== Speculative Execution =====
For exploratory work, it's worth
+==== Choosing a file size ====
-=== Choosing a file size ===
-
-==== Jobs with Map and Reduce ====
+===== Jobs with Map and Reduce =====
For jobs that have a reducer, the total size of the output dataset divided by the number of reducers implies the size of your output files footnote:[Large variance in counts of reduce keys not only drives up reducer run times, it causes variance in output sizes; but that's just insult added to injury. Worry about that before you worry about the target file size.].
Of course your working dataset is less than a few hundred MB this doesn't matter.
@@ -99,7 +98,7 @@ Even if you don't find any of those compelling enough to hang your hat on, I'll
If your dataset is
-==== Mapper-only jobs ====
+===== Mapper-only jobs =====
There's a tradeoff:
View
6 22a-tuning-wise_and_lazy.asciidoc
@@ -1,4 +1,4 @@
-== Hadoop Tuning for the wise and lazy
+=== Hadoop Tuning for the wise and lazy ===
There are enough knobs and twiddles on a hadoop installation to fully stock the cockpit of a 747. Many of them interact surprisingly, and many settings improve some types of jobs while impeding others. This chapter will help you determin
@@ -217,7 +217,7 @@ an m2.2xlarge:
In pre-2.0 Hadoop (the version most commonly found at time of writing in 2012), there's a hard limit of 2 GB in the buffers used for merge sorting of mapper outputs footnote[it's even worse than that, actually; see `mapred.job.shuffle.input.buffer.percent` in the tuning-for-the-foolish chapter.]. You want to make good use of those buffers, but
-== Hadoop Tuning for the foolish and brave
+=== Hadoop Tuning for the foolish and brave ===
=== Measuring your system: theoretical limits ===
@@ -269,7 +269,7 @@ That is, however, from intra-cluster traffic. By contrast, flume connections are
see `-Dpig.exec.nocombiner=true` if using combiners badly. (You'll want to use this for a rollup job).
-== Tuning pt 2 ==
+=== Tuning pt 2 ===
* Lots of files:
- Namenode and 2NN heap size
View
1 22b-scripts.asciidoc
@@ -1,4 +1,3 @@
-
=== Explorations and Scripts
* Wikipedia
View
4 22c-tuning-brave_and_foolish.asciidoc
@@ -1,4 +1,4 @@
-== Hadoop System configurations ==
+=== Hadoop System configurations ===
Here first are some general themes to keep in mind:
@@ -8,8 +8,6 @@ The default settings are those that satisfy in some mixture the constituencies o
* If you're going to run two master nodes, you're a bit better off running one master as (namenode only) and the other master as (jobtracker, 2NN, balancer) -- the 2NN should be distinctly less utilized than the namenode. This isn't a big deal, as I assume your master nodes never really break a sweat even during heavy usage.
-
-
**Memory**
Here's a plausible configuration for a 16-GB physical machine with 8 cores:
View
2 22d-use_method_checklist.asciidoc
@@ -1,5 +1,3 @@
-== Hadoop Metrics ==
-
[[use_method]]
=== The USE Method applied to Hadoop ===
View
2 23a-overview_of_datasets.asciidoc
@@ -1,5 +1,3 @@
-[[overview_of_datasets]]
-== Overview of Datasets ==
The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:
View
2 24a-unix_cheatsheet.asciidoc
@@ -1,5 +1,3 @@
-== Cheatsheets ==
-
=== Terminal Commands ===
[[hadoop_filesystem_commands]]
View
2 25a-authors.asciidoc
@@ -1,5 +1,3 @@
-== Book Metadata ==
-
=== Author ===
Philip (flip) Kromer is the founder and CTO at Infochimps.com, a big data platform that makes acquiring, storing and analyzing massive data streams transformatively easier. I enjoy Bowling, Scrabble, working on old cars or new wood, and rooting for the Red Sox.
View
2 25b-colophon.asciidoc
@@ -1,4 +1,4 @@
-== A sort of colophon
+=== A sort of colophon ===
the http://github.com/schacon/git-scribe[git-scribe toolchain] was very useful creating this book. Instructions on how to install the tool and use it for things like editing this book, submitting errata and providing translations can be found at that site.

0 comments on commit 62f9c95

Please sign in to comment.
Something went wrong with that request. Please try again.