Organizde storm+trident sections

infochimps-labs · Jan 10, 2014 · 251280c · 251280c
1 parent ae3161f
commit 251280c
Show file tree

Hide file tree

Showing 11 changed files with 76 additions and 21 deletions.
diff --git a/11-geographic.asciidoc b/11-geographic.asciidoc
@@ -1,7 +1,7 @@
 [[geographic]]
 == Geographic Data Processing
 
-===  Weather Near You:  Turning A Point Distribution Into Regions of Influence
+===  Turning Points of Measurements Into Regions of Influence
 
 Frequently, geospatial data is, for practical reasons, sampled at discrete points but should be understood to represent measurements at all points in space.  For example, the measurements in the NCDC datasets are gathered at locations chosen for convenience or value -- in some cases, neighboring stations are separated by blocks, in other cases by hundreds of miles.  It is useful to be able to reconstruct the underlying spatial distribution from point-sample measurements.  
 
@@ -17,7 +17,7 @@ This effectively precomputes the “nearest x” problem:  For any point in ques
 
 It also presents a solution to the spatial sampling problem by assigning the measurements taken at each sample location to its Voronoi region.  You can use these piece-wise regions directly or follow up with some sort of spatial smoothing as your application requires.  Let’s dive in and see how to do this in practice.  
 
-==== (TODO: section name)
+==== Finding Nearby Objects
 
 Let’s use the GeoNames dataset to create a “nearest <whatever> to you” application, one that, given a visitor’s geolocation, will return the closest hospital, school, restaurant and so forth.  We will do so by effectively pre-calculating all potential queries; this could be considered overkill for the number of geofeatures within the GeoNames dataset but we want to illustrate an approach that will scale to the number of cell towers, gas stations or anything else.   
 
@@ -61,7 +61,7 @@ To cover the entire globe at zoom level 13 requires 67 million records, each cov
 
 If the preceding considerations leave you with a range of acceptable zoom levels, choose one in the middle.  If they do not, you will need to use the multiscale decomposition approach (TODO:  REF) described later in this chapter.
 
-==== (TODO: name section)
+==== Voronoi Polygons turn Points into Regions
 
 Now, let's use the Voronoi trick to turn a distribution of measurements at discrete points into the distribution over regions it is intended to represent.  In particular, we will take the weather-station-by-weather-station measurements in the NCDC dataset and turn it into an hour-by-hour map of global data.  Spatial distribution of weather stations varies widely in space and over time; for major cities in recent years, there may be many dozens while over stretches of the Atlantic Ocean and in many places several decades ago, weather stations might be separated by hundreds of miles.  Weather stations go in and out of service, so we will have to prepare multiple Voronoi maps.  Even within their time of service, however, they can also go offline for various reasons, so we have to be prepared for missing data.  We will generate one Voronoi map for each year, covering every weather station active within that year, acknowledging that the stretch before and after its time of service will therefore appear as missing data.  
 
@@ -104,7 +104,7 @@ The payoff for all this is pretty sweet.  We only have to store and we only have
 
 NOTE:  The multiscale keys work very well in HBase too.  For the case where you are storing multiscale regions and querying on points, you will want to use a replacement character that is lexicographically after the digits, say, the letter "x."  To find the record for a given point, do a range request for one record on the interval starting with that point's quad key and extending to infinity (xxxxx…).  For a point with the finest-grain quad key of 012012, if the database had a record for 012012, that will turn up; if, instead, that region only required zoom level 4, the appropriate row (0120xx) would be correctly returned.  
 
-==== (TODO: name section)
+==== Smoothing the Distribution
 
 We now have in hand, for each year, a set of multiscale quad tile records with each record holding the weather station IDs that cover it.  What we want to produce is a dataset that has, for each hour and each such quad tile, a record describing the consensus weather on that quad tile.  If you are a meteorologist, you will probably want to take some care in forming the right weighted summarizations -- averaging the fields that need averaging, thresholding the fields that need thresholding and so forth.  We are going to cheat and adopt the consensus rule of "eliminate weather stations with missing data, then choose the weather station with the largest area coverage on the quad tile and use its data unmodified."  To assist that, we made a quiet piece of preparation and have sorted the weather station IDs from largest to smallest in area of coverage, so that the Reducer simply has to choose from among its input records the earliest one on that list.  
 

diff --git a/supplementary/11-conceptual_model.asciidoc → 12-conceptual_model.asciidoc b/supplementary/11-conceptual_model.asciidoc → 12-conceptual_model.asciidoc
diff --git a/22c-tuning-brave_and_foolish.asciidoc → 23-hadoop_tuning-brave_and_foolish.asciidoc b/22c-tuning-brave_and_foolish.asciidoc → 23-hadoop_tuning-brave_and_foolish.asciidoc
@@ -1,14 +1,11 @@
-=== Hadoop System configurations ===
+== Hadoop Tuning for the Brave and Foolish
 
 
-Here first are some general themes to keep in mind:
-
 The default settings are those that satisfy in some mixture the constituencies of a) Yahoo, Facebook, Twitter, etc; and b) Hadoop developers, ie. peopl who *write* Hadoop but rarely *use* Hadoop. This means that many low-stakes settings (like keeping jobs stats around for more than a few hours) are at the values that make sense when you have a petabyte-scale cluster and a hundred data engineers; 
 
 * If you're going to run two master nodes, you're a bit better off running one master as (namenode only) and the other master as (jobtracker, 2NN, balancer) -- the 2NN should be distinctly less utilized than the namenode. This isn't a big deal, as I assume your master nodes never really break a sweat even during heavy usage.
 
-
-**Memory**
+=== Memory ===
 
 Here's a plausible configuration for a 16-GB physical machine with 8 cores:
 
@@ -52,7 +49,7 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:
 
 * Given the number of files and amount of data you're storing, I would set the NN heap size agressively - at least 4GB to start, and keep an eye on it. Having the NN run out of memory is Not Good. Always make sure the secondary name node has the same heap setting as the name node.
 
-**Handlers and threads**
+=== Handlers and threads ===
 
 * `dfs.namenode.handler.count`: default `X`, recommended: `(0.1 to 1) * size of cluster`, depending on how many blocks your HDFS holds.
 * `tasktracker.http.threads` default `X`, recommended `X`
@@ -64,7 +61,7 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:
 
 * `io.file.buffer.size`: default `X` recommended `65536`; always use a multiple of `4096`.
 
-**Storage**
+=== Storage ===
 
 * `mapred.system.dir`: default `X` recommended `/hadoop/mapred/system` Note that this is a path on the HDFS, not the filesystem).
 
@@ -85,7 +82,7 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:
 * `dfs.blockreport.intervalMsec`: default 3_600_000 (1 hour); recommended 21_600_000 (6 hours)  for a large cluster.
   - 100_000 blocks per data node for a good ratio of CPU to disk
 
-**Other**
+=== Other ===
 
 * `mapred.map.output.compression.codec`: default XX, recommended ``. Enable Snappy codec for intermediate task output.
   - `mapred.compress.map.output`
@@ -113,5 +110,3 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:
 
 * Bump `mapreduce.job.counters.limit` -- it's not configurable per-job.
 
-
-
diff --git a/88-storm+trident-internals.asciidoc → 24-storm+trident-internals.asciidoc b/88-storm+trident-internals.asciidoc → 24-storm+trident-internals.asciidoc
diff --git a/89-intro-to-storm-topo.graffle → 24-storm+trident-topology.graffle b/89-intro-to-storm-topo.graffle → 24-storm+trident-topology.graffle
diff --git a/89-intro-to-storm-topo.png → 24-storm+trident-topology.png b/89-intro-to-storm-topo.png → 24-storm+trident-topology.png
diff --git a/88-storm-overview.asciidoc → 24a-storm+trident-overview.asciidoc b/88-storm-overview.asciidoc → 24a-storm+trident-overview.asciidoc
diff --git a/88-storm+trident-tuning.asciidoc → 25-storm+trident-tuning.asciidoc b/88-storm+trident-tuning.asciidoc → 25-storm+trident-tuning.asciidoc
diff --git a/book-all.asciidoc b/book-all.asciidoc
@@ -0,0 +1,56 @@
+= Big Data for Chimps
+
+include::00-preface.asciidoc[]
+
+include::01-first_exploration.asciidoc[]
+
+include::02-hadoop_basics.asciidoc[]
+
+include::03-map_reduce.asciidoc[]
+
+include::04-structural_operations.asciidoc[]
+
+include::05-analytic_patterns.asciidoc[]
+
+include::06-big_data_ecosystem.asciidoc[]
+
+include::07-filesystem_mojo.asciidoc[]
+
+include::08-intro_to_storm+trident.asciidoc[]
+
+include::09-statistics.asciidoc[]
+
+include::10-event_streams.asciidoc[]
+
+include::11-geographic.asciidoc[]
+
+include::12-conceptual_model.asciidoc[]
+
+include::13-data_munging.asciidoc[]
+
+include::14-organizing_data.asciidoc[]
+
+include::16-machine_learning.asciidoc[]
+
+include::18-java_api.asciidoc[]
+
+include::19-advanced_pig.asciidoc[]
+
+include::19a-advanced_pig.asciidoc[]
+
+include::20-hbase_data_modeling.asciidoc[]
+
+include::21-hadoop_internals.asciidoc[]
+
+include::21b-hadoop_internals-map_reduce.asciidoc[]
+
+include::22-hadoop_tuning.asciidoc[]
+
+include::23-hadoop_tuning-brave_and_foolish.asciidoc[]
+
+include::24-storm+trident-internals.asciidoc[]
+
+include::25-storm+trident-tuning.asciidoc[]
+
+include::25-appendix.asciidoc[]
+
diff --git a/book.asciidoc b/book.asciidoc
@@ -1,5 +1,7 @@
 = Big Data for Chimps
 
+include::00-preface.asciidoc[]
+
 include::01-first_exploration.asciidoc[]
 
 include::02-hadoop_basics.asciidoc[]
@@ -8,7 +10,7 @@ include::03-map_reduce.asciidoc[]
 
 include::04-structural_operations.asciidoc[]
 
-include::05-hadoop_patterns.asciidoc[]
+include::05-analytic_patterns.asciidoc[]
 
 include::06-big_data_ecosystem.asciidoc[]
 
@@ -22,8 +24,12 @@ include::10-event_streams.asciidoc[]
 
 include::11-geographic.asciidoc[]
 
+include::12-conceptual_model.asciidoc[]
+
 include::13-data_munging.asciidoc[]
 
+include::14-organizing_data.asciidoc[]
+
 include::16-machine_learning.asciidoc[]
 
 include::18-java_api.asciidoc[]
@@ -36,16 +42,14 @@ include::20-hbase_data_modeling.asciidoc[]
 
 include::21-hadoop_internals.asciidoc[]
 
-include::21c-hadoop_internals-hdfs.asciidoc[]
-include::21d-hadoop_internals-tuning.asciidoc[]
+include::21b-hadoop_internals-map_reduce.asciidoc[]
 
 include::22-hadoop_tuning.asciidoc[]
 
-include::88-storm+trident-internals.asciidoc[]
+include::23-hadoop_tuning-brave_and_foolish.asciidoc[]
 
-include::88-storm+trident-tuning.asciidoc[]
+include::24-storm+trident-internals.asciidoc[]
 
-include::11-conceptual_model.asciidoc[]
+include::25-storm+trident-tuning.asciidoc[]
 
 include::25-appendix.asciidoc[]
-
diff --git a/88-locality.asciidoc → ...2-conceptual_model-storm+trident.asciidoc b/88-locality.asciidoc → ...2-conceptual_model-storm+trident.asciidoc