Skip to content

Commit

Permalink
Organizde storm+trident sections
Browse files Browse the repository at this point in the history
  • Loading branch information
Philip (flip) Kromer committed Jan 10, 2014
1 parent ae3161f commit 251280c
Show file tree
Hide file tree
Showing 11 changed files with 76 additions and 21 deletions.
8 changes: 4 additions & 4 deletions 11-geographic.asciidoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[[geographic]]
== Geographic Data Processing

=== Weather Near You: Turning A Point Distribution Into Regions of Influence
=== Turning Points of Measurements Into Regions of Influence

Frequently, geospatial data is, for practical reasons, sampled at discrete points but should be understood to represent measurements at all points in space. For example, the measurements in the NCDC datasets are gathered at locations chosen for convenience or value -- in some cases, neighboring stations are separated by blocks, in other cases by hundreds of miles. It is useful to be able to reconstruct the underlying spatial distribution from point-sample measurements.

Expand All @@ -17,7 +17,7 @@ This effectively precomputes the “nearest x” problem: For any point in ques

It also presents a solution to the spatial sampling problem by assigning the measurements taken at each sample location to its Voronoi region. You can use these piece-wise regions directly or follow up with some sort of spatial smoothing as your application requires. Let’s dive in and see how to do this in practice.

==== (TODO: section name)
==== Finding Nearby Objects

Let’s use the GeoNames dataset to create a “nearest <whatever> to you” application, one that, given a visitor’s geolocation, will return the closest hospital, school, restaurant and so forth. We will do so by effectively pre-calculating all potential queries; this could be considered overkill for the number of geofeatures within the GeoNames dataset but we want to illustrate an approach that will scale to the number of cell towers, gas stations or anything else.

Expand Down Expand Up @@ -61,7 +61,7 @@ To cover the entire globe at zoom level 13 requires 67 million records, each cov
If the preceding considerations leave you with a range of acceptable zoom levels, choose one in the middle. If they do not, you will need to use the multiscale decomposition approach (TODO: REF) described later in this chapter.
==== (TODO: name section)
==== Voronoi Polygons turn Points into Regions
Now, let's use the Voronoi trick to turn a distribution of measurements at discrete points into the distribution over regions it is intended to represent. In particular, we will take the weather-station-by-weather-station measurements in the NCDC dataset and turn it into an hour-by-hour map of global data. Spatial distribution of weather stations varies widely in space and over time; for major cities in recent years, there may be many dozens while over stretches of the Atlantic Ocean and in many places several decades ago, weather stations might be separated by hundreds of miles. Weather stations go in and out of service, so we will have to prepare multiple Voronoi maps. Even within their time of service, however, they can also go offline for various reasons, so we have to be prepared for missing data. We will generate one Voronoi map for each year, covering every weather station active within that year, acknowledging that the stretch before and after its time of service will therefore appear as missing data.
Expand Down Expand Up @@ -104,7 +104,7 @@ The payoff for all this is pretty sweet. We only have to store and we only have
NOTE: The multiscale keys work very well in HBase too. For the case where you are storing multiscale regions and querying on points, you will want to use a replacement character that is lexicographically after the digits, say, the letter "x." To find the record for a given point, do a range request for one record on the interval starting with that point's quad key and extending to infinity (xxxxx…). For a point with the finest-grain quad key of 012012, if the database had a record for 012012, that will turn up; if, instead, that region only required zoom level 4, the appropriate row (0120xx) would be correctly returned.
==== (TODO: name section)
==== Smoothing the Distribution
We now have in hand, for each year, a set of multiscale quad tile records with each record holding the weather station IDs that cover it. What we want to produce is a dataset that has, for each hour and each such quad tile, a record describing the consensus weather on that quad tile. If you are a meteorologist, you will probably want to take some care in forming the right weighted summarizations -- averaging the fields that need averaging, thresholding the fields that need thresholding and so forth. We are going to cheat and adopt the consensus rule of "eliminate weather stations with missing data, then choose the weather station with the largest area coverage on the quad tile and use its data unmodified." To assist that, we made a quiet piece of preparation and have sorted the weather station IDs from largest to smallest in area of coverage, so that the Reducer simply has to choose from among its input records the earliest one on that list.
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,14 +1,11 @@
=== Hadoop System configurations ===
== Hadoop Tuning for the Brave and Foolish


Here first are some general themes to keep in mind:

The default settings are those that satisfy in some mixture the constituencies of a) Yahoo, Facebook, Twitter, etc; and b) Hadoop developers, ie. peopl who *write* Hadoop but rarely *use* Hadoop. This means that many low-stakes settings (like keeping jobs stats around for more than a few hours) are at the values that make sense when you have a petabyte-scale cluster and a hundred data engineers;

* If you're going to run two master nodes, you're a bit better off running one master as (namenode only) and the other master as (jobtracker, 2NN, balancer) -- the 2NN should be distinctly less utilized than the namenode. This isn't a big deal, as I assume your master nodes never really break a sweat even during heavy usage.
**Memory**
=== Memory ===

Here's a plausible configuration for a 16-GB physical machine with 8 cores:

Expand Down Expand Up @@ -52,7 +49,7 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:

* Given the number of files and amount of data you're storing, I would set the NN heap size agressively - at least 4GB to start, and keep an eye on it. Having the NN run out of memory is Not Good. Always make sure the secondary name node has the same heap setting as the name node.

**Handlers and threads**
=== Handlers and threads ===

* `dfs.namenode.handler.count`: default `X`, recommended: `(0.1 to 1) * size of cluster`, depending on how many blocks your HDFS holds.
* `tasktracker.http.threads` default `X`, recommended `X`
Expand All @@ -64,7 +61,7 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:

* `io.file.buffer.size`: default `X` recommended `65536`; always use a multiple of `4096`.

**Storage**
=== Storage ===

* `mapred.system.dir`: default `X` recommended `/hadoop/mapred/system` Note that this is a path on the HDFS, not the filesystem).

Expand All @@ -85,7 +82,7 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:
* `dfs.blockreport.intervalMsec`: default 3_600_000 (1 hour); recommended 21_600_000 (6 hours) for a large cluster.
- 100_000 blocks per data node for a good ratio of CPU to disk

**Other**
=== Other ===

* `mapred.map.output.compression.codec`: default XX, recommended ``. Enable Snappy codec for intermediate task output.
- `mapred.compress.map.output`
Expand Down Expand Up @@ -113,5 +110,3 @@ Here's a plausible configuration for a 16-GB physical machine with 8 cores:

* Bump `mapreduce.job.counters.limit` -- it's not configurable per-job.



File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
56 changes: 56 additions & 0 deletions book-all.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
= Big Data for Chimps

include::00-preface.asciidoc[]

include::01-first_exploration.asciidoc[]

include::02-hadoop_basics.asciidoc[]

include::03-map_reduce.asciidoc[]

include::04-structural_operations.asciidoc[]

include::05-analytic_patterns.asciidoc[]

include::06-big_data_ecosystem.asciidoc[]

include::07-filesystem_mojo.asciidoc[]

include::08-intro_to_storm+trident.asciidoc[]

include::09-statistics.asciidoc[]

include::10-event_streams.asciidoc[]

include::11-geographic.asciidoc[]

include::12-conceptual_model.asciidoc[]

include::13-data_munging.asciidoc[]

include::14-organizing_data.asciidoc[]

include::16-machine_learning.asciidoc[]

include::18-java_api.asciidoc[]

include::19-advanced_pig.asciidoc[]

include::19a-advanced_pig.asciidoc[]

include::20-hbase_data_modeling.asciidoc[]

include::21-hadoop_internals.asciidoc[]

include::21b-hadoop_internals-map_reduce.asciidoc[]

include::22-hadoop_tuning.asciidoc[]

include::23-hadoop_tuning-brave_and_foolish.asciidoc[]

include::24-storm+trident-internals.asciidoc[]

include::25-storm+trident-tuning.asciidoc[]

include::25-appendix.asciidoc[]

18 changes: 11 additions & 7 deletions book.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
= Big Data for Chimps

include::00-preface.asciidoc[]

include::01-first_exploration.asciidoc[]

include::02-hadoop_basics.asciidoc[]
Expand All @@ -8,7 +10,7 @@ include::03-map_reduce.asciidoc[]

include::04-structural_operations.asciidoc[]

include::05-hadoop_patterns.asciidoc[]
include::05-analytic_patterns.asciidoc[]

include::06-big_data_ecosystem.asciidoc[]

Expand All @@ -22,8 +24,12 @@ include::10-event_streams.asciidoc[]

include::11-geographic.asciidoc[]

include::12-conceptual_model.asciidoc[]

include::13-data_munging.asciidoc[]

include::14-organizing_data.asciidoc[]

include::16-machine_learning.asciidoc[]

include::18-java_api.asciidoc[]
Expand All @@ -36,16 +42,14 @@ include::20-hbase_data_modeling.asciidoc[]

include::21-hadoop_internals.asciidoc[]

include::21c-hadoop_internals-hdfs.asciidoc[]
include::21d-hadoop_internals-tuning.asciidoc[]
include::21b-hadoop_internals-map_reduce.asciidoc[]

include::22-hadoop_tuning.asciidoc[]

include::88-storm+trident-internals.asciidoc[]
include::23-hadoop_tuning-brave_and_foolish.asciidoc[]

include::88-storm+trident-tuning.asciidoc[]
include::24-storm+trident-internals.asciidoc[]

include::11-conceptual_model.asciidoc[]
include::25-storm+trident-tuning.asciidoc[]

include::25-appendix.asciidoc[]

File renamed without changes.

0 comments on commit 251280c

Please sign in to comment.