Permalink
Browse files

fixing build errors

  • Loading branch information...
1 parent 05ee8f5 commit fb20d9976e85b1f6f920becf2e45f9d14a025312 Philip (flip) Kromer committed Feb 19, 2013
View
@@ -106,9 +106,6 @@ This is the plan. We'll roll material out over the next few months. Should we fi
- Why Hadoop
- robots are cheap, people are important
-
-PRACTICAL
-
18. *Hadoop Native Java API*
- don't
@@ -129,8 +126,6 @@ PRACTICAL
- Tuning for the Wise and Lazy
- Tuning for the Brave and Foolish
- The USE Method for understanding performance and diagnosing problems
-
-APPENDIX
23. *Overview of Datasets and Scripts*
- Datasets
View
@@ -30,8 +30,6 @@
- MI for geotile
- (visualize)
-INTERMEDIATE
-
5. *The Toolset*
- toolset overview
- pig vs hive vs impala
@@ -187,8 +185,6 @@ INTERMEDIATE
- How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
- Why Hadoop
- robots are cheap, people are important
-
-PRACTICAL
18. *Hadoop Native Java API*
- don't
@@ -212,8 +208,6 @@ PRACTICAL
- Tuning for the Wise and Lazy
- Tuning for the Brave and Foolish
- The USE Method for understanding performance and diagnosing problems
-
-APPENDIX
23. *Overview of Datasets and Scripts*
- Datasets
@@ -1,6 +1,5 @@
-
Chimpanzee, Elephant and Elf worked in harmony according to the following workflow:
* Chimpanzee:
View
@@ -22,7 +22,7 @@ Libraries:
Line numbering is _astonishingly_ hard. Each reducer needs to know how many records came before it -- but that information is non-local. It's a total sort with a degree of difficulty.
-===== Mapper
+==== Mapper
Choose how you're going to partition and order the data.
View
@@ -1,27 +1,27 @@
== Sampling ==
* Random sample:
- ** fixed size of final sample
- ** fixed probability (binomial) for each element
- ** spatial sampling
- ** with/without replacement
- ** weighted
- ** by interestingness
- ** stratified: partition important features into bins, and sample tastefully to achieve a smooth selection across bins. Think of density of phase space
- ** consistent sample: the same sampling parameters on the same population will always return the same sample.
+ - fixed size of final sample
+ - fixed probability (binomial) for each element
+ - spatial sampling
+ - with/without replacement
+ - weighted
+ - by interestingness
+ - stratified: partition important features into bins, and sample tastefully to achieve a smooth selection across bins. Think of density of phase space
+ - consistent sample: the same sampling parameters on the same population will always return the same sample.
* Algorithms:
- ** iterative
- ** batch
- ** scan
- ** reservoir
+ - iterative
+ - batch
+ - scan
+ - reservoir
* graph:
- ** sample to preserve connectivity
- ** sample to preserve local structure
- ** sample to preserve global representation
+ - sample to preserve connectivity
+ - sample to preserve local structure
+ - sample to preserve global representation
* random variates
- ** http://en.wikipedia.org/wiki/Ziggurat_algorithm[Ziggurat Algorithm] to use a simple lookup table to accelerate generation of complex distributions
+ - http://en.wikipedia.org/wiki/Ziggurat_algorithm[Ziggurat Algorithm] to use a simple lookup table to accelerate generation of complex distributions
We're not going to worry about extracting samples larger than fit on one reducer.
@@ -86,7 +86,6 @@ A [http://github.com/mrflip/wukong/blob/master/examples/sample_records.rb Ruby e
Script.new( Mapper, nil ).run
-References:
* See this http://blog.rapleaf.com/dev/?p=187[rapleaf blog post] for why randomness is considered harmful.
=== Random Sampling using strides ===
@@ -116,12 +115,9 @@ Each mapper outputs the sampling index of each preserved row as the key, and the
It's essential that you keep the sampling index given by the first pass.
-=== References ===
+=== Refs ===
* http://db.cs.berkeley.edu/papers/UCB-PhD-olken.pdf[Random Sampling from Databases], Frank Olken, 1993
-
-* containers:
- ** https://github.com/skade/rbtree[RBTree] for ruby
- ** https://github.com/rubyworks/pqueue[Priority Queue]
-
-* http://stackoverflow.com/a/2584770/41857[Stack Overflow: How to pick random (small) data samples using Map/Reduce? answer by Bkkbrad]
+* https://github.com/skade/rbtree[RBTree] for ruby
+* https://github.com/rubyworks/pqueue[Priority Queue]
+* http://stackoverflow.com/a/2584770/41857[Stack Overflow: How to pick random (small) data samples using Map/Reduce?] answer by Bkkbrad
@@ -1,12 +1,12 @@
=== Histograms and Distributions ===
-In the section on <<munging_truth_and_error,Inconsistent Truth and Error>>,
-we made the point that
-the tools and patterns of though for dealing with numerical error and uncertainty are
+In the section on <<munging_truth_and_error,Inconsistent Truth and Error>>, we made the point that the tools and patterns of though for dealing with numerical error and uncertainty are
==== Distribution of temperatures ====
-===== Filter weather of interest
+Find how temperature is distributed
+
+===== Filter weather of interest =====
The chapter on geodata will show general techniques for doing spatial aggregates. For now, we'll just choose the best-match weather station for each stadium and use that. (I did use my ability to skip ahead in the book to pull out those weather stations of interest.)
@@ -76,9 +76,7 @@ TODO: show math
===== Map-side JOIN
-We can do even better, though. The `stadium_wstns` table is tiny; we can do a ((map-side join)) footnote:[aka fragment replicate or hash-map join].
-TODO: reference forward to pig chapter
-http://pig.apache.org/docs/r0.10.0/perf.html#replicated-joins:[map-side join]
+We can do even better, though. The `stadium_wstns` table is tiny; we can do a <<advanced_pig_map_side_join,map-side join>>
NOTE: if the observations were stored sorted by weather station ID, you could even do a merge join. When we get to the geographic data chapter you'll see why we made a different choice.
@@ -23,8 +23,6 @@ it to spatial analysis in more dimensions -- see <<brain_example>>, which also e
=== Geographic Data Model ===
-
-
Geographic data shows up in the form of
* Points -- a pair of coordinates. When given as an ordered pair (a "Position"), always use `[longitude,latitude]` in that order, matching the familiar `X,Y` order for mathematical points. When it's a point with other metadata, it's a Place footnote:[in other works you'll see the term Point of Interest ("POI") for a place.], and the coordinates are named fields.
@@ -38,7 +36,7 @@ Geographic data shows up in the form of
[NOTE]
===============================
There's a slight muddying of the term "feature" -- to a geographer, a feature is a generic term for the _thing_ being described; later, in the chapter on machine learning, a feature
-Since I'm writing as a data scientist dabbling in geography (and because that chapter's hairy enough as it is), we'll just say "object" in place of "geographic feature"
+Since we're data scientists dabbling in geography, we'll just say "object" in place of "geographic feature" (and reserve the term "feature" for its machine learning sense).
===============================
Geospatial Information Science ("GIS") is a deep subject, treated here shallowly -- we're interested in models that have a geospatial context, not in precise modeling of geographic features themselves. Without apology we're going to use the good-enough WGS-84 earth model and a simplistic map projection. We'll execute again the approach of using existing traditional tools on partitioned data, and Hadoop to reshape and orchestrate their output at large scale. footnote:[If you can't find a good way to scale a traditional GIS approach, algorithms from Computer Graphics are surprisingly relevant.]
@@ -1,13 +1,13 @@
=== Keep Exploring ===
-===== Balanced Quadtiles =====
+==== Balanced Quadtiles =====
Earlier, we described how quadtiles define a tree structure, where each branch of the tree divides the plane exactly in half and leaf nodes hold features. The multiscale scheme handles skewed distributions by developing each branch only to a certain depth. Splits are even, but the tree is lopsided (the many finer zoom levels you needed for New York City than for Irkutsk).
K-D trees are another approach. The rough idea: rather than blindly splitting in half by area, split the plane to have each half hold the same-ish number of points. It's more complicated, but it leads to a balanced tree while still accommodating highly-skew distributions. Jacob Perkins (`@thedatachef`) has a http://thedatachef.blogspot.com/2012/10/k-d-tree-generation-with-apache-pig.html[great post about K-D trees] with further links.
-===== It's not just for Geo =====
+==== It's not just for Geo =====
=== Exercises ===
@@ -33,7 +33,7 @@ Now we meet our first two XML-induced complexities: _splitting_ the file among m
FIXME: we used `crack` not plain text the whole way
-===== The law of small numbers
+==== The law of small numbers
The law of small numbers: given millions of things, your one-in-a-million occurrences become commonplace.
@@ -42,7 +42,7 @@ Needless to say, we found this out not while developing locally, but rather some
This crashing is a _good feature_ of our script: it wasn't clear that an empty article body is permissible.
-===== Custom Splitter / InputFormat =====
+==== Custom Splitter / InputFormat =====
At 40GB uncompressed, the articles file will occupy about 320 HDFS blocks (assuming 128MB blocks), each destined for its own mapper. However, the division points among blocks is arbitrary: it might occur in the middle of a word in the middle of a record with no regard for your feelings about the matter. However, if you do it the courtesy of pointing to the first point within a block that a split _should_ have occurred, Hadoop will handle the details of patching it onto the trailing end of the preceding block. Pretty cool.
@@ -53,7 +53,7 @@ If you're
Writing an input format and splitter is only as hard as your input format makes it, but it's the kind of pesky detail that lies right at the "do it right" vs "do it (stupid/simpl)ly" decision point. Luckily there's a third option, which is to steal somebody else's code footnote:[see Hadoop the Definitive Guide, chapter FIXME: XX for details of building your own splitter]. Oliver Grisel (@ogrisel) has written an Wikipedia XML reader as a raw Java API reader in the http://mahout.apache.org/[Mahout project], and as a Pig loader in his https://github.com/ogrisel/pignlproc[pignlproc] project.
Mahout's XmlInputFormat (https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java[src])
-===== Brute Force =====
+==== Brute Force =====
If all you need to do is yank the data out of it's ill-starred format, or if the data format's complexity demands the agility of a high-level language, you can use Hadoop Streaming as a brute-force solution. In this case, we'll still be reading the data as a stream of lines, and use native libraries to do the XML parsing. We only need to ensure that the splits are correct, and the `StreamXmlRecordReader` (http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html[doc] / https://github.com/apache/hadoop-common/blob/branch-0.21/mapreduce/src/contrib/streaming/src/java/org/apache/hadoop/streaming/StreamXmlRecordReader.java[source]);
that ships with Hadoop is sufficient.
View
@@ -1,9 +1,9 @@
-=== Graph propogation
+=== Graph propogation ===
Since the connections among pages are robot-legible, links within topics can be read to imply a geolocation -- the movie "Dazed and Confused" (which took place in austin) and the artist Janis Joplin (who got her start in Austin) can be identifiably paired with the loose geolocation of Austin, TX.
-=== Basic pagerank
+=== Basic pagerank ===
----
include::page_counts.pig[]
@@ -7,4 +7,3 @@ It makes for fascinating leisure reading, but to get stuff done the Chimpanzee W
So we're going to
* show you how to picture the transformation
-*
View
@@ -1,3 +1,4 @@
+=== Graph of Airline Flights ===
==== Airline Passenger Flow Network ====
View
@@ -13,7 +13,7 @@ They're also a great illustration of three key scalability patterns.
Once you have a clear picture of how these joins work,
you can be confident you understand the map/reduce paradigm deeply.
-[[map_side_join]]
+[[advanced_pig_map_side_join]]
=== Map-side Join ===
A map-side (aka 'fragment replicate') join
@@ -24,7 +24,7 @@ The Pig manual calls this a "fragment replicate" join, because that is how Pig t
Throughout the book, I'll refer to it as a map-side join, because that's how you should think about it when you're using it.
The other common name for it is a Hash join -- and if you want to think about what's going on inside it, that's the name you should use.
-===== How a Map-side (Hash) join works =====
+==== How a Map-side (Hash) join works =====
If you've been to enough large conferences you've seen at least one registration-day debacle. Everyone leaves their hotel to wait in a long line at the convention center, where they have set up different queues for some fine-grained partition of attendees by last name and conference track. Registration is a direct join of the set of attendees on the set of badges; those check-in debacles are basically the stuck reducer problem come to life.
@@ -58,15 +58,15 @@ One map-side only pass through the data is enough to do the join.
See ((distribution of weather measurements)) for an example.
-===== Example: map-side join of wikipedia page metadata with wikipedia pageview stats =====
+==== Example: map-side join of wikipedia page metadata with wikipedia pageview stats =====
[[merge_join]]
=== Merge Join ===
-===== How a merge join works =====
+==== How a merge join works =====
(explanation)
@@ -78,17 +78,17 @@ You will also see better performance if the data in the left table is partitione
____________________________________________________________________
-=== Example: merge join of user graph with page rank iteration ===
+==== Example: merge join of user graph with page rank iteration ====
-==== Skew Join ====
+=== Skew Join ===
(explanation of when needed)
-===== How a skew join works =====
+==== How a skew join works ====
(explanation how)
-===== Example: ? counting triangles in wikipedia page graph ? OR ? Pageview counts ? =====
+==== Example: ? counting triangles in wikipedia page graph ? OR ? Pageview counts ? ====
TBD
@@ -170,5 +170,7 @@ TBD
=== Pig and JSON ===
TBD
-''''
+=== References ===
+
+* http://pig.apache.org/docs/r0.10.0/perf.html#replicated-joins:[map-side join]
@@ -42,6 +42,8 @@ An HBase data mode is typically designed around multiple tables, each serving on
1. What query do you want to make that _must_ happen at milliseconds speed?
2. There are a set of related queries or batch jobs -- which would you like to be efficient?
+If you are using it primarily for batch use,
+
1. What is the batch job you are most interested in simplifying?
2. There are a set of related queries or batch jobs -- which would you like to be efficient?
@@ -47,9 +47,7 @@ Exercises:
=== Mappers ===
-
-
-===== Hadoop Streaming (Wukong, MrJob, etc) =====
+==== Hadoop Streaming (Wukong, MrJob, etc) ====
If it's a Hadoop "streaming" job (Wukong, MrJob, etc), the child process is a Java jar that itself hosts your script file:
@@ -72,7 +70,7 @@ Once the maps start, it's normal for them to seemingly sit at 0% progress for a
- the mapper sortbuf data threshold
- the mapper sortbuf total threshold
-===== Speculative Execution ======
+==== Speculative Execution ====
For exploratory work, it's worth
@@ -75,7 +75,7 @@ Raw ingredients:
- oneline -- mapred -- identity -- identity
- oneline -- mapred -- identity -- swallow
-===== Variation =====
+==== Variation ====
* non-local map tasks
* EBS volumes
@@ -43,7 +43,7 @@ There are further benefits:
The USE metrics described below help you to identify the limiting resource of a job; to diagnose a faulty or misconfigured system; or to guide tuning and provisioning of the base system.
-===== Improve / Understand Job Performance =====
+==== Improve / Understand Job Performance ====
Hadoop is designed to drive max utilization for its _bounding resource_ by coordinating manageable saturation_ of the resources in front of it.
@@ -57,9 +57,9 @@ The "bounding resource" is the fundamental limit on performance -- you can't pro
At each step of a job, what you'd like to see is very high utilization of exactly _one_ bounding resource from that list, with reasonable headroom and managed saturation for everything else. What's "reasonable"? As a rule of thumb, utilization above 70% in a non-bounding resource deserves a closer look.
-===== Diagnose Flaws =====
+==== Diagnose Flaws ====
-===== Balanced Configuration/Provisioning of base system =====
+==== Balanced Configuration/Provisioning of base system ====
=== Resource List ===
@@ -236,8 +236,7 @@ Exchanges:
If at all possible, use a remote monitoring framework like Ganglia, Zabbix or Nagios. However http://sourceforge.net/projects/clusterssh[clusterssh] or http://code.google.com/p/csshx[its OSX port] along with the following commands will help
-
-===== Exercises =====
+=== Exercises ===
**Exercise 1**: start an intensive job (eg <remark>TODO: name one of the example jobs</remark>) that will saturate but not overload the cluster. Record all of the above metrics during each of the following lifecycle steps:
View
@@ -1,4 +1,4 @@
-== Colophon
+== A sort of colophon
the http://github.com/schacon/git-scribe[git-scribe toolchain] was very useful creating this book. Instructions on how to install the tool and use it for things like editing this book, submitting errata and providing translations can be found at that site.
@@ -1,5 +1,5 @@
[[hadoop_cluster_howto]]
-== Acquiring a Hadoop Cluster
+=== Acquiring a Hadoop Cluster ===
This book picks up where the internet leaves off,
and this stuff changes so fast

0 comments on commit fb20d99

Please sign in to comment.