Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

getting book generation on lock

  • Loading branch information...
commit 36720df86f4b69926a484198a20a149438b55f95 1 parent 756fe90
Philip (flip) Kromer mrflip authored
3  .gitignore
View
@@ -39,5 +39,4 @@ pkg
## PROJECT::SPECIFIC
-*.rdb
-book.html
+output
11 .gitscribe
View
@@ -0,0 +1,11 @@
+--- # -*- YAML -*-
+publish: false
+edition: 0.1
+language: en
+version: 1.0
+author: Philip (flip) Kromer
+cover: image/cover.jpg
+
+book_file: book.asciidoc
+verbose: true
+output_types: ['pdf']
29 Gemfile
View
@@ -0,0 +1,29 @@
+source 'http://rubygems.org'
+
+gem 'gorillib', :github => 'infochimps-labs/gorillib', :branch => 'version_1'
+gem 'wukong', :github => 'infochimps-labs/wukong', :branch => 'vanilla_2'
+
+gem 'git-scribe', :path => '../git-scribe'
+
+gem 'guard', ">= 1.0"
+gem 'guard-git-scribe'
+
+# SciRuby/sciruby
+
+# Gems you would use if hacking on this gem (rather than with it)
+group :support do
+ gem 'pry'
+ #
+ gem 'guard-yard'
+ gem 'yard', ">= 0.7"
+ gem 'redcarpet', ">= 2.1"
+end
+
+# Gems for testing and coverage
+group :test do
+ gem 'rspec', "~> 2.8"
+ gem 'guard-rspec', ">= 0.6"
+ if RUBY_PLATFORM.include?('darwin')
+ gem 'rb-fsevent', ">= 0.9"
+ end
+end
0  Guardfile
View
No changes.
0  Rakefile
View
No changes.
5 aa01-about.asciidoc
View
@@ -1,5 +1,7 @@
+[[about]]
== About ==
+[[about_coverage]]
=== What this book covers ===
'Big Data for Chimps' shows you how to solve hard problems using simple, fun, elegant tools.
@@ -23,6 +25,7 @@ Most of the chapters have exercises included. If you're a beginning user, I high
Feel free to hop around among chapters; the application chapters don't have large dependencies on earlier chapters.
+[[about_is_for]]
=== Who this book is for ===
You should be familiar with at least one programming language, but it doesn't have to be Ruby. Ruby is a very readable language, and the code samples provided should correspond cleanly to languages like Python or Scala.
@@ -33,6 +36,7 @@ All of the code in this book will run unmodified on your laptop computer and on
You should have an actual project in mind that requires a big data toolkit to solve -- a problem that requires scaling out across multiple machines. If you don't already have a project in mind but really want to learn about the big data toolkit, take a quick browse through the exercises. At least a few of them should have you jumping up and down with excitement to learn this stuff.
+[[about_is_not_for]]
=== Who this book is not for ===
This is not "Hadoop the Definitive Guide" (that's been written, and well); this is more like "Hadoop: a Highly Opinionated Guide". The only coverage of how to use the bare Hadoop API is to say "In most cases, don't". We recommend storing your data in one of several highly space-inefficient formats and in many other ways willingly trade a small performance hit for a large increase in programmer joy. The book has a relentless emphasis on writing *scalable* code, but no content on writing *performant* code beyond the advice that the best path to a 2x speedup is to launch twice as many machines.
@@ -41,6 +45,7 @@ That is because for almost everyone, the cost of the cluster is far less than th
The book does have some content on machine learning with Hadoop, on provisioning and deploying hadoop, and on a few important settings. But it does not cover advanced algorithms, operations or tuning in any real depth.
+[[about_how_written]]
=== How this book is being written ===
I plan to push chapters to the publicly-viewable http://github.com/infochimps-labs/big_data_for_chimps['Hadoop for Chimps' git repo] as they are written, and to post them periodically to the http://blog.infochimps.com[the Infochimps blog] after minor cleanup.
58 aa02-TODO.asciidoc
View
@@ -1,3 +1,4 @@
+[[TODO]]
== TODO / Errata ==
@@ -6,12 +7,14 @@
* The technical chapters and the "Mu" chapters will be interleaved later, but they're held separate for now.
+[[todo_tasks]]
=== Tasks for later ===
* Link-shorten links
* For large export images: http://zoom.it/pages/about/
-==== Questions ====
+[[todo_hey_editor]]
+==== Hey Editor ====
* Does a bulleted list have a dot at the end of the line?
* How do I slide a chapter title to be before `1` (eg the about this book)?
@@ -20,44 +23,46 @@
* Indexing hints yes or no?
-[[style_references]]
+[[style_notes]]
==== Style notes:
-[[style_markup]]
===== Markup
+----
+== Chapter
+=== Section
+==== Sub-section
+===== Passage
+----
+
Conventions:
-* monospaced: `+plus signs+` for +monospaced text+
-* `<literal>`: `\`backticks\`` for generic `literal text`
-* `<userinput>`: `*+star plus+*` for *+literal commands typed by user+* or `pass:[<userinput>foo</userinput>]`
-*
-* `<replaceable>`: `_++bar plus plus++_` for _+replaceable+_ or `pass:[<replaceable>foo</replaceable>]`
-* u inp+replbl: `_**++bar star star plus plus++**_` _+user input replaceable+_ or `pass:[<userinput><replaceable>foo</replaceable></userinput>]`
+| monospaced | `+plus signs+` | +monospaced text+
+| `<literal>` | `\`backticks\`` | generic `literal text`
+| `<userinput>` | `*+star plus+*` | *+literal commands typed by user+* or `pass:[<userinput>foo</userinput>]`
+|
+| `<replaceable>` | `_++bar plus plus++_` | _+replaceable+_ or `pass:[<replaceable>foo</replaceable>]`
+| user inp+replbl | `_**++bar star star plus plus++**_` _+user input replaceable+_ or `pass:[<userinput><replaceable>foo</replaceable></userinput>]`
Text; the first word on line is the docbook tag:
-* `<emphasis>`: `_underscores_` for _emphasized text_
+| `<emphasis>` | `_underscores_` for _emphasized text_
** indicates _new terms_, _URLs_, _email addresses_, _filenames_, and _file extensions_
- ** for consistency don't use `'single quotes'` for _emphasis_ except to indicate a filename or path:
-* `<filename>`: [role="filename"]'/path/to/file.ext'
-* `<book title>`: [role="citetitle"]'Hadoop: The Definitive Guide'
-* bold text: `*asterisks*` for *bold text* should _not_ be used for emphasis -- use underscores.
-* `<Math>`
+ ** for consistency don't use `'single quotes'` for _emphasis_ except to indicate a filename or path
+| `<filename>` | [role="filename"]'/path/to/file.ext'
+| `<book title>` | [role="citetitle"]'Hadoop | The Definitive Guide'
+| bold text | `*asterisks*` for *bold text* should _not_ be used for emphasis -- use underscores.
+| `<Math>`
** Subscripts and superscripts work like so: H~2~O and 2^5^ = 32 (but if you're doing math, you'd probably want to italicize the variables, like so: _x_^2^ + _y_^2^ = _z_^2^). For more on math see !!ORMTODO!!.
-* `<References>`
- ** Reference chapters with angle brackets `<<style_notes>>`: <<style_notes>>
-
-* Use a title (`.I am a Title!`) for figures, tables, examples, and sidebars, but not otherwise.
+| `<xref>` | `<<double angles>>` | <<style_notes>>
-* footnotes
- ** `footnote:[This is a standard footnote.]` Put them belly-up against text and following the punctuation -- `Hello.footnote:[hi!]`
- ** (note: this effs with emacs' markup, so during the early phases I will instead _always_ use a space; will strip them later)
+| Use a title (`.I am a Title!`) for figures, tables, examples, and sidebars, but not otherwise.
+| footnotes | `footnote:[This is a standard footnote.]`
+Put them belly-up against text and following the punctuation -- `Hello.footnote:[hi!]`
+(note: this effs with emacs' markup, so during the early phases I will instead _always_ use a space; will strip them later)
-
-[[style_code]]
-===== Code
+===== Blocks
Inline code block. You can use the `include::code/HelloWorld.rb[]` to pull it in from a separate file.
@@ -88,9 +93,6 @@ EOF
----
====
-
-
-
.A Sidebar
****
This is a sidebar!!
0  ba05-data_classes.asciidoc
View
No changes.
71 ba06-semi_structured_data-a.asciidoc
View
@@ -1,74 +1,3 @@
[[semi_structured_data]]
== Semi-Structured Data ==
-To explore data, you need data, in a form you can use, and that means engaging in the necassary evils of data munging: code to turn the data you have into the data you'd like to use. Whether it's public or commercial data ((data_commons)), data from a legacy database, or managing unruly human-entered data, you will need to effectively
-
-Your goal is to take data from the _source domain_ (the shape, structure and cognitive model of the raw data) to the _target domain_ (the shape, structure and cognitive model you will work with) in a process that is
-
-// prepare? munge?
-
-* Describe target model schema
- // ie assemble Gorillib model
-
-* Extract: the _syntactic_ transformation from raw blobs to passive structured records
-* Transform: the _semantic_ transformation of structured data in the source domain to active models in the target domain
-* Normalize
- ** clean up errors and missing fields
- ** augment with data drom other tables
-* Canonicalization:
- ** choose exemplars and mountweazels
- ** pull summary statistics
- ** sort by the most likely join key
- ** pull out your best guess as to a subuniverse (later)
-* Land
- ** the data in its long-term home(s)
-
-(later):
-
-* fix your idea of subuniverse
-
-DL TODO: (fix style (document pig style); align terminology with above `wikipedia/pageviews_extract_a.rb`
-
-
-=== Extract, Translate, Canonicalize ===
-
-image::images/semistructured_data_workflow.png[Semistructured Data Workflow]
-
-* Raw data, as it comes from the source. This could be unstructured (lines of text,
-* Source domain models:
-* Target domain models:
-* Transformer:
-
-You'll be tempted to move code too far to the right -- to put transformation code into
-Resist the urge. At the beginning of the project you're thinking about the details of extracting the data, and possibly still puzzling out the shape of the target domain models.
-Make the source domain models match the raw data as closely as reasonable, doing only the minimum effective work to make the data active.
-
-Separate the model's _properties_ (fundamental intrinsic data), its _metadata_ (details of processing), and its _derived variables_ (derived values).
-
-In an enterprise context, this process is "ETL" -- extraction, transformation and loading. In data at scale, rather than centralizing access in a single data store, you'll more often syndicate data along a documented provenance chain. So we'll change this to "Extract, Transform and Land".
-
-=== Solitary, poor, nasty, brutish, and short
-
-Thomas Hobbes wrote that the natural life of man is "solitary, poor, nasty, brutish, and short".
-and so should your data munging scripts be:
-
-* Solitary: write discrete, single concern scripts. It's much easier to validate the data when your transformations are simple to reason about.
-* Poor: spend wisely. It's especially important to optimize for programmer time, not cluster efficency: you will probably spend more in opportunity cost to write the code than you ever will running it. Even if it will be a production ETL footnote:[ETL = Extraction, Transformation and Loading -- what you call data munging when it runs often and well enough to deserve an enterprise-y name] job, get it running well and then make it run quickly. The hard parts are not where you expect them to be.
-* Nasty: as you'll see, the real work in data munging are the special cases and domain incompatibilities. Ugly transformations will lead to necessarily ugly code, but even still there are ways to handle it. There's also a certain type of brittleness that is *good* -- a script that quietly introduces corrupt data
-* Brutish: be effective, not elegant. Fight the ever-strong temptation to automate the in-automatable. Here are some street-fighting techniques we'll use below:
- ** Hand-curating a 200-row join table is the kind of thing you learn computer programing to avoid, but it's often the smart move.
- ** Faced with a few bad records that don't fit any particular pattern, you can write a script to recognize and repair them; or, if it's of modern size, just hand-fix them and save a diff. You've documented the change concisely and given future-you a chance to just re-apply the patch.
- ** To be fully-compliant, most XML libraries introduce a necessary complexity that complexifies *up* the work to deal with simple XML structures. It's often more sensible to handle simple as plain text.
-* Short: use a high-level language. You'll invest much more in using the data than in writing the munging code. Two weeks later, when you discover that 2% of the documents used an alternate text encoding (or whatever card Loki had dealt) you'll be glad for brief readable scripts and a rich utility ecosystem.
-
-FIXME: the analogy will line up better with the source if I can make the point that 'your data munging scripts are to civilize data from the state of nature'.
-FIXME: the stuff above I like; need to chop out some of the over-purplish stuff from the rest of the chapter so the technical parts don't feel too ranty.
-
-=== Canonical Data ===
-
-* **sort** the data along its most-likely join field (sometimes but not always its primary key). This often enables a <<merge_join,merge join>>, with tremendous speedup.
-
-* **choose exemplars or add mountweazels**. Choose a few familiar records, and put their full contents close at hand to use for testing and provenance. You may also wish to add a ((mountweazel)), a purposefully-bogus record. Why? First, so you have something that fully exercises the data pipeline -- all the fields are present, text holds non-keyboard and escape characters, and so forth. Second, for production smoke testing: it's a record you can send through the full data pipeline, including writing into the database, without concern that you will clobber an actual value. Lastly, since exemplars are real records, they may change over time; you can hold your mountweazel's fields constant. Make sure it's unmissably bogus and unlikely to collide: "John Doe" is a _terrible_ name for a mountweazel, if there's any way user-contributed data could find its way into your database. The best-chosen bogus names appeal to the worst parts of a 5th-grader's sense of humor.
-
-* **sample** a coherent subuniverse. Ensure this includes, or go back and add, your exemplar records.
-
72 ba06-semi_structured_data-c-patterns.asciidoc
View
@@ -0,0 +1,72 @@
+
+To explore data, you need data, in a form you can use, and that means engaging in the necassary evils of data munging: code to turn the data you have into the data you'd like to use. Whether it's public or commercial data ((data_commons)), data from a legacy database, or managing unruly human-entered data, you will need to effectively
+
+Your goal is to take data from the _source domain_ (the shape, structure and cognitive model of the raw data) to the _target domain_ (the shape, structure and cognitive model you will work with) in a process that is
+
+// prepare? munge?
+
+* Describe target model schema
+ // ie assemble Gorillib model
+
+* Extract: the _syntactic_ transformation from raw blobs to passive structured records
+* Transform: the _semantic_ transformation of structured data in the source domain to active models in the target domain
+* Normalize
+ ** clean up errors and missing fields
+ ** augment with data drom other tables
+* Canonicalization:
+ ** choose exemplars and mountweazels
+ ** pull summary statistics
+ ** sort by the most likely join key
+ ** pull out your best guess as to a subuniverse (later)
+* Land
+ ** the data in its long-term home(s)
+
+(later):
+
+* fix your idea of subuniverse
+
+DL TODO: (fix style (document pig style); align terminology with above `wikipedia/pageviews_extract_a.rb`
+
+
+=== Extract, Translate, Canonicalize ===
+
+image::images/semistructured_data_workflow.png[Semistructured Data Workflow]
+
+* Raw data, as it comes from the source. This could be unstructured (lines of text,
+* Source domain models:
+* Target domain models:
+* Transformer:
+
+You'll be tempted to move code too far to the right -- to put transformation code into
+Resist the urge. At the beginning of the project you're thinking about the details of extracting the data, and possibly still puzzling out the shape of the target domain models.
+Make the source domain models match the raw data as closely as reasonable, doing only the minimum effective work to make the data active.
+
+Separate the model's _properties_ (fundamental intrinsic data), its _metadata_ (details of processing), and its _derived variables_ (derived values).
+
+In an enterprise context, this process is "ETL" -- extraction, transformation and loading. In data at scale, rather than centralizing access in a single data store, you'll more often syndicate data along a documented provenance chain. So we'll change this to "Extract, Transform and Land".
+
+=== Solitary, poor, nasty, brutish, and short
+
+Thomas Hobbes wrote that the natural life of man is "solitary, poor, nasty, brutish, and short".
+and so should your data munging scripts be:
+
+* Solitary: write discrete, single concern scripts. It's much easier to validate the data when your transformations are simple to reason about.
+* Poor: spend wisely. It's especially important to optimize for programmer time, not cluster efficency: you will probably spend more in opportunity cost to write the code than you ever will running it. Even if it will be a production ETL footnote:[ETL = Extraction, Transformation and Loading -- what you call data munging when it runs often and well enough to deserve an enterprise-y name] job, get it running well and then make it run quickly. The hard parts are not where you expect them to be.
+* Nasty: as you'll see, the real work in data munging are the special cases and domain incompatibilities. Ugly transformations will lead to necessarily ugly code, but even still there are ways to handle it. There's also a certain type of brittleness that is *good* -- a script that quietly introduces corrupt data
+* Brutish: be effective, not elegant. Fight the ever-strong temptation to automate the in-automatable. Here are some street-fighting techniques we'll use below:
+ ** Hand-curating a 200-row join table is the kind of thing you learn computer programing to avoid, but it's often the smart move.
+ ** Faced with a few bad records that don't fit any particular pattern, you can write a script to recognize and repair them; or, if it's of modern size, just hand-fix them and save a diff. You've documented the change concisely and given future-you a chance to just re-apply the patch.
+ ** To be fully-compliant, most XML libraries introduce a necessary complexity that complexifies *up* the work to deal with simple XML structures. It's often more sensible to handle simple as plain text.
+* Short: use a high-level language. You'll invest much more in using the data than in writing the munging code. Two weeks later, when you discover that 2% of the documents used an alternate text encoding (or whatever card Loki had dealt) you'll be glad for brief readable scripts and a rich utility ecosystem.
+
+FIXME: the analogy will line up better with the source if I can make the point that 'your data munging scripts are to civilize data from the state of nature'.
+FIXME: the stuff above I like; need to chop out some of the over-purplish stuff from the rest of the chapter so the technical parts don't feel too ranty.
+
+=== Canonical Data ===
+
+* **sort** the data along its most-likely join field (sometimes but not always its primary key). This often enables a <<merge_join,merge join>>, with tremendous speedup.
+
+* **choose exemplars or add mountweazels**. Choose a few familiar records, and put their full contents close at hand to use for testing and provenance. You may also wish to add a ((mountweazel)), a purposefully-bogus record. Why? First, so you have something that fully exercises the data pipeline -- all the fields are present, text holds non-keyboard and escape characters, and so forth. Second, for production smoke testing: it's a record you can send through the full data pipeline, including writing into the database, without concern that you will clobber an actual value. Lastly, since exemplars are real records, they may change over time; you can hold your mountweazel's fields constant. Make sure it's unmissably bogus and unlikely to collide: "John Doe" is a _terrible_ name for a mountweazel, if there's any way user-contributed data could find its way into your database. The best-chosen bogus names appeal to the worst parts of a 5th-grader's sense of humor.
+
+* **sample** a coherent subuniverse. Ensure this includes, or go back and add, your exemplar records.
+
12 book.asciidoc
View
@@ -2,10 +2,10 @@ include::aa01-about.asciidoc[]
include::ba01-chimpanzee_and_elephant.asciidoc[]
-include::ba05-overview_of_datasets.asciidoc[]
-
include::ba06-semi_structured_data-a.asciidoc[]
+include::ba06-semi_structured_data-c-patterns.asciidoc[]
+
include::ba06-semi_structured_data-b-wikipedia_other.asciidoc[]
include::ba06-semi_structured_data-c-wikipedia_corpus.asciidoc[]
@@ -20,6 +20,8 @@ include::ba06-semi_structured_data-z-other_strategies.asciidoc[]
include::ba04-toolset.asciidoc[]
+include::ba02-simple_stream.asciidoc[]
+
include::fu01-sampling.asciidoc[]
include::fu02-statistics.asciidoc[]
@@ -34,9 +36,9 @@ include::fu10-tuning.asciidoc[]
include::fu06-processing_graphs.asciidoc[]
-include::fu00-overview_of_problems.asciidoc[]
+include::ba05-overview_of_datasets.asciidoc[]
-include::ba02-simple_stream.asciidoc[]
+include::fu00-overview_of_problems.asciidoc[]
include::ba03-herding_cats.asciidoc[]
@@ -76,4 +78,4 @@ include::aa02-TODO.asciidoc[]
include::README.asciidoc[]
-include::tr01-authors.asciidoc[]
+include::tr01-authors.asciidoc[]
6 tr01-authors.asciidoc
View
@@ -1,8 +1,8 @@
-== Authors ==
+== Book Metadata ==
-=== Philip (flip) Kromer ===
+=== Author ===
-I am founder and CTO at Infochimps.com, a big data platform that makes acquiring, storing and analyzing massive data streams transformatively easier. I enjoy Bowling, Scrabble, working on old cars or new wood, and rooting for the Red Sox.
+Philip (flip) Kromer is the founder and CTO at Infochimps.com, a big data platform that makes acquiring, storing and analyzing massive data streams transformatively easier. I enjoy Bowling, Scrabble, working on old cars or new wood, and rooting for the Red Sox.
Graduate School, Dept. of Physics - University of Texas at Austin, 2001-2007
Bachelor of Arts, Computer Science - Cornell University, Ithaca NY, 1992-1996
13 tr02-LICENSE.asciidoc
View
@@ -0,0 +1,13 @@
+=== License ===
+
+TODO: actual license stuff
+
+Text and assets are released under CC-BY-NC-SA (Creative Commons Attribution, Non-commercial, derivatives encouraged but Share Alike)
+
+____
+This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
+____
+
+
+Code is Apache licensed unless specifically labeled otherwise.
+
4 tr03-colophon.asciidoc
View
@@ -0,0 +1,4 @@
+==== Colophon
+
+
+the http://github.com/schacon/git-scribe[git-scribe toolchain] was very useful creating this book. Instructions on how to install the tool and use it for things like editing this book, submitting errata and providing translations can be found at that site.
Please sign in to comment.
Something went wrong with that request. Please try again.