Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Big Data Ecosystem

  • Loading branch information...
commit e24b866763463abb8b720519e401686e0f77599c 1 parent cdf6eb7
Philip (flip) Kromer mrflip authored
37 00a-intro-more_outlines.asciidoc
View
@@ -20,31 +20,6 @@ Introduce the chapter to the reader
* like a "map" of the book
* "First part is about blah, next is about blah, ..."
-==== Intro: Storm+Trident Fundamentals
-
-At this point, you have good familiarity with Hadoop’s batch processing power and the powerful inquiries it unlocks above and as a counterpart to traditional database approach. Stream analytics is a third mode of data analysis and, it is becoming clear, one that is just as essential and transformative as massive scale batch processing has been.
-
-Storm is an open-source framework developed at Twitter that provides scalable stream processing. Trident draws on Storm’s powerful transport mechanism to provide _exactly once_ processing of records in _windowed batches es_ for aggregating and persisting
-to an external data store.
-
-The central challenge in building a system that can perform fallible operations on billions of records reliably is how to do so without yourself producing so much bookkeeping that it becomes its own scalable Stream processing challenge. Storm handles all details of reliable transport and efficient routing for you, leaving you with only the business process at hand. (The remarkably elegant way Storm handles that bookkeeping challenge is one of its principle breakthroughs; you’ll learn about it in the later chapter on Storm Internals.)
-
-This takes Storm past the mere processing of records to Stream Analytics -- with some limitations and some advantages, you have the same ability to specify locality and write arbitrarily powerful general-purpose code to handle every record. A lot of Storm+Trident’s adoption is in application to real-time systems. footnote:[for reasons you’ll learn in the Storm internals chapter, it’s not suitable for ultra-low latency (below, say, 5s of milliseconds), Wall Street-type applications, but if latencies above that are real-time enough for you, Storm+Trident shines.]
-
-But, just as importantly, the framework exhibits radical _tolerance_ of latency. It’s perfectly reasonable to, for every record, perform reads of a legacy data store, call an internet API and the like, even if those might have hundreds or thousands of milliseconds worst-case latency. That range of timescales is simply impractical within a batch processing run or database query. In the later chapter on the Lambda Architecture, you’ll learn how to use stream and batch analytics together for latencies that span from milliseconds to years.
-
-As an example, one of the largest hard drive manufacturers in the world ingests sensor data from its manufacturing line, test data from quality assurance processes, reports from customer support and post mortem analysis of returned devices. They have been able to mine the accumulated millisecond scale sensor data for patterns that predict flaws months and years later. Hadoop produces the “slow,” deep results, uncovering the patterns that predict failure. Storm+Trident produces the fast, relevant results: operational alerts when those anomalies are observed.
-
-Things you should take away from this chapter:
-
-Understand the type of problems you solve using stream processing and apply it to real examples using the best-in-class stream analytics frameworks.
-Acquire the practicalities of authoring, launching and validating a Storm+Trident flow.
-Understand Trident’s operators and how to use them: Each apply `CombinerAggregator`s, `ReducerAggregator`s and `AccumulatingAggregator`s (generic aggregator?)
-Persist records or aggregations directly to a backing database or to Kafka for item-potent downstream storage.
-(probably not going to discuss how to do a streaming join, using either DRPC or a hashmap join)
-
-NOTE: This chapter will only speak of Storm+Trident, the high level and from the outside. We won’t spend any time on how it’s making this all work until (to do ref the chapter on Storm+Trident internals)
-
==== Skeleton: Storm+Trident Internals
What should you take away from this chapter:
@@ -98,7 +73,7 @@ Lifecycle of a File:
* An interesting opportunity happens when the sequence order of records corresponds to one of your horizon keys.
* Explain using the example of weblogs. highlighting strict order and partial order.
* In the frequent case, the sequence order only somewhat corresponds to one of the horizon keys. There are several types of somewhat ordered streams: block disorder, bounded band disorder, band disorder. When those conditions hold, you can use windows to recover the power you have with ordered streams -- often, without having to order the stream.
-* Unbounded band disorder only allows “”convergent truth aggregators. If you have no idea when or whether that some additional record from a horizon group might show up, then you can’t treat your aggregation as anything but a best possible guess at the truth.
+* Unbounded band disorder only allows "convergent truth" aggregators. If you have no idea when or whether that some additional record from a horizon group might show up, then you can’t treat your aggregation as anything but a best possible guess at the truth.
* However, what the limited disorder does get you, is the ability to efficiently cache aggregations from a practically infinite backing data store.
* With bounded band or block disorder, you can perform accumulator-style aggregations.
* How to, with the abstraction of an infinite sorting buffer or an infinite binning buffer, efficiently re-present the stream as one where sequence order and horizon key directly correspond.
@@ -130,16 +105,6 @@ Continuous horizon: getting 1-d locality
==== Skeleton: Statistics
-Data is worthless. Actually, it's worse than worthless. It costs you money to gather, store, manage, replicate and analyze. What you really want is insight -- a relevant summary of the essential patterns in that data -- produced using relationships to analyze data in context.
-
-Statistical summaries are the purest form of this activity, and will be used repeatedly in the book to come, so now that you see how Hadoop is used it's a good place to focus.
-
-Some statistical measures let you summarize the whole from summaries of the parts: I can count all the votes in the state by summing the votes from each county, and the votes in each county by summing the votes at each polling station. Those types of aggregations -- average/standard deviation, correlation, and so forth -- are naturally scalable, but just having billions of objects introduces some practical problems you need to avoid. We'll also use them to introduce Pig, a high-level language for SQL-like queries on large datasets.
-
-Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That's especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution -- the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.)
-
-But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to
-
* Holistic vs algebraic aggregations
* Underflow and the "Law of Huge Numbers"
* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
2  05d-intro_to_pig.asciidoc
View
@@ -9,8 +9,6 @@ end
Reading it as prose the script says "for each article: break it into a list of words; group all occurrences of each word and count them; then output the article id, word and count."
-
-
.Snippet from the Wikipedia article on "Barbecue"
[quote, wikipedia, http://en.wikipedia.org/wiki/Barbeque]
____
311 06a-big_data_ecosystem.asciidoc
View
@@ -0,0 +1,311 @@
+=== Big Data Ecosystem and Toolset
+
+Big data is necessarily a polyglot sport. The extreme technical challenges demand diverse technological solutions and the relative youth of this field means, unfortunately, largely incompatible languages, formats, nomenclature and transport mechanisms. What’s more, every ecosystem niche has multiple open source and commercial contenders vying for prominence and it is difficult to know which are widely used, which are being adopted and even which of them work at all.
+
+Fixing a map of this ecosystem to print would be nearly foolish; predictions of success or failure will prove wrong, the companies and projects whose efforts you omit or downplay will inscribe your name in their “Enemies” list, the correct choice can be deeply use-case specific and any list will become out of date the minute it is committed to print. Your authors, fools both, feel you are better off with a set of wrong-headed, impolitic, oblivious and obsolete recommendations based on what has worked for us and what we have seen work for other people.
+
+==== Basic Toolset
+
+We think you are doing it wrong if you are not using :
+ - for most general-purpose jobs, a high-level data orchestration tool such as Pig or Hive
+ - for the dirty stuff (e.g. parsing, contacting external apis, etc..), a high-level scripting language such as Ruby or Python
+ - for deeper exploration of your data once it's been cut down to size, a statistical analysis tool -- R, Pandas or Julia
+ - Additionally: toolkits (Mahout, Scikit, Kiji; NLTK / Stanford / OpenNLP)
+
+Infrastructure:
+ - Batch analytics: Hadoop, from either the Cloudera or Hortonworks distributions (or, for the moneyed, Map/R)
+ - Streaming Analytics: Storm+Trident, although Spark Streaming shows promise.
+ - One or more scalable datastores -- for storing billions of objects, HBase, Accumulo or Cassandra; and for storing millions of results, Elasticsearch or MongoDB
+ - Optionally, an in-memory OLAP tool such as Impala, Druid or Redshift.
+
+==== Core Platform: Batch Processing
+
+Hadoop is the one easy choice to make; Doug Cutting calls it the “kernel of the big data operating system” and we agree. It can’t be just that easy, though; you further have to decide which distribution to adopt. Almost everyone should either choose Cloudera’s distribution (CDH) or Hortonworks' (HDP). Both come in a complete, well-tested open source version as well as a commercial version backed by enterprise features and vendor support. Both employ significant numbers of core contributors, have unshakable expertise and open-source credibility.
+
+Cloudera started in 2009 with an all-star list of founders including Doug Cutting, the originator of the Hadoop project. It was the first to offer a packaged distribution and the first to offer commercial support; its offerings soon came to dominate the ecosystem.
+
+Hortonworks was founded two years later by Eric Baldeschwieler (aka Eric 14), who brought the project into Yahoo! and fostered its essential early growth, and a no-less-impressive set of core contributors. It has rapidly matured a first-class offering with its own key advantages.
+
+Cloudera was the first company to commercialize Hadoop; it’s distribution is, by far, the most widely adopted and if you don’t feel like thinking, it’s the easy choice. The company is increasingly focused on large-scale enterprise customers and its feature velocity is increasingly devoted to its commercial-only components.
+
+Hortonworks’ offering is 100-percent open source, which will appeal to those uninterested in a commercial solution or who abhor the notion of vendor lock-in. More impressively to us has been Hortonworks’ success in establishing beneficial ecosystem partnerships. The most important of these partnerships is with Microsoft and, although we do not have direct experience, any Microsoft shop should consider Hortonworks first.
+
+There are other smaller distributions, from IBM VMware and others, which are only really interesting if you use IBM VMware or one of those others.
+
+The core project has a distribution of its own, but apart from people interested in core development, you are better off with one of the packaged distributions.
+
+The most important alternative to Hadoop is Map/R, a C++-based rewrite, that is 100-percent API compatible with Hadoop. It is a closed-source commercial product for high-end Enterprise customers and has a free version with all essential features for smaller installations. Most compellingly, its HDFS also presents an NFS interface and so can be mounted as a native file system. See the section below (TODO: REF) on why this is such a big deal. Map/R is faster, more powerful and much more expensive than the open source version, which is pretty much everything you need to know to decide if it is the right choice for you.
+
+There are two last alternatives worthy of note. Both discard compatibility with the Hadoop code base entirely, freeing them from any legacy concerns.
+
+Spark is, in some sense, an encapsulation of the iterative development process espoused in this book: prepare a sub-universe, author small self-contained scripts that checkpoint frequently and periodically reestablish a beachhead by running against the full input dataset. The output of Spark’s Scala-based domain-specific statements are managed intelligently in memory and persisted to disk when directed. This eliminates, in effect, the often-unnecessary cost of writing out data from the Reducer tasks and reading it back in again to Mapper tasks. That’s just one example of many ways in which Spark is able to impressively optimize development of Map/Reduce jobs.
+
+Disco is an extremely lightweight Python-based implementation of Map/Reduce that originated at Nokia. FOOTNOTE: [If these analogies help, you could consider the Leica to Hadoop’s Canon or the nginX to its Apache.] Its advantage and disadvantage is that it is an essentials-only realization of what Hadoop provides whose code base is a small fraction of the size.
+
+We do not see either of them displacing Hadoop but since both are perfectly happy to run on top of any standard HDFS, they are reasonable tools to add to your repertoire.
+
+
+====== Sidebar: Which Hadoop Version?
+
+At the time of writing, Hadoop is at a crossroads between versions with fundamental differences in architecture and interface. Hadoop is, beyond a doubt, one of the greatest open source software success stories. It’s co-base has received contributions from thousands of committers, operators and users.
+
+But, as Hadoop entered Enterprise-adoption adulthood from its Web-2.0 adolescence, the core team determined that enough early decisions -- in naming, in architecture, in interface -- needed to be remade to justify a rewrite. The project was split into multiple pieces: principally, Map/Reduce (processing data), HDFS (storing data) and core essential code shared by each.
+
+Under the hood, the 2.0 branch still provides the legacy architecture of Job Tracker/Namenode/SecondaryNamenode masters but the way forward is a new componentized and pluggable architecture. The most significant flaw in the 1.0 branch -- the terrifying lack of Namenode redundancy -- has been addressed by a Zookeeper-based “high availability” implementation. (TODO: Describe YARN, Distributed Job Tracker and so forth). At the bottom, the name and meaning of Hadoop's hundreds of configuration variables have been rethought; you can find a distressingly-long Rosetta Stone from old to new at (TODO: add link).
+
+Even more important are the changes to interface. The HDFS is largely backward-compatible; you probably only need to recompile. The 2.0 branch offers an “MR1” toolkit -- backward compatible with the legacy API -- and the next-generation “MR2” toolkit that takes better advantage of 2.0’s new architecture. Programs written for “MR1” will not run on “MR2” and vice versa.
+
+So, which should you choose? The way forward is clearly with the 2.0’s architecture. If you are just ramping up on Hadoop, use the componentized YARN-based systems from the start. If you have an existing legacy installation, plan to upgrade at at a deliberate pace -- informed exactly by whether you are more terrified of having a Namenode fail or being an early-ish adopter. For the Map/Reduce toolkit, our best advice is to use the approach described in this book: Do not use the low-level API. Pig, Hive, Wukong and most other high-level toolkits are fully compatible with each. Cascading is not yet compatible with “MR2” but likely will if the market moves that way.
+
+The 2.0 branch has cleaner code, some feature advantages and has the primary attention of the core team. However, the “MR1” toolkit has so much ecosystem support, documentation, applications, lessons learned from wide-scale deployment, it continues to be our choice for production use. Note that you can have your redundant Namenode without having to adopt the new-fangled API. Adoption of “MR2” is highly likely (though not certain); if you are just starting out, adopting it from the start is probably a sound decision. If you have a legacy investment in “MR1” code, wait until you start seeing blog posts from large-scale deployers titled “We Spent Millions Upgrading To MR2 And Boy Are We Ever Happy We Did So.”
+
+The biggest pressure to move forward will be Impala, which requires the “MR2” framework. If you plan to invest in Impala heavily, it is probably best to uniformly adopt “MR2.”
+
+==== Core Platform: Streaming Data Processing
+
+While the batch processing landscape has largely settled around Hadoop, there are many more data streaming data processing technologies vying for mind share. Roughly speaking, we see three types of solutions: Complex Event Processing (CEP) systems that grew out of high-frequency trading and security applications; Streaming Transport systems, which grew out of the need to centralize server logs at extremely high throughput; and Streaming Analytics systems, developed to perform sophisticated analysis of high-rate activity streams.
+
+The principal focus of a CEP is to enable time-windowed predicates on ordered streams -- for example, “Trigger a buy order for frozen orange juice futures if Mortimer & Mortimer has sold more than 10,000 shares in the last hour” or “Lock down system access if a low-level Army analyst’s terminal is accessing thousands of State Department memos.” These platforms are conventionally programmed using a SQL-like query language but support low-level extension and an ecosystem of expensive consultants to write same.
+
+These platforms are relentlessly focused on low latency, which is their gift and their curse. If you are looking for tightly-bound response times in the milliseconds, nothing else will do. Its cost is a tightly-constrained programming model, poor tolerance for strongly-disordered data and a preference for high-grade hardware and expensive commercial software. The leading open source entrant is Esper, which is Java-based and widely used. Commercial offerings include (TODO: find out what commercial offerings there are, e.g. Tibco and Streambase).
+
+Most people with petabyte-scale data first have to figure out how to ship terabyte-scale data to their cluster. The best solutions here are Kafka or Flume. Kafka, a Java-based open source project from LinkedIn, is our choice. It is lightweight, increasingly well-adopted and has a wonderful architecture that allows the operating system to efficiency do almost all the work. Flume, from Cloudera and also Java-based, solves the same problem but less elegantly, in our opinion. It offers the ability to do rudimentary in-stream processing similar to Storm but lacks the additional sophistication Trident provides.
+
+Both Kafka and Flume are capable of extremely high throughput and scalability. Most importantly, they guarantee “at least once” processing. Within the limits of disk space and the laws of physics, they will reliably transport each record to its destination even as networks and intervening systems fail.
+
+Kafka and Flume can both deposit your data reliably onto an HDFS but take very different approaches to doing so. Flume uses the obvious approach of having an “always live” sync write records directly to a DataNode acting as a native client. Kafka’s Camus add-on uses a counterintuitive but, to our mind, superior approach. In Camus, data is loaded onto the HDFS using Mapper-Only MR jobs running in an endless loop. Its Map tasks are proper Hadoop jobs and Kafka clients and elegantly leverage the reliability mechanisms of each. Data is live on the HDFS as often as the Import job runs -- not more, not less.
+
+Flume’s scheme has two drawbacks: First, the long-running connections it requires to individual DataNodes silently compete with the traditional framework. (FOOTNOTE: Make sure you increase DataNode handler counts to match.) Second, a file does not become live on the HDFS until either a full block is produced or the file is closed. That’s fine if all your datastreams are high rate, but if you have a range of rates or variable rates, you are forced to choose between inefficient block sizes (larger NameNode burden, more Map tasks) or exceedingly long delays until data is ready to process. There are workarounds but they are workarounds.
+
+Both Kafka and Flume have evolved into general purpose solutions from their origins in high-scale server log transport but there are other use-case specific technologies. You may see Scribe and S4 mentioned as alternatives but they are not seeing the same wide-spread adoption. Scalable message queue systems such as AMQP, RabbitMQ or Kestrel will make sense if (a) you are already using one; (b) you require complex event-driven routing; or (c) your system is zillions of sources emitting many events rather than many sources emitting zillions of events. AMQP is Enterprise-y and has rich commercial support. RabbitMQ is open source-y and somewhat more fresh. Kestrel is minimal and fast.
+
+===== Stream Analytics
+
+The streaming transport solutions just described focus on getting your data from here to there as efficiently as possible. A streaming analytics solution allows you to perform, well, analytics on the data in flight. While a transport solution only guarantees _at least once_ processing, frameworks like Trident guarantee _exactly once_ processing, enabling you to perform aggregation operations. They encourage you to do anything to the data in flight that Java or your high-level language of choice permits you to do -- including even high-latency actions such as pinging an external API or legacy data store -- while giving you efficient control over locality and persistence. There is a full chapter introduction to Trident in Chapter (TODO: REF), so we won’t go into much more detail here.
+Trident, a Java and Clojure-based open source project from Twitter, is the most prominent so far.
+
+There are two prominent alternatives. Spark Streaming, an offshoot of the Spark project mentioned above (TODO: REF), is receiving increasing attention. Continuity offers an extremely slick developer-friendly commercial alternative. It is extremely friendly with HBase (the company was started by some members of the HBase core team); as we understand it, most of the action actually takes place within HBase, an interesting alternative approach.
+
+Trident is extremely compelling, the most widely used, is our choice for this book and our best recommendation for general use.
+
+===== Online Analytic Processing (OLAP) on Hadoop
+
+The technologies mentioned so far, for the most part, augment the mature, traditional data processing tool sets. There are now arising Hadoop-based solutions for online analytic processing (OLAP) that directly challenge the data warehousing technologies at the core of most large-scale enterprises. These rely on keeping a significant amount of your data in memory, so bring your wallet. (It may help to note that AWS offers instances with 244 GB of RAM -- yes, that’s one quarter of a terabyte -- for a mere $2500 per month, letting you try before you buy.)
+
+The extremely fast response times close the gap to existing Enterprise IT in two ways: First, by offering SQL-like interface and database-like response times and second, by providing the ODBC (FOOTNOTE: Online Database Connectivity)-compatible connectors that traditional business intelligence (BI) tools expect.
+
+Impala, a Java-based open source project from Cloudera, is the most promising. It reuses Hive’s query language, although current technological limitations prevent it from supporting the full range of commands available in Hive. Druid, a Java-based open source project from Metamarkets, offers a clean, elegant API and will be quite compelling to folks who think like programmers and not like database analysts. If you're interested in a commercial solution, Hadapt and VoltDB (software) and Amazon’s RedShift (cloud-hosted) all look viable.
+
+Lastly, just as this chapter was being written Facebook open sourced their Presto project. It is too early to say whether it will be widely adopted, but Facebook doesn't do anything thoughtlessly or at a small scale. We'd include it in any evaluation.
+
+Which to choose? If you want the simple answer, use Impala if you run your own clusters or RedShift if you prefer a cloud solution. But this technology only makes sense when you've gone beyond what traditional solutions support. You'll be spending hundreds of thousands of dollars here, so do a thorough investigation.
+
+NOTE: You’ll hear the word “realtime” attached to both streaming and OLAP technologies; there are actually three things meant by that term. The first, let’s call “immediate realtime” provided by the CEP solutions: If the consequent actions of a new piece of data have not occurred within 50 milliseconds or less, forget about it. Let’s call what the streaming analytics solutions provide “prompt realtime;” there is a higher floor on the typical processing latency but you are able to handle all the analytical processing and consequent actions for each piece of data as it is received. Lastly, the OLAP data stores provide what we will call “interactive realtime;” data is both promptly manifested in the OLAP system’s tables and the results of queries are returned and available within an analyst’s attention span.
+
+===== Database Crossloading
+
+ All the tools above focus on handling massive streams of data in constant flight. Sometimes, what is needed is just to get that big pile of data over there into that thing over there. Sqoop, an Apache project from Cloudera, capably solves the unlovely task of transferring data in and out of legacy Enterprise stores, SQL server, Natiza, Oracle, MySQL, PostgreSQL and more.
+
+Most large enterprises are already using a traditional ETL (FOOTNOTE: Extract, Transform and Load, although by now it means the thing ETL vendors sell) tool such as Informatica and (TODO: put in name of the other one). If you want a stodgy, expensive Enterprise-grade solution, their sales people will enthusiastically endorse it for your needs, but if extreme scalability is essential, and their relative immaturity is not a deal breaker, use Sqoop, Kafka or Flume to centralize your data.
+
+==== Core Platform: Data Stores
+
+In the old days, there was such a thing as “a” database. These adorable, all-in-one devices not only stored your data, they allowed you to interrogate it and restructure it. They did those tasks so well we forgot they were different things, stopped asking questions about what was possible and stopped noticing the brutal treatment the database inflicted on our programming models.
+
+As the size of data under management explodes beyond one machine, it becomes increasingly impossible to transparently support that abstraction. You can pay companies like Oracle or Netezza large sums of money to fight a rear-guard action against data locality on your behalf or you can abandon the Utopian conceit that one device can perfectly satisfy the joint technical constraints of storing, interrogating and restructuring data at arbitrary scale and velocity for every application in your shop.
+
+As it turns out, there are a few coherent ways to variously relax those constraints and around each of those solution sets has grown a wave of next-generation data stores -- referred to with the (TODO: change word) idiotic collective term “NoSQL” databases. The resulting explosion in the number of technological choices presents a baffling challenge to anyone deciding “which NoSQL database is the right one for me?” Unfortunately, the answer is far worse than that because the right question is “which NoSQL _databases_ are the right choices for me?”
+
+Big data applications at scale are best architected using a variety of data stores and analytics systems.
+
+The good news is that, by focusing on narrower use cases and relaxing selected technical constraints, these new data stores can excel at their purpose far better than an all-purpose relational database would. Let’s look at the respective data store archetypes that have emerged and their primary contenders.
+
+===== Traditional Relational Databases
+
+The reason the name “NoSQL” is so stupid is that it is not about rejecting traditional databases, it is about choosing the right database for the job. For the majority of jobs, that choice continues to be a relational database. Oracle, MS SQL Server, MySQL and PostgreSQL are not going away. The latter two have widespread open source support and PostgreSQL, in particular, has extremely strong geospatial functionality. As your data scales, fewer and fewer of their powerful JOIN capabilities survive but for direct retrieval, they will keep up with even the dedicated, lightweight key-value stores described below.
+
+If you are already using one of these products, find out how well your old dog performs the new tricks before you visit the pound.
+
+===== Billions of Records
+
+At the extreme far end of the ecosystem are a set of data stores that give up the ability to be queried in all but the simplest ways in return for the ability to store and retrieve trillions of objects with exceptional durability, throughput and latency. The choices we like here are Cassandra, HBase or Accumulo, although Riak, Voldemort, Aerospike, Couchbase and Hypertable deserve consideration as well.
+
+Cassandra is the pure-form expression of the “trillions of things” mission. It is operationally simple and exceptionally fast on write, making it very popular for time-series applications. HBase and Accumulo are architecturally similar in that they sit on top of Hadoop’s HDFS; this makes them operationally more complex than Cassandra but gives them an unparalleled ability to serve as source and destination of Map/Reduce jobs.
+
+All three are widely popular open source, Java-based projects. Accumulo was initially developed by the U.S. National Security Administration (NSA) and was open sourced in 2011. HBase has been an open source Apache project since its inception in 2006 and both are nearly identical in architecture and functionality. As you would expect, Accumulo has unrivaled security support while HBase’s longer visibility gives it a wider installed base.
+
+We can try to make the choice among the three sound simple: If security is an overriding need, choose Accumulo. If simplicity is an overriding need, choose Cassandra. For overall best compatibility with Hadoop, use HBase.
+
+However, if your use case justifies a data store in this class, it will also require investing hundreds of thousands of dollars in infrastructure and operations. Do a thorough bake-off among these three and perhaps some of the others listed above.
+
+What you give up in exchange is all but the most primitive form of locality. The only fundamental retrieval operation is to look records or ranges of records by primary key. There is Sugar for secondary indexing and tricks that help restore some of the power you lost but effectively, that’s it. No JOINS, no GROUPS, no SQL.
+
+- H-base, Accumulo and Cassandra
+- Aerospike, Voldemort and Riak, Hypertable
+
+===== Scalable Application-Oriented Data Stores
+
+If you are using Hadoop and Storm+Trident, you do not need your database to have sophisticated reporting or analytic capability. For the significant number of use cases with merely hundreds of millions (but not tens of billions) of records, there are two data stores that give up the ability to do complex JOINS and GROUPS and instead focus on delighting the application programmer.
+
+MongoDB starts with a wonderful hack: It uses the operating system’s “memory-mapped file” (mmap) features to give the internal abstraction of an infinitely-large data space. The operating system’s finely-tuned virtual memory mechanisms handle all details of persistence, retrieval and caching. That internal simplicity and elegant programmer-friendly API make MongoDB a joy to code against.
+
+Its key tradeoff comes from its key advantage: The internal mmap abstraction delegates all matters of in-machine locality to the operating system. It also relinquishes any fine control over in-machine locality. As MongoDB scales to many machines, its locality abstraction starts to leak. Some features that so delighted you at the start of the project prove to violate the laws of physics as the project scales into production. Any claims that MongoDB “doesn’t scale,” though, are overblown; it scales quite capably into the billion-record regime but doing so requires expert guidance.
+
+Probably the best thing to do is think about it this way: The open source version of MongoDB is free to use on single machines by amateurs and professionals, one and all; anyone considering using it on multiple machines should only do so with commercial support from the start.
+
+The increasingly-popular ElasticSearch data store is our first choice for hitting the sweet spot of programmer delight and scalability. The heart of ElasticSearch is Lucene, which encapsulates the exceptionally difficult task of indexing records and text in a streamlined gem of functionality, hardened by a decade of wide open source adoption. (FOOTNOTE: Lucene was started, incidentally, by Doug Cutting several years before he started the Hadoop project.)
+
+ElasticSearch embeds Lucene into a first-class distributed data framework and offers a powerful programmer-friendly API that rivals MongoDB’s. Since Lucene is at its core, it would be easy to mistake ElasticSearch for a text search engine like Solr; it is one of those and, to our minds, the best one, but it is also a first-class database.
+
+===== Scalable Free-Text Search Engines: Solr, ElasticSearch and More
+
+The need to perform free-text search across millions and billions of documents is not new and the Lucene-based Solr search engine is the dominant traditional solution with wide Enterprise support. It is, however, long in tooth and difficult to scale.
+
+ElasticSearch, described above as an application-oriented database, is also our recommended choice for bringing Lucene’s strengths to Hadoop’s scale.
+
+Two recent announcements -- the Apache Blur project and the closely-related ClouderaSearch product -- also deserve consideration.
+
+
+===== Lightweight Data Structures
+
+ZooKeeper is basically “distributed correctness in a box.” Transactionally updating data within a distributed system is a fiendishly difficult task, enough that implementing it on your own should be a fireable offense. ZooKeeper and its ubiquitously available client libraries let you synchronize updates and state among arbitrarily large numbers of concurrent processes. It sits at the core of HBase, Storm, Hadoop’s newer high-availability Namenode and dozens of other high-scale distributed applications. It is a bit thorny to use; projects like etcd (TODO link) and Doozer (TODO link) fill the same need but provide friendlier APIs. We feel this is no place for liberalism, however -- ZooKeeper is the default choice.
+
+If you turn the knob for programmer delight all the way to the right, one request that would fall out would be, “Hey - can you take the same data structures I use while I’m coding but make it so I can have as many of them as I have RAM and shared across as many machines and processes as I like?” The Redis data store is effectively that. Its API gives you the fundamental data structures you know and love -- hashmap, stack, buffer, set, etc -- and exposes exactly the set of operations that can be performance and distributedly correct. It is best used when the amount of data does not much exceed the amount of RAM you are willing to provide and should only be used when its data structures are a direct match to your application. Given those constraints, it is simple, light and a joy to use.
+
+Sometimes, the only data structure you need is “given name, get thing.” Memcached is an exceptionally fast in-memory key value store that serves as the caching layer for many of the Internet’s largest websites. It has been around for a long time and will not go away any time soon.
+
+If you are already using MySQL or PostgreSQL, and therefore only have to scale by cost of RAM not cost of license, you will find that they are perfectly defensible key value stores in their own right. Just ignore 90-percent of their user manuals and find out when the need for better latency or lower cost of compute forces you to change.
+
+Kyoto Tycoon is an open source C++-based distributed key value store with the venerable DBM database engine at its core. It is exceptionally fast and, in our experience, is the simplest way to efficiently serve a mostly-cold data set. It will quite happily serve hundreds of gigabytes or terabytes of data out of not much more RAM than you require for efficient caching.
+
+===== Graph Databases
+
+Graph-based databases have been around for some time but have yet to see general adoption outside of, as you might guess, the intelligence and social networking communities (NASH). We suspect that, as the price of RAM continues to drop and the number of data scientists continues to rise, sophisticated analysis of network graphs will become increasingly important and, we hear, increasing adoption of graph data stores.
+
+The two open source projects we hear the most about are the longstanding Neo 4J project and the newer, fresher TitanDB.
+
+Your authors do not have direct experience here, but the adoption rate of TitanDB is impressive and we believe that is where the market is going.
+
+==== Programming Languages, Tools and Frameworks
+
+===== SQL-like High-Level Languages: Hive and Pig
+
+Every data scientist toolkit should include either Hive or Pig, two functionally equivalent languages that transform SQL-like statements into efficient Map/Reduce jobs. Both of them are widely-adopted open source projects, written in Java and easily extensible using Java-based User-Defined Functions (UDFs).
+
+Hive is more SQL-like, which will appeal to those with strong expertise in SQL. Pig’s language is sparer, cleaner and more orthogonal, which will appeal to people with a strong distaste for SQL Hive’s model manages and organizes your data for you, which is good and bad. If you are coming from a data warehouse background, this will provide a very familiar model. On the other hand, Hive _insists_ on managing and organizing your data, making it play poorly with the many other tools that experimental data science requires. (The H Catalog Project aims to fix this and is maturing nicely).
+
+In Pig, every language primitive maps to a fundamental dataflow primitive; this harmony with the Map/Reduce paradigm makes it easier to write and reason about efficient dataflows. Hive aims to complete the set of traditional database operations; this is convenient and lowers the learning curve but can make the resulting dataflow more opaque.
+
+Hive is seeing slightly wider adoption but both have extremely solid user bases and bright prospects for the future.
+
+Which to choose? If you are coming from a data warehousing background or think best in SQL, you will probably prefer Hive. If you come from a programming background and have always wished SQL just made more sense, you will probably prefer Pig. We have chosen to write all the examples for this book in Pig -- its greater harmony with Map/Reduce makes it a better tool for teaching people how to think in scale. Let us pause and suggestively point to this book’s creative commons license, thus perhaps encouraging an eager reader to translate the book into Hive (or Python, Chinese or Cascading).
+
+===== High-Level Scripting Languages: Wukong (Ruby), mrjob (Python) and Others
+
+Many people prefer to work strictly within Pig or Hive, writing Java UDFs for everything that cannot be done as a high-level statement. It is a defensible choice and a better mistake than the other extreme of writing everything in the native Java API. Our experience, however, has been, say 60-percent of our thoughts are best expressed in Pig, perhaps 10-percent of them require a low-level UDF but that the remainder are far better expressed in a high-level language like Ruby or Python.
+
+Most Hadoop jobs are IO-bound, not CPU-bound, so performance concerns are much less likely to intervene. (Besides, robots are cheap but people are important. If you want your program to run faster, use more machines, not more code). These languages have an incredibly rich open source toolkit ecosystem and cross-platform glue. Most importantly, their code is simpler, shorter and easier to read; far more of data science than you expect is brass-knuckle street fighting, necessary acts of violence to make your data look like it should. These are messy, annoying problems, not deep problems and, in our experience, the only way to handle them maintainably is in a high-level scripting language.
+
+You probably come in with a favorite scripting language in mind, and so by all means, use that one. The same Hadoop streaming interface powering the ones we will describe below is almost certainly available in your language of choice. If you do not, we will single out Ruby, Python and Scala as the most plausible choices, roll our eyes at the language warhawks sharpening their knives and briefly describe the advantages of each.
+
+Ruby is elegant, flexible and maintainable. Among programming languages suitable for serious use, Ruby code is naturally the most readable and so it is our choice for this book. We use it daily at work and believe its clarity makes the thought we are trying to convey most easily portable into the reader’s language of choice.
+
+Python is elegant, clean and spare. It boasts two toolkits appealing enough to serve as the sole basis of choice for some people. The Natural Language toolkit (NLTK) is not far from the field of computational linguistics set to code. SciPy is widely used throughout scientific computing and has a full range of fast, robust matrix and numerical logarithms.
+
+Lastly, Scala, a relative newcomer, is essentially “Java but readable.” It’s syntax feels very natural to native Java programmers and executives directly into the JBM, giving it strong performance and first-class access to native Java frameworks, which means, of course, native access to the code under Hadoop, Storm, Kafka, etc.
+
+If runtime efficiency and a clean match to Java are paramount, you will prefer Scala. If your primary use case is text processing or hardcore numerical analysis, Python’s superior toolkits make it the best choice. Otherwise, it is a matter of philosophy. Against Perl’s mad credo of “there is more than one way to do it,” Python says “there is exactly one right way to do it,” while Ruby says “there are a few good ways to do it, be clear and use good taste.” One of those alternatives gets your world view; choose accordingly.
+
+===== Statistical Languages: R, Julia, Pandas and more
+
+For many applications, Hadoop and friends are most useful for turning big data into medium data, cutting it down enough in size to apply traditional statistical analysis tools. SPSS, SaSS, Matlab and Mathematica are long-running commercial examples of these, whose sales brochures will explain their merits better than we can.
+
+R is the leading open source alternative. You can consider it the “PHP of data analysis.” It is extremely inviting, has a library for everything, much of the internet runs on it and considered as a language, is inelegant, often frustrating and Vulcanized. Do not take that last part too seriously; whatever you are looking to do that can be done on a single machine, R can do. There are Hadoop integrations, like RHIPE, but we do not take them very seriously. R is best used on single machines or trivially parallelized using, say, Hadoop.
+
+Julia is an upstart language designed by programmers, not statisticians. It openly intends to replace R by offering cleaner syntax, significantly faster execution and better distributed awareness. If its library support begins to rival R’s, it is likely to take over but that probably has not happened yet.
+
+Lastly, Pandas, Anaconda and other Python-based solutions give you all the linguistic elegance of Python, a compelling interactive shell and the extensive statistical and machine-learning capabilities that NumPy and scikit provide. If Python is your thing, you should likely start here.
+
+===== Mid-level Languages
+
+You cannot do everything a high-level language, of course. Sometimes, you need closer access to the Hadoop API or to one of the many powerful, extremely efficient domain-specific frameworks provided within the Java ecosystem. Our preferred approach is to write Pig or Hive UDFs; you can learn more in Chapter (TODO: REF).
+
+Many people prefer, however, prefer to live exclusively at this middle level. Cascading strikes a wonderful balance here. It combines an elegant DSL for describing your Hadoop job as a dataflow and a clean UDF framework for record-level manipulations. Much of Trident’s API was inspired by Cascading; it is our hope that Cascading eventually supports Trident or Storm as a back end. Cascading is quite popular, and besides its native Java experience, offers first-class access from Scala (via the Scalding project) or Clojure (via the Cascalog project).
+
+Lastly, we will mention Crunch, an open source Java-based project from Cloudera. It is modeled after a popular internal tool at Google; it sits much closer to the Map/Reduce paradigm, which is either compelling to you or not.
+
+===== Frameworks
+
+Finally, for the programmers, there are many open source frameworks to address various domain-specific problems you may encounter as a data scientist. Going into any depth here is outside the scope of this book but we will at least supply you with a list of pointers.
+
+Elephant Bird, Datafu and Akela offer extremely useful additional Pig and Hive UDFs. While you are unlikely to need all of them, we consider no Pig or Hive installation complete without them. For more domain-specific purposes, anyone in need of a machine-learning algorithm should look first at Mahout, Kiji, Weka scikit-learn or those available in a statistical language, such as R, Julia or NumPy.
+
+Apache Giraph and Gremlin are both useful for graph analysis. (NOTE TO TECH REVIEWERS: What else deserves inclusion?)
+
+// ==== Visualization And Business Insight (BI) Tools
+//
+// NOT TODAY(?):
+// - Datameer
+// - Pentaho
+// - Tableau
+// - Platfora
+// Amino
+// Spotfire
+// Tableau Desktop and Server
+// - chartio, Raw, d3, ???
+//
+// Lastly, because we do not know where else to put them, there are several Hadoop “environments,” some combination of IDE frameworks and conveniences that aim to make Hadoop friendlier to the Enterprise programmer. If you are one of those, they are worth a look.
+
+// ==== Cloud and Managed Services
+//
+// TODAY:
+//
+// - Qubole, Elastic Map/Reduce, Mortar Data, Treasure Data, Continuity and Infochimps
+// - Heroku-based options
+// - AWS Redshift
+// - Azure, HDInsight
+//
+// ==== Operational Components
+//
+// * Workflow Tools
+// - Azkaban, Oozie
+//
+// - Mesos
+// - WANDisco
+// * Administration
+// - Cloudera Manager
+// - Ambari - monitoring thru RESTful APIs
+// - Provisioning: Ironfan, Juju, Whirr, Serengeti, Openstack Hadoop barclamp
+// - Monitoring: Chukwa, Cactus, Ganglia,
+// - StackIQ
+//
+// Cloudera Enterprise
+// Hortonworks Data Platform
+//
+// ===== Alternative HDFS Implementations
+//
+// TODAY:
+//
+// - WANDisco
+// - OrangeFS
+// - glusterfs
+// - Quantcast QFS
+// - Map/R NFS
+// - ...
+// - Direct datastore: DataStax Brisk,
+//
+// ===== Security
+//
+// - Kerberos; MS/Hortonworks has Active Directory integration
+// - fundamental limitations
+// - Gazzang, Dataguise
+//
+// ==== Vertical-Focused and System Integrators
+//
+// - ThinkBig Analytics
+// - Tresata - Big Data Analytics Platform for the Financial Services Industry
+// - Mu Sigma
+// - Booz-Allen
+// - Wibidata Real-time personalization framework
+// - Metamarkets
+// - Infochimps/CSC
26 07-intro_to_storm+trident.asciidoc
View
@@ -1 +1,27 @@
== Intro to Storm+Trident
+
+
+==== Intro: Storm+Trident Fundamentals
+
+At this point, you have good familiarity with Hadoop’s batch processing power and the powerful inquiries it unlocks above and as a counterpart to traditional database approach. Stream analytics is a third mode of data analysis and, it is becoming clear, one that is just as essential and transformative as massive scale batch processing has been.
+
+Storm is an open-source framework developed at Twitter that provides scalable stream processing. Trident draws on Storm’s powerful transport mechanism to provide _exactly once_ processing of records in _windowed batches es_ for aggregating and persisting
+to an external data store.
+
+The central challenge in building a system that can perform fallible operations on billions of records reliably is how to do so without yourself producing so much bookkeeping that it becomes its own scalable Stream processing challenge. Storm handles all details of reliable transport and efficient routing for you, leaving you with only the business process at hand. (The remarkably elegant way Storm handles that bookkeeping challenge is one of its principle breakthroughs; you’ll learn about it in the later chapter on Storm Internals.)
+
+This takes Storm past the mere processing of records to Stream Analytics -- with some limitations and some advantages, you have the same ability to specify locality and write arbitrarily powerful general-purpose code to handle every record. A lot of Storm+Trident’s adoption is in application to real-time systems. footnote:[for reasons you’ll learn in the Storm internals chapter, it’s not suitable for ultra-low latency (below, say, 5s of milliseconds), Wall Street-type applications, but if latencies above that are real-time enough for you, Storm+Trident shines.]
+
+But, just as importantly, the framework exhibits radical _tolerance_ of latency. It’s perfectly reasonable to, for every record, perform reads of a legacy data store, call an internet API and the like, even if those might have hundreds or thousands of milliseconds worst-case latency. That range of timescales is simply impractical within a batch processing run or database query. In the later chapter on the Lambda Architecture, you’ll learn how to use stream and batch analytics together for latencies that span from milliseconds to years.
+
+As an example, one of the largest hard drive manufacturers in the world ingests sensor data from its manufacturing line, test data from quality assurance processes, reports from customer support and post mortem analysis of returned devices. They have been able to mine the accumulated millisecond scale sensor data for patterns that predict flaws months and years later. Hadoop produces the "slow, deep" results, uncovering the patterns that predict failure. Storm+Trident produces the fast, relevant results: operational alerts when those anomalies are observed.
+
+Things you should take away from this chapter:
+
+Understand the type of problems you solve using stream processing and apply it to real examples using the best-in-class stream analytics frameworks.
+Acquire the practicalities of authoring, launching and validating a Storm+Trident flow.
+Understand Trident’s operators and how to use them: Each apply `CombinerAggregator`s, `ReducerAggregator`s and `AccumulatingAggregator`s (generic aggregator?)
+Persist records or aggregations directly to a backing database or to Kafka for item-potent downstream storage.
+(probably not going to discuss how to do a streaming join, using either DRPC or a hashmap join)
+
+NOTE: This chapter will only speak of Storm+Trident, the high level and from the outside. We won’t spend any time on how it’s making this all work until (to do ref the chapter on Storm+Trident internals)
18 09a-statistics-intro.asciidoc
View
@@ -1,4 +1,22 @@
+==== Skeleton: Statistics
+
+Data is worthless. Actually, it's worse than worthless. It costs you money to gather, store, manage, replicate and analyze. What you really want is insight -- a relevant summary of the essential patterns in that data -- produced using relationships to analyze data in context.
+
+Statistical summaries are the purest form of this activity, and will be used repeatedly in the book to come, so now that you see how Hadoop is used it's a good place to focus.
+
+Some statistical measures let you summarize the whole from summaries of the parts: I can count all the votes in the state by summing the votes from each county, and the votes in each county by summing the votes at each polling station. Those types of aggregations -- average/standard deviation, correlation, and so forth -- are naturally scalable, but just having billions of objects introduces some practical problems you need to avoid. We'll also use them to introduce Pig, a high-level language for SQL-like queries on large datasets.
+
+Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That's especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution -- the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.)
+
+But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to
+
+* Holistic vs algebraic aggregations
+* Underflow and the "Law of Huge Numbers"
+* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
+* Count-min sketch for most frequent elements
+* Approx histogram
+
- Counting
- total burgers sold - total commits, repos,
- counting a running and or smoothed average
4 11a-spatial_join.asciidoc
View
@@ -1,4 +1,3 @@
-
**TODO**:
Will be reorganizing below in this order:
@@ -79,6 +78,7 @@ This is a 1-d index into a 2-d space! What's more, nearby points in space are ty
Note: you'll sometimes see people refer to quadtile coordinates as `X/Y/Z` or `Z/X/Y`; the 'Z' here refers to zoom level, not a traditional third coordinate.
==== Patterns in UFO Sightings ====
+
////Introduce/buffer a bit first -- like, "The following approach can also be used to analyze x, y, or z..." Root in real-world applications, first. Amy////
Let's put Hadoop into practice for something really important: understanding where a likely alien invasion will take place. The National UFO Reporting Center has compiled a dataset of 60,000+ documented UFO sightings, with metadata. We can combine that with the 7 million labelled points of interest in the Geonames dataset: airports and zoos, capes to craters, schools, churches and more.
@@ -123,6 +123,7 @@ Mapper output:
==== Reducer: combine objects on each quadtile ====
+
////Introduce this - (it's true, you'll need to reorient the reader pretty consistently). "Here, we are looking for..." Amy////
The reducer is now fairly simple. Each quadtile will have a handful of UFO sightings, and a potentially large number of geonames places to test for nearbyness. The nearbyness test is straightforward:
@@ -365,7 +366,6 @@ In code:
Take care to put coordinates in the order "longitude, latitude", maintaining consistency with the (X, Y) convention for regular points. Natural english idiom switches their order, a pernicious source of error -- but the convention in http://www.geojson.org/geojson-spec.html#positions[geographic systems] is unambiguously to use `x, y, z` ordering. Also, don't abbreviate longitude as `long` -- it's a keyword in pig and other languages. I like `lng`.
=========================
-
==== Exploration
* _Exemplars_
6 11b-multiscale_join.asciidoc
View
@@ -1,11 +1,10 @@
==== Adaptive Grid Size ====
+
////Very interesting; but, give an example (one, or two) of how this extends to equivalent real-world examples. Amy////
The world is a big place, but we don't use all of it the same. Most of the world is water. Lots of it is Siberia. Half the tiles at zoom level 2 have only a few thousand inhabitantsfootnote:[000 001 100 101 202 203 302 and 303].
-Suppose you wanted to store a "what country am I in" dataset -- a geo-joinable decomposition of the region boundaries of every country. You'll immediately note that
-Monaco fits easily within on one zoom-level 12 quadtile; Russia spans two zoom-level 1 quadtiles.
-Without multiscaling, to cover the globe at 1-km scale and 64-kB records would take 70 terabytes -- and 1-km is not all that satisfactory. Huge parts of the world would be taken up by grid cells holding no border that simply said "Yep, still in Russia".
+Suppose you wanted to store a "what country am I in" dataset -- a geo-joinable decomposition of the region boundaries of every country. You'll immediately note that Monaco fits easily within on one zoom-level 12 quadtile; Russia spans two zoom-level 1 quadtiles. Without multiscaling, to cover the globe at 1-km scale and 64-kB records would take 70 terabytes -- and 1-km is not all that satisfactory. Huge parts of the world would be taken up by grid cells holding no border that simply said "Yep, still in Russia".
There's a simple modification of the grid system that lets us very naturally describe multiscale data.
@@ -29,7 +28,6 @@ The quadkey `1330011xx` means "I carry the information for grids `133001100`, `1
image::images/fu05-quadkeys-multiscale-ZL8.png[Japan at Zoom Level 8]
-
image::images/fu05-quadkeys-multiscale-ZL9.png[Japan at Zoom Level 9]
145 12a-herding_cats.asciidoc
View
@@ -14,16 +14,17 @@ If you're already familiar with Unix pipes and chaining commands together, feel
One of the key aspects of the Unix philosophy is that it is built on simple tools that can be combined in nearly infinite ways to create complex behavior. Command line tools are linked together through 'pipes' which take output from one tool and feed it into the input of another. For example, the `cat` command reads a file and outputs its contents. You can pipe this output into another command to perform useful transformations. For example, to select only the first field (delimited by tabs) of the each line in a file you could:
-```
+----
cat somefile.txt | cut -f1`
-```
+----
+
The vertical pipe character, '|', represents the pipe between the `cat` and `cut` commands. You can chain as many commands as you like, and it's common to construct chains of 5 commands or more.
In addition to redirecting a command's output into another command, you can also redirect output into a file with the `>` operator. For example:
-```
+----
echo 'foobar' > stuff.txt
-```
+----
writes the text 'foobar' to stuff.txt. stuff.txt is created if it doesn't exist, and is overwritten if it does.
@@ -37,21 +38,21 @@ Each Unix command has 3 input/output streams: standard input, standard output, a
The third stream, stderr, is generally used for error messages, progress information, or other text that is not strictly 'output'. Stderr is especially useful because it allows you to see messages from a command even if you have redirected the command's stdout. For example, if you wanted to run a command and redirect its output into a file, you could still see any errors generated via stderr. curl, a command used to make network requests, #FINISH
-```
+----
* CURL COMMAND*
-```
+----
It's occasionally useful to be able to redirect these streams independently or into each other. For example, if you're running a command and want to log its output as well as any errors generated, you should redirect stderr into stdout and then direct stdout to a file:
-```
+----
*EXAMPLE*
-```
+----
Alternatively, you could redirect stderr and stdout into separate files:
-```
+----
*EXAMPLE*
-```
+----
You might also want to suppress stderr if the command you're using gets too chatty. You can do that by redirecting stderr to /dev/null, which is a special file that discards everything you hand it.
@@ -61,23 +62,23 @@ Now that you understand the basics of pipes and output redirection, lets get on
`cat` reads the content of a file and prints it to stdout. It can accept multiple files, like so, and will print them in order:
-```
+----
cat foo.txt bar.txt bat.txt
-```
+----
`cat` is generally used to examine the contents of a file or as the start of a chain of commands:
-```
+----
cat foo.txt | sort | uniq > bar.txt
-```
+----
In addition to examining and piping around files, `cat` is also useful as an 'identity mapper', a mapper which does nothing. If your data already has a key that you would like to group on, you can specify `cat` as your mapper and each record will pass untouched through the map phase to the sort phase. Then, the sort and shuffle will group all records with the same key at the proper reducer, where you can perform further manipulations.
`echo` is very similar to `cat` except it prints the supplied text to stdout. For example:
-```
+----
echo foo bar baz bat > foo.txt
-```
+----
will result in foo.txt holding 'foo bar baz bat', followed by a newline. If you don't want the newline you can give echo the -n option.
@@ -87,31 +88,31 @@ will result in foo.txt holding 'foo bar baz bat', followed by a newline. If you
The `cut` command allows you to cut certain pieces from each line, leaving only the interesting bits. The -f option means 'keep only these fields', and takes a comma-delimited list of numbers and ranges. So, to select the first 3 and 5th fields of a tsv file you could use:
-```
+----
cat somefile.txt | cut -f 1-3,5`
-```
+----
Watch out - the field numbering is one-indexed. By default cut assumes that fields are tab-delimited, but delimiters are configurable with the `-d` option.
This is especially useful if you have tsv output on the hdfs and want to filter it down to only a handful of fields. You can create a hadoop streaming job to do this like so:
-```
+----
wu-mapred --mapper='cut -f 1-3,5'
-```
+----
`cut` is great if you know the indices of the columns you want to keep, but if you data is schema-less or nearly so (like unstructured text), things get slightly more complicated. For example, if you want to select the last field from all of your records, but the field length of your records vary, you can combine cut with the `rev` command, which reverses text:
-```
+----
cat foobar.txt | rev | cut -1 | rev`
-```
+----
This reverses each line, selects the first field in the reversed line (which is really the last field), and then reverses the text again before outputting it.
`cut` also has a -c (for 'character') option that allows you to select ranges of characters. This is useful for quickly verifying the output of a job with long lines. For example, in the Regional Flavor exploration, many of the jobs output wordbags which are just giant JSON blobs, one line of which would overflow your entire terminal. If you want to quickly verify that the output looks sane, you could use:
-```
+----
wu-cat /data/results/wikipedia/wordbags.tsv | cut -c 1-100
-```
+----
==== Character encodings
@@ -129,33 +130,33 @@ Remember that if you're using these commands as Hadoop mappers or Reducers, you
While `cut` is used to select columns of output, head and tail are used to select lines of output. head selects lines at the beginning of its input while tail selects lines at the end. For example, to view only the first 10 lines of a file, you could use head like so:
-```
+----
head -10 foobar.txt
-```
+----
`head` is especially useful for sanity-checking the output of a Hadoop job without overflowing your terminal. `head` and cut make a killer combination:
-```
+----
wu-cat /data/results/foobar | head -10 | cut -c 1-100
-```
+----
`tail` works almost identically to `head`. Viewing the last ten lines of a file is easy:
-```
+----
tail -10 foobar.txt
-```
+----
`tail` also lets you specify the selection in relation to the beginning of the file with the '+' operator. So, to select every line from the 10th line on:
-```
+----
tail +10 foobar.txt
-```
+----
What if you just finished uploading 5,000 small files to the HDFS and realized that you left a header on every one of them? No worries, just use `tail` as a mapper to remove the header:
-```
+----
wu-mapred --mapper='tail +2'`
-```
+----
This outputs every line but the first one.
@@ -169,30 +170,30 @@ This outputs the end of the log to your terminal and waits for new content, upda
`grep` is a tool for finding patterns in text. You can give it a word, and it will diligently search its input, printing only the lines that contain that word:
-```
+----
GREP EXAMPLE
-```
+----
`grep` has a many options, and accepts regular expressions as well as words and word sequences:
-```
+----
ANOTHER EXAMPLE
-```
+----
The -i option is very useful to make grep ignore case:
-```
+----
EXAMPLE
-```
+----
As is the -z option, which decompresses g-zipped text before grepping through it. This can be tremendously useful if you keep files on your HDFS in a compressed form to save space.
When using `grep` in Hadoop jobs, beware its non-standard exit statuses. `grep` returns a 0 if it finds matching lines, a 1 if it doesn't find any matching lines, and a number greater than 1 if there was an error. Because Hadoop interprets any exit code greater than 0 as an error, any Hadoop job that doesn't find any matching lines will be considered 'failed' by Hadoop, which will result in Hadoop re-trying those jobs without success. To fix this, we have to swallow `grep`'s exit status like so:
-```
+----
(grep foobar || true)
-```
+----
This ensures that Hadoop doesn't erroneously kill your jobs.
@@ -202,56 +203,56 @@ This ensures that Hadoop doesn't erroneously kill your jobs.
As you might expect, `sort` sorts lines. By default it sorts alphabetically, considering the whole line:
-```
+----
EXAMPLE
-```
+----
You can also tell it to sort numerically with the -n option, but -n only sorts integers properly. To sort decimals and numbers in scientific notation properly, use the -g option:
-```
+----
EXAMPLE
-```
+----
You can reverse the sort order with -r:
-```
+----
EXAMPLE
-```
+----
You can also specify a column to sort on with the -k option:
-```
+----
EXAMPLE
-```
+----
By default the column delimiter is a non-blank to blank transition, so any content character followed by a whitespace character (tab, space, etc…) is treated as a column. This can be tricky if your data is tab delimited, but contains spaces within columns. For example, if you were trying to sort some tab-delimited data containing movie titles, you would have to tell sort to use tab as the delimiter. If you try the obvious solution, you might be disappointed with the result:
-```
+----
sort -t"\t"
sort: multi-character tab `\\t'
-```
+----
Instead we have to somehow give the -t option a literal tab. The easiest way to do this is:
-```
+----
sort -t$'\t'
-```
+----
`$'<string>'` is a special directive that tells your shell to expand `<string>` into its equivalent literal. You can do the same with other control characters, including `\n`, `\r`, etc…
Another useful way of doing this is by inserting a literal tab manually:
-```
+----
sort -t' '
-```
+----
To insert the tab literal between the single quotes, type `CTRL-V` and then `Tab`.
If you find your sort command is taking a long time, try increasing its sort buffer size with the `--buffer` command. This can make things go a lot faster:
-```
+----
example
-```
+----
**TALK ABOUT SORT'S USEFULNESS IN BIG DATA**
@@ -259,22 +260,22 @@ example
`uniq` is used for working with with duplicate lines - you can count them, remove them, look for them, among other things. For example, here is how you would find the number of oscars each actor has in a list of annual oscar winners:
-```
+----
example
-```
+----
Note the -c option, which prepends the output with a count of the number of duplicates. Also note that we sort the list before piping it into uniq - input to uniq must always be sorted or you will get erroneous results.
You can also filter out duplicates with the -u option:
-```
+----
example
-```
+----
And only print duplicates with the -d option:
-```
+----
example
-```
+----
* TALK ABOUT USEFULNESS, EXAMPLES*
@@ -288,21 +289,21 @@ example
`wc` is a utility for counting words, lines, and characters in text. Without options, it searches its input and outputs the number of lines, words, and bytes, in that order:
-```
+----
EXAMPLE
-```
+----
`wc` will also print out the number of characters, as defined by the LC_CTYPE environment variable:
-```
+----
EXAMPLE
-```
+----
We can use wc as a mapper to count the total number of words in all of our files on the HDFS:
-```
+----
EXAMPLE
-```
+----
==== md5sum and sha1sum
@@ -314,9 +315,9 @@ EXAMPLE
`expand` and `unexpand` are simple commands used to transform spaces to tabs and vice versa:
-```
+----
EXAMPLE
-```
+----
This comes in handy when trying to get data into or out of the TSV format.
1  13a-wikipedia_other.asciidoc
View
@@ -20,7 +20,6 @@ mapper do
end
--------------------
-
We're going to make the following changes:
* split the `lang_and_project` attribute into `in_language` and `wp_project`. They're different properties, and there's no good reason to leave them combined.
30 88-locality.asciidoc
View
@@ -1,3 +1,33 @@
+
+
+
+
+
+
+
+Big Data is two things:
+
+* The solution to a problem: what happens when volume of ....
+* The emergence of an opportunity: (the "Unreasonable Effectiveness of Data" phenomenon).
+
+....
+So big data is what happens when the required context volume for fully-justified synthesis of belief.
+There are several ways to get into trouble with the context volume -- too much data in space, in time, etc (explain).
+
+You have to hit a set of things.
+
+* **Truth** -- consistent, stable and complete synthesis of the facts (define?)
+
+you might think this is in general essential, overriding, and achievable. It isn't, isn't and isn't.
+
+
+
+
+
+
+
+
+
=== Model for Data Analysis
The architecture, APIs and user experience of Storm+Trident and Hadoop are very different, and there are fundamental differences in what they provide: Hadoop cannot process an infinite stream of data, and Storm+Trident doesn't naturally lend itself to complex aggregations at terabyte scale. But there is so much commonality in how to think about an analytic flow in each that they must be reflections of the same core process -- with differences that are the result of engineering under constraint.
2  88-storm+trident-internals.asciidoc
View
@@ -70,7 +70,7 @@ Say how a tuple is cleared from the pending register when its tree is finally ac
=== Acking and Reliability
-Storm's elegant acking mechanism is probably its most significant breakthrough. It ensures that a tuple and its descendents are processed successfully or fail loudly and it does so with a minimum amount of bookkeeping chatter. The rest of the sections in this chapter, while advanced, ultimately are helpful for architecting and productionizing a dataflow. This section, however, . is comparatively optional -- the whole point of the reliability mechanism is that it Just Works. It's so fascinating we can't help but include it but if you're not looking to have your brain bent today, feel free to skip it. (It's fairly complicated, so in a few places, I will make mild simplifications and clarify the details in footnotes.)
+Storm's elegant acking mechanism is probably its most significant breakthrough. It ensures that a tuple and its descendents are processed successfully or fail loudly and it does so with a minimum amount of bookkeeping chatter. The rest of the sections in this chapter, while advanced, ultimately are helpful for architecting and productionizing a dataflow. This section, however, is comparatively optional -- the whole point of the reliability mechanism is that it Just Works. It's so fascinating we can't help but include it but if you're not looking to have your brain bent today, feel free to skip it. Footnote:[This story is complicated enough that I'm going to make two minor simplifications that are areally only interesting if you're knee-deep in the code. We'll let you know about them in the footnotes.]
Here's how it works.
Please sign in to comment.
Something went wrong with that request. Please try again.