Browse files

Merge branch 'gh-pages' of into gh…

  • Loading branch information...
palvaro committed Dec 10, 2011
2 parents 57c0804 + ed787e9 commit 6db7f57b57769e62c79dfe760caa3deccd77a49b
@@ -0,0 +1,167 @@
+# Big Data
+- No buzzword-compliant discussion of cloud computing can avoid mention of Big Data. Today is the day.
+- We'll talk about programming models and execution frameworks as well.
+# The current sorry state of affairs
+A typical analytic lifecycle:
+- Data In DB -> sample.csv -> R -> spec.docx -> -> DB extract -> scores.csv -> DB Import
+- Better: push the functionality of R into the DB or Hadoop cluster
+- This is not just a SMOP!
+- Need to figure out how to write analytic algorithms in data-parallel style, as SQL or MapReduce code.
+## Programming Models
+History, various people realized in the 1970's and 1980's that "disorderly" programming allowed for parallelism, one way or another. Two strands
+- "dataflow" programming (most prevalent in computer architecture)
+- "declarative" programming (basically SQL).
+These two models persist today as the only broadly successful parallel programming paradigms. Usually called *data-parallel* style of computation. Sometimes called *Single-Program-Multiple-Data (SPMD)*.
+- The Dead Parallel Computer Society of the 1980's (vs. Teradata)
+- Caveat: MPI does exist, and is used some in HPC.
+Side note: Bloom in many ways is an extension of the success of this history. If data-parallelism works for Big Data, why not for fine-grained computing?
+### Parallel SQL
+Why is it good for parallelism?
+- Originally, the motivation for the relational model and languages--so called "data independence"--was to enable reorganization of data on disks: new sort orders, indexing, and so on.
+- But as we know, storage layouts are just one form of rendezvous in space/time. Another form is batched I/O, e.g. the partitioned hash join. And more generally the ability to do query optimization--e.g. to reorder entire batches of work based on high-level reasoning about commutativity and associativity.
+- Partitioning and communication are just another spin on this. E.g. the partitioned symmetric hash join.
+- Various research groups and one company--Teradata--figured this out in the 1980's.
+Is SQL a "programming" language?
+- By the time people were looking into parallel SQL, expectations were high: you had to do the whole suite of relational features, especially transactions and automatic query optimization.
+- OTOH, the set of expressions to be evaluated on the data (both scalars and aggregates) were typically fixed.
+- So really confined into being a query language: need to keep it simple to optimize it, etc. Even though in principle it's quite flexible.
+This legacy still pervades many of the parallel database vendors, but things are changing thanks to pressure from MapReduce.
+SQL extensibility
+- A big topic in the late 80's and 90's
+- At some level, really easy
+ - UDFs
+ - UDAs
+ - OO-style UDTs
+- DBMS expectations set the bar really high though
+ - auto-parallelization
+ - query optimization
+ - indexing
+ - security
+- Meanwhile, language limitations
+ - Recursion not usually well treated
+ - Cultural (and sometimes practical) aversion to loose structure. (Why not a table with two columns, key and val! Why not a table with one column and one row?)
+- Result: not a lot of programmer enthusiasm. *But*: if your data lives in an SQL database, you should push your code to the data there.
+- Would it work
+ - You bet. See [MADLib]( ML algorithms implemented in extended SQL running inside the database.
+Bloom vs SQL:
+ - Bloom is explicitly partitioned and potentially MPMD
+ - SQL is auto-partitioned and inherently SPMD.
+ - Bloom schemas are more forgiving (you can work around this in SQL)
+ - A Thought: compile single-node Bloom down to parallel SQL!? Could this be the right way to generate complex code like MADlib?
+### MapReduce
+A topic that needs little introduction these days.
+- A dataflow programming model.
+- Very easy to explain.
+- Low bar to entry, in the style of dynamic typing: record splitting and key/val pairs, focus on extensions not the core.
+- Also cultural acceptance of text manipulation rather than a type system.
+- Low expectations. Just parallelize -- no optimization, indexing, security, recursion, transactions...
+- Arguably because it's simple, people have been willing to see it as an algorithmic building block.
+ - Initial example of PageRank attractively algorithmic. Followed by various other machine learning algorithms in recent years (see [Mahout](
+ - MapReduce has done *wonders* for changing how people think about computing. Data-centric mindsets, disorderly programming, scale.
+ - "Most interesting thing about MapReduce is that people are interested in it." This is not pejorative--it's really fascinating and useful.
+- On the flip side, quite a low-level interface. Even simple matching (joins) are a hassle. Hence evolution of Sawzall/Dremel/Tenzing (Google), Pig/JAQL/Hive/Cascading (Hadoop).
+Bloom vs. MapReduce:
+ - Again, Bloom explicitly partitioned and potentially MPMD
+ - MapReduce auto-partitioned, SPMD
+ - Thought 1: compile single-node Bloom down to Hadoop? Vs. Pig/Cascalog?
+ - Thought 2: implement Hadoop in Bloom? Yes we can! See BOOM Analytics, below.
+ - Thought 3: what happens when you combine Thought 1 and Thought 2?!?
+ - Thought 4: any reason we didn't discuss this for SQL DBMSs?
+### Three implementations of PageRank
+PageRank is usually described as an iterative (multi-pass) algorithm for propagating weights on a graph. It's actually computing an eigenvector of the adjaceny matrix. This makes it an intriguing example for parallel algorithmics: simple but non-trivial, definitely more interesting than word-counts or SQL rollups.
+- In (single-site) [Bloom](../thursdaycode/pagerank/pagerank.rb)
+- In [SQL](../thursdaycode/pagerank/pagerank.sql)
+- In [Hadoop](
+## Runtime issues
+### Let's review Hadoop.
+- JobTracker & TaskTrackers
+- Job divided into set of map & reduce tasks
+- JobTracker farms out work to TaskTrackers
+- map reads in input chunks from HDFS, splits into records, runs user map code, partitions output k/v pairs to local disk.
+- Reduce tasks pull buckets from all mappers (shuffle!), which run combiners. then sorts locally, runs user reduce code on each key, stores output in HDFS
+- TaskTrackers have fixed # of slots, heartbeat their status to JobTracker
+- Failure handling, straggler handling
+- Obvious pros:
+ - centralized knowledge and scheduling at JobTracker
+ - easy restart/competition of map tasks
+ - relatively easy restart of reduce tasks
+ - decoupling of scheduling between mappers and reducers, facilitated by big disk buffers
+- Obvious cons:
+ - SPOF at JobTracker
+ - pessimistic checkpointing
+ - no pipelining!
+ - potentially inefficient coordination between producers and consumers
+### Your basic SQL engine
+- Coordinator node, usually with hot standby, does scheduling, query optimization.
+- Worker nodes with storage, index and query processing capability
+- Data pre-partitioned and replicated across workers (hash/range/random)
+- Query optimizer chooses algorithms, order of ops, materialization points vs pipelining at each stage. Other components determine admission control, memory utilization, mutliprogramming level...
+- Comm patterns include: local processing, all-to-all shuffling using hash and sort, "broadcast" joins, tree-based aggregation
+- Obvious pros:
+ - Query optimization can make a big difference in productivity
+ - No overhead for checkpointing required
+ - Pipelining is easy and quite common
+- Obvious cons:
+ - restart from the beginning only
+ - straggler handling is not standard
+ - hence higher variance in performance: fast runs should trump Hadoop, but slow runs can eb very bad.
+### The BOOM Analytics Story
+[Boom Analytics Research Paper](
+How much cleaner/easier would it be if we reimplemented HDFS and Hadoop in Overlog (the precursor to Bloom)?
+- BFS + Redo Hadoop scheduler
+- BFS redundancy and scale-out
+- Hadoop JobTracker scheduling
+### The MapReduce Online Story
+[MapReduce Online Paper](
+- Can we have Hadoop-style checkpointing and SQL-style pipelining?
+- E.g. for "online aggregation" or infinite stream queries?
+- You bet we can.
+- Tricks:
+ - Maps push to (live) reducers to couple the pipeline when they can, reducers pull the rest
+ - Batch up pushes and run combiners before pushing.
+ - Reduce publishes "snapshot" outputs for speculative consumption by subsequent maps
+ - Fault tolerance
+ - Map failure:
+ - Reducers keep track of which mapper produced each spill file
+ - Reducers treat incoming task outputs as tentative until told completion is done. tentative stuff can only merge with stuff from same task.
+ - Reduce failure:
+ - Mappers have to save their buffers until reducer cpletes.
@@ -0,0 +1,112 @@
+# Bloom Feedback
+- Break into groups. Each group to come up with 2 "critiques" and 2 "WIBNIs" (Wouldn't It Be Nice If…) To help, go back to project code and read it through for the parts that felt awkward. Try to construct code samples for these 4. Prioritize
+ # Critique: missing if/else construct.
+ out1 <= inny {|i| i if i.c < 4 and i.d > 2}
+ out2 <= inny {|i| i if i.c>=4 or i.d <=2}
+ # WIBNI there was operator autocomplete in an editor
+ # (e.g. <~ when channel on lhs)
+Discuss top critique from each group, then top WIBNI from each group. Then any remaining.
+- Go round again, ask for reflection on common sources of Bloom code bugs and how to debug them. Examples if at all possible.
+# Bloom Lessons to live by
+What have we learned that you can take away into other languages?
+ - All state uniformly treated as disorderly collections of data.
+ - Remember: no distinction between variables and data in Bloom
+ - all in *collections*, so reorderable, partitionable by default.
+ - Example: filesystem metadata in KVS.
+ - Partition? yes
+ - Replicate? Multi-master? yes.
+ - No real difference between replicated *state* and replicated *processes*!
+ - Note that caching is a form of (partial) replication.
+ - Can you do this in a traditional PL? Sure you can!
+ - Space-Time Rendezvous as a key construct:
+ - Remember
+ - sender-persist, receiver-persist, both-persist
+ - storage is implicit sender-persist communication. "Implicitness" often leads to rigid thinking
+ - always think of concurrency/consistency issues w.r.t. communication ordering!
+ - Uses
+ - data joins: table/table rendezvous
+ - msg handlers = channel/table rendezvous
+ - timeout logic = periodic/table rendezvous
+ - heartbeats = periodic/table rendezvous
+ - Manipulating the rendezvous of 2 scratches (channels, periodics)
+ - example: request/response pattern
+ - persisting one scratch, the other scratch, or both
+ - when to "garbage collect" the persisted data?
+ - When do you *need* time? (<+ or <~)
+ - asynchronous tasks (<~)
+ - non-monotonicity (esp. with cycles -- recursion):
+ <code>kvs <+- (del_msg*kvs).rights(:key=>:key)</code>
+ - Understanding rendezvous makes it easy to switching between traditional programming metaphors
+ - storage vs. communication: e.g. shared memory vs. "IPC".
+ - sync vs. async: e.g. function call vs. RPC
+ - state "at endpoints", "at proxies", "stateless" (carried in packets)
+ - can you apply state-as-data tricks uniformly across these metaphors? I think so!
+ - What of this can you take away into other languages?
+ - lightweight event handlers instead of threads: not so hard?
+ - stream query implementation: not so hard?
+ - monotonicity analysis (CALM)
+ - monotonic code is eventually consistent
+ - set accumulation
+ - increment integers
+ - beware of non-monotonicity downstream of asychrony.
+ - delete, replace, set minus
+ - aggregation
+ - though sometimes you can convince yourself it's monotonic, e.g. (ints, +, max, <)
+ - non-monotonicity, when guarded by coordination, becomes eventually consistent!
+ - what kind of coordination will control the reordering you worry about?
+ - global atomic broadcast?
+ - FIFO point-to-point channels?
+ - actions vs. transactions?
+ - remember our 2 shopping carts
+ - even though we didn't avoid coordination, the disorderly cart *moved* it to a less frequent dataflow transition (checkout).
+ - can you think about this in a traditional PL? Yes! (Though it's up to you to prove and maintain.)
+# Some stuff that's hiding (or should be) in the Bloom runtime
+ - Making this high-level language work requires two things
+ - fast single-node stream-query (relational) processing
+ - network event handler
+ - indexes, pipelined/symmetric hash-joins and hash-groups
+ - Bud has a long way to go on this front!
+ - but we know what to do
+ - sophisticated "query optimizer"
+ - e.g. rewrite programs to only populate collections as needed ("magic sets")
+ - e.g. rewrite programs to share common sub-expressions across rules
+ - e.g. choose orders of operation for stuff like multi-way joins
+ - e.g. intelligently "garbage collect" persisted tuples that can no longer join with anything
+ - Again, Bud has a long way to go on this front!
+ - we know how to do some of this
+ - we know how to build a nice framework for this
+ - there will be research!
+# If you like this stuff:
+ - please keep using it!
+ - even if your job requires another lang, Bloom is great for design/prototype
+ - we'd love your continuing feedback
+ - don't be shy with criticism!
+ - you may want to do research
+ - stay in touch!
+ - you may want to work in distributed systems and/or big data and/or PL.
+ - stay in touch!
+# Thank you and congratulations!
+ - You are the best Bloom programmers in the world.
+ - You are enlightened distributed system designers.
+ - We salute you.
@@ -0,0 +1,70 @@
+# Critiques
+- temp's don't have column names. ("AS")
+- lot of effort on "user side" of interfaces in the way we designed them -- e.g. sequence IDs not baked into initial requests.
+- module/mixin system kinda hacky. really hard to keep track of overwriting names, overloading. similarly-named bloom blocks can disappear.
+ - multiple includes lead to bugs.
+ - more than one way to do extension
+ - neither of the two ways is enough
+- block for notin works differently -- can't be used like a "map"
+ - can't get projection on notin (block of notin is for notin-ing, rather than projection).
+- difficult to have different modules run on different timescales!
+ - e.g. output only after some number of timesteps
+ - related to module/mixin system: modules might want their own "clock rates"
+- synchrony vs. asynchrony of modules not obvious
+ - asynch requires more work from user of module
+- bootstrapping default values (e.g. for count) requires too much logic
+- debugging is hard.
+ - what are best practices? what mechanisms are available?
+ - tracking behavior across timesteps, e.g. asynch rendezvous behavior
+- budvis dies with multiple threads?
+- how many ticks are enuf?
+ - for ordering-centric stuff we need to stimulate the system proactively
+ - bloom is bad at this?
+- hard to reason about randomness
+ - e.g. choose m of n for quorum
+- testing was really hard.
+ - tedious to put in sync_callback_do
+ - would be nice to have a declarative testing framework!
+- more docs!! cheat sheet is too terse.
+- using channel in two different modules is hairy
+ - have to make sure you agree on fully-qualified name!
+- localhost vs.!
+ - strings for id is bogus!
+# Wouldn't It Be Nice If...
+- autocomplete documentation for state statements
+- distinguish between protocol and implementation +1
+ - generics for the language: parameterized polymorphism
+- automatic schema inference, and operator chaining
+- <~ on scratch should work like <=, not be illegal
+- budvis:
+ - color highlighting of diffs across ticks
+ - click to expand opens in a new window, then next timestep reopens windows
+- from rebl to interactive debugger:
+ - import an entire chunk of code, modify, tick, dump, etc.
+ - perhaps rebl should extend irb?
+- left-side addressing to represent subset of columns to fill in (or overwrite)
+ foo[col] <= bar {|b| []}
+ foo[col] <+- bar{|b| []}
+ foo[col] <+- bar{|b| [nil]}
+- bloom as a database/hadoop front-end??
Oops, something went wrong.

0 comments on commit 6db7f57

Please sign in to comment.