# Stream Processing

Analysing data in real-time
* Twitter trends
* Google Analytics
* Intrusion detection

Map-reduce is a *batch-processing* system
* High(er) latency
* Have to wait for entire dataset to complete

## Storm
Apache Project
JVM-based
* Users: Twitter, Flipboard, Weather, WebMD

Five major keywords
* Tuples
    * Ordered list of elements
    * For example, on Twitter, a tuple may be (<tweeter, tweet>)
    * Or if we're tracking clicks (<URL, clicker-IP, date, time>)
* Streams
    * Potentially infinite sequence of tuples
    * 
* Spouts
    * Stream generator
    * Also known as a reader, crawler, watcher
    * Could generate multiple streams
* Bolts
    * Stream processor
    * Modifies stream input into new stream output
    * Most of the code you write in a Storm app
* Topology
    * A directed graph of spouts and bolts (and output bolts)
    * Makes up a Storm app
    * Cycles allowed - avoiding infinite loops up to the user
    
### More on Bolts
Bolts have many operations that can be applied to streams:
``Filter``: Forward only tuples which satisfy a condition
``Join``: When receiving two streams, output pairs which satisfy a condition (think ``zip`` in Python)
``Apply``/``transform``: Modify each tuple with a function
BUT, bolts need to work fast; might have TBs of data coming in.
So, we can parallelize bolts by turning them into sub-tasks.
* Split incoming streams into tasks via a "Grouping Strategy"
    * Shuffle Grouping
        * Distribute tuples evenly (round-robin)
    * Fields Grouping
        * Distribute tuples using a subset of their fields
            * IP addresses below 192.xxx.xxx.xxx go *here*, above go *there*
    * All Grouping
        * All tasks receive all input tuples
        * Good for Joins, where you combine with another stream

### Storm Cluster
* Master node
    * Elected via a leader election protocol
    * Runs a daemon known as a *Nimbus*
        * Distributes code around cluster
        * Failure detection
        * Assigns tasks to machines
* Worker
    * Runs on a machine
    * Runs a *supervisor* daemon
    * Listens for work assigned to its machines
* Zookeeper
    * Coordinates Nimbus and supervisors
    * Backs up states of servers
    
#### Node Failures
* A tuple has *failed* when its topology of resulting tuples fails to be fully processed within a timeframe
    * Implies a sub-tree of the overall topology, originating from a root tuple
* __Anchoring__ Map an output tuple to one+ input tuples. If the output tuple is not received, replay the input tuples.
    * This code lives in ``OutputCollector``
    * ``Emit``: Emit an output tuple
    * ``Ack``: Acknowledge you finished processing a tuple
    * ``Fail``: Immediately fail the spot tuple at root of tuple topology. Might do this if there was a database exception, etc.
        * Must ``ack``/``fail`` every tuple. Otherwise you might end up with memory leaks.

## Distributed Graph Processing

__Sample Systems__
* Google's Pregel system
* Piccolo, Giraph, GraphLab, PowerGraph, LFGraph, X-Stream

#### What is a Graph?
* A "network"
* Take Facebook
    * Vertices (or nodes) would be users
    * Edges would be friend relationships
* Take the WWW:
    * Vertices: Routers/switches
    * Edges: URL Links
* "Directed" graphs are uni-directional
* "Bi-directional" graphs are undirected
    * Social networks tend to be this
    
#### Why do we need to process Graphs?
* Need to analyse graphs to derive properties
    * Shortest paths
    * Matching
    
### Prototypical Graph Processing Algorithm

This is for a non-distributed system.

* Works in *iterations*
* Assign each vertex a *value*
* For each iteration, each vertex
    * Gather values from immediate neighbors (one hope via an edge)
    * Perform computation
    * Update its value and broadcast new value to neighbors
* Terminate after
    1. Fixed number of iterations
    1. Vertice values stop changing
    
#### Distributed

We could use Hadoop (Map-Reduce)
Each stage would be 1 graph iteration; as many map-reduce runs as their are iterations
Assign vertex IDs as keys in the reduce phase
Actually very slow; must transfer vertices over network, and write all these values to HDFS

### Bulk Synchronous Parallel Model
Each vertex is computed by a separate processor
At end of computation on each vertex, wait for other vertices to finish, then proceed as a whole to the next stage.

1. Assign each vertex to one server; thus, each server has many vertices
1. For each iteration,
    1. Gather
        * Get all neighboring vertices' values
    1. Apply
        * Compute new value
    1. Scatter
        * Re-distribute new value
        
Now, the locality of neighboring vertices plays a role, as we have to communicate over the network during the Gather and Scatter phases.
* Hash-based assignment
    * hash(Vertex ID) % # of servers 
        * Much like Chord
* Locality-based
    * Assign vertices with more neighbors to the same server as its neighbors
        
### Pregel
* Pregel uses the master/worker model
    * Master
        * Monitors worker servers
        * Maintains membership list of worker servers
        * Has Web UI
    * Worker
        * Runs Gather-Apply-Scatter
* Uses Google File System or BigTable
* Temp data stored on disk

#### Execution
1. Many copies of program begin executing on cluster
1. Master assigns a partition of vertices to each worker
1. Master tells all workers to perform one iteration
1. Master waits for all workers to finish before initiating next iteration
1. Computation halts once all vertices are inactive or no messages are in transit
1. Master instructs each worker to save its portion of the graph.

#### Fault-Tolerance
1. Checkpointing
    * Workers periodically snaphost their partitions to persistent storage
1. Failure Detection
    * Ping messages from master->worker
1. Recovery
    * Master reassigns graph partitions to currently available workers
    * Workers all reload their partition state from last checkpoint
    
#### Performance
* Single-Source Shortest Path
    * 1 B vertices
        * 50 workers = 180 seconds
        * 800 workers = 20 seconds
    * 50 B vertices
        * 800 workers = 700 seconds (< 12 minutes!)

## Network Structure

Network == Graph (synonyms)

Graphs have __nodes__, or __vertices__
__Edges__ connects nodes.


### Complexity
1. Structural
    * Size and relations between nodes
1. Evolution
    * Churn and change in networks
1. Diversity
    * Variance in edges per node
    * Variance in node *weight* and edge *cost*
    * Some people are more popular, some friendships are more important
1. Node Complexity
    * Attribute and schema differences of nodes
1. Emergent Phenomena
    * Simple end behavior becomes complex system behavior
    * Butterfly effect
    
__small world networks__Any node can be reach by any other with a small number of hops
* Why? Networks _evolve naturally_ from a starting nucleus

### Characterizing Networks
1. Clustering Coefficient (CC)
    * Given three vertices, A, B, C, an A-C edge, and a C-B edge, what is the probability there is an edge A-B?
    * Tree networks have a CC of 0 - no three nodes are all directly connected
    * Complete graph (every vertex connected to every other vertex) has a CC of 1.0
    * Random Graph: Low CC, Short Paths
        * ![](img/random_graph.png)
    * Extended Ring Graph: High CC, Long Paths
        * ![](img/extended_ring_graph.png)
    * Small World Network: High CC, Short path
1. Path Length
    * Between any pair of vertices in the graph, what is the shortest path between them?
    * Calculate path lengths for every pair of nodes and take the average = average path length

You can convert an extended ring or a random graph into a small-world network
![](img/small_world_chart.png)

Most "naturally evolved" networks are small-world.
Thus, they also grow incrementally
* Preferntial model of growth
    * When adding a vertex `u` to a graph `G`, connect it to existing vertex `v` with a probability proportional to `|v.neighbors|`
    
#### Degrees
__Degree of a vertex__ # of immediate neighbor vertices
__Degree Distribution__ Probability of a given node having `k` edges

* Regular graphs: all node have the same degree
* Many distributions can emerge: Gaussian, Random, Power law
![Power Law Graphs](img/power_law_graphs.png)

As you can see, graphs following a power law have a few nodes with many neighbors.

Examples of power law & small-world graphs:

* Telephone call graph, protein networks??
* WWW is small-world & power-law with $\alpha = 2.1 - 2.4$

_But_, power-law != small-world

#### Resilience of Small-World Graphs
* Kill a large random selection of nodes = graph is likely to stay connected
* Kill a few high-degree nodes = likely to disconnect graph
    * Body relies on a few key nutrients as building blocks for many other nutrients
    * Certain key cities route large amounts of electric in the grid

* But you don't just have to kill a high-degree node; you could also overload them with shortest-path seletions.
    * Solution(possibly): Sometimes take a random path

## Big Ideas

* Parallelism
    * Storm Bolts run many tasks in parallel
    * Pregel uses many workers, coordinated by Pregel
* Topologies (Directed Graphs)
* Fault-tolerance