# Algorithms for Massive Datasets



Hadoop first open source implementation of distributed file system based on Google File System.

Tabular structure of file index in which is indicated where a part of file is stored.
Two replicas of file in a rack (fast search and access in same rack), one in another.

After the creation of replicas of chunks the main node update the meta-data for the file and it become available in distributed file system.


## Map-Reduce
File devided in chunks, each one devided in records (or lines).

Map function operates on single records.

Example:
$$
(k, l) \xrightarrow{MAP} \forall w \in l.split()\\
output: (w, 1)
$$
Master Node Controller supervise all operations.

$$
(w,(1,1,\dots, 1) := S) \xrightarrow{REDUCE} (w, len(S))
$$



## Relational Algebra
Abstract: The mathematical formulation of SQL queries.

Relation (represents a table):
$$
    R(A,B) \subseteq A \times B
$$

* Selection: $\sigma_C(R) \qquad C \text{ is a boolean criterion} $
$$
    \forall t\in R \xrightarrow{MAP} (t,t) \text{ if } c(t) \\
    (t, (t))\xrightarrow{REDUCE} (t,t)
$$

* Projection: $\pi_S(R)$ where $S$ is a set of attributes
$$
    \forall t\in R \xrightarrow{MAP} (t',t') \\
    (t', (t',\dots, t')) \xrightarrow{REDUCE} (t',t')
$$

* Set operations: union, intersection, set difference.

Union:
$$
    \forall t\in R \xrightarrow{MAP} (t,t)\; , \qquad \forall t\in S \xrightarrow{MAP} (t,t) \\
    (t, (t)) \xrightarrow{REDUCE} (t,t)\\
    (t, (t,t))\xrightarrow{REDUCE} (t,t)
$$
Intersection:
$$
    \forall t\in R \xrightarrow{MAP} (t,t)\; , \qquad \forall t\in S \xrightarrow{MAP} (t,t) \\
    (t, (t)) \xrightarrow{REDUCE} \emptyset\\
    (t, (t,t))\xrightarrow{REDUCE} (t,t)
$$
Set difference:
$$
    \forall t\in R \xrightarrow{MAP} (t, 'R')\; , \qquad \forall t\in S \xrightarrow{MAP} (t,'S') \\
    (t, R) \xrightarrow{REDUCE} (t,t)\\
    (t, S) \xrightarrow{REDUCE} \emptyset \\
    (t, (R,S)) \xrightarrow{REDUCE} \emptyset
$$
* Aggr. + group: $ \gamma_{A, \theta(B)}(R) $
$$
    \forall (a,b)\in R \xrightarrow{MAP} (a,b)\\
    (a,(b_1,\dots,b_n)) \xrightarrow{REDUCE} (a, \theta(b_1,\dots,b_n))
$$
* Joins: $R(A.B) \bowtie S(B,C)$
$$
    \forall (a,b)\in R \xrightarrow{MAP_R} (b,(a, 'R')) \\
    \forall (b,c)\in S \xrightarrow{MAP_S} (b, (c, 'S')) \\
    (b, ((a_1,R),(c_1, S), (c_3, S),(a_2, R),\dots)) \xrightarrow{REDUCE} 
$$
1. sort $l$ using 2nd elem as primary key
2. $\alpha$ list of elems from $R$
3. $\beta$ list of elems from $S$
4. $\forall (a,c) \in \alpha\times\beta$
5. output: $(a,b,c)$

### Product of matrices
$$
    A=[a_{ik}]_{m\times n}, \qquad B=[b_{k,j}]_{n\times o} \qquad P=AB=[p_{ij}]_{m\times o}\quad p_{ij} =\sum_{k=1}^n a_{ik}b_[kj ]\\
    (i, k, a_{ik}) \in A(I,K,V)\\
    (k, j, b_{kj}) \in A(K,J,W)\\   
    
    A\bowtie B (i,k,j, a_{ik}, b_{kj})
$$
First method:
$$

    \forall (i,k,a_{ik}) \in A \xrightarrow{M_A} (k, (A,i,a_{ik}))\\
    \forall (k,j,b_{kj}) \in B \xrightarrow{M_B} (k, (B,j,b_{kj}))\\  
    
    (k, ((A,1,a_{1k}), \dots, (A,m,a_{mk})), (B,1,b_{k1}), \dots, (B,o,b_{ko})))
$$
another process of map-reduce to obtain
$$
    ((i,j), (a_{ik} b_{kj})) \xrightarrow{M} ((i,j), (a_{ik} b_{kj}))\\
    ((i,j), (a_{i1} b_{1j})) \xrightarrow{R} ((i,j), p_{ij})
$$

Alternative method using a single map-reduce process:
$$
    \forall (i,k,a_{ik}) \xrightarrow{M_A} \forall j=1,\dots, o \quad ((i,j), A, k, a_{ik}) \\
    \forall (k,j,b_{ij}) \xrightarrow{M_B} \forall i=1,\dots, m \quad ((i,j), B, k, b_{kj}) \\

    ((i,j), ((A, 1, a_{i1}),\dots, (A, n, a_{in}), (B, 1, b_{1j}), \dots, (B, n, b_{nj})) \xrightarrow{R} ((i,j), p_{ij})
$$