# Lecture 11

# Cache oblivious

Recall that in the DAM/external memory model, we model a machine with a memory of size $M$, a *disc* of unlimited size, and pay cost 1 to move $B$ consecutive items from/to memory to/from disc. In this model we described the B-tree structure, which supported insertion, deletion, and search in time $O(\log_B n)$ time, and the $B^\epsilon$ tree which improved the insertion and deletion time to $O(\frac{1}{B^{1-\epsilon}}\log_B n)$ for any $\epsilon>0$.

However, real computers have many different levels of registers, memory, address translation, and storage, with different speeds and block sizes. It would be difficult to try to figure out the parameters for each machine and create data structures and algorithms optimized for them. Instead the inventors of the cache oblivious model made a brilliant observation, that if we analyzed algorithms just like in the DAM model, but where $B$ and $M$ are unknown to the algorithm, then the analysis would be valid between any two levels of a memory hierarchy. So, that is the cache-oblivious model: same as DAM but the algorithm does not know $M$ or $B$.

For some algorithms, this is not a problem. For example, scanning $n$ items requires $O(\frac{n}{B})$ time in the DAM and cache-oblivious models, because the algorithm *scan* is not parameterized by $B$. However, the $B$-tree crucially must know $B$ in order to decide how big a node should be. As such, a completely different approach is required.

We now describe the *van Emde Boas* structure, which is a way to support searching in the cache-oblivious model in the same $O(\log_B n)$ time of the B-tree, but that does not know $B$.

The structure is as follows: build a perfectly balanced binary search tree containing the data items, which has height $\log n$. Cut the tree in half height-wise, which gives tree of height $\frac{1}{2}\log n$; one of which is the top and $\approx \sqrt{n}$ are below the cut; each of these trees has $\approx 2^{\frac{1}{2}\log n}=\sqrt{n}$ items. Place the data from each of these trees into an array recursively.

Now look at a search path. This is stored in

- The one big tree of $n$ nodes and height $\log n$
- Two recursive trees of $\sqrt{n}$ nodes of height $\frac{1}{2}\log n$
- Four recursive trees (in the second level of recursion) of $n^{\frac{1}{4}}$ nodes of height $\frac{1}{4}\log n$
- Generalizing: $2^i$ recursive trees (in the $i$th level of recursion) of $n^{\frac{1}{2^i}}$ nodes of height $\frac{1}{2^i}\log n$
- Set $i=\log \log_B N$: $2^{\log \log_B N}=\log_B N$ recursive trees (in the $i$th level of recursion) of $n^{\frac{1}{2^i}}=n^{\frac{1}{2^{\log \log_B N}}}=n^{\frac{\log B}{\log N}}=2^{\frac{\log B \log n}{\log N}}=2^{\log B}  =B$ nodes of height $\frac{1}{2^i}\log n=\frac{1}{2^{\log \log_B N}}\log n = \log B$

This las statement says on any search path, it will pass through only $\log_B N$ trees who are stored in memory consecutively and have size at most $B$. Thus the cache-oblivious cost for search is $O(\log_B N)$ in the cache-oblivious model.

Observe that this structure did not use $B$ for the construction, only the analysis. The multi-level recursive nature is typical of cache-oblivious algorithms and is the main way to take advantage of many different levels of locality.

This structure does not support insertion and deletion, or the speedups we obtained for $B^\epsilon$ trees. However, structures have been obtained with these results (any many others) but are too complex to present in the limited time we have.

# Parallel Algorithms

This will be a very quick introduction to parallel algorithms. Computers are becoming increasingly parallel, with having 2-30 cores being quite typical in consumer machines. GPUs on many modern computers can also support 1000s of running processes at the same time, but have a particular architecture that makes coding them challenging (and that varies with brand). 

Here we will discuss algorithm design for parallel computers from a high-level perspective; actual parallel coding varies quite a bit among languages, and Python (which we have been using) is not the best example as most python implementations do not support multithreading well. In real parallel programming, you also need to worry abut issues relating to locks and syncronization, which can be quite complicated to get right. Think of making our B-tree parallel. What if one thread is doing an insertion while another is searching? What if both are inserting at the same time?

The model we will be discussing is the PRAM (the *parallel RAM*) which has $p$ processors that run at the same time in lock-step and that share memory. The main parameters we will be using are:

- $T_p$: The running time with $p$ processors
- $T_1$: The *sequental* runtimes, what you would get with only one processor
- $T_\infty$: The runtime with infinite processors. Often called the *span*
- $W$: The work, the sum of the amount of time each processor spends. For a given $p$, this is at most $pT_p$.

Brent's law says that if you know $T_1$ and $T_\infty$, you can get the runtime for any number of processors:

$$ T_p = O\left(T_\infty + \frac{T_1}{p} \right) $$

This makes sense as you can not go faster than $T_\infty$ and you can get get more than a factor $p$ speedup over $T_1$.

In an ideal word, we want algorithms that have the same work as the sequential algorithms, but run much faster as $p$ grows.

### Example: Sorting, take 1

Recall that mergesort is a sequential algorithm that runs in time $T_1=O(n \log n)$ and work $O(n \log n)$.

We assume the input is an array $A$ of size $n$ of distinct numbers, and $B$ is the destination array. Using $p=n$ processors, processor $i$ counts how many items in $A$ are at most $A[i]$, call this $\ell_i$. It then runs $B[\ell_i]=A[i]$. The list is now sorted.

This took time $O(n)$ with $n$ processors, $T_n=O(n)$ which is faster than mergesort. However, the work is $O(n^2)$ which is significantly worse.

### Example: Merging

Suppose you have two sorted lists $A$ and $B$ of size $n/2$ and you want to merge them into a sorted list $C$. With $n$ processors, we can do something similar to the sorting example and assign a single processor to each data item, and count how many items are less than or equal to it, which gives it position in the sorted list. However, as the lists are sorted, each processor can use binary search rather than sequential search. This takes time $T_n=O(\log n)$, and $O(n \log n)$ work. With $p\leq N$ processors this thus takes time:

$$ T_p = O \left( \log n + \frac{n \log n}{p} \right) $$

This is faster than normal merge when $p > \log n$, and the work is a factor-log worse.

### Example: Sorting, take 2

Run mergesort, but with the parallel merge just described. This gives

$$ T_p = O \left( \log^2 n + \frac{n \log^2 n}{p} \right) $$

This is also faster than normal mergesort when $p > \log n$, and the work is a factor-log worse. There are parallel sorting algorithms that have optimal $T_1 = O(n \log n)$ and $T_\infty = O(\log n)$ but they are more complicated.




# Homework

This will be discussed in the last class (this is not real homework as you don't have time to work on it at home.)


- One complication of the PRAM model is what to do if two items want to write to the same memory location at the same time. This could be forbidden, forbidden if items attempt to write different values, or an arbitrary value could be written. Consider the problem of convex hull (described in class). Give algorithms for to solve this that are as fast as possible (in terms of span) in each of these models.



