## The Cache Hierarchy

* Memory is an abstraction
  * looks to processor like a 1-d adress space of data locations
  * uniform access from all cores/processors
*  Actually a steep, hierarchy of cache in which different levels have different:
    * Performance
    * Capacity
    * Sharing
  
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*42rOo-Rl8seCDV5cZafUNA.png" width=512 title="Cache Hierarchy" />

Image from https://3000rain.medium.com/memory-hierarchy-afb83b61558c

* Caches are a place to store a smaller amount of data that is frequently/recently used to make data access faster.
  * Processor caches (on chip) is a cache for memory.  Managed by hardware.
  * Memory (DRAM) is a cache for pages from disk.  Managed by a storage system (database, file system).
  * Management refers to the process of loading and evicting the contents in response to workload.
  
### The Hierarchy

<img src="http://www.imexresearch.com/newsletters/images/201009_SSDImages/20100913_SSD_0000.png" width=512 title="IMEX Data on Latency and Cost" />

<img src="https://eda360insider.files.wordpress.com/2012/05/wegener-1.gif?w=1400" width=512 title="Cache latency and granularity" />

### Latency

Delays (in clock cycles) to different levels in the cache hierarchy for an i7 (Nehalem, 2008).
 * $1$ cycle to registers (private to each core)
 * $1$ cycle to L1 (private to each core)
 * $4$ cycles to L2 (private to each core)
 * $35$ cycles to L3 (shared by cores)
 * $145$ cycles to memory (shared by processors)
 * $10^5$ cycles to NVRAM
 * $10^7$ cycles to magnetic disk

_Data Loading_: New data that has not been used must be loaded from SSD, disk, or memory.

_Data Sharing_: When two threads need to share data, they incur the cost of transferring data through the fastest shared cache.
  * 2 cores on the same processor take 70 cycles (35 to write to L3 and 35 to read from L3)
  * 2 processors take 290 cycles
  
The following figure is almost right. SMP should really say something like QPI (quickpath interconnect). It is helpful to visualize sharing betweeen cores in L2 and processors in L3.

<img src="https://www.enterpriseai.news/wp-content/uploads/2014/06/shared-memory-cluster-story-1-processor-cluster.jpg" width=512 title="NUMA schematic from EnterpriseAI" />

This sharing results in _interference_ between processes that share data in OpenMP and threads.  This is the major source for lost parallelism in these programming models.


**Cache examples on blackboard.**



### Processor Caching Concepts

The memory system should be thought of a a vectorized parallel system.  Whenever you 
get data, you get many words of data.  To get good memory throughput, you must 
use all that data.  Most important to understanding cache performance are:
* __cache line__: data are moved among levels in the cache one line at a time
  * 128 bytes is a typical value for L1 or L2
  * each access is a parallel load of an entire line
  * good parallel programs will use 64 or 128 bytes
* __unified__: refers to whether or not the cache is shared (among cores or processors)

Other concepts that don't matter as much.
* __inclusive vs exclusive__: has implications for hardware management policies.  We don't care.
  * __inclusive__: data in higher level caches are also in lower level caches
  * __exclusive__: data in higher level caches are not in lower level caches
* __associativity__: the number of hardware locations that a cache line can go into
  * important for HW design.  We typically don't care.

### Memory Access Patterns

* You can’t just access memory
  * Different memory access patterns result in large performance differences for the same computation
* Worry about:
  * Parallelism: am I using all the data in a cache line
    * To access a single byte, one must load a whole line
    * Sequential access to memory is always parallel!
  * Sharing/reuse: is my program referencing data in the cache more than once?  At what levels?

Good memory access patterns are __aligned, sequential__ and __coalesced__.
  * Aligned – access range starts/ends on cache line boundaries
  * Sequential – a continuous range of bytes
  * Coalesced – combine multiple small accesses into fewer large accesses


* For good memory performance in looping programs
  * choose an iteration order that is sequential in memory
  * align data
      * use addresses that are 0 modulo 128 or 256
      * assume that large memory allocations are sequential
      * there are specific interfaces to allocate aligned memory (not portable)
      
      
#### Row versus Column Example

Example in [row_column.c](./examples/openmp/row_column.c)

* Nested loops are a good example
  * Row versus column order can make a big difference.
  * Think of memory as reading a sequential cache line at a time
  
<img src="./images/rowvcol.png" width=512 title="http://akira.ruc.dk/~keld/teaching/IPDC_f10/Slides/pdf4x/4_Performance.4x.pdf" />

* Reading data a row at a time results in sequential access of all elements.  
* Reading successive elements in a column results in strided I/O.
    * One element accessed for every column's worth of data.
    
    
We will consider the following two snippets. In the first snippet, $x$ varies fastest. One element of data are access for every $DIM$ elements.
    
```
    for (int y=0; y<DIM; y++) {
        for (int x=0; x<DIM; x++) {
            array[x*DIM+y] = (double)rand()/RAND_MAX;
        }        
    }
```

In this example $y$ is the inside loop and varies fastest. Elements are accessed sequentially in memory

```
    for (int x=0; x<DIM; x++) {
        for (int y=0; y<DIM; y++) {
            array[x*DIM+y] = (double)rand()/RAND_MAX;
        }        
    }
```

### 2-d array conventions and programming languages
 
Programming languages that use 2-d array indexing use one of two conventions to serialize array elements to memory.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Row_and_column_major_order.svg/800px-Row_and_column_major_order.svg.png" width=256 title="Row-major versus column-major" />

* Row major systems include python, C++.
* Colum major systems include R, Fortran, and image formats.

You must be careful when using 2-d indexing.  Class thought exercise. _Write a embedded for loop for python that accesses elements `ar[row][col]` in sequential order._


### False Sharing

Example in [sharing.c](./examples/openmp/sharing.c)

Good simple treatment at this [blog](https://haryachyy.wordpress.com/2018/06/19/learning-dpdk-avoid-false-sharing/).  Referring to their diagram.

<img src="https://haryachyy.wordpress.com/wp-content/uploads/2018/06/false-sharing-illustration.png" width=384 title="False sharing" />

The key thing to understand is that the processors/cores need to exchange synchronization events through the process or main memory system. Another developer [blog](https://learn.microsoft.com/en-us/archive/msdn-magazine/2008/october/net-matters-false-sharing) makes this point.

<img src="https://learn.microsoft.com/en-us/archive/msdn-magazine/2008/october/images/cc872851.figo2(en-us).gif" width=384 title="False sharing" />

https://learn.microsoft.com/en-us/archive/msdn-magazine/2008/october/images/cc872851.figo2(en-us).gif

The four examples:
  * each thread writes to same variable (sharing)
  * each thread writes to adjacent variables (false sharing)
  * each thread writes to different region (no sharing)
These examples are just about memory access patterns. We will look at reducers in the next lecture as a way to solve shared write performance.

### LMBench: Understanding Cache Misses

LMBench is a suite of performance benchmarking tools written by Carl Staelin and Larry McVoy in 1996. The strided access benchmark still provides the best insight into the structure of cache latencies.  The experiment does the following:

> Access a single byte at 128 byte strides (i.e. 0, 128, 256, 384, ...) for an array of a specified size.  Loop over the array multiple times to amortize any initial load costs.

For arrays that fit into:
  * L1 cache: the L1 cache that contains the entire array and each byte can be accessed in a single clock cycle
  * L2 cache but not L1 cache: every byte access transfers a line from the L2 cache to the L1 cache. (Lines are 128 bytes). Each access occurs at L2 latency
  * L3 cache but not L2 cache: every byte access transfers a line from L3->L2->L1.
  * Larger than L3: performance increases as a function of the working set size. The operating system manages this cache and has access to predictive prefetching and other optimizations.
    
__Conclusion__: the exact same code at different sizes can have >20x performance differences.
  * you have to understand the cache hierarchy and reason about memory access patterns
    

### LMBench and NUMA:

<img src="https://sites.utexas.edu/jdm4372/files/2012/03/RangerLatencyChart.jpg" width=768 title="Ranger Memory Performance at TACC" />

For reads that are bigger than L3, there is varying performance depending in multi-processor systems. We will cover this later in a section on __NUMA__=non-uniform memory access.

### Memory Best Practices

* Look at the read/write patterns in you inner loops. Make sure that the access pattern is sequential.
* Consider what data is being accessed on a cache miss. Are the other elements in the line useful to the same core?
* Avoid false sharing by making sure that each thread/core has a local variable to which it is writing......or separate writes into diferent cache lines.

I've assigned the following reading https://siboehm.com/articles/22/Fast-MMM-on-CPU. It is a GREAT treatment. Let's discuss.