# Module 6

## 1. Latency Hiding Techniques
THe basic problem is that, in a multiprocessor system the memory accesses takes a lot of time. And compared to a processor, the memory access time is less compared to the improvements in processor speeds. So __latency hiding__ techniques are used to get better results.
#### Techniques
- Pre Fetching Techniques
- Using Coherent Caches
- Using Relaxed Memory
- Using multiple contexts support

#### 1.1 Shared Virtual Memory
##### The problem:
- Usually in a __NUMA__ system. there will be shared memory. And these shared memory will be accessed by different addresses. This causes delays when updating the caches and invalidating the copies, etc.
- In this approach, cache coherence was maintained using distributed directory based protocol. 
- For each memory block, a directory kept track of all the nodes accessing it. Load and writes were done with write buffers. 
##### The solution:
- The solution to this problem is to use a __shared virtual memory__. 
- Instead of having seperate addresses for the shared memory.
- A __virtual address__ was used to map to the shared memory.
- A memory location having read access was allowed to be held by the different processors.
- A memory location to be written was only allowed to be held by the processor writing it.
- When a page fault occurs, the memory manager retrieves the memory location and brings it to the processors cache.
- This allows the code to be larger in size by storing it in continuous virtual address space.
#### 1.2 Prefetching Techniques
##### Overvew
- It is the technique which brings the next required data before the processor actually wants it.
- There are two types of prefetches:
    - __Binding__ : In this prefetching, the data is loaded ultimately and thus has the risk of it getting stale when other processors write over that data. This has a risk if the data is changed, thus forcing the data to be read again and then wasting more clock cycles.
    - __Non Binding__ : In this type of prefetching, the data comes closer to the processor but its not loaded entirely so that it can be loaded when required with the updated changes.
- __Hardware controlled__ prefetching includes __long cache lines__ and __look ahead__. It is faster. 
- __Software controlled__ prefetching lets us write exclusive instructions which will help prefetching more customizable. However there is the additional over. head of instructions.
##### Benefits:
- It speeds up the execution greatly.
- Back to back issuing of prefetch can help in hiding all but the first prefetch instruction due to pipelining.
- Ownership based prefetching can solve the coherence problem.
#### 1.3 Distributed Coherent Caches
##### Problem:
- Eventhough snoopy based protocols are very effective when it comes to single processor systems, etc. It gets really complicated in multi processor systems.
- Earlier multiprocessor systems avoided caches because of this. However, it slowed up execution a lot.
- Largest saving in latency was made in reducing the cycles for reading from memory. However, that power is lost if we avoid caching.

#### 1.4 Scalable Coherence Interface
##### Solution:
- A __Scalable Coherence Inteface__ with low latency is used to connect between the nodes instead of a traditional bused backplane.
- Connects the interface between nodes and external interconnect using 16-bit links with speeds of up to 1 GBps.
- Each node has an input to output link which connects to an SCI ring or a crossbar.
- Instead of broadcast in the case of a bus, point to point communication is used.
- The bandwidth, arbitration and the addressing outperforms the latter.
- Since there is no snoopy controller per node, it is also very less expensive.
- Although SCI is scallable, the memory in the cache directory also scales up.
- The performance however does not scale.

##### Working
###### Sharing-List Structures
- They are data structures that contain the data of  the shared location.
- THey are dynamically created, pruned and destroyed.
- They have the property to bypass the coherence protocols for locally cached data.
- Communications are supported by shared memory controllers.
- There is a first bit that tags the first processor.
- The other processors are linked using doubly linked list.
###### Sharing-List Creation
- The states of every memory are defined as __clean__, __dirty__, __valid__ or __stale__.
- Head processor is always responsible for list management.
- First the location is in the home state.
- When a processor asks for it, the home state is changed to cached after making necessary changes to it.
###### Sharing-List Updates
- For subsequent memory requests, the memory is cached which makes the head of the sharing-list dirty.
- When a request for that memory is made, instead of the pointer to the memory a pointer to the other node accessing it is given.
- A second cache to cache transaction called __prepend__ takes place which makes the old point its backward pointer to the new one.
- The newly accessed node becomes the new head.
- A processor may release itself from the list.

#### 1.5 Relaxed Memory Consistency
###### Processor Consistency

Writes of one processor is always in program order, however the writes of two different processors may not always be in program order.

- The basic idea is that the idea of maintaining consistency is relaxed with respect to the read and write operations thereby allowing additional buffering and pipelining opportunities.
- The following conditions allow the reads following a write to bypass the write:
    - Before a __read__ is performed w:r:t any other processor, all previous reads are to be performed.
    - Before a __write__ is performed w:r:t any other processor, all previous reads or writes are to be performed.
###### Release Consistency
The release consistency allows a relaxed way for releasing and acquiring locks thereby providing flexibility in buffering and pipelining.

- Before a __read or write__ access is allowed to perform w:r:t any other processor all previous acquire must be performed.
- Before a __release__ accesss is allowed to perform with respect to any other processor, all previous read and store must be performed.
- __Special accesses__ are processor-consistent with one another. The ordering restrictions imposed by weak consistency are not present in release consistency. Instead, release consistency requires processor consistency and not sequential consistency.

__Weak consistency__ :- programmer defined synchronisation to make the program more consistent.



## 2. Multithreading Issuses and Solutions

#### Parameters to analyze performance
1. __The latency__
2. __The number of threads__
3. __The context-switching overhead__
4. __The interval between switches__

#### Problem of asynchrony

When two threads read two seperate data, then the reading and processing by the induvidual two threads becomes fast. However, if another thread reads from the two threads then idiling will take place waiting for the other threads to complete.

##### Multithreading Solutions
- This solution involves using multiple threads to hide the latency.
- For a load operation, when the data is being read another thread can begin working thereby saving time on waiting.
- However, the other thread switched must not have a load themsevles. This will make the situation worse.
- Using __continuations__. Each load operation is assigned with an appropriate operation that does not have this issue.
##### Distributed cache
- The distributed cache involves every node owning a cache location.
- Each location has an __import__ and an __export__ list. They keep track of who all the ones they are sharing and the ones shared with them.
- This provides a solution for remote loads, not for synchronizing loads.

#### 3. Multiple context processors

The basic idea is to improve the efficiency by increasing the __busy__ time over __busy__+__switching__+__idle__.

###### Context Switching Policies
- __Switch on Cache miss__ - The context is switched when a cache is missed.
- __Switch on every load__ - The context is switched to a useful process when waiting for the other context to load.
- __Switch on every instruction__ - Despite being a load operation, every instruction has some delay. To make up for that, the context is switched.
- __Switch on block of instruction__ - Block of instruction from different threads are interleaved. Thus switching will help in improving the cache hit ratio.

#### 4. Fine Gain Paralleism