The processor consists of multiple cores, each with private L1 instruction and data caches. The cores share the L2 cache.
Each core has a single-issue, in-order execution pipeline with multiple hardware threads. Each thread looks like a separate processor to software, with a full set of general purpose registers. However, the threads share L1 caches and arithmetic pipelines. The processor issues an instruction from a different thread each cycle, skipping threads that are waiting on a cache miss or data dependency. This keeps arithmetic and memory units utilized.
When a data cache miss, branch, or exception occurs, the core rolls back the thread that caused it. It invalidates instructions in the pipeline that were issued by that thread after the offending instruction. If a branch instruction caused the rollback, it resets the program counter to the target of the branch and continues execution from there. If a cache miss caused the rollback, it resets the program counter to the memory instruction that caused the miss and suspends the thread. When the memory subsystem satisfies the miss, the core reissues the memory instruction.
There are 16 parallel arithmetic pipelines, each corresponding to a vector lane. Scalar-only instructions use the lowest lane. Instructions can mix scalar and vector operands. A scalar source operand is duplicated across all lanes. The processor can optionally write back vector results to a subset of lanes if an instruction specifies a mask register.
This diagram shows the execution pipeline for a single core (click to expand):
This stage contains a program counter for each hardware thread. It selects one to fetch each cycle, skipping threads that are waiting on instruction cache misses or who's instruction FIFOs are full. It increments the program counter of the issued thread, effectively always predicting that branches are not taken.
The instruction cache is set associative. Each way of the instruction cache has a separate bank of SRAM containing tag values for the cached lines as well as valid bits. This stage reads those in parallel using the middle bits of the instruction address (the set index) as the address. Tag memory has one cycle of latency, so the next stage handles the result. Tag memory has independent read and write ports, so it can process a read the same cycle the memory subsystem is fulfilling a cache miss.
The stage also contains a least recently used list of ways for each cache set. A signal from the next stage updates these when there is a cache hit. This is read by the L2 cache interface when a response comes in to determine which way to put newly loaded cache lines into.
A TLB could be added to this stage to support virtual memory, but that is not currently implemented.
This stage compares each of the way tags read in the previous stage to the requested tag. If a cache hit occurs, it issues the instruction cache data address--formed from the way and the set indices--to instruction cache data memory. If a miss occurs, it enqueues a fill request in the instruction cache load miss queue. Both the L1 instruction and data caches can continue to process requests while misses are pending. This allows threads to hide the latency of cache misses.
The decode stage determines which register operands are needed for the instruction. It uses multiplexers to extract the register indices from the instruction, and maps them to register file ports. It determines the latency of the instruction--which is equivalent to the number of execution stages in the pipeline. The thread select stage uses the latency to track register writeback conflicts.
The thread select stage contains a FIFO of decoded instructions for each thread. Each cycle, it picks an instruction from one of the threads to issue. It skips threads that are unable to run because of operand dependencies, structural hazards, or cache misses. It uses a round robin selection policy, switching threads each cycle whenever possible.
Each thread has its own scoreboard, which tracks operand dependencies between instructions. There are two situations where a dependency can occur. The most common is a read-after-write dependency, which occurs when an operand to an instruction is result of a previous one. For example, in the following, the second instruction uses register s0, which is the updated by the first instruction. It takes several cycles to compute the result. If thread select stage were to issue the dependent instruction too soon after, it would get the stale value in the register:
add_f s0, s1, s2 add_f s3, s0, s4
The other type of dependency is a write-after-write dependency, where a destination register is written back-to-back by instructions with different latencies.
Each scoreboard contains a pending bit for each register, which is used as follows:
Because the execution pipelines have different lengths, it is possible for instructions issued in different cycles to arrive at the writeback stage simultaneously. This must be avoided because the writeback stage can only retire one instruction per cycle. The thread select stage prevents this by delaying the issue of instructions that would conflict. A shift register, shared between all threads, tracks which stages have active instructions. Each thread checks the appropriate shift register bit based on the latency of the next instruction to be issued.
The thread select stage tracks if a thread is waiting because the store queue is full or a data cache miss has occurred. It doesn't issue instructions from waiting threads until a signal from the data cache marks it runnable again.
Some instructions must be issued more than once to complete. For example, scatter/gather vector memory operations perform a separate cache access for each lane. The thread select stage issues these instructions multiple times with a separate subcycle counter signal. When a rollback occurs, the thread select stage resets the subcycle to the one that triggered the rollback. The subcycle mechanism is generic and could support having fewer execution units than vector lanes or other complex instructions.
This stage has two substages. The first computes the scoreboard mask for the next instruction of each thread. The second checks for conflicts and selects a thread to issue.
The operand fetch stage issues the source register indices to the register file, which has one cycle of latency. The vector and scalar register files in this stage contain registers for all the threads. They form addresses by combining the register index with the thread ID. Both register files have two read ports and one write port. The writeback stage controls the write port.
There are two internal substages. The first fetches the operands. The second steers the results the appropriate output ports to the next stages, one of the following:
This five stage pipeline is used for floating point operations and integer multiplication.
|1||Compare exponents (and significands if exponents are equal). Use multiplexers to sort smaller and larger numbers into appropriate outputs.||Subtract exponents|
|2||Align smaller significand by shifting right. This stage collects guard, round, and sticky bits from the shifted out bits for rounding.||multiply|
|3||Add/subtract (invert the significand for logical subtraction). Pre-normalization rounding occurs here based on GRS bits from above.||multiply|
|4||Leading zero detection on significand for normalization.||multiply|
|5||Normalize shift left. Post normalization rounding (only necessary for overflows).||Normalize/Round|
In the current implementation, multiplication is performed in a single stage (stage 2), and the result is passed through stages 3 & 4. A better implementation would do something more sophisticated like use a Booth encoder/Wallace tree to reduce delay and improve clock speed (although this isn't critical path on FPGA).
A single stage executes instructions that have one cycle of latency such as integer addition and bitwise logical operations. Despite the name, it performs some simple floating point operations like reciprocal estimate. This stage also checks for branches.
Like the instruction cache, this stage performs a tag lookup on all ways in parallel. Tag memory in this stage has an extra read port that allows cache snoop operations to occur without interfering with load instructions in the execution pipeline. The cache coherence section below describes this in more detail.
This stage also contains the least recently used list, which operates the same as the one in the instruction fetch tag stage.
If a cache hit occurs, this stage reads the data from cache SRAM. This has a 512 bit wide read interface, so a contiguous vector load occurs in one cycle.
The data cache is write through. Store instructions do not directly update the L1 cache but are instead inserted into a store queue and eventually sent to the L2 cache. When the L2 cache sends a response, it updates the L1 cache line. Each entry in the store queue contains a 512 bits wide cache line. This stage aligns smaller writes to the proper location in the cache line and sets a 64 bit write mask with a single bit for each 8 bit lane. This stage does not load a cache line if a store occurs to an address that is not resident (it is not allocate-on-write).
When a load is requested, this stage checks the store queue to see if any stores are pending to the same address for the same thread. If there are, bypasses the store data to the writeback stage so the write is visible to that thread.
If a load miss occurs or a store is attempted when the store queue is full, it rolls back the thread and suspends it. When a load miss is filled, the load miss queue issues a signal to the thread select stage to wake the thread up. If a store occurs to an address that is already pending in the store queue for the same thread, and has not been sent to the L2 cache yet, the store queue will update the existing entry with new store data (write combining).
The highest portion of the address space is reserved for non-cached device registers. Accesses to this region cause transactions on a separate request/response bus. These may only be 32-bit transfers.
This stage performs the following functions:
Exceptions are precise in this architecture. The processor guarantees that all instructions before the instruction that caused the exception are retired, and no instructions after it are. Although the processor issues instructions in-order, they may be retired (have their results written back to the register file) out of order. This is because the execution pipelines have different lengths. The rollback logic must consider this to keep exceptions precise.
Interrupts cause a rollback. This prevents instructions that were already issued to the pipeline from causing rollbacks that return execution to the previous code path. If a thread is currently blocked waiting for a memory access, the interrupt resets its program counter, but the thread remains blocked until the previous wait completes. This avoids race conditions that can occur when the wakeup is later signaled.
When there is a request to send from a core to the L2 cache (for example, a L1 data cache miss) the core asserts its signals on its own L2 request interface. The L2 cache asserts an acknowledgement signal when it can process the request. Each requestor has a separate acknowledgement signal to allow arbitration between simultaneous requests. The request contains an entire 512 bit cache line and a 64 bit mask for stores.
When a load miss occurs, all pending entries in the L1 load miss queue are searched in parallel by tag/line. If a match occurs, the appropriate bit in the thread bitmap is set to indicate the thread is waiting on the entry. This combines the loads into one request. If the load is not already pending, a new load miss queue entry is allocated.
When the L2 cache receives a store request, it updates its own internal cache line, loading it if it is not resident, then broadcasts the update to all cores.
The L2 cache may respond to requests out of order because the restart mechanism can reorder transactions. When the L2 cache finishes processing a request, it broadcasts a response on the L2 response interface, which all cores monitor. Identifier fields in the response indicate which request it is associated with.
The L2 cache has a four stage pipeline.
The L2 cache is write-back and allocates cache lines both on load and store misses. When a cache miss occurs, the L2 cache places the request in a miss queue. A state machine controls loading data from external memory. When the transaction finishes, the request restarts at the beginning of the L2 cache pipeline along with the fetched data. A writeback queue holds data that must be stored back to main memory. Requests in the writeback queue are processed before those in the load miss queue to avoid loading stale data. The L2 cache uses the AMBA AXI protocol to communicate with main memory. One Verilog module encapsulates protocol handling. This could be replaced to support different bus protocols.
The L2 cache is not inclusive or exclusive: A cache line may only be in the L1 cache, only in the L2 cache, or in both.
To simplify software, this processor supports a coherent physical address space, shared across all threads and cores. There are three rules that a cache coherence protocol must obey:
The L1 data cache uses a snooping write-update protocol. Because the cache is write-through, there are no 'dirty' or 'exclusive' L1 cache lines as there would be in a write-back cache. Therefore it doesn't need a complex protocol to obtain exclusive access to a line. However, it does need to update L1 data cache lines in other L1 caches whenever a write occurs.
Bypass logic in the store queue enforces rule 1 by making a pending write visible to the thread that issued it. The write update mechanism enforces rule 2. Because all writes go first to the L2 cache (without updating the L1D cache), which serializes the results and broadcasts it to all cores, rule 3 is observed. The store bypass logic also does not make writes visible to other threads on the same core to enforce this rule.
Each core snoops L2 responses for store updates. A three stage pipeline processes messages from the L2 cache. The response logic handles both store updates and L1 load miss fills:
This processor uses a relaxed memory consistency model. Because the pipelines issue memory instructions in order, read-to-read, and write-to-write ordering is preserved. However, write-to-read ordering is relaxed between threads because execution of a thread can proceed while a write is in the store queue. This is known as "processor consistency" (PC). The membar (memory barrier) instruction enforces explicit ordering. If there is an pending store in the store queue, the membar instruction suspends the thread until the L2 cache acknowledges it. This guarantees all other L1 caches reflect the store.
If a full line is written to the L2 cache as a block vector store and the line is not cache resident, it doesn't load it from memory--which is unnecessary--but rather puts the new data into the line.
Synchronized load/store instructions enable primitives such as spinlocks, atomic counters, and more sophisticated lock-free/wait-free data structures. When a synchronized memory load is requested, the L2 cache latches the address of the load in an internal register (one register is reserved per thread per core). If a write occurs to that cache line by another thread, it invalidates the internal register. Synchronized store operations check this register. If the address has been invalidated, the synchronized store fails and zero is written back to the source register. If it does match, the store proceeds and a one is written back to the source register.
To guarantee proper operation, a synchronized load must atomically set the internal register in the L2 and return the current data in that line. Synchronized loads always behave like L1 cache misses to ensure atomicity. The cache load miss queue handles this request.
GPGPU sim is an architecture simulator written in C++ that is designed to model contemporary GPUs.