Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP


Jeff Bush edited this page · 196 revisions


  • Introduction
  • Execution Pipeline
  • L2 Cache Subsystem
  • Comparisons with Similar Architectures
  • Introduction

    The processor consists of multiple cores, each with private L1 instruction and data caches. The cores share the L2 cache.

    system architecture diagram

    Execution Pipeline

    Each core has a single-issue, in-order execution pipeline with multiple hardware threads. Each thread looks like a separate processor to software, with a full set of general purpose registers. However, the threads share L1 caches and arithmetic pipelines. The processor issues an instruction from a different thread each cycle, skipping threads that are waiting on a cache miss or data dependency. This keeps arithmetic and memory units utilized.

    When a data cache miss, branch, or exception occurs, the core rolls back the thread that caused it. It invalidates instructions in the pipeline that were issued by that thread after the offending instruction. If a branch instruction caused the rollback, it resets the program counter to the target of the branch and continues execution from there. If a cache miss caused the rollback, it resets the program counter to the memory instruction that caused the miss and suspends the thread. When the memory subsystem satisfies the miss, the core reissues the memory instruction.

    There are 16 parallel arithmetic pipelines, each corresponding to a vector lane. Scalar-only instructions use the lowest lane. Instructions can mix scalar and vector operands. A scalar source operand is duplicated across all lanes. The processor can optionally write back vector results to a subset of lanes if an instruction specifies a mask register.

    This diagram shows the execution pipeline for a single core (click to expand):

    execution pipeline diagram

    Instruction Fetch Tag

    This stage contains a program counter for each hardware thread. It selects one to fetch each cycle, skipping threads that are waiting on instruction cache misses or who's instruction FIFOs are full. It increments the program counter of the issued thread, effectively always predicting that branches are not taken.

    The instruction cache is set associative. Each way of the instruction cache has a separate bank of SRAM containing tag values for the cached lines as well as valid bits. This stage reads those in parallel using the middle bits of the instruction address (the set index) as the address. Tag memory has one cycle of latency, so the next stage handles the result. Tag memory has independent read and write ports, so it can process a read the same cycle the memory subsystem is fulfilling a cache miss.

    The stage also contains a least recently used list of ways for each cache set. A signal from the next stage updates these when there is a cache hit. This is read by the L2 cache interface when a response comes in to determine which way to put newly loaded cache lines into.

    A TLB could be added to this stage to support virtual memory, but that is not currently implemented.

    Instruction Fetch Data

    This stage compares each of the way tags read in the previous stage to the requested tag. If a cache hit occurs, it issues the instruction cache data address--formed from the way and the set indices--to instruction cache data memory. If a miss occurs, it enqueues a fill request in the instruction cache load miss queue. Both the L1 instruction and data caches can continue to process requests while misses are pending. This allows threads to hide the latency of cache misses.

    Instruction Decode

    The decode stage determines which register operands are needed for the instruction. It uses multiplexers to extract the register indices from the instruction, and maps them to register file ports. It determines the latency of the instruction--which is equivalent to the number of execution stages in the pipeline. The thread select stage uses the latency to track register writeback conflicts.

    Thread Select

    The thread select stage contains a FIFO of decoded instructions for each thread. Each cycle, it picks an instruction from one of the threads to issue. It skips threads that are unable to run because of operand dependencies, structural hazards, or cache misses. It uses a round robin selection policy, switching threads each cycle whenever possible.

    Each thread has its own scoreboard, which tracks operand dependencies between instructions. There are two situations where a dependency can occur. The most common is a read-after-write dependency, which occurs when an operand to an instruction is result of a previous one. For example, in the following, the second instruction uses register s0, which is the updated by the first instruction. It takes several cycles to compute the result. If thread select stage were to issue the dependent instruction too soon after, it would get the stale value in the register:

        add_f s0, s1, s2
        add_f s3, s0, s4

    The other type of dependency is a write-after-write dependency, where a destination register is written back-to-back by instructions with different latencies.

    Each scoreboard contains a pending bit for each register, which is used as follows:

    1. When the thread select stage issues an instruction, it sets the pending bit for the destination register.
    2. When the writeback stage updates the register file with the result of an instruction, it clears the pending bit for the destination register. The thread select stage also clears pending bits for instructions that are squashed in the pipeline by a rollback.
    3. Before issuing an instruction, the thread select stage checks the pending bit for the sources and destination registers of that instruction. If they are set, it doesn't issue the thread that cycle. Checking the source register protects against read-after-write dependencies, and checking the destination protects against write-after-write dependencies.

    Because the execution pipelines have different lengths, it is possible for instructions issued in different cycles to arrive at the writeback stage simultaneously. This must be avoided because the writeback stage can only retire one instruction per cycle. The thread select stage prevents this by delaying the issue of instructions that would conflict. A shift register, shared between all threads, tracks which stages have active instructions. Each thread checks the appropriate shift register bit based on the latency of the next instruction to be issued.

    The thread select stage tracks if a thread is waiting because the store queue is full or a data cache miss has occurred. It doesn't issue instructions from waiting threads until a signal from the data cache marks it runnable again.

    Some instructions must be issued more than once to complete. For example, scatter/gather vector memory operations perform a separate cache access for each lane. The thread select stage issues these instructions multiple times with a separate subcycle counter signal. When a rollback occurs, the thread select stage resets the subcycle to the one that triggered the rollback. The subcycle mechanism is generic and could support having fewer execution units than vector lanes or other complex instructions.

    This stage has two substages. The first computes the scoreboard mask for the next instruction of each thread. The second checks for conflicts and selects a thread to issue.

    Operand fetch

    The operand fetch stage issues the source register indices to the register file, which has one cycle of latency. The vector and scalar register files in this stage contain registers for all the threads. They form addresses by combining the register index with the thread ID. Both register files have two read ports and one write port. The writeback stage controls the write port.

    There are two internal substages. The first fetches the operands. The second steers the results the appropriate output ports to the next stages, one of the following:

    Floating Point Arithmetic Pipeline

    This five stage pipeline is used for floating point operations and integer multiplication.

    Stage Addition Multiplication
    1 Compare exponents (and significands if exponents are equal). Use multiplexers to sort smaller and larger numbers into appropriate outputs. Subtract exponents
    2 Align smaller significand by shifting right. This stage collects guard, round, and sticky bits from the shifted out bits for rounding. multiply
    3 Add/subtract (invert the significand for logical subtraction). Pre-normalization rounding occurs here based on GRS bits from above. multiply
    4 Leading zero detection on significand for normalization. multiply
    5 Normalize shift left. Post normalization rounding (only necessary for overflows). Normalize/Round

    In the current implementation, multiplication is performed in a single stage (stage 2), and the result is passed through stages 3 & 4. A better implementation would do something more sophisticated like use a Booth encoder/Wallace tree to reduce delay and improve clock speed (although this isn't critical path on FPGA).

    Integer Arithmetic

    A single stage executes instructions that have one cycle of latency such as integer addition and bitwise logical operations. Despite the name, it performs some simple floating point operations like reciprocal estimate. This stage also checks for branches.

    Data Cache Tag

    Like the instruction cache, this stage performs a tag lookup on all ways in parallel. Tag memory in this stage has an extra read port that allows cache snoop operations to occur without interfering with load instructions in the execution pipeline. The cache coherence section below describes this in more detail.

    This stage also contains the least recently used list, which operates the same as the one in the instruction fetch tag stage.

    Data Cache Data

    If a cache hit occurs, this stage reads the data from cache SRAM. This has a 512 bit wide read interface, so a contiguous vector load occurs in one cycle.

    The data cache is write through. Store instructions do not directly update the L1 cache but are instead inserted into a store queue and eventually sent to the L2 cache. When the L2 cache sends a response, it updates the L1 cache line. Each entry in the store queue contains a 512 bits wide cache line. This stage aligns smaller writes to the proper location in the cache line and sets a 64 bit write mask with a single bit for each 8 bit lane. This stage does not load a cache line if a store occurs to an address that is not resident (it is not allocate-on-write).

    When a load is requested, this stage checks the store queue to see if any stores are pending to the same address for the same thread. If there are, bypasses the store data to the writeback stage so the write is visible to that thread.

    If a load miss occurs or a store is attempted when the store queue is full, it rolls back the thread and suspends it. When a load miss is filled, the load miss queue issues a signal to the thread select stage to wake the thread up. If a store occurs to an address that is already pending in the store queue for the same thread, and has not been sent to the L2 cache yet, the store queue will update the existing entry with new store data (write combining).

    The highest portion of the address space is reserved for non-cached device registers. Accesses to this region cause transactions on a separate request/response bus. These may only be 32-bit transfers.


    This stage performs the following functions:

    • Generates signals to update the register file. For vector results, a mask signal with one bit per lane indicates which lanes to update.
    • Extracts and aligns cache line data for memory loads. Combines pending store buffer data retrieved from last stage.
    • Asserts the rollback control signals, which are either raised by earlier pipeline stages or triggered in the writeback stage by interrupts. Rollbacks are only asserted from this stage for simplicity. The thread select stage ensures only one instruction arrives per cycle. This ensures only one rollback needs to be handled at a time.

    Exceptions are precise in this architecture. The processor guarantees that all instructions before the instruction that caused the exception are retired, and no instructions after it are. Although the processor issues instructions in-order, they may be retired (have their results written back to the register file) out of order. This is because the execution pipelines have different lengths. The rollback logic must consider this to keep exceptions precise.

    Interrupts cause a rollback. This prevents instructions that were already issued to the pipeline from causing rollbacks that return execution to the previous code path. If a thread is currently blocked waiting for a memory access, the interrupt resets its program counter, but the thread remains blocked until the previous wait completes. This avoids race conditions that can occur when the wakeup is later signaled.

    L2 Cache Subsystem

    L1/L2 Cache Interconnect

    When there is a request to send from a core to the L2 cache (for example, a L1 data cache miss) the core asserts its signals on its own L2 request interface. The L2 cache asserts an acknowledgement signal when it can process the request. Each requestor has a separate acknowledgement signal to allow arbitration between simultaneous requests. The request contains an entire 512 bit cache line and a 64 bit mask for stores.

    When a load miss occurs, all pending entries in the L1 load miss queue are searched in parallel by tag/line. If a match occurs, the appropriate bit in the thread bitmap is set to indicate the thread is waiting on the entry. This combines the loads into one request. If the load is not already pending, a new load miss queue entry is allocated.

    When the L2 cache receives a store request, it updates its own internal cache line, loading it if it is not resident, then broadcasts the update to all cores.

    The L2 cache may respond to requests out of order because the restart mechanism can reorder transactions. When the L2 cache finishes processing a request, it broadcasts a response on the L2 response interface, which all cores monitor. Identifier fields in the response indicate which request it is associated with.

    L2 cache

    The L2 cache has a four stage pipeline.

    l2 cache pipeline diagram

    • Arbitrate Picks one request to process from all of the cores or a restarted request. The latter always takes priority to avoid a deadlock if the miss queue fills up The L2 pipeline cannot stall, but it can stop accepting new requests.
    • Tag Issues memory address to tag memory which has one cycle of latency. Dirty bits are also checked here.
    • Read If this is an L2 cache hit, read the cache data, which will be sent in the L2 response. If this transaction is a fill, reads the dirty evicted line, which will be put into the writeback queue. Asserts signals that update tag memory if needed.
    • Update Issues signals that update cache memory, combining with previously read data per byte based on the store mask.

    The L2 cache is write-back and allocates cache lines both on load and store misses. When a cache miss occurs, the L2 cache places the request in a miss queue. A state machine controls loading data from external memory. When the transaction finishes, the request restarts at the beginning of the L2 cache pipeline along with the fetched data. A writeback queue holds data that must be stored back to main memory. Requests in the writeback queue are processed before those in the load miss queue to avoid loading stale data. The L2 cache uses the AMBA AXI protocol to communicate with main memory. One Verilog module encapsulates protocol handling. This could be replaced to support different bus protocols.

    The L2 cache is not inclusive or exclusive: A cache line may only be in the L1 cache, only in the L2 cache, or in both.

    Cache Coherence

    To simplify software, this processor supports a coherent physical address space, shared across all threads and cores. There are three rules that a cache coherence protocol must obey:

    1. If a processor writes to a memory location and later reads back from it, and no other processors writes to that location in between, it must get the same value back.
    2. When a processor writes to a location other processors must eventually "see" it (read the new value if they access the same address).
    3. All processors must see stores in the same order--although not necessarily the order that the stores were submitted. If one processor reads A then read B from a memory location, another processor cannot read B then read A.

    The L1 data cache uses a snooping write-update protocol. Because the cache is write-through, there are no 'dirty' or 'exclusive' L1 cache lines as there would be in a write-back cache. Therefore it doesn't need a complex protocol to obtain exclusive access to a line. However, it does need to update L1 data cache lines in other L1 caches whenever a write occurs.

    Bypass logic in the store queue enforces rule 1 by making a pending write visible to the thread that issued it. The write update mechanism enforces rule 2. Because all writes go first to the L2 cache (without updating the L1D cache), which serializes the results and broadcasts it to all cores, rule 3 is observed. The store bypass logic also does not make writes visible to other threads on the same core to enforce this rule.

    Each core snoops L2 responses for store updates. A three stage pipeline processes messages from the L2 cache. The response logic handles both store updates and L1 load miss fills:

    • The first stage sends the address from the response to L1D tag memory to determine if it is the cache. L1D tag memory has a dedicated read port for snoops so this occurs in parallel with accesses from the instruction pipeline.
    • A cycle later, the result comes back from tag RAM. If this is a new cache line fill, it asserts signals that update tag RAM.
    • Data memory is updated a cycle after tag memory. This ensures there isn't a race condition, as the execution pipeline reads data memory a cycle after checking the tag.

    This processor uses a relaxed memory consistency model. Because the pipelines issue memory instructions in order, read-to-read, and write-to-write ordering is preserved. However, write-to-read ordering is relaxed between threads because execution of a thread can proceed while a write is in the store queue. This is known as "processor consistency" (PC). The membar (memory barrier) instruction enforces explicit ordering. If there is an pending store in the store queue, the membar instruction suspends the thread until the L2 cache acknowledges it. This guarantees all other L1 caches reflect the store.

    If a full line is written to the L2 cache as a block vector store and the line is not cache resident, it doesn't load it from memory--which is unnecessary--but rather puts the new data into the line.

    Synchronized load/stores

    Synchronized load/store instructions enable primitives such as spinlocks, atomic counters, and more sophisticated lock-free/wait-free data structures. When a synchronized memory load is requested, the L2 cache latches the address of the load in an internal register (one register is reserved per thread per core). If a write occurs to that cache line by another thread, it invalidates the internal register. Synchronized store operations check this register. If the address has been invalidated, the synchronized store fails and zero is written back to the source register. If it does match, the store proceeds and a one is written back to the source register.

    To guarantee proper operation, a synchronized load must atomically set the internal register in the L2 and return the current data in that line. Synchronized loads always behave like L1 cache misses to ensure atomicity. The cache load miss queue handles this request.

    Comparisons with Similar Architectures

    Larrabee/MIC/Xeon Phi

    • The Xeon Phi instruction set allows memory operands (vector and scalar), and the pipeline is optimized to make this efficient.
    • Scalar and vector execution uses the same pipeline in Nyuzi rather than being broken into separate units like Xeon Phi.
    • Xeon Phi allows multiple issue with two pipelines U & V, where Nyuzi is single issue.
    • Xeon Phi has a separate L2 cache for each core, connected using a ring bus utilizing a write invalidate cache coherence protocol. Nyuzi uses a shared L2 cache, separate L1 caches with a write-through data cache, and a write update coherence protocol. This presumably results in more interconnect traffic in Nyuzi and limits its ability to scale to higher core counts.


    • This design uses thread specific flushes (rollbacks) similar to OpenSPARC
    • The design of the L2 cache and interconenct in Nyuzi was influenced by OpenSPARC.
      • Like OpenSPARC, Nyuzi uses a shared L2 cache and write-through L1 data cache.
      • Nyuzi uses a pipelined L2 cache and split transaction protocol like OpenSPARC. However, Nyuzi omits the crossbar interconnect and banked L2 caches of OpenSPARC in favor of a single L2 cache with a multiplexed arbiter.
      • Like OpenSPARC, Nyuzi uses a write-update cache coherence protocol, however OpenSPARC uses a shared directory in the L2 cache, where Nyuzi snoops updates via an extra port to the L1D tag memory.
      • OpenSPARC uses an inclusive L2 cache (lines in the L1 must be in L2). Nyuzi uses a non-inclusive/non-exclusive cache, which eliminates the need for invalidates when L2 lines are evicted and simplifies the logic. OpenSPARC most likely used this design to simplify implementation of its central cache directory. Since Nyuzi uses a snooping protocol, this is not required.
    • Unlike OpenSPARC, Nyuzi detects operand dependencies with a scoreboard for better pipeline utilization.
    • The pipeline structure is significantly different. Nyuzi decodes instructions before issue because the decoded information is used for scheduling. OpenSPARC decodes it after. OpenSPARC puts the memory stage after the execute stage. Nyuzi places them in parallel.
    • Unlike OpenSPARC, Nyuzi does not have a result forwarding network. It was a critical path in earlier implementations and is not necessary because the memory and arithmetic units run in parallel rather than being end-to-end like OpenSPARC.


    GPGPU sim is an architecture simulator written in C++ that is designed to model contemporary GPUs.

    • The front end of Nyuzi's pipeline is similar to GPGPU-sim's. Both use a scoreboard to track operand dependencies.
    • The instruction cache has two stages in Nyuzi (vs. one in GPGPU sim) to improve clock speed
    • Nyuzi uses a FIFO of decoded instructions per thread instead of a single entry. This is necessary because of the deeper pipeline (when the FIFO full condition is detected, there still may be a few instructions in flight)
    • Nyuzi does not use a SIMT stack; predication is done explicitly with scalar register masks.
    • Nyuzi does not use a operand collector, but has a more traditional multi-ported register file.
    • The back end of the pipeline for Nyuzi is very different and more closely resembles a CPU.
    Something went wrong with that request. Please try again.