Understanding DDR Memory Training

Damien Zammit edited this page Nov 24, 2016 · 30 revisions

Understanding DDR Memory Training

The following is to aid in understanding both the electronic and software aspects of training up DDR memory controllers in 'librecore'. This particular aspect is one of the most challenging aspects of modern firmware and is typically not well documented for new comers.

Subsystem Latency Fundamentals

Typically in electronics we define 'latency' as the round-trip time via multiple subsystems. In the specific case of DDR memory and a CPU our architecture may look something like the following:

             +----------+-----+-----+                          |````\
 ( CPU ) <-> | DDR ctrl | phy | I/O |<--- mainboard traces --->|DRAM |
             +----------+-----+-----+                          |..../
             \  ,,area of focus,,  /

The relationship between the CPU and Controllers command can be considered to take X cylces while the Controller to Physical takes Y cycles while the I/O path is in the domain of nanoseconds. Obviously the board has its own associated latency. After all these cumulative latencies there is the response path of Y' and X' which need not be the same as Y and X. Therefore there are a number of latencies to consider in the overall subsystem.

Additionally, the Controller need not send on data in-order as the Controllers tries to optimize the bus usage such there is always traffic on the bus. Therefore you must consider both the "optimal case" and the "average case" when considering the latency of a given test pattern or the latency of a series of test patters respectively. Naturally there is the "worst case" latency for which we can tolerate along the entire path.

DDR latency

The DDR latency is the time the memory controller (MC) must wait between requesting data and the actual delivery of the data. It is also known as Column Address Strobe (CAS) Latency or simply CL. The value of the CL is usually expressed in terms of clock cycles. For an example, a DIMM with CL3 implies that the memory controller must wait three clock cycles until data is delivered after a request is made.

 +--|--------------|---> Clock
 |  |              |
 ^  |___.          |
 |_|read|__        |
 +-----------------|---> Command
 |                 |
 ^  |              |
 |  |<--- CL ----->|&&|
 +---------------------> Data Out (CL3)

Note that the latencies are higher for DDR3 > DDR2 > DDR and so have an additional parameter called Additional Latency (AL). The total latency will then be CL+AL. Fortunately AL is almost always AL0. This means a DDR2-800 CL5 DIMM will delay less time (i.e., is faster) to start delivering data than a DDR3-800 CL7 DIMM. However, since both are "800 MHz" memories, both provide the exact same maximum theoretical transfer rate (6,400 MB/s).

When comparing DIMM modules with different clock rates, you need to do some math to be able to compare the latencies. Notice that we are talking about "clock cycles." When the clock is higher, each clock cycle is shorter (i.e., shorter period).

Serial Presence Detect (SPD)

...explain how it works...

DDR 2n-Prefetch Architecture

Dynamic memories store data inside an array of tiny capacitors. DDR memories transfer two bits of data per clock cycle from the memory array to the memory internal I/O buffer. This is called 2-bit prefetch. On DDR2 this internal datapath was increased to four bits, and on DDR3 it was raised again to eight bits. This is actually the trick that allows DDR3 to work at higher clock rates than DDR2, and DDR2 at higher clock rates than DDR.

The clocks to which we have been referring so far are the clock rates on the “external world,” i.e., on the I/O interface from the memory, where the communication between the memory and the memory controller takes place. Internally, however, the memory works a little differently.

To the DRAM vendor, 2n-prefetch means that the internal data bus can be twice the width of the external data bus, and therefore the internal column access frequency can be half of the external data transfer rate.

That is, for each single read access cycle internal to the device, two external data words are provided, as shown: Simplified Block Diagram of 2n-Prefetch (Read)

 F                +------------+  n-bit
 r          +---->| n-bit Data |`\  data  /`.(DQS)
 o   2n-bit |     |  Register  |  \      /    +-----------+ [ n-bit data + DQS ]
 m ----/-/--+     +------------+   +----`---->| D0   MUX  |    `.
            |                      +----.---->| D1   C  Q |----/---|>-- All DQ & DQS 
 R          |     +------------+  /      \    +------|----+               outputs.
 A          +---->| n-bit Data | / n-bit  \          `--- Clkd
 M core.          |  Register  |`    data  `.(DQS)

Similarly, two external data words written to the device are internally combined and written in one internal access, as shown: Simplified Block Diagram of 2n-Prefetch (Write)

 DQ0-DQi ----|>--.----->| D       Q |`\ n-bit data
                 |      |           |  \.____.
                 |      | n-bit     |        |
                 |  +---|> Data Reg |        |
                 |  |   +-----------+        |
                 |  |                n-bit   |
                 |  |   +-----------+   data `\     +-----------+  2n-bit data
                 +--|-->| D       Q |`---/-----+--->| D       Q |.--/-- To DRAM
                    |   |           |               |  2n-bit   |         Core.
                    |   | n-bit     |               |  Data Reg |
 DQS --------|>-----+--o|> Data Reg |           +---|>          |
                        +-----------+           |   +-----------+

Signaling Fundamentals

The DDR system is composed of two fundamental blocks, the DDR controller and the DIMM modules themselves. These two blocks are connected across the mainboard by way of traces that form specific data lines.

These data lines are named as follows:

  • CA
  • CLK
  • DQ
  • DQS

connected in the following way:

  +----------+                 +---------+
  | D   C    |<----- CA ------>|  D   M  |
  | D   O    |<----- CLK ----->|  I   O  |
  | R   N    |                 |  M   D  |
  |     T    |<----- DQ ------>|  M   U  |
  |     R    |<----- DQS ----->|      L  |
  |     L.   |                 |      E. | 
  +----------+                 +---------+

Typically the DDR interface is a wide and parallel interface and there are three important timings; The 'CA' (Command Address) is sampled by the 'CLK' (Clock), 'DQ' (Data) is sampled by the 'DQS' (Data Strobe) and finally the 'CLK' and 'DQS' being aligned so the domain crossing between the 'CLK' and 'DQS happens properly.

In order to make all these timings optimal there are various training modes that have been introduced to aid in optimizing all the timings.

In DDR2 some examples are;

  • OCD calibration or the automatic version, ZQ calibration.
  • RD training (read training pattern) to align the DTS/DQ.
  • CA training to allow the host to tune the CA relationship.

The Fly-by Topology

For better signal quality at higher speed grades, DDR3 adopts a so called "Fly-by" architecture for the commands, addresses and clock signals. This effectively reduced the number of stubs and signaling length from the DDR2 T-Branch architecture to a more elegant and straightforward design.

The Fly-by topology generally connects the DRAM chips on the memory module in a series, and at the end of the linear connection is a grounded termination point that absorbs residual signals, to prevent them from being reflected back along the bus.

In DDR3 a fly-by topology was introduced as depicted;

           [ DRAM modules ]
 .--[-]--[-]--[-]--[-]--[-]--[-]--> CA bus "fly-by"
 |   ^    ^    ^    ^    ^    ^
 |   |    |    |    |    |    |
     |    |    |    |    |    |
      \   \     \   /    /   /
       .....( DQ + DQS ).....

Despite the advantages of the Fly-by topology, there is an added complexity; the sequential Fly-by connection of the Command-Address-Clock bus to the DRAMs will cause increasing Clock Skew with the Data bus at every DRAM down the line. In short, the Command-Address-Clock bus signals travel down the line with increasing delay. In order to align the 'CLK' to 'DQS', the 'WL' (write-leveling) training algorithm was introduced.

In DDR4, 'WR' (write) training was introduced to try to align the write DQS arriving at the DRAM. Also 'CBT' (Command Bus Training) was introduced.


DDR PHY Interface (DFI)

The DDR PHY Interface (DFI) is an interface protocol that defines the connectivity between a DDR memory controller (MC) and a DDR physical interface (PHY) for DDR memory devices. The protocol defines the signals, signal relationships, and timing parameters required to transfer control information, read and write data to and from the DRAM devices over the DFI. This interface does not encompass all of the features of the MC or the PHY, nor does it put any restrictions on how the PHY or the MC interface to other aspects of the system such as DFT, other system calibration capabilities, or other signals that may exist between the MC and the PHY for a particular implementation.

Serial Presence Detect (SPD)

Some memory modules are soldered to the mainboard as part of the hardware. This configuration is a fixed one, and thus the memory controller has a fixed configuration. However, most DIMM modules are not part of the board themselves, i.e. they are added later in the form of a socketed module. This means that the MC cannot know in advance what kind of DIMM is plugged in at boot time and would not know how to initialize the DIMM since all DIMMs have different electrical parameters. This problem is solved by the presence of a separate miniature EEPROM on every DIMM module.

 |          +------------+   |
 |  DIMM    | SPD EEPROM |   |
 |          +----||------+   |
{   ||||         ||           }
|   ||||         ||  DIMM X   |
|   ||||         ||  SOCKET   |
    ||||         ||
(DDR signals)  (SMBUS) Address 0xYY

The SPD EEPROM stores the electrical parameters of the DIMM usually compacted into a maximum of 256 bytes. The SPD EEPROM is accessable over the SMBus which is a variant of the two wire protocol I2C. In I2C, there is one data line and one other for clock. The EEPROM has three additional pins (SA0-2) that asserts the slot by way of a unique address between the range of 0x5[0..7].

Each generation of DDR specifies its own SPD content layout which can be found in the respective JEDEC documentation, alternatively Wikipedia gives a good summary of each layout. Further, you can query the SPD data via the 'decode-dimms' command on GNU/Linux from the 'i2c-tools' package.

Hardware specific configuration

Due to each target board having different physical electronic characteristics, the subsystem latencies will vary greatly and so each target must be considered with its own set of challenges. The steps required for memory initialization include the generic command sequencing according to the relevant JEDEC standard (DDR2/3/4), but the actual commands required vary from one memory controller to another due to hardware implementation.

Suggested Reading

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.