#### OpenCXD: An Open Real-Device-Guided Hybrid Evaluation Framework for CXL-SSDs

<u>Hyunsun Chung<sup>1</sup></u>, Junhyeok Park<sup>1</sup>, Taewan Noh<sup>1</sup>, Seonghoon Ahn<sup>1</sup>, Kihwan Kim<sup>1</sup>, Ming Zhao<sup>2</sup>, Youngjae Kim<sup>1</sup>







The 33<sup>rd</sup> International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Paris, France, October 21-23, 2025

- Background
- Motivation
- Design
- Evaluation
- Conclusion



# **Background**

# **CXL: A new memory interconnect**



- Surge of memory heavy workloads in AI and LLM workloads have posed a new challenge in providing more memory capacity.
- **CXL**, a new memory interconnect, has emerged to answer the challenge.
- CXL enables additional memory space to host systems by exposing internal on-board DRAM via PCle-attached CXL devices.





#### **CXL Devices: DRAM based**



- Current forms of CXL mainly focus on memory expansion in the form of DRAM-backed CXL memory devices.
- Industry solutions include Samsung's CMM-D and SK hynix's CMM-DDR5.





#### **CXL Devices: the advent of CXL-SSDs**



- Interest in utilizing NAND flash-based SSDs with DRAM as memorysemantic CXL devices has grown in academia and industry alike
  - High capacity to cost ratio, NAND reuse, etc...
  - Example of industry interest includes Samsung's CMM-H<sub>[1]</sub>.





# SkyByte[1]: SOTA CXL-SSD architecture





Fig. 2: Write/read flows of the state-of-the-art CXL-SSD [11], comprising a *Write Log*, *Data Cache*, and *Log Index*.

- Proposes a combination of a write log and data cache in SSD DRAM.
- Write log takes incoming write I/O of cacheline sizes (64B).
- Data Cache acts as a NAND page cache to achieve faster memory read I/O.
- Both serve to enable both higher I/O performance as well as memorysemantic I/O.

## SkyByte[1]: SOTA CXL-SSD architecture





Fig. 2: Write/read flows of the state-of-the-art CXL-SSD [11], comprising a Write Log, Data Cache, and Log Index.

- SkyByte also proposes a contextswitch policy where a CXL-SSD DRAM miss will trigger a system context switch to cover for the long NAND I/O time.
- SkyByte sets a 2µs threshold for the context switch to be triggered.

#### Status quo of current CXL-SSD evaluation



Due to the lack of CXL-SSD hardware, prior works<sub>[1-3]</sub> adopt an evaluation framework consisting of a x86 system simulator and an SSD simulator.

#### x86 Simulator<sub>[4]</sub>:



Simulates the host-side CPU and memory operations.

#### SSD Simulator<sub>[5]</sub>:



Simulates the CXL-SSD controller ops (write log, data cache, flush, NAND I/O, etc)

<sup>[1]:</sup> SkyByte: Architecting an Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design, Haoyang Zhang et al. (HPCA 2025)

<sup>[2]:</sup> RomeFS: A CXL-SSD Aware File System Exploiting Synergy of Memory-Block Dual Paths, Yekang Zhan et al. (SoCC 2024)

<sup>[3]:</sup> ByteFS: System Support for (CXL-based) Memory-Semantic Solid-State Drives, Shaobo Li et al. (ASPLOS 2025)

<sup>4]:</sup> gem5: The gem5 simulator system

<sup>5]:</sup> Amber: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources, Donghyun Gouk et al. (MICRO 2018)

#### Status quo of current CXL-SSD evaluation



Due to the lack of CXL-SSD hardware, prior works<sub>[1-3]</sub> adopt an evaluation framework consisting of a x86 system simulator and an SSD simulator.

#### x86 Simulator<sub>[4]</sub>:



Simulates the host-side CPU

#### SSD Simulator<sub>[5]</sub>:



Simulates the CXL-SSD controller ops

→ This hybrid approach was acceptable for early-stage exploration of CXL-SSD concepts.

<sup>[3]:</sup> ByteFS: System Support for (CXL-based) Memory-Semantic Solid-State Drives, Shaobo Li et al. (ASPLOS 2025)

<sup>4]:</sup> gem5: The gem5 simulator system

<sup>5]:</sup> Amber: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources, Donghyun Gouk et al. (MICRO 2018)



# **Motivation**

#### Limitations of status quo evaluation



Moving from storage to memory brings new demands to the evaluation platforms of CXL-SSDs.

 Reflecting minute latency variations of hardware components (DRAM, NAND) become essential for portraying memorysemantic SSDs.



#### Limitations of status quo evaluation



Cacheline Offset = ( (MemAddr mod PageSize)

Moving from storage to memory brings new demands to the evaluation platforms of CXL-SSDs.

 Evaluation of impact of new CXL-SSD features (write log, data cache) in overall performance is also critical.



#### Limitations of status quo evaluation



Cacheline Offset = ( (MemAddr mod PageSize)

Data Cache

Entry for LPN

Moving from storage to memory brings new demands to the evaluation platforms of CXL-SSDs.

 Evaluation of impact of new CXL-SSD features

Such requirements in the evaluation platform are difficult to meet in SSD simulation, which rely on **static parameters**.

→ How does a parameter based SSD simulator compare to a real SSD device in terms of latency evaluation?

Write Log

< LPN. Cacheline Offset >

# Limitations of Software-Driven SSD Simulation: SimpleSSD vs OpenSSD



SimpleSSD provides NAND access latency as a single parameterized value.



Fig. 3: NAND read/program I/O times of two different types of NAND with iodepth=1.

Fig. 4: NAND read/program I/O times of two different types of NAND (a) and (b) with iodepth=8. The data is zoomed in to show the 6000-7000 range for clarity.

#### **Limitations of Software-Driven SSD Simulation:**

SimpleSSD vs OpenSSD

But, as shown in the graph, real NAND access latency cannot be represented as a single unified value.



Fig. 3: NAND read/program I/O times of two different types of NAND with iodepth=1.



Fig. 4: NAND read/program I/O times of two different types of NAND (a) and (b) with iodepth=8. The data is zoomed in to show the 6000-7000 range for clarity.

**DISCOS** 

**Limitations of Software-Driven SSD Simulation:** SimpleSSD vs OpenSSD





Depending on NAND vendor and workload, unexpected latency spikes can also be observed.



Fig. 3: NAND read/program I/O times of two different types of NAND (a) and (b) with iodepth=1.



Fig. 4: NAND read/program I/O times of two different types of NAND (a) and (b) with iodepth=8. The data is zoomed in to show the 6000-7000 range for clarity.

# Limitations of Software-Driven SSD Simulation: NAND Access Latency Breakdown



 Growing impact of the low-level NAND flash controller and firmware can be seen with more stressing workloads



Fig. 5: Breakdown of NAND (b)'s average  $t_R$  and  $t_{Prog}$ .

# Limitations of Software-Driven SSD Simulation: NAND (SK Hynix, Toshiba) Access Latency Pattern CDF



 CDF shows NAND of similar specifications from different vendors show differing latency characteristics that cannot be represented by a single fixed value.



Fig. 6: NAND I/O latency Cumulative Distribution Function (CDF) of two different types of NAND in different workloads (a) randread, iodepth=1, (b) randwrite, iodepth=1, (c) randread, iodepth=8.

#### **Takeaway from motivation experiments**



- NAND I/O latency is highly variant based on factors outside NAND specifications → A single value covering latency is not enough!
  - Even NAND with similar specifications between different vendors show differing performance characteristics.
- Such approximations are adequate for average performance metrics for storage devices but is lacking in portraying per-request performance required for memory devices.

How can we improve evaluation platforms to more accurately portray CXL-SSD performance on a memory-semantic level?



# Proposed Solution: OpenCXD

# Overview of *OpenCXD*



#### x86 simulator

- Provides cycle-accurate simulation of the entire memory hierarchy.
- Replays memory traces from the target workload.

#### SSD platform

- Complete SSD firmware, including the NVMe interface, FTL, and NAND I/O scheduling.
- Current implementation replicates the internal architecture of SkyByte.



Fig. 7: Architecture of OpenCXD and its execution flow.

# Custom NVMe commands in *OpenCXD*



| dword | description           |  |          |  |                 |  |
|-------|-----------------------|--|----------|--|-----------------|--|
| 0     | CID                   |  | Reserved |  | opcode          |  |
| 1     | Namespace ID          |  |          |  |                 |  |
| 3     | Reserved              |  |          |  |                 |  |
| 4     |                       |  |          |  |                 |  |
| 5     | Metadata Pointer      |  |          |  |                 |  |
| 6     | PRPentry#1 (not used) |  |          |  |                 |  |
| 7     |                       |  |          |  |                 |  |
| 8     | DDDontry#2 (not used) |  |          |  |                 |  |
| 9     | PRPentry#2 (not used) |  |          |  | <del>(</del> u) |  |
| 10    | Reserved              |  |          |  |                 |  |
| 11    | CXL Memory Address    |  |          |  |                 |  |
| 12    | Load or Store         |  |          |  |                 |  |
| 13    | Reserved              |  |          |  |                 |  |
| 14    |                       |  |          |  |                 |  |
| 15    |                       |  |          |  |                 |  |

(a) NVMe Command (SQE)



| dword |     | description       |      |   |             |   |
|-------|-----|-------------------|------|---|-------------|---|
| 0     |     | R                 | SCT  |   | Op Overhead | R |
| 1     |     | Device Op Latency |      |   |             |   |
| 2     |     | SQ                | l ID |   | CQ ID       |   |
| 3     | Sta | atus f            | ield | Р | CID         |   |

(b) NVMe CQE

Fig. 8: NVMe command and CQE for OPENCXD.

- Submission Queue Entry (SQE) includes the memory operation type (load/store) and memory address of said operation .
- Completion Queue Entry (CQE) returns operation latency and overhead values that is required for simulation integration.



- 1 Memory access detection
  - LLC cache miss detection - Checks whether the missing address falls within the memory-mapped region of the CXL-SSD.



Fig. 7: Architecture of OpenCXD and its execution flow.



- 2 Issuing custom NVMe command
  - Encapsulates
     memory request into a
     newly defined NVMe
     command for OpenCXD.
  - The NVMe command is issued to the SSD platform via the NVMe passthrough interface.



Fig. 7: Architecture of OpenCXD and its execution flow.



- ③ Perform CXL-SSD operations
  - Fetches the NVMe command.
  - Performs the corresponding CXL-SSD operations (e.g., Write Logging).



Fig. 7: Architecture of OpenCXD and its execution flow.



- 4 Return timing information
  - Returns the total
     processing time in
     device to the host by
     using reserved field of
     the NVMe Completion
     Queue Entry (CQE).



Fig. 7: Architecture of OpenCXD and its execution flow.



- 5 Integration with simulation
  - Extracts the device-side latency from the received CQE.
  - Subtracts the NVMe interface overhead.
  - Adds the CXL.mem protocol overhead as used in SkyByte (~40ns), which is recognized as consistent and reasonable.



Fig. 7: Architecture of OpenCXD and its execution flow.

# Timing Flow integration in *OpenCXD*





Fig. 8: Timing flow of a CXL memory request in OpenCXD.

- 1) Pause the x86 simulator during each CXL.mem access.
- 2) Incorporate latency measured from firmware execution on the SSD. This implementation allows modeling of CXL.mem timing behavior while removing NVMe transport overhead from equation.



# **Evaluation**

## **Evaluation Setup**



Testbed:

#### DaisyPlus OpenSSD Platform<sub>[1]</sub>



TABLE II: Specifications of the OpenSSD platform.

| SoC          | Xilinx Zynq UltraScale+ ZU17EG, |
|--------------|---------------------------------|
|              | with ARM Cortex-A53 Core        |
| NAND Module  | 256GB, 4 Channel & 8 Way        |
| Interconnect | PCIe Gen3 × 16 End-Points       |
| DRAM         | 2GB LPDDR4 @ 2400MHz            |

TABLE III: Specifications of the host system.

|        | 1                                   |
|--------|-------------------------------------|
| CPU    | Intel(R) Core(TM) i7-14700K CPU     |
| CIO    | @ 5.60GHz (28 cores)                |
| Memory | 32GB DDR5                           |
| OS     | Ubuntu 24.04.2, Linux Kernel 6.11.0 |

# **Evaluation Setup**



 Workloads: 7 benchmarks (bc, bfs-dense, dlrm, radix, srad, tpcc, ycsb), with traces taken from SkyByte's artifact.

**TABLE I:** Benchmarks used in our experiments.

| Category            | Suite         | Name      | Memory<br>Footprint | Write<br>Ratio | LLC<br>MPKI |
|---------------------|---------------|-----------|---------------------|----------------|-------------|
| Graph               | Rodinia [17]  | bfs-dense | 9.13GB              | 25%            | 122.9       |
| Processing          | GAP [13]      | bc        | 8.18GB              | 11%            | 39.4        |
| HPC                 | Splashv3 [51] | radix     | 9.60GB              | 29%            | 7.1         |
| Image<br>Processing | Rodinia [17]  | srad      | 8.16GB              | 24%            | 7.5         |
| Database            | WHISPER [45]  | ycsb      | 9.61GB              | 5.0%           | 92.2        |
|                     |               | tpcc      | 15.77GB             | 36%            | 1.0         |
| Machine<br>Learning | DLRM [46]     | dlrm      | 12.35GB             | 32%            | 5.1         |

 Results from OpenCXD (evaluation integrating real SSD hardware) and SkyByte artifact (evaluation using SSD simulation) are compared in evaluation.

## Evaluation: CXL-SSD operation latency



- OpenCXD shows varying results in latency in all workloads.
  - SkyByte's log write and cache hit always seen to be 712ns and 640ns respectively.



## Evaluation: CXL-SSD operation latency



 Some cache hit latencies even goes over the 2µs threshold set in SkyByte, showing the effects of DRAM latency spikes.



#### **Evaluation: Cache miss latencies**



 Due to taking NAND controller and firmware overhead into consideration, OpenCXD shows a 2.4x higher average latency across all benchmarks over SkyByte.



#### **Evaluation: Cache miss latencies**



 Due to taking NAND controller and firmware overhead into consideration, OpenCXD shows a 2.4x higher average latency across all benchmarks over SkyByte.



→ OpenCXD reveals higher sensitivity to DRAM performance and latency spikes, differing based on each workload.

**Evaluation** Conclusion **Background** Motivation Design

# Evaluation: Latency spread analysis





Histograms clearly show the skewed latency distribution in SkyByte.

87.2% of requests in srad and 94.3% in YCSB return the same fixed latency value of 99.72 µs.







Fig. 11: Histograms of NAND read I/O latency during hit misses for srad ((a): OPENCXD, (b): SkyByte) and ycsb ((c): OPENCXD, (d): SkyByte).

# Evaluation: Latency spread analysis



Histograms clearly show the skewed latency distribution in SkyByte.

 87.2% of requests in srad and 94.3% in YCSB return the same fixed latency value of 99.72 μs.



→ OpenCXD enables a more accurate study of CXL-SSD performance with a varied spread of latencies.

# **Evaluation: CPU Cycles Comparison**



- OpenCXD overall required more CPU cycles required to process one million memory accesses over SkyByte.
  - Result seen as OpenCXD reflects overall higher I/O latencies than SkyByte.
  - More threads are required to hide read latency through context switching in OpenCXD.
  - Results highlights the need for additional optimizations to effectively hide NAND I/O latency.



# **Evaluation: CPU Cycles Comparison**



- OpenCXD overall required more CPU cycles required to process one million memory accesses over SkyByte.
  - Result seen as OpenCXD reflects overall higher I/O latencies than SkyByte.
  - More threads are required to hide read latency through context switching in OpenCXD.
  - o Results highlights the need for



→ Evaluations from *OpenCXD* reveal new work for CXL-SSD optimizations over the current state of the art.



# Conclusion

#### Conclusion



- SSD simulation in an evaluation platform for CXL-SSDs is inadequate
  - Current SSD simulators are focused for storage devices, not memory-semantic devices.
- OpenCXD proposes a new evaluation method that incorporates real SSD hardware with CXL-SSD firmware in simulation results.
- OpenCXD demonstrates the effectiveness of such integration of real hardware in simulations.
  - More nuanced SSD and DRAM latency incorporation
  - Critical insights into CXL-SSD optimization requirements



# Thank you!

#### Contact

- Hyunsun Chung / hchung1652@sogang.ac.kr
- Junhyeok Park / junttang@sogang.ac.kr
- Youngjae Kim / youkim@sogang.ac.kr
- Data-Intensive Computing & Al Systems Laboratory https://discos.sogang.ac.kr/

<Camera-ready paper> Can be found on Google Scholar



#### OPENCXD: An Open Real-Device-Guided Hybrid Evaluation Framework for CXL-SSDs

Hyunsun Chung<sup>1,\*</sup>, Junhyeok Park<sup>1,\*</sup>, Taewan Noh<sup>1</sup>, Seonghoon Ahn Kihwan Kim<sup>1</sup>, Ming Zhao<sup>2</sup>, Youngjae Kim<sup>1,†</sup> Sogang University, Seoul, Republic of Korea, 2Arizona State University, Tempe, AZ, USA

proximate internal flash behaviors. While effective for early-stage capitoration, this approach cannot faithfully model firmware-level exploration and tow-level storage dynamics critical to CXL-SSD performance. In this paper, we present OPENCXD, a real-device-guided hybrid evaluation framework that bridges the gap between dimulation and hardware. OPENCXD integrates a cycle-accurate CXL-mem simulator on the host side with a physical (DunSSX) adaptare armains read framewor. This cambles in-alter Through these contributions, OPENCXD reflects device-level h these contributions, OPENCAD reflects device-tive tena unobservable in simulation-only setups, providing instables for future firmware design tailored to CXL-SSDs.

#### I. INTRODUCTION

analytics applications has led to memory footprints reaching tens of terabytes [1]-[3], far beyond the limits of traditional DRAM installations. This widening gap between demand and capacity has resurrected the infamous memory wall [4], in which memory bandwidth and size become critical bottlenecks to system performance. To mitigate this issue, researchers have egun exploring memory expansion techniques that repurpose alternative technologies as additional memory [5]-[7], paying the way for new architectural paradigms. Among these, Compute Express Link (CXL) [8] has emerged as a promising enabler of such memory expansion. CXL is a high-bandwidth, cache-coherent interconnect built on top of the PCI Express (PCIe) [9] infrastructure. By using CXL, large-capacity PCIe attached directly to the host system memory space, creating a memory pool that augments or disaggregates DRAM [10]. Notably, CXL's memory semantics (CXL mem) support

They are first co-surbors and have contributed equally.

Abstract—The advent of Compute Express Link (CXL) enables overhead and access latency while enabling a unified, fiered Administration of Computer Experts List CXI conducts
See a perchant of the time to increase in processing and the computer of via the block interface (e.g., NVMe [12]).

The key challenge for CXL-SSDs is how to architect and evaluate these devices effectively, maximizing their perfor mance potential while addressing inherent latency trade-offs. OpenSSD platform running real firmware. This enables in-situ However, this evaluation remains difficult in practice, due to tocol on SSDs. To overcome this, recent research has adopted (e.g., MacSim [13], Gem5 [14]) extended with CXL.mem semantics and SSD simulators (e.g., SimpleSSD [15], Flash-Sim [16]) that model internal flash behaviors such as address translation and I/O scheduling. This methodology enables The growing scale of modern deep learning models and data early-stage design exploration and has been widely used in prior work [17]-[19].

Notably, the CXL mem interface overhead itself has been characterized in prior studies [20], [21], and shown to be relatively consistent and bounded [10]. As such, injecting this CXL interface time overhead into x86 simulations as a parameter is generally considered a reasonable approach for modeling host-side access costs. However, the same cannot be said for modeling device-side behavior. Replacing a real SSD with a simulator introduces significant challenges in capturing the full complexity of CXL-SSD-specific firmware logic an storage interactions, which are limitations that fundamentally

ation. First, since CXL-SSDs function as memory rathe than storage, they must handle fine-grained, cacheline-level memory accesses, making them highly sensitive to device-side byte-addressable access to device memory, meaning a CPU performance fluctuations. However, simulators typically rely can read or write an SSD's onboard DRAM buffer with ordinary load and store instructions. This eliminates the need Second, CXL-SSDs introduce new firmware-managed mech anisms, such as write logging and log compaction [11], that do not exist in conventional SSDs. Simulating these new