Threading and concurrency

The following sections explain how DiskSpd manages threads, I/O queuing, and CPU affinity. These settings control how I/O operations are dispatched and balanced across the system.

Thread models: per-target threads vs fixed threads

One thread can run a test against more than one target (file, physical drive, partition). The total number of threads can be set with the -F (fixed threads) parameter. For example, the following command uses two threads, each accessing five files:

diskspd -F2 -o2 file1 file2 file3 file4 file5

In this example, -F2 causes only two threads to be created. Each thread will access all of the files, with two outstanding operations per target (-o2). As a result, there are 5 x 2 = 10 outstanding operations per thread and 2 x 10 = 20 total operations.

By contrast, the -t (per-target threads) parameter creates sets of threads which only access a specific target.

diskspd -t2 -o2 file1 file2 file3 file4 file5

In this example, the -t2 parameter causes two threads per target to be started, for 5 x 2 = 10 total threads with two outstanding operations each. Again, this would produce 2 x 10 = 20 total operations.

The -O parameter can be used to specify the number of outstanding operations per fixed thread rather than per target. For example, the following command uses two threads, each accessing five files with a total of four requests outstanding per thread:

diskspd -F2 -O4 file1 file2 file3 file4 file5

In this example before each operation is issued a file to target will be randomly chosen.

The -F parameter is especially useful for balancing the workload on multiprocessor (or multi-core) systems. You can set -F to the number of processors in the system. By default, the threads are affinitized in a Round Robin manner, which distributes processing evenly across the CPUs, enabling you to test more devices simultaneously before you hit a CPU bottleneck.

The -F and -t parameters are mutually exclusive. Figure 1 and Figure 2 illustrate the differences between -t and -F.

Figure 1. Threads specified with the -F (-F3) fixed threads parameter

Figure 2. Threads specified with the -t (-t3) per-target threads parameter

Thread stride

By default, a target is accessed by only one thread although that thread can contain more than one overlapped I/O operation.

With the -t or -F parameters discussed in the previous section, multiple threads will access the same file. By default, threads performing sequential operations will all start from offset 0 (zero) of the target. Use the -T parameter to specify the offset between threads (thread stride) if necessary.

The -T parameter only applies to sequential I/O (-s) with more than one thread per target, and conflicts with -r and -si.

Figure 3 shows an example of access with a stride creating an interleaved pattern between multiple threads. Thread 1 accesses blocks 1, 4 and 7 of the file. Thread 2 accesses blocks 2, 5 and 8 and thread 3 accesses blocks 3, 6 and 9.

To get such a pattern, thread stride (-T) must be equal to block size (-b) and sequential stride size (-s) must equal the number of threads (in this case 3) times the block size. This is a case where specifying strides in units of blocks can be more concise. Either of the following commands will produce that pattern:

diskspd -t3 -T4k -b4k -s12k C:\testfile

or

diskspd -t3 -T1b -b4K -s3b c:\testfile

Figure 3. Accessing the file from multiple threads

All the used parameters are also explained in Figure 4.

Figure 4. Parameters: base file offset (B), block size (b), stride size (s) and offset between threads (T)

In the previous example, while the pattern is suggestive, there is no interlock between the threads to maintain a strictly sequential pattern of access to the storage (see Test sequential I/O). It is possible due to thread scheduling that the pattern could separate over time, with one or more threads falling behind or racing ahead of their peers.

A second use case for thread stride is to create multiple spatially separated sequential streams on a target:

diskspd -c3G -t3 -T1G -b4K -s c:\testfile

This pattern will create a 3GiB file and three threads, with each thread starting I/O at succeeding 1GiB intervals.

Thread 1: 0, 4KiB, 8KiB, …
Thread 2: 1GiB, 1GiB+4KiB, 1GiB+8KiB, …
Thread 3: 2GiB, 2GiB+4KiB, 2GiB+8KiB, …

Thread stride need not be a multiple of sequential strides (or vice versa). When the end of file is encountered, access wraps back to the beginning at an offset such that each thread will reproduce the same I/O offsets on its next sweep through the target. In the earlier examples each thread will loop back to 0 (zero). Consider the following counter-example:

diskspd -c3G -t3 -T13k -b4K -s c:\testfile

In this case, the second thread will loop back to offset 1K and then produce 5K, 9K, before returning to 13K and continuing through the file again.

Number of outstanding I/O requests

The basic way to set number of outstanding I/O requests is with the -o parameter; for example, -o32 specifies 32 I/O requests per thread per target. If -o1 is used and threads are accessing only one target (-F is not used or there is only one target specified or -O1 is used), I/O is issued synchronously.

At the time the test starts, each thread begins issuing I/O up to its limit created by -o (or-O). As completions arrive another is issued to replace it in the queue.

For sequential I/O, by default the next operation will be issued with respect to the most recent I/O operation started within the same thread. Figure 5 shows an example of this behavior with three outstanding I/Os per thread (-o3) and a sequential stride equal to the block size (-s). The next I/O operation will start at the offset immediately after I/O #3, which is marked with a dashed line.

Figure 5. Overlapped I/O (-o3)

The sequential I/O process can be explained with the following pseudo-code. Each thread has its own lastFileOffset variable.

UINT64 GetNextOffset()
{
    lastFileOffset += stride;
    return lastFileOffset;
}

This behavior changes with the -p parameter for parallel sequential I/Os. When used, the offset of the next I/O operation is calculated by using the offset of the I/O operation that has just finished instead of the most recent I/O operation that started. Figure 6 shows how the -p parameter changes the behavior with three I/Os per thread (-o3) and a sequential stride equal to the block size (-s).

Figure 6. I/O dispatch pattern of the -p parameter (at -o3)

In the figure above, the primed I/Os (marked with prime symbols) indicate the sequence of completion and dispatch, for example that I/O 2'' was issued on the completion of I/O 2'. At the time the diagram stops the three outstanding I/Os are 1', 2'' and 3.

With -p, the next sequential offset is calculated in the manner shown in the following pseudo-code, assuming there is enough data left in the file:

UINT64 GetNextOffset(struct IO *completedIO)
{
    return completedIO->startOffset + stride;
}

The -p option creates a very specific pattern perhaps most suitable for cache stress and its effect should be carefully considered before use in a test. The -p option is ignored if -r is specified and makes sense only with -o2 or greater.

Thread Operation

Threads run their IO independently with a total number of IO requests as stated in the previous section. Absent rate limits (-g) this means that an I/O operation is restarted as soon as it completes.

By default, a thread issuing more than one IO request at a time will issue asynchronous I/O through an independent completion port. The thread behavior is shown in the following pseudo-code, where waitN is the GetQueuedCompletionStatusEx API.

thread {
    SetThreadIdealProcessor
    AffinitizeThread
    port = CreateIOCompletionPort
    WaitForStartSignal

    work = [queue depth]

    while (not end of test) {

        if (work not empty)
            workitem = dequeue(work)
            issueIO(workitem, port)

        if (not measuring latency and work not empty)
            loop

        if (measuring latency and work not empty)
            completions = waitN(port, 0ms)
        else
            completions = waitN(port, INFINITE)

        if (completions)
            updatestats(completions)
            insert(completions, work)
    }
}

With completion routines (-x) the asynchronous model changes so that I/O is reissued from the completion routine. In this case, the thread's primary work is done after issuing all of the workitems. The completion routines reissue the completed IO until the end of the test is signaled.

thread {
    SetThreadIdealProcessor
    AffinitizeThread
    port = CreateIOCompletionPort
    WaitForStartSignal

    work = [queue depth]

    while (work)
        workitem = dequeue(work)
        issueIO(workitem, completionRoutine)

    WaitForEndOfTest
}

completionRoutine(workitem) {
    updatestats(workitem)

    if (not end of test)
        issueIO(workitem, completionRoutine)
}

In the case of single IO requests, (-o1), DiskSpd does not use asynchronous I/O. Instead, operations are synchronously executed in a loop as demonstrated by the following pseudo-code:

SetThreadIdealProcessor
AffinitizeThread
WaitForStartSignal

while (not end of test) {
    status = issueIO(workitem)
    updatestats(workitem)
}

In all cases the sequencing of the test is controlled by the main thread, signalling the worker threads through a phase-of-test indication. The main thread of the I/O request generator works in the following manner:

OpenFiles
CreateThreads

phase = Warmup
SendStartSignal
Sleep(warmup)

phase = Measurement
Sleep(duration)

phase = Cooldown
Sleep(cooldown)

phase = End
WaitForThreadsToCleanUp
SendResultsToResultParser

Think time and I/O bursts

An exception to continuous reissue is a specification of I/O dispatch scheduling in terms of per-thread per-target "think" time and I/O burst size. This can be specified with the combination of -i and -j parameters.

-i<count>: number of I/Os to issue per burst
-j<milliseconds>: number of milliseconds to pause between bursts

Ensure that there is sufficient outstanding I/O allowed (-o) to achieve the intended bursts. Storage latency may prevent the system from achieving the rates theoretically specified with these parameters.

Rate limits

Another exception are rate limits specified with the -g parameter, a throughput limit per-thread per-target. By default the value is in bytes per millisecond. With the i qualifier (-g<value>i), the value specifies IOPS of the given block size (-b).

The following example limits each thread to 1000 IOPS of 4KiB blocks:

diskspd -c1G -b4K -o32 -t1 -w50 -g1000i -d60 -Sh testfile.dat

The next example targets 80 bytes/millisecond. This is 80,000 bytes/second and at 8KiB/IO yields an effective target of ~9.8 IOPS.

diskspd -t1 -o1 -s8k -b8k -Sh -w100 -g80 c:\test1

By adding a second file, this doubles the total I/O target to 160,000 bytes/second, or ~19.5 IOPS.

diskspd -t1 -o1 -s8k -b8k -Sh -w100 -g80 c:\test1 c:\test2

Adding a second thread doubles the total I/O target again to 320,000 bytes/second, or ~39.1 IOPS.

diskspd -t2 -o1 -s8k -b8k -Sh -w100 -g80 c:\test1 c:\test2

The precision of rate limits can be affected by thread scheduling, total CPU consumption, instantaneous storage latency and other factors. Longer total test times will generally converge on the requested rate limits. If rate limits aren't reached, consider providing additional outstanding I/Os (-o) to threads, dividing work across adding additional threads (-t or -F) or, as appropriate to test goals, adding targets.

Throughput limits cannot be specified when using completion routines (-x).

In general, effective use of rate limits may require some experimentation.

Completion routines

DiskSpd by default uses I/O completion ports to refill outstanding operation queues. However, completion routines can also be used. The -x parameter instructs DiskSpd to use I/O completion routines instead of I/O completion ports.

When using completion routines, the next I/O is dispatched from the completion routine as opposed to returning to a single, master loop as with I/O completion ports.

CPU affinity

Thread rescheduling between processors can produce inconsistent results during performance testing. For that reason, DiskSpd by default affinitizes all of its threads to CPU cores in a round-robin manner starting at logical CPU 0 in Processor Group 0, assigning one thread per CPU in that group and so forth for each subsequent Processor Group in the system. If there are more threads than logical CPUs, assignment returns to CPU 0 in Processor Group 0 and the process repeats until all threads are assigned.

This default behavior is the same as -ag and can be explicitly specified as such. This default affinity can be turned off by the -n parameter.

Advanced CPU affinity can be turned on by using the -a parameter with a specific processor group and CPU number ordering: g<group#>,<cpu#>,<cpu#>,... Threads will be affinitized to the specified CPUs in a round-robin manner until all threads are assigned.

Multiple -a specifications can be made on the same command line; they accumulate in the order specified.
Multiple g#,#[,#,…] specifications can be made within the same switch; the following are equivalent:
- -ag0,0,1,2,g1,0,1,2
- -ag0,0,1,2 -ag1,0,1,2
The number of CPUs can differ from the number of threads.

Group number can be omitted on small, single group systems.

The processor group topology of the system — including Socket, NUMA, Core, and Power Efficiency Class (big/little core) information — along with the active/inactive processor mask is provided in the <System> element of the XML output results. In text results, the hierarchy of processor topology elements are displayed when more than one is present (e.g. multi-group, multi-socket systems) alongside the CPU utilization for the test.

NOTE: Efficiency classes (big/little cores) can have major impact on results. When working on heterogeneous systems be aware of core properties in combination with thread affinity rules.

Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading and concurrency

Thread models: per-target threads vs fixed threads

Thread stride

Number of outstanding I/O requests

Thread Operation

Think time and I/O bursts

Rate limits

Completion routines

CPU affinity

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally