# Performance analysis for CPU-based deconvolution code
Set plf.SetThreadFileName("threads.txt") or similar to dump information about how long each work chunk is taking to run, in the multithreaded C code. This notebook examines the information in that file and compares it between runs (e.g. when running with different numbers of threads)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import tifffile
import sys, time, os, csv, warnings, glob
import cProfile, pstats
from jutils import tqdm_alias as tqdm

import psfmatrix, lfimage
import projector, lfdeconv
import special_fftconvolve as special
import jutils as util
import py_light_field as plf
import lf_performance_analysis as perf

In [None]:
def LoadThreadInfo(path):
    # Load a file (e.g. "threads.txt") saved after a previous projection operation
    rows = []
    with open(path) as f:
        cf = csv.reader(f, delimiter='\t')
        for row in cf:
            rows.append(np.array(row).astype(np.double))
    rows = np.array(rows)
    tStart = rows[0,2]
    tEnd = np.max(rows[:,3])
    dt = rows[:,3]-rows[:,2]
    mt = (rows[:,5]-rows[:,4]) + (rows[:,7]-rows[:,6])
    return (rows, tStart, tEnd, dt, mt)

col = ['red', 'yellow', 'green', 'blue']
lab = ['FFT', 'transpose', 'mirror', 'convolve', 'convolve (1st)', 'convolve (2nd)', 'convolve (mutex)']
numWorkTypes = 4
workNames = { 0:'fft', 1:'transpose', 2:'mirror', 3:'convolve' }

## Memory bandwidth benchmarks
### Information on my mac pro
RAM is 1066 MHz DDR3, which according to wikipedia should give 68GB/s(!). But reported bandwidths for the mac pro seems to be 2GB/sec/core, so I am not sure what the real limiting factor is here, but the reported bandwidth seems more plausible!

Source: https://en.wikipedia.org/wiki/Mac_Pro#Memory says up to 16GB/s total for early mac pros.

Interestingly though, https://macperformanceguide.com/MacPro2019-MemoryBandwidth.html quotes 2.5GB/s/core for the 2019 mac pro, which is not actually that much faster.

### My performance results
- My read performance seems pretty close to that quoted 2GB/sec/core.
- My read performance falls with number of threads when I have two independent reads going on simultaneously - but for a single thread it seems to be almost unchanged from single-read performance. How on earth can I explain all this!?
- My write performance is perplexingly *high* (but falling with number of threads). I can't explain that. I suppose it could be something like the bandwidth limitation is between the L3 cache (per processor) and the RAM? The speed with 8 threads is 4.4x slower than with 1 thread, and sits a bit below the implied maximum bandwidth.
- Single-thread read-write is *faster* than single-thread read! That makes no sense! Unless, I suppose, the arithmetic is causing me a problem somehow (messing up the pipeline, perhaps??). **Should think about this one**
- In-place increment is even faster (which feels plausible compared to single-thread read-write, even if I can't put my finger on exactly what this would change). It is close to write-only performance, which perhaps makes sense if it's limited by the same factor ultimately.
- It is utterly bizarre that single-thread performance for CalculateRow is actually **higher** than single-thread read performance!!
- I don't understand why calculate-row performance falls with number of threads. It falls by a factor of about 2 (from 1->8 threads). **I should think more** about why that could be. It seems as if this might be the underlying explanation for why I am not getting a greater speed boost overall from multithreading. 
- CalculateRow is two reads and an increment. 

Remember that, even if I don't understand exactly what is going on, I can still benchmark performance for various "batched" implementations of CalculateRow and observe empirically how their performance varies.

In [None]:
timeHistory = dict()

In [None]:
# Run memory benchmarks
bmName = ['read', '2xread', 'write', 'read/write', 'increment', 'calc-row', 'calc-row-dummy', 'calc-row-2']
threadsToUse = [1, 2, 4, 6, 8]

for numThreads in threadsToUse:
    # Sizes <1e5 elements have too much overhead (thread setup etc?), and the numbers don't make sense.
    problemSizes = np.array([1e5, 1e6, 5e6, 1e7, 2e7, 4e7, 6e7])
    #problemSizes = np.array([1e7, 2e7])
    #np.array([1e4, 1e5, 2e5, 5e5, 1e6, 5e6, 1e7, 2e7])

    bms = np.arange(len(bmName))
    times = np.zeros((problemSizes.shape[0], bms.shape[0]))
    t0 = time.time()
    for p in range(len(problemSizes)):
        problemSize = problemSizes[p]
        dts = plf.MemoryBenchmark(numThreads, int(problemSize))
        times[p] = np.array(dts)
    print('For info: calculating benchmarks took %.1fs (x%d)' % (time.time()-t0, numThreads))
    timeHistory[numThreads] = times.copy()

In [None]:
# Predictions (1 and 8 threads) for calculate-row-dummy benchmark, based on measurements for simpler benchmarks.
# The 8-thread prediction from the dual-read benchmark is actually close to reality, 
#  although the 1-thread prediction is half the reality!
# That seems to come back to the issue where a single thread is over-performing as far as
#  the increment (and dual-read) benchmarks go.

# Naive predictions from 2x single-read + 1x increment
print(1/(2/1800+1/4500))
print(1/(2/1800+1/1400))
# Predictions from dual-read + 1x increment
print(1/(1/1800+1/4500))
print(1/(1/1250+1/1400))

In [None]:
for b in range(len(bms)):
    plt.figure(figsize=(15,6))
    plt.title(bmName[b])
    for t in threadsToUse:
        _times = timeHistory[t]
        bm = bms[b]
        elementSize = 8
        mElementsProcessed = problemSizes*elementSize/1e6
        mElementsPerSec = mElementsProcessed / _times[:,b]
        plt.plot(problemSizes/1e6, mElementsPerSec, label='%s[x%d]'%(bmName[bm], t))
    plt.ylabel('MB/sec/thread')
    plt.xlabel('Melements')
    plt.ylim(0,None)
    plt.legend()
    plt.show()

## Timing measurements

In [None]:
projectorClass = projector.Projector_allC
if False:
    inputImage = lfimage.LoadLightFieldTiff('/Users/jonny/Movies/Nils files/Rectified/Left/Cam_Left_40_X1_N19.tif')
    hMatrix = psfmatrix.LoadMatrix('PSFmatrix/reducedPSFMatrix_M22.2NA0.5MLPitch125fml3125from-156to156zspacing4Nnum19lambda520n1.33.mat')

In [None]:
# Timing measurements
if False:
    inputImage = np.zeros((19*19,19*19), dtype='float32')
    for numJobs in [8, 4, 2, 1]:
        # Run the test, saving thread performance information
        plf.SetThreadFileName("threads_new_%d.txt" % numJobs)
        perf.main(['profile-prime-cache', 'profile-new-batch'],
                  inputImage=None, numJobs=numJobs, printProfileOutput=False)
    plf.SetThreadFileName("")
    
if True:
    inputImage = np.zeros((19*19,19*19), dtype='float32')
    for numJobs in [8, 4, 2, 1]:
        # Run the test, saving thread performance information
        plf.SetThreadFileName("threads_square_%d.txt" % numJobs)
        perf.main(['profile-prime-cache', 'profile-new-batch'],
                  matPath='PSFmatrix/fdnormPSFMatrix_M22.2NA0.5MLPitch125fml3125from-56to56zspacing4Nnum19lambda520n1.33.mat',
                  inputImage=inputImage, batchSize=2, numJobs=numJobs, printProfileOutput=False)
    plf.SetThreadFileName("")

## Visual plots of what the activity of each thread looks like over time
Note that for large operations, if zoomed out, we will not get a clear impression because (I think) the plot over-draws subsequent work types with a minimum width of one pixel. So, for example, the time on FFT appears to be exaggerated because that is drawn last.

Note also that this code is not very well structured - I should separate the parsing from the plotting to save re-parsing when e.g. I just want to change the range of the plot!

In [None]:
#(rows, t0, tEnd, dt, mt) = LoadThreadInfo('threads_square_8_copy.txt')
(rows, t0, tEnd, dt, mt) = LoadThreadInfo('threads_square_8_neworder.txt')
#(rows, t0, tEnd, dt, mt) = LoadThreadInfo('threads_square_8_oldorder.txt')

#(rows, t0, tEnd, dt, mt) = LoadThreadInfo('brutha-benchmarks-2/threads_square_16.txt')

In [None]:
numThreads = int(np.max(rows[:,0]))+1

def ParseThreadInfo(xStart=None, xEnd=None, monitorMutexes=True, minMutexWaitTime=5e-6):
    # minMutexWaitTime:            Do not bother displaying mutex wait times less than this (which is effectively an instant return)

    runTime = 0
    mutexTime = 0
    __intermediateTimes = []
    __boundaryTimes = []
    __x = []; __y = []
    __mx = []; __my = []
    reportWaits = 10 # Print info about the first N mutex wait intervals
    if (xStart is None):
        xStart = 0
    if (xEnd is None):
        xEnd = 1e100

    for w in tqdm(range(numWorkTypes)):
        workName = workNames[w]
        _intermediateTimes = []
        _boundaryTimes = []
        _x = []; _y = []
        _mx = []; _my = []

        for t in tqdm(range(numThreads), leave=False):
            x = []; y = []
            mx = []; my = []
            intermediateTimes = []
            boundaryTimes = []
            for r in tqdm(rows[:], leave=False):
                if (r[0] == t) and (r[1] == w):
                    x.extend([r[2], r[2], r[3], r[3]])
                    y.extend([0,1,1,0])
                    runTime += r[3]-r[2]
                    if monitorMutexes:
                        if ((workName == 'convolve') and (r[3] >= xStart) and (r[2] <= xEnd)):
                            if ((r[5]-r[4]) > minMutexWaitTime):
                                if (reportWaits):
                                    print("Mutex wait: %le"%(r[5]-r[4]))
                                    reportWaits = reportWaits - 1
                                mx.extend([r[4], r[4], r[5], r[5]])
                                my.extend([0,1,1,0])
                            mutexTime += r[5]-r[4]
                            if (r[6] != 0) and ((r[7]-r[6]) > minMutexWaitTime):
                                if (reportWaits):
                                    print("Mutex wait: %le"%(r[7]-r[6]))
                                    reportWaits = reportWaits - 1
                                mx.extend([r[6], r[6], r[7], r[7]])
                                my.extend([0,1,1,0])
                            mutexTime += r[7]-r[6]
                    if (workName == 'fft'):
                        intermediateTimes.extend([r[5], r[7]])
                    elif (workName == 'convolve'):
                        intermediateTimes.append(r[5])

                    boundaryTimes.append([r[3]])
            _intermediateTimes.append(np.array(intermediateTimes)-t0)
            _boundaryTimes.append(np.array(boundaryTimes)-t0)
            _x.append(np.array(x) - t0)
            _y.append(np.array(y))
            _mx.append(np.array(mx) - t0)
            _my.append(np.array(my))
        __intermediateTimes.append(_intermediateTimes)
        __boundaryTimes.append(_boundaryTimes)
        __x.append(_x)
        __y.append(_y)
        __mx.append(_mx)
        __my.append(_my)
    if (reportWaits == 0):
        warnings.warn('Did not report all waits')
    return (__x, __y, __mx, __my, __intermediateTimes, __boundaryTimes, runTime, mutexTime)

def PlotWorkThread(w, t, x, y, mx, my, intermediateTimes, boundaryTimes, 
                   monitorMutexes=True,
                   plotBoundaries=True,
                   plotIntermediateCheckpoints=True):
    # monitorMutexes:              Grey block to mark time spent waiting to acquire an accum mutex
    # plotBoundaries:              Black 'x' to mark boundaries between individual operations
    # plotIntermediateCheckpoints: Grey 'x' to mark intermediate checkpoints within individual operations

    if t == 0:
        thisLabel = lab[w]
    else:
        thisLabel = None
    # Show the time spent working on a task.
    # Note that this code is very slow for the convolve operation, just because there are
    # massive numbers of separate blocks. I could fuse adjacent blocks where there is just
    # a few µs of gap between them. But that would be getting distracted over code that
    # I am only running occasionally, for diagnostic purposes!!
    plt.fill_between(x, t, t+y/2, color=col[w], where=y.astype(np.bool), label=thisLabel)
    if monitorMutexes:
        # Show the time spent waiting to acquire an accum mutex.
        plt.fill_between(mx, t+0.1, t+0.1+my*0.3, where=my.astype(np.bool), color='gray')
    if (len(intermediateTimes) > 0) and plotIntermediateCheckpoints:
        plt.plot(intermediateTimes, [t+0.25]*len(intermediateTimes), '|', color='brown')
    if plotBoundaries:
        plt.plot(boundaryTimes, [t+0.25]*len(boundaryTimes), '|', color='black')

(x, y, mx, my, intermediateTimes, boundaryTimes, runTime, mutexTime) = ParseThreadInfo()

In [None]:
wallclockRunTime = np.max(rows[:,3]) - np.min(rows[:,2])
(xStart, xEnd) = (0, wallclockRunTime) # Full range of data

def PlotForTimeRange(xStart, xEnd,                    
                     monitorMutexes=True,
                     plotBoundaries=True,
                     plotIntermediateCheckpoints=True,
                     minMutexWaitTime=1e-5):
    plt.figure(figsize=(14,4))
    plt.xlim(xStart, xEnd)
    plt.title('Thread breakdown')
    for w in tqdm(range(numWorkTypes-1,-1,-1)):  # (tqdm does not seem to like the 'reversed' syntax, so I do it this way!)
        workName = workNames[w]
        for t in tqdm(range(numThreads), leave=False):
            PlotWorkThread(w, t, x[w][t], y[w][t], mx[w][t], my[w][t], intermediateTimes[w][t], boundaryTimes[w][t],
                           monitorMutexes, plotBoundaries, plotIntermediateCheckpoints)
    plt.legend()
    plt.show()

if True:
    # Overview of the whole run
    PlotForTimeRange(xStart, xEnd, 
                     monitorMutexes=False, plotBoundaries=False, plotIntermediateCheckpoints=False)
if True:
    # Examine the run in more zoomed-in detail
    plotRange = 0.5
    for xStart in np.arange(0,wallclockRunTime,plotRange):
        PlotForTimeRange(xStart, xStart+plotRange, plotIntermediateCheckpoints=False)

cRows = rows[:,1] == numWorkTypes-1
convolveTime = (np.sum(dt[cRows]-mt[cRows]))
dta = rows[:,3]-rows[:,4]   # Only the accumulator stage of the convolution (including any mutex blocking time)
convolveAccumTime = (np.sum(dta[cRows]-mt[cRows])) # Time spent actively working on the accumulator
print('Wall %.2f, run %.2f, active %.2f, mutex %.2f, idle frac %.2f' % (wallclockRunTime, runTime, runTime-mutexTime, mutexTime, mutexTime/runTime))
# Report total CPU load (*not* actual speedup compared to single-threaded),
# and also report the average number of CPUs that are working on convolution operations
print('Parallelism %.1f (of which convolve: %.2f of which accum: %.2f)' % ((runTime-mutexTime)/runTime*numThreads, convolveTime/wallclockRunTime, convolveAccumTime/wallclockRunTime))
print('Time breakdown:')
for i in range(numWorkTypes-1):
    print(' %s %.2f [%d]' % (lab[i], np.sum(dt[rows[:,1]==i]), np.sum(rows[:,1]==i)))
print(' convolve %.2f [%d]' % (convolveTime, np.sum(rows[:,1]==i)))
print(' convolve/mutex %.2f' % (np.sum(mt[cRows])))

In [None]:
# Temporary code looking at the early stuff in more detail
PlotForTimeRange(0, 0.1)
for xStart in np.arange(0,4,0.4):
    PlotForTimeRange(xStart, xStart+0.4)

In [None]:
def PlotAccumulatorUse(tFactor):
    # Plot to examine what fraction of the time *somebody* is using an accumulator 
    # (useful for t=z=1)

    for t in range(numThreads):
        x = []; y = []
        for r in rows:
            if (r[0] == t) and (r[1] == 2):
                if (r[6] != 0):
                    x.extend([r[5], r[5], r[6], r[6], r[7], r[7], r[3], r[3]])
                    y.extend([0,1,1,0,0,1,1,0])
                else:
                    x.extend([r[5], r[5], r[3], r[3]])
                    y.extend([0,1,1,0])
        x = np.array(x) - t0
        y = np.array(y)
        if t == 0:
            thisLabel = 'accum'
        else:
            thisLabel = None
        # Show the time spent holding any accumulator mutex
        plt.fill_between(x, t*tFactor+0.25, t*tFactor+y/2, color='orange', where=y.astype(np.bool), label=thisLabel)
        
if False:
    # Plot a separate figure showing how the accumulator is being used.
    # This is really only informative in the 1z,1t case, since I do not currently distinguish between
    # the different mutexes, and so the plot just shows whether *any* mutex is held at any given time
    plt.figure(figsize=(14,4))
    plt.title('Accumulator mutex')
    PlotAccumulatorUse(0)
    plt.show()        

## Explore how run times vary as a function of number of threads
- FFT simply takes consistently a little longer with 8 threads. 
- Mirror has some interesting patterns, but it takes negligible time anyway
- Convolve also has interesting patterns, but it looks like there's a dominant trend underneath all that. I think the most important point is that there is clearly a minimum run time, and both 1 and 8 threads cases *can* achieve that; either *might* take longer than that, but the 8 thread case is more likely to. I imagine this must be due either to cache residency or memory bandwidth contention. I would guess cache residency isn't such a big deal, simply because I think the work sizes are already larger than the caches.

It could possibly be an alignment issue I suppose, but memory bandwidth certainly seems plausible. If that is indeed the case, I suppose there isn't much I can do about it. For the most part, the bandwidth is unavoidable, though I did achieve a ~10% speedup by chunking together two convolves (the ones with and without x-mirroring) to reduce the read bandwidth slightly. I reckon anything further along those lines would be complicating the code quite a bit, without much prospect of getting a significant improvement.

Important observation: the two-thread case has pretty much exactly double performance of the one-thread case. I reckon that supports my theory about memory bandwidth, or at least something related to the number of actual physical CPUs.

In [None]:
def GetScaling(filenames):
    runTimes = []
    threadNumbers = []
    for f in filenames:
        (rows, tStart, tEnd, dt, mt) = LoadThreadInfo(f)
        runTimes.append(tEnd-tStart)
        threadNumbers.append(np.max(rows[:,0]+1))
    return (np.array(threadNumbers), np.array(runTimes))
    
(threadNumbers, runTimes) = GetScaling(glob.glob("brutha-benchmarks-2/threads_square_*.txt"))

In [None]:
plt.plot(threadNumbers, np.array(runTimes)*threadNumbers, 'x')
plt.ylim(0,None)

In [None]:
fileX = 'threads_square_1.txt'
fileY = 'threads_square_8.txt'
(rowsX, t0X, dtX, mtX) = LoadThreadInfo(fileX)
(rowsY, t0Y, dtY, mtY) = LoadThreadInfo(fileY)

In [None]:
# Check we are comparing two equivalent runs.
# This code assumes that both fileX and fileY are runs on the same problem
# (but perhaps e.g. using different numbers of threads).
assert(rowsX.shape == rowsY.shape)
# Watch out for files from old code, which I do not support here any more
if (np.max(rowsX[:,1]) == 2):
    warnings.warn('This looks like an old file (only 3 work types). This code will not work correctly.')
numWorkTypes = 4

ranges = [[0, 0.05], [0, 0.01], [0, 0.01], [0, 0.015], [0, 0.002], [0, 0.015], [0, 5e-5]]
for _workType in range(numWorkTypes+3):
    workType = np.minimum(_workType, numWorkTypes-1)
    if (_workType == numWorkTypes):
        # Convolve first part (up to the first mutex acquisition)
        selector = rowsX[:,1]==workType
        initialTimeY = rowsY[:,4]-rowsY[:,2]
        _dtY = initialTimeY[selector]
        initialTimeX = rowsX[:,4]-rowsX[:,2]
        _dtX = initialTimeX[selector]
        problem = _dtX > 1e9
    elif (_workType == numWorkTypes+1):
        # Convolve second part (excluding mutex acquisitions)
        # First consider case with two mutex acquisitions [which is actually obsolete now]
        selector = (rowsY[:,1]==workType) & (rowsY[:,6]!=0)
        initialTimeY = rowsY[:,6]-rowsY[:,5] + rowsY[:,3]-rowsY[:,7]
        _dtY = initialTimeY[selector]
        selector = (rowsX[:,1]==workType) & (rowsX[:,6]!=0)
        initialTimeX = rowsX[:,6]-rowsX[:,5] + rowsX[:,3]-rowsX[:,7]
        _dtX = initialTimeX[selector]
        # Now consider case with one mutex acquisition
        # Note that this graph won't make much sense if we compare new code with old, 
        # since the work has been reshuffled a bit
        selector = (rowsY[:,1]==workType) & (rowsY[:,6]==0)
        initialTimeY = rowsY[:,3]-rowsY[:,5]
        _dtY = np.append(_dtY, initialTimeY[selector])
        selector = (rowsX[:,1]==workType) & (rowsX[:,6]==0)
        initialTimeX = rowsX[:,3]-rowsX[:,5]
        _dtX = np.append(_dtX, initialTimeX[selector])
    elif (_workType == numWorkTypes+2):
        # Convolve mutex time
        selector = (rowsX[:,1]==workType)
        _dtY = mtY[selector]
        _dtX = mtX[selector]
    else:
        assert(workType < numWorkTypes)
        selector = rowsX[:,1]==workType
        _dtY = dtY[selector]
        _dtX = dtX[selector]
    avY = np.average(_dtY)
    avX = np.average(_dtX)
    gradient = avY / avX
    # Clip outliers, to avoid lots of whitespace in the plots
    _dtX = np.minimum(_dtX, ranges[_workType][1])
    _dtY = np.minimum(_dtY, gradient*ranges[_workType][1])

    plt.figure(figsize=(10,4))
    plt.title(lab[_workType])
    plt.xlabel(fileX)
    plt.ylabel(fileY)
    plt.plot(_dtX, _dtY, '.')
    plt.plot(ranges[_workType], ranges[_workType])
    plt.plot(ranges[_workType], np.array(ranges[_workType])*gradient)
    plt.plot([avX, avX, 0], [0, avY, avY])
    plt.show()
    print(lab[_workType], np.sum(_dtX), np.sum(_dtY), gradient)

These timings are old ones before I speeded up the convolution code a little bit.
As a result, these specific numbers are outdated, but I reckon the general theme remains, and I am not going to rerun them all right now...

## With one timepoint

### 1 z plane:
Wall 2.44, run 19.32, active 12.59, mutex 6.73, idle frac 0.35
Parallelism 5.2 (Convolve: 1.03, Accum: 0.97)

Time breakdown: FH 9.27, mirror 0.81, convolve 2.51, convolve/mutex 6.73
### 2 z planes:
Wall 3.68, run 28.93, active 26.25, mutex 2.68, idle frac 0.09
Parallelism 7.3 (Convolve: 1.44, Accum: 1.37)
Time breakdown: FH 19.01, mirror 1.92, convolve 5.31, convolve/mutex 2.68

### 4 z planes:
Run 49.41, active 49.11, mutex 0.30, idle frac 0.01
Parallelism 8.0

Time breakdown: FH 34.83, mirror 4.01, convolve 10.27, convolve/mutex 0.30
### Next 4 z planes:
Run 49.40, active 49.13, mutex 0.27, idle frac 0.01
Parallelism 8.0
### 8 z planes:
Run 108.06, active 108.06, mutex 0.00, idle frac 0.00
Parallelism 8.0

## With two timepoints

### 1 z plane:
Wall 2.90, run 23.01, active 15.99, mutex 7.02, idle frac 0.31
Parallelism 5.6 (Convolve: 1.67, Accum: 1.59)
Time breakdown: FH 10.18, mirror 0.96, convolve 4.85, convolve/mutex 7.02

### 2 z planes:
Run 30.14, active 28.15, mutex 2.00, idle frac 0.07
Parallelism 7.5

Time breakdown: FH 15.84, mirror 2.37, convolve 9.94, convolve/mutex 2.00

### 3 z planes:
Run 35.87, active 34.34, mutex 1.53, idle frac 0.04
Parallelism 7.7
### 4 z planes:
Run 64.77, active 64.75, mutex 0.02, idle frac 0.00
Parallelism 8.0 (Convolve: 2.80)

Time breakdown: FH 37.32, mirror 4.55, convolve 22.87, convolve/mutex 0.02

## With 16 timepoints
### 1 z plane
Run 51.78, active 51.78, mutex 0.01, idle frac 0.00
Parallelism 8.0 (Convolve: 6.51)

Time breakdown: FH 8.25, mirror 1.36, convolve 42.16, convolve/mutex 0.01

## With 32 timepoints
### 1 z plane
Run 87.08, active 87.07, mutex 0.01, idle frac 0.00
Parallelism 8.0 (Convolve: 7.11)

Time breakdown: FH 8.36, mirror 1.26, convolve 77.46, convolve/mutex 0.01

## Summary
More time is being lost waiting for a mutex for 1z,2t compared to 2z,1t. This might just be because there is less FH work to be getting on with, and so there are more threads ready to do work.

In fact, with 1z,1t, while threads are inevitably sitting idle I am actually making pretty good use of time - the accumulator mutex is held by somebody almost all the time. However, we are being less efficient with the 2z,t1 and 1z,2t cases. There, we often only hold one of the mutexes at a time. It might be possible to adjust the scheduling code to be more effective with the mutexes (which I think in practice means getting a bit ahead of ourselves with FH rather than waiting until we run out of work entirely). An alternative strategy would be to create additional temporary accumulators and combine them at the end, to support more parallelism. Some perspective though: I do care about the 1z,2t case, but I could only speed it up by 33% (respectable but not earth-shattering); the 4z,2t case is probably more representative of Nils' data, and that has no idle time.