# Benchmarking code to generate Figs 2 and 3 of the accompanying paper

Run the cells below to plot performance graphs.

The following parameters are used in the benchmarking scenario, representing a typical light field workload

| Parameter | Value |
| --------------- | ------- |
| Numerical Aperture | 0.5  |
| Magnification | 22.222   |
| ML Pitch ($\mu$m) | 125 |
| $f_\textrm{ML}$ ($\mu$m) | 3125  |
| Refractive index | 1.33   |
| $\lambda$ (nm) | 520 |
| $z$ range ($\mu$m) | ±60   |
| | |
| Number of planes | 25  |
| N | 19  |
| X | 1463   |
| Y | 1273   |
| $N_\textrm{iter}$ | 4 |


In [None]:
import py_light_field as plf
import benchmark

### Perform self-tests (optional)

In [None]:
import projector
import lfdeconv
import projector as proj
_ = projector.selfTest()
_ =  lfdeconv.main(['basic', 'full', 'parallel'])

# Testing thread scaling
For just the backprojection, the tests in the next cell take about an hour to run on suil-bheag (for 1-8 threads). This will generate a file `stats.txt`, but the subsequent code actually plots a file that is already in the repository recording the performance I have recorded on my own server.

Interestingly, the thread scaling data looks very similar for the smaller image testcase.
The actual run time thread scaling looks a bit worse for the smaller testcase, but the scaling of actual CPU time used is similar in both cases (suggesting threads are idling in the smaller testcase)

The anomlously slow run-time for two threads on cuinneag does seem to be reproducible. I assume this is connected with the two-processor architecture, and is connected with memory and cache usage?

In [None]:
import benchmark
if False:
    # Smaller problem just to road-test this code and analysis
    threadScalingResults = benchmark.main(benchmarkGPU=False,
                                          prefix=['smaller-image', 'x4'],
                                          prefix2=['parallel-scaling'])
else:
    threadScalingResults = benchmark.main(benchmarkGPU=False,
                                          prefix=['olaf-image', 'olaf-matrix', 'x16'],
                                          prefix2=['parallel-scaling'])

In [None]:
import csv
import numpy as np
import matplotlib.pyplot as plt
def PlotThreadStats(filename, rescaleAll=1, show=True):
    # rescaleAll: multiply all values by a constant factor (useful if doing multiple plots on top of each other)
    # show: call plt.show(). Disable this to allow plotting multiple plots on top of each other
    # forPaper: disable various annotations I don't want on the plot for the paper
    threadStats = []
    with open(filename) as f:
        csv_reader = csv.reader(f, delimiter='\t')
        for row in csv_reader:
            threadStats.append(row)
    threadStats = np.array(threadStats).astype(np.float64)
    threadStats[:,1:] *= rescaleAll
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    plt.title(filename)
    plt.xlabel("Number of threads")
    plt.ylabel("Benchmark time (s)")

    plt.plot(threadStats[:,0], threadStats[:,1], label='Run time')
    plt.plot(threadStats[:,0], threadStats[:,2]+threadStats[:,3], label='CPU time')
    plt.plot(threadStats[:,0], threadStats[:,1]*threadStats[:,0], label='Run time scaled')
    plt.ylim(0,None)
    plt.legend()
    plt.subplot(1,2,2)
    plt.title("{0} efficiency".format(filename))
    plt.plot(threadStats[:,0], threadStats[0,1]*threadStats[0,0]/(threadStats[:,1]*threadStats[:,0]), label='Efficiency')
    plt.ylim(0,None)
    if show:
        plt.show()
    
def PlotThreadStatsForPaper(filename):
    # rescaleAll: multiply all values by a constant factor (useful if doing multiple plots on top of each other)
    # show: call plt.show(). Disable this to allow plotting multiple plots on top of each other
    # forPaper: disable various annotations I don't want on the plot for the paper
    threadStats = []
    with open(filename) as f:
        csv_reader = csv.reader(f, delimiter='\t')
        for row in csv_reader:
            threadStats.append(row)
    threadStats = np.array(threadStats).astype(np.float64)
    fig, ax = plt.subplots(figsize=(6,4))
    fdMapping = lambda y: y/threadStats[0,1]
    bkMapping = lambda y: y*threadStats[0,1]
    ax2 = ax.secondary_yaxis('right', functions=(fdMapping, bkMapping))
    ax.set_xlabel("Number of threads")
    ax.set_ylabel("Benchmark time (s)")
    ax2.set_ylabel("Efficiency")

    ax.plot(threadStats[:,0], threadStats[:,1], marker='.', label='Run time')
    ax.set_xticks([1] + list(range(2, 17, 2)))
    ax.set_ylim(0,900)
    ax.set_yticks(range(0, 801, 200))
    ax2.set_yticks(np.arange(0, 1.01, 0.2))
    ax.plot(threadStats[:,0], 
            bkMapping(threadStats[0,1]*threadStats[0,0]/(threadStats[:,1]*threadStats[:,0])), 
            marker='x',
            label='Multithreading efficiency')
    leg = fig.legend(loc='upper center', borderaxespad=4)
    fig.savefig("thread_scaling.pdf")
    # The next two lines are a cosmetic workaround.
    # For some reason I need a larger borderaxespad for the saved pdf figure,
    # compared to the one displayed inline in this notebook
    leg.remove()
    fig.legend(loc='upper center', borderaxespad=2.5)
    fig.show()
    factor = threadStats[-1,1]/threadStats[0,1]
    print("Time fraction with 16 threads: {0}%, {1}x faster".format(100*factor, 1/factor))
#PlotThreadStats('stats_large_suil-bheag.txt')
#PlotThreadStats('stats_large_cuinneag.txt')

PlotThreadStatsForPaper('stats_large_cuinneag.txt')


# Testing batch scaling

Look at how runtime scales with batch size. To run this on your own machine, edit the code below to create empty arrays batchScalingResultsCPU and batchScalingResultsGPU, and the code will automatically run benchmarks for every batch size specified in batchSizesCPU/GPU.

Initially I just looked at the backprojection here, to keep the run times manageable. It's satisfying to see that the results are very well modelled by a fixed overhead (presumably calculation of F(H)) plus an extra factor that scales very linearly with the batch size.

Relatively speaking, the GPU baseline is much lower (i.e. we don't need as large a batch size to amortise it away). That's good news in terms of GPU RAM consumption. However, it does lead me to suspect that my GPU implementation of my special FFT is less optimised. It's *possible* the GPU is just less good at that, but it probably means I could be doing more to optimise that code, if I really put my mind to it.

The same pattern seems to apply to the full deconvolution (which I've sampled for a few batch sizes), and it's gratifying to see that we don't pick up any additional overheads or penalties here.

In [1]:
import benchmark
import projector as proj
import numpy as np

if True:
    # Datasets calculated earlier on suil-bheag CPU and GPU
    print("Assume running on suil-bheag")
    batchSizesGPU = np.array(1+np.arange(24))
    batchScalingResultsGPU = [2.0887858867645264, 2.650907516479492, 3.1582248210906982, 3.639864206314087, 4.187718629837036, 4.7083728313446045, 5.515699625015259, 6.032058954238892, 6.538254976272583, 6.759566068649292, 7.901719570159912, 7.820116281509399, 9.093420267105103, 8.917783260345459, 9.327208280563354, 9.943439483642578, 11.797546148300171, 11.179245710372925, 12.977349758148193, 12.188851594924927, 12.5778169631958, 13.543910503387451, 14.43958854675293, 14.407610893249512]

    batchSizesCPU = np.array(list(range(1,8,1)) + list(range(8,16,2)) + list(range(16,33,4)))
    batchScalingResultsCPU = [70.64740204811096,77.01506614685059,84.01113772392273,90.80544757843018,98.12121725082397,
                                  104.44658899307251,111.3729362487793,117.50704097747803,131.72836804389954,144.44977974891663,
                                  156.9799840450287,170.16073203086853,197.4131510257721,223.088045835495,249.77226161956787,
                                  274.7552146911621]

    batchSizesGPUdc = np.array([1, 2, 4, 8, 16, 24])
    batchScalingResultsGPUdc = [19.04511833190918, 23.87207317352295, 36.4692268371582, 51.96277904510498, 91.1813862323761, 135.37568163871765]

    batchSizesCPUdc = np.array([1, 2, 4, 8, 16, 32])
    batchScalingResultsCPUdc = [628.8385584354401, 686.4196479320526, 811.7965116500854, 1050.5633614063263, 1522.4902625083923, 2508.4796035289764]
else:
    # Datasets calculated earlier on macbook CPU
    print("Assume running on macbook")
    # Benchmarking results: [[2.9634785652160645, 2.926506519317627, 3.040677785873413, 2.7863423824310303], [1.3397929668426514, 1.3884947299957275, 1.3806533813476562, 1.3544728755950928]]
    batchSizesCPU = np.array([1, 32])
    batchScalingResultsCPU = [105.0, 449.6]

    batchSizesCPUdc = np.array([1, 32])
    batchScalingResultsCPUdc = [916.7, 4480.0]

# These loops will only calculate any missing entries that are not yet present in the above precalculated arrays,
# to avoid taking ages to run this cell every time you just want to analyze the results
for batchSize in batchSizesCPU[len(batchScalingResultsCPU):]:
    # Benchmark for this batch size
    batchScalingResultsCPU.append(benchmark.main(benchmarkGPU=False, prefix=['olaf-image', 'olaf-matrix', 'x{0}'.format(batchSize)])[0][0])
    
if proj.gpuAvailable:
    for batchSize in batchSizesGPU[len(batchScalingResultsGPU):]:
        # Benchmark for this batch size
        batchScalingResultsGPU.append(benchmark.main(benchmarkCPU=False, prefix=['olaf-image', 'olaf-matrix', 'x{0}'.format(batchSize), 'volumes-on-gpu'])[0][0])    
        # These next lines clear the FFT plan cache.
        # In every loop iteration we compute different-shaped FFTs, so the cached plans
        # are just taking up memory without being useful. Without clearing like this,
        # we gradually fill up the GPU memory with redundant cached data.
        import cupy as cp
        cp.fft.config.get_plan_cache().clear()
else:
    print("No GPU available - not benchmarking")    
    
# Run the full deconvolution for a limited set of batch sizes, just to confirm that the scalings still apply
for batchSize in batchSizesCPUdc[len(batchScalingResultsCPUdc):]:
    batchScalingResultsCPUdc.append(benchmark.main(benchmarkGPU=False, prefix=['olaf-image', 'olaf-matrix', 'deconv', 'x{0}'.format(batchSize)])[0][0])
    
if proj.gpuAvailable:    
    for batchSize in batchSizesGPUdc[len(batchScalingResultsGPUdc):]:
        batchScalingResultsGPUdc.append(benchmark.main(benchmarkCPU=False, prefix=['olaf-image', 'olaf-matrix', 'deconv', 'x{0}'.format(batchSize), 'volumes-on-gpu'])[0][0])    
        import cupy as cp
        cp.fft.config.get_plan_cache().clear()
else:
    print("No GPU available - not benchmarking")

Assume running on suil-bheag
No GPU available - not benchmarking
No GPU available - not benchmarking


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def PlotResults(batchSizes, batchScalingResults, desc):
    batchScalingResults = np.array(batchScalingResults)
    batchSizes = batchSizes[:len(batchScalingResults)]
    plt.figure(figsize=(10,4))
    plt.subplot(1,2,1)
    plt.title("Time ({0})".format(desc))
    plt.xlabel("Batch size")
    plt.ylabel("Time (s)")
    plt.plot(batchSizes, batchScalingResults)
    plt.plot(batchSizes, batchScalingResults, '.')
    mc = np.polyfit(batchSizes, batchScalingResults, 1)
    plt.plot(batchSizes, mc[0]*batchSizes+mc[1])
    plt.ylim(0,None)
    
    plt.subplot(1,2,2)
    plt.title("Time per image ({0})".format(desc))
    plt.xlabel("Batch size")
    plt.ylabel("Time (s)")
    plt.plot(batchSizes, batchScalingResults / batchSizes)
    plt.plot(batchSizes, batchScalingResults / batchSizes, '.')
    plt.ylim(0,None)
    plt.show()
    print("{0} baseline: {1:.2f}s. Gradient: {2:.2f}s/image".format(desc, mc[1], mc[0]))
    
def PlotResultsPaper(batchSizes, batchScalingResults, filename, desc):
    batchScalingResults = np.array(batchScalingResults)
    batchSizes = batchSizes[:len(batchScalingResults)]
    fig, ax = plt.subplots(figsize=(6,4))
    ax.set_xlabel("Batch size")
    ax.set_ylabel("Time (s)")
    fdMapping = lambda y: y*batchScalingResults[0]/batchScalingResults[-1]
    bkMapping = lambda y: y/batchScalingResults[0]*batchScalingResults[-1]
    ax2 = ax.secondary_yaxis('right', functions=(fdMapping, bkMapping))
    ax2.set_ylabel("Time per image (s)")

    ax.plot(batchSizes, batchScalingResults, '.')
    mc = np.polyfit(batchSizes, batchScalingResults, 1)
    ax.plot(batchSizes, mc[0]*batchSizes+mc[1], color='blue', label=f'Total elapsed time ({desc})')
    ax.set_ylim(0,None)
    
    ax.plot(batchSizes, bkMapping(batchScalingResults / batchSizes), '.')
    batchSizes2 = np.arange(1, batchSizes[-1]+1e-6, 0.01)
    ax.plot(batchSizes2, bkMapping((mc[0]*batchSizes2+mc[1])/batchSizes2), color='orange', label='Time per image (s)')

    leg = fig.legend(loc='upper center', borderaxespad=4)
    fig.savefig(filename)
    # The next two lines are a cosmetic workaround.
    # For some reason I need a larger borderaxespad for the saved pdf figure,
    # compared to the one displayed inline in this notebook
    leg.remove()
    fig.legend(loc='upper center', borderaxespad=2.5)
    fig.show()
    print("{0} baseline: {1:.2f}s. Gradient: {2:.2f}s/image".format(filename, mc[1], mc[0]))
    
if False:
    PlotResults(batchSizesCPU, batchScalingResultsCPU, "CPU")
    PlotResults(batchSizesGPU, batchScalingResultsGPU, "GPU")
    PlotResults(batchSizesCPUdc, batchScalingResultsCPUdc, "CPU deconv")
    PlotResults(batchSizesGPUdc, batchScalingResultsGPUdc, "GPU deconv")

PlotResultsPaper(batchSizesCPUdc, batchScalingResultsCPUdc, "cpu_deconv.pdf", "CPU")
PlotResultsPaper(batchSizesGPUdc, batchScalingResultsGPUdc, "gpu_deconv.pdf", "GPU")

##  Work in progress: look at forward projection slowness
I've noticed that forward projection is slower than backprojection on my macbook (although I have not seen this on other platforms). Let's try and investigate why...

I am struggling to reproduce this, actually. With olaf and x16 I maybe have slightly elevated rusage for forward projection, but no clear difference in overall run time. With olaf x32 on cuinneag I see no trend at all.

I haven't actually rerun x32 on macbook pro, I wonder if that will show the problem again or if it will have gone away??

In [None]:
# See if I see this on a smaller testcase that's easier to play with

''' 
    Olaf, x16, macbook
        Running with batch image shape (16, 1463, 1273), batch x16
        work elapsed wallclock time 260.581940
        Total work delta rusage: [1002.017153   25.666372]
        work elapsed wallclock time 267.333674
        Total work delta rusage: [1024.208123   28.725325]
        work elapsed wallclock time 267.539262
        Total work delta rusage: [993.014693  33.542868]
        work elapsed wallclock time 264.884809
        Total work delta rusage: [1011.825912   29.671406]
        work elapsed wallclock time 267.695029
        Total work delta rusage: [1001.905267   29.950696]
    time: 1329.85. overall delta rusage: [5033.941084  148.399634]
    
    Olaf, x32, cuinneag
        Running with batch image shape (32, 1463, 1273), batch x32
        work elapsed wallclock time 104.480644
        Total work delta rusage: [1511.719767   44.051956]
        work elapsed wallclock time 104.862145
        Total work delta rusage: [1518.188842   53.990656]
        work elapsed wallclock time 109.053886
        Total work delta rusage: [1531.75683    52.955494]
        work elapsed wallclock time 104.540129
        Total work delta rusage: [1531.171942   41.989661]
        work elapsed wallclock time 108.401333
        Total work delta rusage: [1519.177746   56.641623]
    time: 533.41. overall delta rusage: [7613.053808  250.65336 ]

'''
import benchmark
import projector as proj
import numpy as np

batchSizesCPUdc = np.array([1, 16])
batchScalingResultsCPUdc = []

import py_light_field as plf
plf.SetProgressReportingInterval(10.0)

# Run the full deconvolution for a limited set of batch sizes, just to confirm that the scalings still apply
for batchSize in batchSizesCPUdc[len(batchScalingResultsCPUdc):]:
    # Doesn't seem to occur for non-olaf x16:
    #batchScalingResultsCPUdc.append(benchmark.main(benchmarkGPU=False, prefix=['deconv', 'i2', 'x{0}'.format(batchSize)])[0][0])
    #batchScalingResultsCPUdc.append(benchmark.main(benchmarkGPU=False, prefix=['olaf-image', 'olaf-matrix', 'i2', 'deconv', 'x{0}'.format(batchSize)])[0][0])
    batchScalingResultsCPUdc.append(benchmark.main(benchmarkGPU=False, prefix=['olaf-image', 'olaf-matrix', 'i1', 'deconv', 'x{0}'.format(batchSize)])[0][0])

In [None]:
import benchmark
# Testing same mini-scenario as I was testing in matlab
benchmark.main(benchmarkGPU=False,
               prefix=['olaf-image', 'olaf-matrix-reduced', 'i4', 'deconv', 'x4'])