In [1]:
%matplotlib inline

# Optimization
## Michael McDermott

Optimization was conducted to improve the performance of the LJ Molecular Dynamics program. 

In [1]:
import numpy as np

### Note on runtimes:
#### I discovered about halfway through my work that the run time is dependent on step size. Most of the benchmarking is done at 1e-12 s step, but the computation is all done at 1e-13 or 1e-14. I am not sure why this is, as the "virtual" time should have no influence over the actual run time, but my guess is it has to do with the relative frequency that the force calculation exits early. 

![iterspeed](./iterspeedpng.png)

### Bone-stock performance
#### Program was run on an i3-3120M 2.5 Ghz (4 threads). A typical speed benchmark consists of running with time step = 1e-12 seconds and end time = 1e-10.  Initialization (populating the simulation with atoms) is timed separately from iterations. Initialization time is fixed with respect to simulation. Iters/second are computed from after simulation. 
Compiling: `gcc main.c aux.c -lm -o plain.run`

    Number of iterations: 101
    Initialization seconds: 2.751
    Elapsed seconds: 22.519
    Iterations per second: 4.49
    
    Number of iterations: 101
    Initialization seconds: 2.684
    Elapsed seconds: 30.022
    Iterations per second: 3.36
    
    Number of iterations: 101
    Initialization seconds: 2.106
    Elapsed seconds: 21.016
    Iterations per second: 4.81


In [2]:
baseline = np.mean([4.49, 3.36, 4.81])
print("Baseline speed (iterations/sec): {:.2f}".format(baseline))

Baseline speed (iterations/sec): 4.22


## Pipe to file
#### This optimization seeks to eliminate some overhead by directly outputing to a file (bash > redirect) rather than standard out. This had no effect, however since future attempts would all be piping to a file, I wanted to control for this effect.

    Initialization seconds: 2.688
    Elapsed seconds: 29.506
    Iterations per second: 3.42

    Initialization seconds: 2.141
    Elapsed seconds: 20.995
    Iterations per second: 4.81

    Initialization seconds: 2.080
    Elapsed seconds: 21.157
    Iterations per second: 4.77


## Optimization flags
### -O1 - 2.5x improvement


    Elapsed seconds: 9.326
    Iterations per second: 10.83

    Elapsed seconds: 6.677
    Iterations per second: 15.13
    
    Elapsed seconds: 7.022
    Iterations per second: 14.38


### -O2

    Initialization seconds: 2.049
    Elapsed seconds: 8.594
    Iterations per second: 11.75

    Initialization seconds: 1.991
    Elapsed seconds: 8.921
    Iterations per second: 11.32

    Initialization seconds: 1.627
    Elapsed seconds: 6.439
    Iterations per second: 15.68
    
### -O3

    Initialization seconds: 2.072
    Elapsed seconds: 8.269
    Iterations per second: 12.21

    Initialization seconds: 1.604
    Elapsed seconds: 6.123
    Iterations per second: 16.50

    Initialization seconds: 1.618
    Elapsed seconds: 6.085
    Iterations per second: 16.60

#### Because O3 optimization provides a great effect with no cost to accuracy, it'll be used from here on out.



In [3]:
o3flag = np.mean([12.21, 16.5, 16.6])
print("O3 flag (it/sec): {:.2f}\nGain: {:.2f}x".format(o3flag, o3flag/baseline))

O3 flag (it/sec): 15.10
Gain: 3.58x


## Bogothreading
#### True hyperthreading with a simulation such as this is very challenging, since the entire system is interconnected. However, what can be done is run the experiment several times in parallel in order to improve the resolution. 8 simultaneous experiments were run on an i7-6700 8-thread processor. 

## Multithreading
#### With a fair bit of tweaking, I was able to split the most intensive part of the code, the force calculation (an O(n²) algorithm), by sending 1/n of the calculations to n threads (in my case 4 was ideal). Spawning a thread incurs a small overhead, on the order of 10 ms. However, given than an average loop is about  60 ms, by using 4 threads, I can theoretically get up to 40 iterations/second (15+10 ms each). Actual gain is 37.7 it/sec, or a ninefold speedup of the original code (2.5x O3 single threaded). 

    Elapsed seconds: 2.661
    Iterations per second: 37.96

    Elapsed seconds: 2.702
    Iterations per second: 37.39

    Elapsed seconds: 2.672
    Iterations per second: 37.79



In [12]:
thread4 = np.mean([37.96, 37.39, 37.79])
print("4 threads (it/sec): {:.2f}\nGain: {:.2f}x".format(thread4, thread4/baseline))
print("Gain w.r.t O3 single thread: {:.2f}x".format(thread4/o3flag))

4 threads (it/sec): 37.71
Gain: 8.94x
Gain w.r.t O3 single thread: 2.50x
