# Julia is as fast as C or Fortran


## Case study: simulation of the Kuramoto-Sivashinksy equation

The Kuramoto-Sivashinsky (KS) equation is a nonlinear time-evolving partial differential equation (PDE) on a 1d spatial domain.

\begin{equation*}
u_t = - u_{xx} - u_{xxxx} - u u_x
\end{equation*}

where $x$ is space, $t$ is time, and subscripts indicate differentiation. We assume a spatial domain $x \in [0, L_x]$ with periodic boundary conditions and initial condition $u(x,0) = u_0(x)$. 

A simulation for $u_0 = \cos(x) + 0.1 \sin x/8  + 0.01 \cos x/32$ and $L_x = 64\pi$.
![alternative text](figs/ksdynamics.svg)

## The benchmark algorithm: KS-CNAB2

We implemented the same numerical integration method for the KS equation in six languages. The method uses finite Fourier expansions to discretize space and semi-implicit finite-differencing to discretize time, specifically the 2nd-order rank-Nicolson/Adams-Bashforth (CNAB2) timestepping formula. All languages use the same FFTW library for the Fourier transforms. 


## Benchmark results: execution time versus simulation size $N_x$

![alternate text](figs/cputime_vs_size.svg)


**Expectation of $N_x \log N_X$ scaling**. Execution time for this algorithm should ideally be dominated by the $N_x \log N_x$ cost of the FFTs. In the above plot, all the codes do appear to scale as $N_x \log N_x$ at large $N_x$ with different prefactors. 

**Two Julia codes.** 

   * **Julia** code is nearly a line-by-line translation of the Matlab code, but it eliminates temporary vectors in the inner loop by using in-place FFTs and julia-0.6's loop fusion capability.
  
   * **Julia unrolled** unrolls all the vector operations into explicit for loops. 
  

**Julia, C, C++, Fortran results are practically identical.** Both Julia codes, Fortran, C, and C++ beat naive Python and Matlab handily. Julia is close to C and  C++ (factor of 1.06), and Julia unrolled is close to  Fortran (factor of 1.03). Execution times of Julia, Julia unrolled, C, C++, and Fortran are all pretty close, about a 15% spread. The benchmarks were averaged over thousands of runs for $N_x = 32$ scaling down to 8 runs for $N_x = 2^{17}$. 


**CPU, OS, and compiler:** All benchmarks were run single-threaded on a six-core Intel Core i7-3960X CPU @ 3.30GHz with 64 GB memory running openSUSE Leap 42.2. C and C++ were compiled with clang 3.8.0, Fortran with gfortran 4.8.3, Julia was julia-0.7-DEV, and all used optimization ``-O3``. For more details see [benchmark-data/cputime.asc](benchmark-data/cputime.asc). 



## Benchmark results: execution time versus line count

![alternate text](figs/cputime_vs_linecount.svg)

Julia clearly hits the sweet spot of low execution time and low line count. The extra lines in Julia over Matlab are for in-place FFTs. 

## Julia code for KS-CNAB2

In [1]:
function ksintegrate(u0, Lx, dt, Nt, nsave);
    u = (1+0im)*u0                      # force u to be complex
    Nx = length(u)                      # number of gridpoints
    kx = vcat(0:Nx/2-1, 0:0, -Nx/2+1:-1)# integer wavenumbers: exp(2*pi*kx*x/L)
    alpha = 2*pi*kx/Lx                  # real wavenumbers:    exp(alpha*x)
    D = 1im*alpha                       # spectral D = d/dx operator 
    L = alpha.^2 - alpha.^4             # spectral L = -D^2 - D^4 operator
    G = -0.5*D                          # spectral -1/2 D operator, to eval -u u_x = 1/2 d/dx u^2

    Nsave = div(Nt, nsave)+1        # number of saved time steps, including t=0
    t = (0:Nsave)*(dt*nsave)        # t timesteps
    U = zeros(Nsave, Nx)            # matrix of u(xⱼ, tᵢ) values
    U[1,:] = u                      # assign initial condition to U
    s = 2                           # counter for saved data
    
    # convenience variables
    dt2  = dt/2
    dt32 = 3*dt/2
    A_inv = (ones(Nx) - dt2*L).^(-1)
    B     =  ones(Nx) + dt2*L
    
    # compute in-place FFTW plans
    FFT! = plan_fft!(u, flags=FFTW.ESTIMATE)
    IFFT! = plan_ifft!(u, flags=FFTW.ESTIMATE)

    # compute nonlinear term Nn == -u u_x 
    Nn  = G.*fft(u.^2);    # Nn == -1/2 d/dx (u^2) = -u u_x
    Nn1 = copy(Nn);        # Nn1 = Nn at first time step
    FFT!*u;
    
    # timestepping loop
    for n = 1:Nt

        Nn1 .= Nn       # shift nonlinear term in time
        Nn .= u         # put u into Nn in prep for comp of nonlinear term
        
        IFFT!*Nn;       # transform Nn to gridpt values, in place
        Nn .= Nn.*Nn;   # collocation calculation of u^2
        FFT!*Nn;        # transform Nn back to spectral coeffs, in place

        Nn .= G.*Nn;    # compute Nn == -1/2 d/dx (u^2) = -u u_x

        # loop fusion! Julia tranforms the following line into 
        # a single for-loop that avoids creating temporary vectors!

        u .= A_inv .* (B .* u .+ dt32.*Nn .- dt2.*Nn1); 
        
        if mod(n, nsave) == 0
            U[s,:] = real(ifft(u))
            s += 1            
        end
    end
   
    t,U
end

ksintegrate (generic function with 1 method)

## Execute the Julia code

In [3]:
# set parameters
Lx = 64*pi
Nx = 1024
dt = 1/16
nsave = 8
Nt = 3200

# set initial condition and run simulation
x = Lx*(0:Nx-1)/Nx
u = cos.(x) + 0.1*sin.(x/8) + 0.01*cos.((2*pi/Lx)*x);
t,U = ksintegrate(u, Lx, dt, Nt, nsave)

# load plotting module and plot results
using Plots; gr()
heatmap(x,t,U, xlim=(x[1], x[end]), ylim=(t[1], t[end]), xlabel="x", ylabel="t", 
    title="u(x,t)", fillcolor=:bluesreds)

## Benchmark codes

  * [ksbenchmark.py](codes/ksbenchmark.py), Python
  * [ksbenchmark.m](codes/ksbenchmark.m), Matlab
  * [ksbenchmark.jl](codes/ksbenchmark.jl), Julia 
  * [ksbenchmark.c](codes/ksbenchmark.c), C
  * [ksbenchmark.cpp](codes/ksbenchmark.cpp), C++ 
  * [ksbenchmark.f90](codes/ksbenchmark.f90), Fortran



# Julia performance benchmarks

Micro-benchmark codes testing iteration, recursion, matrix operations, and I/O, 
using identical algorithms in 11 languages. Smaller is better. Results normalized so C = 1.


![language benchmarks plot](figs/benchmarks.svg)
