## Procedural Terrain Generation -- Fast Terrain

### Details

- Relies on the optimized perlin noise method to generate a randomized terrain.  

- The CPU code is implemented in Numba and parallelized with the `@jit` flag, `parallel=True`. It has been optimized and operates on single precision floating point numbers (`np.float32`).

- Dask: The dask implementation is ready to be tested. Not yet tested in a distributed setup though. On a single-node, it runs slower than parallel numba + numpy only. 

- The GPU code is implemented in Numba and Cupy. Optimizations performed:
  - Uses a thread local array to store a 4,2 array used in gradient function
  - Each thread calculated multiple output elements. This way the code can handle abstractly large input dimensions, and runs more efficiently.
  - The block and grid dimensions have been fine tuned for the V100 GPU.
  - Uses `float32` numbers. 

In [None]:
import os
import numpy as np
import datashader as ds
import xarray as xr

# make sure the right version of perlin.py is loaded
from xrspatial import generate_fast_terrain

In [None]:
# This just an example input size. Generally, the GPU speedup is related to the input size.
W = 1920
H = 1080
x_range = (-20e6, 20e6)
y_range = (-20e6, 20e6)
seed = 42
zfactor = 4000


### Load Data

In [None]:
# Numpy + Numba
# Use float32 datatype
terrain = xr.DataArray(np.zeros((H,W), dtype=np.float32),
                      name='numpy_terrain',
                      dims=('y', 'x'),
                      attrs={'res': 1})





In [None]:
# cupy + numba
# Transfer terrain to the GPU
import cupy
gpu_terrain = xr.DataArray(cupy.zeros((H,W), dtype=np.float32), 
                           name='cupy_terrain',
                           dims=('y', 'x'),
                           attrs={'res': 1})

In [None]:
# Setup dask cluster
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=1, n_workers=4, memory_limit='2GB')

In [None]:
# Dask terrain, not tested yet
import dask.array as da
dask_terrain = xr.DataArray(da.zeros((H,W), dtype=np.float32, chunks=2048),
                           name='dask_numpy_terrain',
                           dims=('y', 'x'),
                           attrs={'res': 1})
dask_terrain.persist()

### CPU Benchmarking

- 4xLarge 16 Cores 128GB RAM

In [None]:
# run cpu benchmark
cpu_time = %timeit -o cpu_res = generate_fast_terrain(terrain, x_range, y_range, seed, zfactor)

### Dask Benchmarking

In [None]:
# run dask benchmark, not yet tested
dask_cpu_time = %timeit -o dask_cpu_res = generate_fast_terrain(dask_terrain, x_range, y_range, seed, zfactor).compute()

### GPU Benchmarking

- T4
- V100

In [None]:
# run gpu benchmark
gpu_time = %timeit -o gpu_res = generate_fast_terrain(gpu_terrain, x_range, y_range, seed, zfactor)

### Calculate and report the results. 

In [None]:
# CPU time  
mean_cpu_time = np.mean(cpu_time.all_runs)/cpu_time.loops
std_cpu_time = np.std(cpu_time.all_runs)/cpu_time.loops

# necessary initializations
mean_dask_cpu_time = std_dask_cpu_time = 0
mean_gpu_time = std_gpu_time = 0
speedup_dask = speedup_gpu = 0

In [None]:
# DASK time and speedup
mean_dask_cpu_time = np.mean(dask_cpu_time.all_runs)/dask_cpu_time.loops
std_dask_cpu_time = np.std(dask_cpu_time.all_runs)/dask_cpu_time.loops

speedup_dask = mean_cpu_time / mean_dask_cpu_time

In [None]:
# GPU time and speedup
mean_gpu_time = np.mean(gpu_time.all_runs)/gpu_time.loops
std_gpu_time = np.std(gpu_time.all_runs)/gpu_time.loops

speedup_gpu = mean_cpu_time / mean_gpu_time

In [None]:
print('HxW        CPU Time (sec)    Dask Time (sec)    Speedup Dask    GPU Time (sec)    Speedup GPU')
print('{}x{}   {:.3f} ± {:.3f}    {:.3f} ± {:.3f}    {:.2f}x    {:.3f} ± {:.3f}    {:.2f}x'.format(
        H, W, mean_cpu_time, std_cpu_time,
        mean_dask_cpu_time, std_dask_cpu_time, speedup_dask,
        mean_gpu_time, std_gpu_time, speedup_gpu))

In [None]:
# stop the dask cluster
client.close()