# CEED Software Thrust and libCEED
## Jed Brown (CU Boulder)
* CU Boulder: Jeremy Thompson, Valeria Barra, Leila Ghaffari
* LLNL: Yohann Dudouit, Jean-Sylvain Camier, Veselin Dobrev, Tzanio Kolev, David Medina
* UTK: Natalie Beams, Ahmad Abdelfattah, Stan Tomov
* ANL: Misun Min, Oana Marin
* UIUC: Thilina Rathnayake, Paul Fischer

## CEED Annual Meeting, 2020-08-12

# [CEED-3.0 Release](https://ceed.exascaleproject.org/ceed-3.0/) (March)

```bash
$ spack install ceed
$ docker run -it --rm jedbrown/ceed
$ singularity run docker://jedbrown/ceed
```

* GSLIB-1.0.6
* libCEED-0.6
* MAGMA-2.5.3
* MFEM-4.1
  * Laghos-3.0
  * Remhos-1.0
* OCCA-1.0.9

* Nek
  * Nek5000-19.0
  * Nekbone-17.0
  * NekCEM-c8db04b
* PETSc-3.13
* PUMI-2.2.2


## [libCEED](https://libceed.readthedocs.io): A FEM library with no FEM (purely algebraic)

* Backend plugins with run-time selection
  * debug/memcheck, optimized
  * libxsmm, CUDA, HIP
  * MAGMA to CUDA and HIP
  * OCCA to OpenMP, OpenCL, CUDA, and HIP
* Single source vanilla C for QFunctions
  * Easy to debug, understand locally, C++ optional
  * Target for DSLs, AD
* 2-clause BSD, Python interface
* Available via MFEM, PETSc, Nek5000

<img src="figures/ceed/libceed-backends.svg" />

Thanks to many developers, including Jeremy Thompson, Yohann Dudouit, Valeria Barra, Natalie Beams,  Ahmad Abdelfattah, Leila Ghaffari, Tzanio Kolev, Veselin Dobrev, David Medina

## libCEED development

![](figures/ceed/libceed-badges.png)

* [v0.6 (March)](https://libceed.readthedocs.io/en/latest/releasenotes/#v0-6-mar-29-2020)
  * <a href="https://doi.org/10.25080/Majora-342d178e-00c">Python interface (SciPy 2020)</a>
  * Diagonal assembly and FDM solvers
  * MAGMA performance improvement
  * Many examples: BPs on the sphere, $p$-MG for BPs, compressible fluids solver, $p$-MG for hyperelasticity
  


<img src="figures/ceed/libceed-sunburst.svg" width="40%" />
CI: OSX, Linux (x86-64, POWER, ARM64, HIP)

* [v0.7 (soon)](https://libceed.readthedocs.io/en/latest/releasenotes/#current-main)
  * Point-block diagonal and GPU
  * HIP backend, OCCA refactor
  * Examples

<img src="figures/ceed/libCEED-2.png" width=100% />


## Quadrature functions: the math

\begin{gather*}
    v^T F(u) \sim \int_\Omega v \cdot \color{olive}{f_0(u, \nabla u)} + \nabla v \!:\! \color{olive}{f_1(u, \nabla u)} \quad
    v^T J w \sim \int_\Omega \begin{bmatrix} v \\ \nabla v \end{bmatrix}^T \color{teal}{\begin{bmatrix} f_{0,0} & f_{0,1} \\ f_{1,0} & f_{1,1} \end{bmatrix}}
    \begin{bmatrix} w \\ \nabla w \end{bmatrix} \\
    u = B_I \mathcal E_e u_L \qquad \nabla u = \frac{\partial X}{\partial x} B_{\nabla} \mathcal E_e u_L \\
    J w = \sum_e \mathcal E_e^T \begin{bmatrix} B_I \\ B_{\nabla} \end{bmatrix}^T
    \underbrace{\begin{bmatrix} I & \\ & \left( \frac{\partial X}{\partial x}\right)^T \end{bmatrix} W_q \color{teal}{\begin{bmatrix} f_{0,0} & f_{0,1} \\ f_{1,0} & f_{1,1} \end{bmatrix}} \begin{bmatrix} I & \\ & \left( \frac{\partial X}{\partial x}\right) \end{bmatrix}}_{\text{coefficients at quadrature points}} \begin{bmatrix} B_I \\ B_{\nabla} \end{bmatrix} \mathcal E_e w_L
\end{gather*}
  
* $B_I$ and $B_\nabla$ are tensor contractions -- independent of element geometry
* Choice of how to order and represent gathers $\mathcal E$ and scatters $\mathcal E^T$
* Similar for Neumann/Robin and nonlinear boundary conditions

## Quadrature functions: debuggable, vectorizable, and JITable

* Independent operations at each of `Q` quadrature points, order unspecified

```c
int L2residual(void *ctx, CeedInt Q,
    const CeedScalar *const in[], CeedScalar *const out[]) {
  const CeedScalar *u = in[0], *rho = in[1], *target = in[2];
  CeedScalar *v = out[0];
  for (CeedInt i=0; i<Q; i++)
    v[i] = rho[i] * (u[i] - target[i]);
  return 0;
}
```
```c
CeedQFunctionAddInput(qf, "u", 1, CEED_EVAL_INTERP);
CeedQFunctionAddInput(qf, "rho", 1, CEED_EVAL_INTERP);
CeedQFunctionAddInput(qf, "target", 1, CEED_EVAL_INTERP);
CeedQFunctionAddOutput(qf, "v", 1, CEED_EVAL_INTERP);
```

## Building Operators from QFunctions

* `Operator` $A_{\text{local}} = \mathcal E^T B^T D B \mathcal E$
  * `ElemRestriction` $\mathcal E$
  * `Basis` $B$
  * `QFunction` $D$
* `CeedCompositeOperatorCreate` sums multiple operators
  * Different polynomial degree, element topology, physical process
* Distributed parallelism handled external to libCEED (MFEM, PETSc, etc.)
  * $A = \mathcal P^T A_{\text{local}} \mathcal P$
  * Flexible for mixed CPU/GPU programming, load balancing

## GPU support

* Resources: `/gpu/cuda/{ref,reg,shared,gen}` `/gpu/hip/ref`
`/gpu/occa/{cuda,opencl,hip}` `/gpu/magma`
* Fastest implementations use atomics; some backends are [deterministic](https://libceed.readthedocs.io/en/latest/gettingstarted/#backends)

### Using host memory
```c
PetscScalar *y;
VecGetArray(Ypetsc, &y); // writable host pointer
CeedVectorSetArray(Yceed, CEED_MEM_HOST, CEED_USE_POINTER, y);
CeedOperatorApply(op, Xceed, Yceed, CEED_REQUEST_IMMEDIATE);
```
### Without device transfers
```c
VecGetArrayInPlace(Ypetsc, &y); // host, CUDA, HIP
CeedVectorSetArray(Yceed, memtype, CEED_USE_POINTER, y);
```

## libCEED mini-apps

### [Compressible fluids](https://libceed.readthedocs.io/en/latest/examples/fluids/)

* SUPG, implicit/IMEX time integration
* Working on CFD challenge

<img src="figures/ceed/CFDChallenge-WindTunnel.png" />

### [Hyperelastic solids](https://libceed.readthedocs.io/en/latest/examples/solids/)

* $p$-multigrid, low-memory repr of matrix-free Jacobian

<img src="figures/ceed/libceed-solids-twist.gif" />

## Apps beyond ECP: PyLith

* Quasi-static and dynamic rupture
* Rate and state fault models; extensible using Python, large user community (CIG)

In [1]:
from IPython.display import Video

Video("figures/pylith-sf1906plan_hd.mp4")

## Apps beyond ECP: PHASTA

* Highly scalable unstructured DDES (flow control, cardiovascular flow); in-situ visualization

<img src="figures/Boeing_A2_isoQspeed2_lowRes.png" width="90%" />

## PSAAP: Micromorphic and grain-resolving composite inelasticity

<img src="figures/micromorph/workflow.png" />

# The future is all GPU, they said

<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/Fugaku?src=hash&amp;ref_src=twsrc%5Etfw">#Fugaku</a> has become No.1 in all the supercomputer performance benchmarks, <a href="https://twitter.com/hashtag/Top500?src=hash&amp;ref_src=twsrc%5Etfw">#Top500</a>, <a href="https://twitter.com/hashtag/HPCG?src=hash&amp;ref_src=twsrc%5Etfw">#HPCG</a>, HPL-AI, and the <a href="https://twitter.com/hashtag/Graph500?src=hash&amp;ref_src=twsrc%5Etfw">#Graph500</a> for the first time in history as a single machine simultaneously. Thanks for putting up the list! <a href="https://t.co/iM1o0gjrtZ">https://t.co/iM1o0gjrtZ</a></p>&mdash; Satoshi Matsuoka (@ProfMatsuoka) <a href="https://twitter.com/ProfMatsuoka/status/1275085055354974210?ref_src=twsrc%5Etfw">June 22, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

* #1 in Green500 in Nov 2019, now #4.

![](figures/hardware/fugaku.png)

![](figures/acme/acme-sypd-2014.png)

### "No DOE facility through 2020 will run ACME faster than Edison" -- 2013

![Bertagna et al (2019)](figures/acme/Bertagna2019-HOMME.png)
Bertagna et al (2019)

# Latency and throughput are different

<img src="figures/Kronbichler-fig4-crop.png" width="60%" />
Adapted from Kronbichler and Ljungkvist (2019)

## Fuhrer at al (2018): Near-global climate simulation at 1 km

<img src="figures/fuhrer2018-scaling-nodes.png" width="75%" />

## Fuhrer at al (2018): Near-global climate simulation at 1 km

<img src="figures/fuhrer2018-scaling-time-ann1.png" width="75%" />

## Fuhrer at al (2018): Near-global climate simulation at 1 km

<img src="figures/fuhrer2018-scaling-time-ann4.png" width="75%" />

# A pareto approach to capability

* Model complexity (physical scales, stochasticity)
* Accuracy/fidelity (V\&V, reliability, decisions)
* Time (execution time, total workflow time)
* Execution cost (node hours, joules, $, £)
* Human cost (training, cognitive load, community)

![](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b7/Front_pareto.svg/2560px-Front_pareto.svg.png)

# Better plotting: $E(n)$ versus $t(n)$

In [2]:
from ceed_postprocess_base import read_logs
import altair as alt
from glob import glob

runs = read_logs(glob('data/ceed/noether/petsc-bps-2020-*.txt'))
runs['FE_nodes_per_compute_node'] = runs['num_unknowns'] / (runs['num_procs'] / runs['num_procs_node']) / runs['dof_per_node']
highlight = alt.selection_single(
    on='mouseover',
    fields=['degree', 'time_per_it', 'backend'],
    nearest=True,
    empty='none',
)

bps_select = alt.selection_single(
    fields=['bp'],
)

base = alt.Chart(runs).encode(
    alt.Y('mdofs:Q', title='MDoF/s per CG iteration'),
    alt.Color('degree:N'),
    alt.Size('num_unknowns', scale=alt.Scale(type='log', domain=(1e3, 1e6))),
    alt.Shape('bp:N'),
    tooltip=('bp', 'num_procs', 'backend', 'num_elem', 'degree', 'num_unknowns', 'file'),
).transform_filter(
    bps_select,
).transform_calculate(
    mdofs='datum.cg_iteration_dps/1e6',
)

points = base.mark_point().encode(
    opacity=alt.condition(highlight, alt.value(1), alt.value(.5)),
).add_selection(
    highlight,
)

lines = base.mark_line().encode(
    size=alt.condition(alt.datum.degree - highlight.degree == 0, alt.value(2), alt.value(1))
)

pane = points + lines

composite = (
    pane.encode(
        alt.X('time_per_it', scale=alt.Scale(type='log')),
    ) |
    pane.encode(
        alt.X('FE_nodes_per_compute_node', scale=alt.Scale(type='log', domain=(3e4, 1e7), clamp=True)),
    )
)

activator = alt.Chart(runs).mark_point().encode(
    alt.Y('bp', title='BP'),
    alt.Shape('bp')
).add_selection(bps_select).properties(title='Selection')

activator | composite.properties(title='Noether (2x EPYC 7452), gcc-10')

* Implementations improved by competition between MFEM, libCEED, libParanumal (Warburton et al), Deal.II (Kronbichler)

<img src="figures/ceed/mfem-libceed-lassen.png" width=60% />

# Cloud and the grand un-distortion

* Funding agencies and facilities put their thumb on the scale
  * Use this machine
  * Use it at this scale
* Incentives sometimes poorly aligned with science & engineering objectives
* PIs have limited time to apply for allocations; easier to "play the game"
* Expensive to port and tune

### Cloud is all fungible
* including human vs compute
* ? cost to PI and program office

![](figures/acme/acme-bundling-poster.png)

![](figures/acme/acme-bundling-mira.png)

Most apps don't have the human resources to go through these contortions.

# Outlook

* libCEED
  * BDDC preconditioning with sum factorization (no assembly)
  * Backend optimization
  * Rust interface
  * Mixed precision
  * Algorithmic differentiation using Enzyme (LLVM, William Moses)
* PETSc
  * Improved HIP and SYCL support; more fusion in Krylov
  * Improved libCEED integration
  * "AMG"/BDDC using libCEED representation (basis independent)
* New solver BPs

Thanks to DOE ECP for support.