# Pragmatic performance-portable solids and fluids with Ratel, libCEED, and PETSc

## **Jed Brown**, CU Boulder
### Collaborators: Zach Atkins, Valeria Barra, Natalie Beams, Fabio Di Gioacchino, Leila Ghaffari, Ken Jansen, Matthew Knepley, William Moses, Rezgar Shakeri, Karen Stengel, Jeremy L. Thompson, James Wright III, Junchao Zhang


## NUWEST 2024

In [1]:
from IPython.display import SVG, Video, HTML, IFrame
import pandas as pd
import altair as alt
from io import StringIO
import numpy as np

import base64
from IPython.display import Image, display
#import matplotlib.pyplot as plt

def mm(graph):
  graphbytes = graph.encode("ascii")
  base64_bytes = base64.b64encode(graphbytes)
  base64_string = base64_bytes.decode("ascii")
  display(
    Image(
      url="https://mermaid.ink/img/"
      + base64_string
    )
  )

# David Keyes, "Petaflop/s, seriously" (ca. 2007)

<img src="figures/Keyes-PeanutButter-2008.png" class="center" width="70%" />

# Constants matter

## Relative cost of compute versus memory access
## Accuracy tolerances depend on application
## GPU vs CPU latencies
## Accuracy or conservation? Unbiased or biased error?

<img src="figures/app-perf-cartoon-2.png" width="100%" />

# Nonlinear solid mechanics


<video src="figures/ratel/schwarz-q2-5x5x5-t20-l2-r2.webm" width="90%" autoplay controls loop />

## Industrial state of practice

* Low order finite elements: $Q_1$ (trilinear) hexahedra, $P_2$ (quadratic) tetrahedra.
* Assembled matrices, sparse direct and algebraic multigrid solvers

## Myths

* High order doesn't help because real problems have singularities.
* Matrix-free is just for (very) high order

<video src="figures/ratel/tire-platen.webm" autoplay loop width="100%" />

# Approximation constants are good for high order

<img src="figures/ratel/accuracy_study_annotated.svg" width="95%" />

# Bandwidth is scarce compared to flops

<img src="figures/karlrupp/flop-per-byte-dp-2022.svg" width="90%" class="center" />

# Why matrix-free?
* Assembled matrices need at least 4 bytes transferred per flop. Hardware does 10 flops/byte. Matrix-free methods store and move less data, compute faster.

<img src="figures/TensorVsAssembly-qstore.svg" width="90%" class="center" />

# Matrix-free is already faster for $Q_1$ elements

<img src="figures/ratel/schwarz-apply.svg" width="90%" />

# $p$-multigrid algorithm and cost breakdown

<img src="figures/ratel/p-mg-cycle.png" width="80%" />

<img src="figures/ratel/op_schematic.svg" width="80%" />

In [7]:
IFrame("figures/ratel/schwarz-q2-flame.svg", width="1200", height="350")

# Nonlinear solve efficiency

## $Q_2$ elements
<img src="figures/ratel/schwarz-q2-t20-r2-l2-SNESSolve.svg" />

## $Q_3$ elements
<img src="figures/ratel/schwarz-q3-t20-r2-l1-SNESSolve.svg" />

# Linear solve efficiency

## $Q_2$ elements
<img src="figures/ratel/schwarz-q2-t20-r2-l2-KSPSolve.svg" />

## $Q_3$ elements
<img src="figures/ratel/schwarz-q3-t20-r2-l1-KSPSolve.svg" />

* Coarse solver is hypre BoomerAMG tuned configured for elasticity; thanks Victor Paludetto
Magri.

# Preconditioner setup efficiency

## $Q_2$ elements
<img src="figures/ratel/schwarz-q2-t20-r2-l2-PCSetUp.svg" />

## $Q_3$ elements
<img src="figures/ratel/schwarz-q3-t20-r2-l1-PCSetUp.svg" />

# Half-inch puck (F67), 50 MDoF, quadratic tets

<img src="figures/micromorph/f67-detail.png" width="100%" />

* 40k grains segmented from CT scans
* 1% global strain, neo-Hookean model
* 34 seconds per nonlinear solve (`rtol=1e-8`)
  * 7 seconds per linear solve
  * 45 CG iterations
* `/gpu/hip/shared` backend since `hip/gen` does not yet support tensor product elements
  * Will try `hip/magma`
* BoomerAMG coarse solve (linear elements)
* Pure-GPU assembly into hypre ParCSR

In [28]:
IFrame("figures/micromorph/flamegraph-tioga-f67-puck.svg", width="1400", height="200")

# Phase-field damage mechanics

$$
\begin{bmatrix}
A & B \\
C & D
\end{bmatrix}
\begin{bmatrix} \mathbf u \\ \phi \end{bmatrix}
=
\begin{bmatrix} \mathbf b \\ \mathbf 0 \end{bmatrix}
$$
* $A$ is elasticity operator
* $D$ is screened Laplacian for damage (Green's functions decay in a few elements)

## p-MG setup

* p-MG coarsen from quadratic to linear elements (tets in this example)
* specify 6-dimensional rigid body modes as near null space
* damage field $\phi$ is not needed in AMG
* optional: point-block Jacobi smoothing

<video src="figures/ratel/luke-damage-Gc-ratio-20.webm" autoplay loop width="100%" />

<video src="figures/ratel/luke-sigmaxx-Gc-ratio-20.webm" autoplay loop width="100%" />

# One node of Crusher vs historical Gordon Bell
* 184 MDoF $Q_2$ elements nonlinear analysis in seconds

### 2002 Gordon Bell (Bhardwaj et al)

<img src="figures/ratel/gordon-bell-2002-mems.png" width="70%" />

<img src="figures/ratel/gordon-bell-2002-mems-table.png" width="100%" />

### 2004 Gordon Bell (Adams et al)

<img src="figures/ratel/gordon-bell-2004-bone.png" width="70%" />

<img src="figures/ratel/gordon-bell-2004-bone-scaling.jpg" width="100%" />

# Comparison

<img src="figures/ratel/gordon-bell-2004-bone.png" width="85%" />

<img src="figures/ratel/schwarz-q2-8x8x8-t20-l2-r2.png" width="70%" class="center" />

| Metric | Adams et al 2004 | Ratel | Ratel |
|---|---|---|---|
| Discretization | linear | quadratic | cubic |
| Machine | ASCI White 130 nodes | Crusher 1 node | Crusher 1 node |
| Peak Bandwidth | 1.56 TB/s | 12 TB/s | 12 TB/s |
| Degrees of freedom | 237 M | 184 M | 331 M |
| kDoF/GB | 460 | 400 | 700 |
| load step strain | 0.5% | 12% | 12% |
| kDoF/s per load step | 600 | 6000 | 5500 |

# Same story for compressible turbulence

<img src="figures/Boeing_A2_isoQspeed2_lowRes.png" width="100%" />

<video src="figures/fluids/ROPI_OutView.webm" autoplay loop />

## PHASTA (Fortran)

* Extreme-scale unstructured CFD
* SUPG, implicit (gen-$\alpha$) Newton-Krylov
* Aurora ESP: 2y on the "Intel/ALCF plan"
  * GPU still slower than CPU

## CEED-PHASTA

* All-new code, using libCEED with PETSc
* Matrix-free cuts setup/helps strong scaling
* End-to-end GPU (NVIDIA, AMD, Intel)

| Code | Arch | Element | second/step |
|---|---|---|---|
| PHASTA | Skylake | $Q_1$ | 6-12 |
| CEED | A100 | $Q_1$ | 1.0 |
| CEED | A100 | $Q_2$ | 0.7 |
| CEED | A100 | $Q_3$ | 0.5 |

# Algorithmic framework

* SUPG/VMS for compressible NS in pressure-primitive variables
* Implicit integration using gen-$\alpha$
* 3 Newton iterations per time step
  * First two are very cheap (5-15 Krylov iterations), third is stiffer

<img src="figures/fluids/libceed-stored-jacobian-james.png" />

* Benchmark
  * flat plate $Re_\theta \approx 970$ STG inflow
  * $Ma \approx 0.1$
  * 12-30 nominal span/steamline resolution (plus units)

<img src="figures/fluids/libceed-apply-polaris-james.png" />

* 10 nodes of Polaris (4x A100/node)
* 250k nodes (1.25 MDoF) per GPU

# Boeing Speed Bump

<img src="figures/fluids/speed-bump-3d.png" />

* "Easy" problem for which RANS prediction of separation is catastrophic.
* Good experimental data available

> Can a RANS model predict a high-lift flow for the right reasons?

<img src="figures/fluids/speed-bump-Cp-Cf-rans-dns.png" class="center" />

# Are structured grids dead?

## Uzun and Malik (2022)

* Prefactored 4th order compact FD
* up to 10th order compact filtering
* Subcycled implicit time integration

<img src="figures/fluids/UzunMalik-BumpVorticity-2022.png" />

### A tale of two bumps, $\mathrm{Re}_L = 2M, \mathrm{Ma} = 0.1$

| Property | Uzun & Malik | Balin & Jansen |
|---|---|---|
| Grid | overset FD | tet/prism FE |
| Domain width | 0.04L | 0.08L |
| # points | 10B | 4B |
| steps needed | 1969k | 154k |
| seconds/step | 1.4 | 12 |
| cores (nodes) | 40k (1000) | 39k (972) |
| days | 33 | 25 |
| Wall normal spacing ($+$ units) | 0.6 | 0.3 |
| Mach number | 0.2 | incompressible |

* $\Delta t$ incompressible $\approx 2 \Delta t$ compressible

# Fluids outlook

<img src="figures/fluids/flat-plate-validation-1410.png" />

* Data-driven subgrid stress model (Prakash, Jansen, Evans)
  * Online training using SmartSim
  * Reference frame and unit invariant

* Speed Bump $Re_L = 2M$
  * Determine DNS resolution for cubic elements
  * Hex-dominant mixed topology meshing adapted to Kolmogorov scale
  * Goal: reduce 30-40 days to 3 days on Aurora
* DDES/Hybrid for real geometry (e.g., HLPW)
* Fundamental numerics
  * Optimized dispersion: stabilization and basis
  * Low-Mach preconditioning, time integration

## [libCEED](https://libceed.readthedocs.io): fast algebra for finite elements

* Backend plugins with run-time selection
  * debug/memcheck, optimized
  * libxsmm, CUDA, HIP, SYCL
  * MAGMA to CUDA, HIP, SYCL*
* Single source vanilla C for QFunctions
  * Easy to debug, understand locally
  * C++ available, but not necessary
  * Target for DSLs, AD
* Python, Julia, Rust
* 2-clause BSD
* Available via MFEM, PETSc, Nek5000

<img src="figures/ceed/libCEEDBackends.svg" width="100%" />

Thanks to many developers, including Jeremy Thompson, Yohann Dudouit, Valeria Barra, Natalie Beams,  Ahmad Abdelfattah, Leila Ghaffari, Will Pazner, Thilina Ratnayaka, Tzanio Kolev, Veselin Dobrev, David Medina

<img src="figures/ceed/libCEED-2.png" width=100% />


## Quadrature functions: the math

\begin{gather*}
    v^T F(u) \sim \int_\Omega v \cdot \color{olive}{f_0(u, \nabla u)} + \nabla v \!:\! \color{olive}{f_1(u, \nabla u)} \quad
    v^T J w \sim \int_\Omega \begin{bmatrix} v \\ \nabla v \end{bmatrix}^T \color{teal}{\begin{bmatrix} f_{0,0} & f_{0,1} \\ f_{1,0} & f_{1,1} \end{bmatrix}}
    \begin{bmatrix} w \\ \nabla w \end{bmatrix} \\
    u = B_I \mathcal E_e u_L \qquad \nabla u = \frac{\partial X}{\partial x} B_{\nabla} \mathcal E_e u_L \\
    J w = \sum_e \mathcal E_e^T \begin{bmatrix} B_I \\ B_{\nabla} \end{bmatrix}^T
    \underbrace{\begin{bmatrix} I & \\ & \left( \frac{\partial X}{\partial x}\right)^T \end{bmatrix} W_q \color{teal}{\begin{bmatrix} f_{0,0} & f_{0,1} \\ f_{1,0} & f_{1,1} \end{bmatrix}} \begin{bmatrix} I & \\ & \left( \frac{\partial X}{\partial x}\right) \end{bmatrix}}_{\text{coefficients at quadrature points}} \begin{bmatrix} B_I \\ B_{\nabla} \end{bmatrix} \mathcal E_e w_L
\end{gather*}
  
* $B_I$ and $B_\nabla$ are tensor contractions -- independent of element geometry
* Choice of how to order and represent gathers $\mathcal E$ and scatters $\mathcal E^T$
* Similar for Neumann/Robin and nonlinear boundary conditions

# QFunctions: debuggable, vectorizable, and JITable

* Independent operations at each of `Q` quadrature points, order unspecified

```c
int L2residual(void *ctx, CeedInt Q,
    const CeedScalar *const in[],
    CeedScalar *const out[]) {
  const CeedScalar *u = in[0], *rho = in[1], *target = in[2];
  CeedScalar *f = out[0];
  for (CeedInt i=0; i<Q; i++)
    // Weak form of the problem goes here
    f[i] = rho[i] * (u[i] - target[i]);
  return 0;
}
```

$$\int v \, \underbrace{\rho (u - \mathtt{target})}_{f} = 0, \quad \forall v$$

![](figures/ceed/solids-perf-disassembly.png)

## Example QFunctions
* Riemann problems
* Return mapping for plasticity
* Nitsche contact
* Synthetic turbuluence generation
* Data-driven SGS model (small neural network)

# Why not a domain-specific language (DSL)?

## Developer experience

* Indexed, refactoring tools
* Libraries of materials
* Unit testing, property testing
* Debugger integration
  * Run in debugger with `-fp_trap`, see how your code computed a negative pressure.
  * Attach debugger to running job, see why return-mapping algorithm is converging slowly.
* Static analysis
* Performance transparency
  * Profiling tools, flamegraph reflects source

## Rust

* Excellent error messages.
* Guaranteed type- and memory-safety.
* Excellent tooling and libraries
* `no_std`: compiling for the host ensures no allocation/system access (that would fail on device)
* Zero-cost FFI: JIT fuse kernels with CUDA-C parts and Rust parts; result is fully inlined.
* Ergonomic and safe AD via Enzyme
  * Working to merge upstream for `+nightly`

# Modeling principles for matrix-free methods

## Seek well-conditioned formulations

* Nitsche contact vs Lagrange multipliers or penalties
* Conforming discretizations vs XFEM and immersed boundary
* Mixed FE vs displacement-only elasticity

## Smooth everything

* Leave extra degrees of freedom in
  * Skip static condensation
* Approx Braess-Sarazin vs segregated MG vs Vanka vs vertex-star
* "optimal" asymptotics must be weighted against implementation efficiency

# Outlook: [petsc.org](https://petsc.org) [libceed.org](https://libceed.org) [ratel.micromorph.org](https://ratel.micromorph.org)

* You can move from $Q_1$ to $Q_2$ elements for about 2x cost (despite 8x more DoFs); $p>2$ is free
* Mesh to resolve geometry, $p$-refine to pragmatic accuracy (tools!)
* libCEED already offers 2x speedup for $Q_1$
* Gordon Bell scale from 20 years ago $\mapsto$ interactive on a workstation (if you can buy MI250X 😊)

## Come to the hands-on session
* https://github.com/jedbrown/nuwest24
* Run p-MG solvers for structural mechanics on Tioga
* Explore QFunctions in real code
* Discuss unstructured implicit discretization and solvers

## Thanks: DOE PSAAP, DOE ECP, DOE ASCR, NSF CISE

<video src="figures/ratel/schwarz-pendulum.webm" autoplay loop width="60%" />

<video src="figures/fluids/ROPI_OutView.webm" autoplay loop width="80%" />

# Old performance model

## Iterative solvers: Bandwidth
* SpMV arithmetic intensity of 1/6 flop/byte
* Preconditioners also mostly bandwidth
  * Architectural latency a big problem on GPUs, especially for sparse triangular solves.
  * Sparse matrix-matrix products for AMG setup
  
## Direct solvers: Bandwidth and Dense compute
* Leaf work in sparse direct solves
* Dense factorization of supernodes
  * Fundamentally nonscalable, granularity on GPUs is already too big to apply on subdomains
* Research on H-matrix approximations (e.g., in STRUMPACK)

# New performance model

## Still mostly bandwidth

* Reduce storage needed at quadrature points
  * Half the cost of a sparse matrix already for linear elements
  * Big efficiency gains for high order
* Assembled coarse levels are much smaller.

## Compute

* Kernel fusion is necessary
* Balance vectorization with cache/occupancy
* $O(n)$, but benefits from BLIS-like abstractions
| BLIS | libCEED |
|------|---------|
| packing | batched element restriction |
| microkernel | basis action |
| ? | user-provided qfunctions |


# Ideas for hands-on: Structures in Ratel

```console
$ mpiexec -n 2 ratel-static -options_file examples/ex01-*.yml [-ceed /gpu/hip]
$ ratel-quasistatic -options_file examples/ex02-*.yml
$ ratel-dynamic -options_file examples/ex03-*.yml
```

## AMG/pMG and the cost of assembly
* `-pc_type gamg` or `-pc_type hypre` use assembled sparse matrices.
  * Compute a coarse baseline with linear elements
  * $h$-refine and solve (AMG)
  * $p$-refine and solve (AMG)
  * $p$-refine and solve (matrix-free pMG)
## Accuracy $p$- vs $h$-refinement

## Robustness and direct solvers
* `-pc_type cholesky` (e.g., MUMPS in parallel)
* `-pc_type pmg -mg_coarse_pc_type cholesky`
* `-pc_type gamg` (or `hypre`)
## The cost of incompressibility
* Poisson ratio: $-1 \le \nu \le 0.5$
* Mixed discretization, augmented Lagrangian (`-nu_primal 0.3`)
* Compare to displacement-only solver at "nice" Poisson ratio.

# Ideas for hands-on: CEED Fluids

```console
$ fluids-navierstokes -options_file examples/fluids/*.yaml
```

## BCs for compressible flow

* Preventing recirculation can be hard in physical domains: Riemann conditions can automatically switch inflow <-> outflow
* `vortexshedding.yaml`: How close can the exit be while predicting shedding frequency with 5% accuracy?
* `gaussianwave.yaml`: What is the effect of HLL vs HLLC at the boundary?

## Primitive vs Conservative?

<img src="figures/fluids/Temperature_p1.png">

<video src="figures/fluids/cyl-vorticity.webm" autoplay loop />