# Micromorph Computing: Scaling and performance
## **Jed Brown**, Beichuan Yan, Jay Appleton, Ren Stengel, Thomas Allard, Sam Lamont, Ken Jansen, Henry Tufo


## PSAAP Virtual Site Visit, 2021-09-15

In [10]:
from IPython.display import SVG, Video, HTML, IFrame
import pandas as pd
import altair as alt
from io import StringIO
import numpy as np

## PSAAP: Micromorphic and grain-resolving composite inelasticity

<img src="figures/micromorph/workflow.png" />

# Grain-resolving simulation

* Typical grain size 0.1mm (.01 - .6), $E = 23$ GPa, $\rho = 1.8 \textrm{g/cm}^3$
* Strong scale to 80% efficiency, estimate efficiencies

In [210]:
data = StringIO("""
code,model,GPU,parallel,materials,time,ℓ_d (mm),Δt (μs),wall ms/step
libCEED,hyperelastic FE,✅,✅,🛠️,implicit,.6,100,1000
libCEED,hyperelastic FE,✅,✅,🛠️,explicit,.6,.001,1
GEOS-MPM,MPM fracture,❌,✅,✅,explicit,.4,.00032,840
LAMMPS,spherical DEM,❌,✅,✅,explicit,8,1.6,36
ParaEllip3d,DEM,❌,✅,✅,explicit,1.6,1.6,20
Abaqus,elastic FE,❌,🤷,✅,explicit,1,.00156,28
""")
model = pd.read_csv(data)
model["sim ms/wall hour"] = model["Δt (μs)"] / model['wall ms/step'] * 3600
model["sim cm^3 ms/node hour"] = model["sim ms/wall hour"] * (.1 * model["ℓ_d (mm)"])**3
model.set_index("code")

Unnamed: 0_level_0,model,GPU,parallel,materials,time,ℓ_d (mm),Δt (μs),wall ms/step,sim ms/wall hour,sim cm^3 ms/node hour
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
libCEED,hyperelastic FE,✅,✅,🛠️,implicit,0.6,100.0,1000,360.0,0.07776
libCEED,hyperelastic FE,✅,✅,🛠️,explicit,0.6,0.001,1,3.6,0.0007776
GEOS-MPM,MPM fracture,❌,✅,✅,explicit,0.4,0.00032,840,0.001371,8.777143e-08
LAMMPS,spherical DEM,❌,✅,✅,explicit,8.0,1.6,36,160.0,81.92
ParaEllip3d,DEM,❌,✅,✅,explicit,1.6,1.6,20,288.0,1.179648
Abaqus,elastic FE,❌,🤷,✅,explicit,1.0,0.00156,28,0.200571,0.0002005714


# Grain resolving capability

In [211]:
alt.Chart(model).mark_point(size=300).encode(
    alt.X("sim ms/wall hour", scale=alt.Scale(type="log")),
    alt.Y("sim cm^3 ms/node hour", scale=alt.Scale(type="log")),
    alt.Shape("code"),
    alt.Size("ℓ_d (mm)", scale=alt.Scale(type="log", domain=(.15,5))),
    alt.Tooltip(["time", "code", "ℓ_d (mm)", "Δt (μs)"]),
    alt.Color("GPU"),
).properties(width=480, height=480)

* FEM/MPM$^*$ inelastic mechanics
  * GPU: multi-node on Lassen (NVIDIA) and Spock (AMD)
* GEOS-MPM/GEOSX
  * explicit, limited to small duration

* ParaEllip3d-CFD
  * 1024 nodes of Quartz
* LAMMPS DEM
  * efficient, but only spherical

# Achievements and limitations

# Achievements

* Implicit capability
  * debuggable, programmable
  * all-GPU matrix-free multigrid
  * near roofline performance
* Large-scale DEM simulations
  * MVAPICH bugs (fixed by Nat Shineman) 
* Explicit GEOS-MPM
* Profiling all components

# Limitations/vulnerabilities

* Implicit MPM in development
* GPU kernel fusion for complex materials
* Mesh quality
* Time stepping for coupled MPM-DEM
* GPU-based DEM
* Grain size distribution
  * forces small elements, small $\Delta t$
  * required resolution for accurate QoI

# Micromorph Computer Science Research
## **Jed Brown**, Beichuan Yan, Jay Appleton, Ren Stengel, Thomas Allard, Sam Lamont, Ken Jansen, Henry Tufo


## PSAAP Virtual Site Visit, 2021-09-15

# Why matrix-free?

* Assembled matrices need at least 4 bytes transferred per flop. Hardware does 10 flops per byte.
* Matrix-free methods store and move less data, compute faster.

<img src="figures/karlrupp/flop-per-byte-dp-2021.svg" class="floatleft" />
<img src="figures/TensorVsAssembly-qstore.svg" class="floatright" />

# libCEED/PETSc BP3 flame graphs

In [5]:
IFrame("data/ceed/noether/noether-bp3-P6-xsmm.svg", width="1600", height="900")

# libCEED and PETSc asynchrony

![](figures/ceed/libceed-bp3-nsys-cuda-gen.png)

# [Hyperelastic solids](https://libceed.readthedocs.io/en/latest/examples/solids/)

* $p$-multigrid, low-memory repr of matrix-free Jacobian
* Multi-node GPU on CUDA and ROCm

<img src="figures/ceed/libceed-solids-twist.gif" class="floatleft" />
<img src="figures/micromorph/libceed-epoxy-traction-20210829.gif" class="floatright" />

# libCEED solid mechanics flame graphs

In [6]:
IFrame("data/ceed/noether/noether-solids-holes-P4-xsmm-coo.svg", width="1600", height="900")

# libCEED solid mechanics icicle graphs

In [7]:
IFrame("data/ceed/noether/noether-solids-holes-P4-xsmm-coo-icicle.svg", width="1600", height="900")

# Solids: efficient matrix-free Jacobians, cf. [Davydov et al. (2020)](https://doi.org/10.1002/nme.6336)

<img src="figures/ceed/libceed-solids-initial-current.png" width="80%" />
<img src="figures/ceed/libceed-solids-jacobian-table.png" width="80%" />

# GEOS-MPM scalability/efficiency

* 8 material points per element
* $144^3 \approx$ 3 million elements

In [152]:
geosmpm = pd.DataFrame([
    [1728, 48, 28726, 34093],
    [512, 16, 86000, 29315],
    [64, 2, 86000, 4751],
    [8, 1, 86000, 1573],
    [1, 1, 86000, 213],
], columns=["processes", "nodes", "time", "steps"])
geosmpm["cells"] = 144**3 # problem size, from Jay; 2^3=8 particles per cell
geosmpm["sec/step"] = geosmpm["time"] / geosmpm["steps"]
geosmpm["efficiency (kcell step/node sec)"] = 1e-3 * geosmpm["cells"] / (geosmpm["nodes"] * geosmpm["sec/step"])
geosmpm["cells/process"] = geosmpm["cells"] / geosmpm["processes"]

points = alt.Chart(geosmpm).mark_point().encode(
    alt.Y("sec/step:Q", scale=alt.Scale(type='log')),
    alt.X("nodes", scale=alt.Scale(type='log')),
    alt.Size("efficiency (kcell step/node sec)"),
    alt.Tooltip(["processes", "efficiency (kcell step/node sec)", "cells/process"])
)

best = geosmpm.iloc[geosmpm["efficiency (kcell step/node sec)"].idxmax()]
ideal = alt.Chart(geosmpm).mark_line(clip=True).encode(
    alt.X('nodes', scale=alt.Scale(type='log')),
    alt.Y('sec/step', scale=alt.Scale(type='log')),
).transform_calculate(
    **{'sec/step': best["sec/step"] * best["nodes"] / alt.datum["nodes"]}
)

In [153]:
alt.layer(points, ideal)

In [148]:
alt.Chart(geosmpm).mark_point().encode(
    alt.X("sec/step", scale=alt.Scale(type='log')),
    alt.Y("efficiency (kcell step/node sec)"),
    alt.Size("nodes:O"),
    alt.Tooltip(["processes", "sec/step", "efficiency (kcell step/node sec)", "cells/process"]))

# GEOS-MPM flame graphs

* "scales well", but spends a lot of time allocating/deallocating/copying memory


In [60]:
IFrame("data/micromorph/flame/geosmpm/perf-flame-20210910.svg", width="1600", height="400")

In [61]:
IFrame("data/micromorph/flame/geosmpm/perf-icicle-20210910.svg", width="1600", height="400")

# ParaEllip3d-CFD

* numerical kernels (eigensolver), `vfabs`, `pow`

In [14]:
IFrame("data/micromorph/flame/paraellip3d/04-isotropic-100k-to-300k-222.svg", width=1600, height=400)

In [15]:
IFrame("data/micromorph/flame/paraellip3d/04-isotropic-100k-to-300k-222-icicle.svg", width=1600, height=400)

# ParaEllip3d scalability

In [155]:
df = pd.read_excel('data/micromorph/YanRegueiro-2018.ods', engine='odf')
df['time'] /= 500 # 500 time steps for the reported tables

df['TDP'] = df['nodes'] * 300
df['particles/joule'] = df['num_particles'] / (df['TDP'] * df['time'])
df.head()

Unnamed: 0,Machine,num_particles,nodes,cores,time,TDP,particles/joule
0,Excalibur,2500,1,32,0.01364,300,610.948192
1,Excalibur,2500,2,64,0.00974,600,427.789185
2,Excalibur,2500,3,96,0.00808,900,343.784378
3,Excalibur,2500,4,128,0.0071,1200,293.42723
4,Excalibur,2500,6,192,0.0058,1800,239.463602


In [156]:
alt.Chart(df).mark_point().encode(
    alt.Y('particles/joule', scale=alt.Scale(type='log')),
    alt.X('time', scale=alt.Scale(type='log')),
    color='num_particles:N',
    size='nodes:O',
    shape='num_particles:O',
    tooltip=('num_particles', 'nodes'),
).properties(
    title='O(n²) find neighbors',
)

In [158]:
df = pd.read_excel('data/micromorph/03-strongScaling-bigOn.ods', engine='odf')
df['time'] /= 500 # 500 time steps for the reported tables

df['TDP'] = df['nodes'] * 300
df['particles/joule'] = df['num_particles'] / (df['TDP'] * df['time'])
df.describe()

alt.Chart(df).mark_point().encode(
    alt.Y('particles/joule', scale=alt.Scale(type='log')),
    alt.X('time', scale=alt.Scale(type='log')),
    alt.Color('num_particles:N'),
    alt.Shape('num_particles:O'),
    alt.Size('nodes:O'),
    tooltip=('num_particles', 'nodes'),
).properties(
    title='O(n) find neighbors',
)

In [165]:
pd.DataFrame([
    ["Quartz", 9.3, 60, 12.6, 4.58],
    ["Onyx", 8.3, 60, 10.4, 38.3],
    ["Centennial", 10.8, 68, 15.3, 6.16],
], columns=["machine", "small, 20 nodes (hours)", "large, 256 nodes (minutes)", "IO (sec)", "scatter (sec)"]).set_index("machine")

Unnamed: 0_level_0,"small, 20 nodes (hours)","large, 256 nodes (minutes)",IO (sec),scatter (sec)
machine,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Quartz,9.3,60,12.6,4.58
Onyx,8.3,60,10.4,38.3
Centennial,10.8,68,15.3,6.16


# LAMMPS DEM: balanced vs unbalanced for 1M particles


In [32]:
df = pd.read_excel('data/micromorph/Scaling_Data.xlsx', header=1, usecols=[2,3,5])
df = df.rename(columns={
    'Wall time': 'Wall time_unbalanced',
    'Wall time.1': 'Wall time_balanced',
})
tidy = pd.wide_to_long(df, stubnames=['Wall time'], i=["Number of nodes"], j='Balance', sep="_", suffix=r"\w+")
tidy['Wall time'] /= 60 # convert minutes to hours
tidy.reset_index(inplace=True)
points = alt.Chart(tidy).mark_point().encode(
    alt.X('Number of nodes', scale=alt.Scale(type='log')),
    alt.Y('Wall time', scale=alt.Scale(type='log')),
    alt.Color("Balance"),
)

T0 = tidy[tidy["Number of nodes"] == 1]["Wall time"].min()
ideal = alt.Chart(tidy).mark_line(clip=True).encode(
    alt.X('Number of nodes', scale=alt.Scale(type='log')),
    alt.Y('Wall time', scale=alt.Scale(type='log', domain=(1, 10))),
).transform_calculate(
    **{'Wall time': T0 / alt.datum["Number of nodes"]}
)

In [33]:
alt.layer(points, ideal)

![](https://docs.lammps.org/_images/balance_rcb.jpg)

# LAMMPS DEM: balanced vs unbalanced for 1M particles

In [34]:
alt.Chart(tidy).mark_point().encode(
    alt.X("Wall time"),
    alt.Y("Efficiency (runs per node hour):Q"),
    alt.Color("Balance"),
    alt.Tooltip(["Number of nodes"])
).transform_calculate(**{
    "Efficiency (runs per node hour)": 1/(alt.datum["Wall time"] * alt.datum["Number of nodes"]),
})

## [libCEED](https://libceed.readthedocs.io): A FEM library with no finite elements (purely algebraic)

* Backend plugins with run-time selection
  * debug/memcheck, optimized
  * libxsmm, CUDA, HIP
  * MAGMA to CUDA and HIP
  * OCCA to OpenMP, OpenCL, CUDA, and HIP
* Single source vanilla C for QFunctions
  * Easy to debug, understand locally, C++ optional
  * Target for DSLs, AD
* Python, Julia, Rust
* 2-clause BSD
* Available via MFEM, PETSc, Nek5000

<img src="figures/ceed/libceed-backends.svg" />

Thanks to many developers, including Jeremy Thompson, Yohann Dudouit, Valeria Barra, Natalie Beams,  Ahmad Abdelfattah, Leila Ghaffari, Will Pazner, Thilina Ratnayaka, Tzanio Kolev, Veselin Dobrev, David Medina

## libCEED development

<img src="figures/ceed/libceed-badges-2021-08-03.png" class="floatleft80" />

<img src="figures/ceed/libceed-sunburst-2021-08-03.png" class="floatright10" width="20%" />

* [v0.8 (March), v0.9 (July)](https://libceed.readthedocs.io/en/latest/releasenotes/)
  * [Python](https://pypi.org/project/libceed/) ([SciPy](https://doi.org/10.25080/Majora-342d178e-00c)), [Julia](https://ceed.exascaleproject.org/libCEED-julia-docs/dev/), [Rust](https://docs.rs/libceed)
  * Assembly (COO matrix, diagonal, point-block diagonal)
  * FDM solvers
  * Performance: CUDA, HIP, MAGMA
  * Many examples: GPU $p$-MG for BPs and hyperelasticity; compressible fluids; BPs on the sphere


* Continuous integration: OSX, Linux
* Cloud and local hardware: x86-64, POWER, ARM64, HIP, CUDA

* [In development](https://libceed.readthedocs.io/en/latest/releasenotes/)
  * BDDC/MG solvers, LFAToolkit
  * Mixed precision
  * More app/library integration
  * Algorithmic differentiation
  * SYCL/DPC++ backend

<img src="figures/ceed/libCEED-2.png" width=100% />


## Quadrature functions: the math

\begin{gather*}
    v^T F(u) \sim \int_\Omega v \cdot \color{olive}{f_0(u, \nabla u)} + \nabla v \!:\! \color{olive}{f_1(u, \nabla u)} \quad
    v^T J w \sim \int_\Omega \begin{bmatrix} v \\ \nabla v \end{bmatrix}^T \color{teal}{\begin{bmatrix} f_{0,0} & f_{0,1} \\ f_{1,0} & f_{1,1} \end{bmatrix}}
    \begin{bmatrix} w \\ \nabla w \end{bmatrix} \\
    u = B_I \mathcal E_e u_L \qquad \nabla u = \frac{\partial X}{\partial x} B_{\nabla} \mathcal E_e u_L \\
    J w = \sum_e \mathcal E_e^T \begin{bmatrix} B_I \\ B_{\nabla} \end{bmatrix}^T
    \underbrace{\begin{bmatrix} I & \\ & \left( \frac{\partial X}{\partial x}\right)^T \end{bmatrix} W_q \color{teal}{\begin{bmatrix} f_{0,0} & f_{0,1} \\ f_{1,0} & f_{1,1} \end{bmatrix}} \begin{bmatrix} I & \\ & \left( \frac{\partial X}{\partial x}\right) \end{bmatrix}}_{\text{coefficients at quadrature points}} \begin{bmatrix} B_I \\ B_{\nabla} \end{bmatrix} \mathcal E_e w_L
\end{gather*}
  
* $B_I$ and $B_\nabla$ are tensor contractions -- independent of element geometry
* Choice of how to order and represent gathers $\mathcal E$ and scatters $\mathcal E^T$
* Similar for Neumann/Robin and nonlinear boundary conditions

## Quadrature functions: debuggable, vectorizable, and JITable

* Independent operations at each of `Q` quadrature points, order unspecified

```c
int L2residual(void *ctx, CeedInt Q,
    const CeedScalar *const in[],
    CeedScalar *const out[]) {
  const CeedScalar *u = in[0], *rho = in[1], *target = in[2];
  CeedScalar *v = out[0];
  for (CeedInt i=0; i<Q; i++)
    v[i] = rho[i] * (u[i] - target[i]);
  return 0;
}
```

![](figures/ceed/solids-perf-disassembly.png)

```c
CeedQFunctionAddInput(qf, "u", 1, CEED_EVAL_INTERP);
CeedQFunctionAddInput(qf, "rho", 1, CEED_EVAL_INTERP);
CeedQFunctionAddInput(qf, "target", 1, CEED_EVAL_INTERP);
CeedQFunctionAddOutput(qf, "v", 1, CEED_EVAL_INTERP);
```

## Building Operators from QFunctions

* `Operator` $A_{\text{local}} = \mathcal E^T B^T D B \mathcal E$
  * `ElemRestriction` $\mathcal E$
  * `Basis` $B$
  * `QFunction` $D$
* `CeedCompositeOperatorCreate` sums multiple operators
  * Different polynomial degree, element topology, physical process
* Distributed parallelism handled external to libCEED (MFEM, PETSc, etc.)
  * $A = \mathcal P^T A_{\text{local}} \mathcal P$
  * Flexible for mixed CPU/GPU programming, load balancing

# Verification and efficiency testing: MMS

<img src="figures/ceed/solids-mms-conv.svg" width="80%" />
<img src="figures/ceed/solids-sing-conv.png" width="70%" />

![](figures/ceed/solids-eccomas/error-cost.svg)

# Julia, Python, and Rust: safer, easier

### [Julia QFunctions](https://ceed.exascaleproject.org/libCEED-julia-docs/dev/UserQFunctions.html): defined and wired up in one place, CUDA.jl

```julia
@interior_qf apply_qfunc = (
    ceed, Q, dim=dim,
    (du, :in, EVAL_GRAD, Q, dim),
    (qdata, :in, EVAL_NONE, Q, dim*(dim+1)÷2),
    (dv, :out, EVAL_GRAD, Q, dim),
    @inbounds @simd for i=1:Q
        dXdxdXdxT = getvoigt(@view(qdata[i,:]), CeedDim(dim))
        dui = SVector{dim}(@view(du[i,:]))
        dv[i,:] .= dXdxdXdxT*dui
    end
)
```
### [Rust packaging](https://docs.rs/libceed): `Cargo.toml` takes the pain out of dependency management
```toml
[dependencies]
libceed = "0.9.0"
```
```console
$ cargo build
```

## GPU support

* Resources: `/gpu/cuda/{ref,reg,shared,gen}` `/gpu/hip/ref`
`/gpu/occa/{cuda,opencl,hip}` `/gpu/magma`
* Fastest implementations use atomics; some backends are [deterministic](https://libceed.readthedocs.io/en/latest/gettingstarted/#backends)

### Using host memory
```c
PetscScalar *y;
VecGetArray(Ypetsc, &y); // writable host pointer
CeedVectorSetArray(Yceed, CEED_MEM_HOST, CEED_USE_POINTER, y);
CeedOperatorApply(op, Xceed, Yceed, CEED_REQUEST_IMMEDIATE);
```
### Without device transfers
```c
VecGetArrayAndMemType(Ypetsc, &y, &mem_type); // host, CUDA, HIP
CeedVectorSetArray(Yceed, P2C(mem_type), CEED_USE_POINTER, y);
```

# BP performance on CPU (2x EPYC 7452)

In [10]:
from postprocess_base import read_logs
import altair as alt
from glob import glob

runs = read_logs(glob('data/ceed/**/*.txt'))
runs['FE_nodes_per_compute_node'] = runs['num_unknowns'] / (runs['num_procs'] / runs['num_procs_node']) / runs['dof_per_node']
runs.head()

Unnamed: 0,file,backend,backend_memtype,hostname,test,num_procs,num_procs_node,degree,quadrature_pts,code,bp,case,num_unknowns,num_elem,dof_per_node,ksp_its,time_per_it,cg_iteration_dps,FE_nodes_per_compute_node
0,data/ceed/lassen/lassen-16-4.txt,/gpu/cuda/gen,device,lassen410,PETSc CEED Benchmark Problem 1,1,1,1,3,libCEED,1,scalar,5616,4692,1,5,0.000362,15509500.0,5616.0
1,data/ceed/lassen/lassen-16-4.txt,/gpu/cuda/gen,device,lassen410,PETSc CEED Benchmark Problem 2,1,1,1,3,libCEED,2,vector,16848,4692,3,5,0.000366,46092000.0,5616.0
2,data/ceed/lassen/lassen-16-4.txt,/gpu/cuda/gen,device,lassen410,PETSc CEED Benchmark Problem 3,1,1,1,3,libCEED,3,scalar,3872,4692,1,1,0.000499,7762750.0,3872.0
3,data/ceed/lassen/lassen-16-4.txt,/gpu/cuda/gen,device,lassen410,PETSc CEED Benchmark Problem 4,1,1,1,3,libCEED,4,vector,11616,4692,3,1,0.000517,22485700.0,3872.0
4,data/ceed/lassen/lassen-16-4.txt,/gpu/cuda/gen,device,lassen410,PETSc CEED Benchmark Problem 1,1,1,1,3,libCEED,1,scalar,10800,9384,1,5,0.000363,29759700.0,10800.0


In [12]:
highlight = alt.selection_single(
    on='mouseover',
    fields=['degree', 'time_per_it', 'backend', 'hostname'],
    nearest=True,
    empty='none',
)

bps_select = alt.selection_single(
    fields=['bp'],
)

base = alt.Chart(runs[runs.hostname == "noether"]).encode(
    alt.Y('mdofs:Q', title='MDoF/s per CG iteration'),
    alt.Color('degree:N'),
    alt.Size('num_unknowns', scale=alt.Scale(type='log', domain=(1e3, 1e6))),
    alt.Shape('bp:N'),
    tooltip=('hostname', 'bp', 'num_procs', 'backend', 'num_elem', 'degree', 'num_unknowns', 'file'),
).transform_filter(
    bps_select,
).transform_calculate(
    mdofs='datum.cg_iteration_dps/1e6',
)

points = base.mark_point().encode(
    opacity=alt.condition(highlight, alt.value(1), alt.value(.5)),
).add_selection(
    highlight,
)

lines = base.mark_line().encode(
    size=alt.condition(alt.datum.degree - highlight.degree == 0, alt.value(2), alt.value(1))
)

pane = points + lines

composite = (
    pane.encode(
        alt.X('time_per_it', scale=alt.Scale(type='log'), title='Time per Iteration'),
    ) |
    pane.encode(
        alt.X('FE_nodes_per_compute_node', scale=alt.Scale(type='log', domain=(3e4, 1e7), clamp=True), title='FE Nodes per Compute Node'),
    )
)

activator = alt.Chart(runs).mark_point().encode(
    alt.Y('bp', title='BP'),
    alt.Shape('bp')
).add_selection(bps_select).properties(title='Selection')

activator | composite.properties(title='CEED BPs')

# BP performance on GPU (V100)

In [13]:
base = alt.Chart(runs[runs.hostname == "lassen385"]).encode(
    alt.Y('mdofs:Q', title='MDoF/s per CG iteration'),
    alt.Color('degree:N'),
    alt.Size('num_unknowns', scale=alt.Scale(type='log', domain=(1e3, 1e6))),
    alt.Shape('bp:N'),
    tooltip=('hostname', 'bp', 'num_procs', 'backend', 'num_elem', 'degree', 'num_unknowns', 'file'),
).transform_filter(
    bps_select,
).transform_calculate(
    mdofs='datum.cg_iteration_dps/1e6',
)

points = base.mark_point().encode(
    opacity=alt.condition(highlight, alt.value(1), alt.value(.5)),
).add_selection(
    highlight,
)

lines = base.mark_line().encode(
    size=alt.condition(alt.datum.degree - highlight.degree == 0, alt.value(2), alt.value(1))
)

pane = points + lines

composite = (
    pane.encode(
        alt.X('time_per_it', scale=alt.Scale(type='log'), title='Time per Iteration'),
    ) |
    pane.encode(
        alt.X('FE_nodes_per_compute_node', scale=alt.Scale(type='log', domain=(3e4, 1e7), clamp=True), title='FE Nodes per Compute Node'),
    )
)

activator = alt.Chart(runs).mark_point().encode(
    alt.Y('bp', title='BP'),
    alt.Shape('bp')
).add_selection(bps_select).properties(title='Selection')

activator | composite.properties(title='CEED BPs')

# Preconditioners and local Fourier analysis

<img src="figures/lfatoolkit/bddc-cartoon.png" />

<img src="figures/lfatoolkit/lowVsHighDirichletBounds.png" width="80%" />

$$\kappa \le C \Big(1 + \log\big(p^2 \frac H h\big)\Big)^2$$

# Mesh quality, Cubit `sculpt`

* Elements needed for 40 grains
| 18k | 220k | 1.77M | 6.98 M |
|-----|------|-------|--------|
| ❌  | ❌   | ❌    | ✅ |

* All had positive Jacobian at corners, sometimes at $2\times 2\times 2$ Gauss points

<img src="figures/JohnenWeillRemacle2017-invalid-hex.png" class="floatleft" />
<img src="figures/JohnenWeillRemacle2017-invalid-hex-corners.png" class="floatright" />

<img src="figures/micromorph/libceed-epoxy-traction-20210829.gif" class="floatleft" />
<img src="figures/micromorph/sculpt-tangled.png" width="50%" />

# Data compression and in-situ visualization

<div class="floatright33">
<img src="data/micromorph/ken-jansen/PSAAP2021/figures_workflow/sensei_schematic.png" width="container" />
<img src="data/micromorph/ken-jansen/PSAAP2021/figures_workflow/immersive_framework.svg" width="container" />
<img src="data/micromorph/ken-jansen/PSAAP2021/figures_workflow/steering.png" width="container" />
</div>

<div class="floatleft66">

* Computation improves faster than IO bandwidth & capacity.
* In-situ visualization for mechanistic analysis with small data.
* Decouple QoI computation from simulation
* Steering opportunities, reanalysis/counter-factual simulation
</div>

<div class="row">
    <div class="column" style="width:100%">
        <img src="figures/logos/ascent_logo_wide_blue.svg" width="20%" />
    </div>
    <div class="column" style="width:100%">
        <img src="figures/logos/libE_logo.png" width="20%" />
    </div>
    <div class="column" style="width:100%">
        <img src="figures/logos/conduit_logo_blue_bold.png" width="20%" />
    </div>
</div>


# Outlook

* libCEED+PETSc MPM
  * Suite of inelastic materials
  * `DMSwarm` born from `pTatin` for particle-cell methods
  * Build out explicit/IMEX dynamics
  * BDDC preconditioning with sum factorization (no assembly)
  * Productivity with Julia and Rust?
  * Algorithmic differentiation using Enzyme (LLVM, William Moses)?
  * Mixed precision (w/Natalie Beams @ UTK)

* ParaEllip3d-CFD
  * GPU porting
  * angular geometries
* GEOSX
  * MPM capability from GEOS-MPM
  * GPU, efficiency, scalability
* CT to mesh/particles (adaptive seeding of MPM)
* UQ workflows (intrusive for efficiency, especially quasi-static)
* Accuracy and validation (cross-cutting)
  * What resolutions are needed to predict QoI to sufficient accuracy?

Thanks to PSAAP III and DOE ECP for support.