E3SM MMF Center for Accelerated Application Readiness (CAAR) for Summit Final Report

Matthew Norman (ORNL)
Sarat Sreepathi (ORNL)
Walter Hannah (LLNL)
Benjamin Hillman (SNL)
Christopher Jones (PNNL)
Mark Taylor (SNL)

Executive Summary

The Energy Exascale Earth System Model (E3SM) is a climate simulation code targeting DOE science drivers, which include resolving the water cycle, cryosphere, and biogeochemistry at high resolution. Future climate uncertainties are strongly subject to uncertainties in clouds, which are difficult to represent in traditional cloud parameterizations. In order to explicitly resolve clouds across the whole globe, we would need 22,000 times more computation than we currently require, which is infeasible. However, the Multi-scale Modeling Framework (MMF) approach approximates certain effects of clouds in a global model by explicitly simulating them on a reduced domain. While the MMF approach is less expensive than full cloud resolution, it is still 50-100 times more expensive than the standard E3SM, making it intractable on current computing platforms.

The MMF approach is ideal for Summit because the Cloud Resolving Models (CRMs) embedded in each global model grid column do not directly communicate with one another over MPI. Since the CRMs consume the vast majority of the runtime, MPI no longer becomes a problem for this configuration. In this document, we detail the various code transformations, testing frameworks, and debugging efforts we utilized to get the E3SM-MMF ready to perform Early Science simulations on Summit. In the end, we achieved 87% weak scaling efficiency from 64 nodes to 1,024 nodes, and at 1,024 nodes, we had an 8.1x speed-up comparing 6 V100s per node against 2 P9s per node with a GPU-enabled throughput of 0.5 Simulated Years Per wallclock Day (SYPD).

Science Drivers

The Energy Exascale Earth System Model (E3SM) is a DOE codebase originally forked from the Community Earth System Model (CESM) that is aimed for DOE science targets while maintaining collaboration with CESM development. The code consists of five main components: atmosphere, ocean, land, sea ice, and land ice. The atmosphere and ocean components are the most costly by far. The atmospheric component, the Community Atmosphere Model – Spectral Element (CAM-SE), uses the cheap and scalable spectral element method with high-order accuracy on a quasi-uniform cubed sphere topology. The ocean component, Model for Prediction Across Scales – Ocean (MPAS-O), is a mixed finite-element model implemented on a statically refined mesh that sub-cycles over fast waves rather than solving the globally dependent “barotropic” portion of the model. Also tied to MPAS-O is the CICE (sea ice) model, which runs on the MPAS grid.

The science driver is to incorporate cloud-resolving scales for a high-resolution global climate model for the first time using the Multi-scale Modeling Framework (MMF) wherein 2-D high-resolution models are embedded within E3SM atmospheric columns to better approximate the effects of convection on the larger scale grid. While the MMF approach is hardly new, accelerating the Cloud Resolving Model on GPUs so that it can support climate-scale runs at high resolution has never been achieved before. This involves porting the System for Atmospheric Modeling (SAM), the model used for cloud resolution, to GPUs using OpenACC.

Scientific Method

As mentioned before, the Spectral Element (SE) method used in the global atmospheric model (the most costly portion of E3SM) is both cheap and scalable. Galerkin methods (the parent class of schemes to which SE belongs) have the benefit that only one reconstruction is performed for all Degrees Of Freedom (DOFs) involved in the element, as opposed to finite-volume and finite-difference methods, which must separately reconstruct each DOF. That makes SE very cheap computationally. Also, SE needs only communicate a linear amount of data between elements (i.e., averaging element boundaries). This allows it to strong scale extremely well, while keeping costs down compared to other methods. Both of these features are crucial for achieving the needed climate throughput of five Simulated Years Per wallclock Day (SYPD), or about 2,000x realtime.

Generally, all components of E3SM are memory bandwidth bound. The CAM-SE model is memory bandwidth bound (even though its core is a matrix-vector multiply) because (1) limiting must be applied to each element, and limiting is typically not compute intensive and (2) it turns out that at fourth-order accuracy (the order we use in production) the averaging of element boundaries tends to cost more because of heavy DRAM accesses and MPI communication times. Since Summit will contain significantly more memory bandwidth per node, we fully expect both of these applications to enjoy speed-up with that increase. Additionally, since Summit is a “fat-node” architecture, CAM-SE will rely less on network fabric overall, improving the strong scaling of the code. The SAM code uses low-order finite-differences for dynamics, and the physics packages have a large number of DRAM accesses per floating point operation.

Explicitly resolved clouds could resolve long-standing uncertainties in climate model projections, but doing so in the full global model will require about 22,000 times more computational power than we currently require in high-resolution climate modeling. The MMF approach is a compromise toward this goal, giving a better approximation of certain effects of explicit cloud resolution on the global model at a significantly lower cost than full explicit cloud resolution. In the MMF approach, each global grid column contains its own persistent Cloud Resolving Model (CRM), which explicitly resolves clouds on a reduced domain. The global model receives tendencies from the CRM, and the CRM is nudged to have an aggregate similarity to the global model column to which it is associated.

The MMF approach, while an effective compromise toward global explicit cloud resolution for climate time scales, still requires generally about 50-100 times more computation than the standard E3SM. For this reason, it is intractable to simulate on existing computing resources. It requires the additional computing power of the Summit supercomputer in order to complete in a feasible amount of time (and with a feasible amount of resources). Summit will enable the first high resolution climate simulation that uses the MMF approach.

Application Readiness Approach

There are certain computational advantages in the MMF approach that are advantageous on the Summit computer. First, the CRMs do not directly communicate with one another using MPI data transfers. Rather, they communicate only through their interactions with the global model. Because of this, most of the MMF configuration’s runtime has very little MPI communication overall. Further, the CRM code only consists of about 30K lines of code, making it a concentrated target for acceleration. Traditionally, climate simulation codes have no particular “hot spots” in the runtime profile to accelerate, meaning most of the nearly 2 million lines of code would need to be accelerated to benefit from accelerators. With the MMF configuration, we can achieve a significant acceleration of the simulation in a feasible time frame for Summit with critical scientific advances toward resolving climate uncertainties.

We chose to use an OpenACC directives-based approach for the GPU port for several reasons. First, the code is in Fortran, and while CUDA Fortran exists, it has an arguably uncertain future, and its resulting code is highly un-Fortran-like. Rather we choose a more portable approach of directives. While OpenACC is not widely supported, GPU offloading capabilities are rapidly developing in OpenMP 4.5+, which is expected to be much more widely supported among a range of compilers and hardware offload targets. Since OpenACC is mature enough now and similar enough to the eventual OpenMP 4.5+ specification, we consider it a good choice. Also, this port involves more than 200 GPU kernels, and with such a large development effort, it is not feasible to create a separate GPU codebase that is significantly different than the original source code. For sake of maintainability going forward, we choose to use OpenACC directives.

Creating a standalone version of the CRM for testing

In the beginning of the project, we created a standalone version of the CRM for more rapid development cycles due to faster-running tests (the full model can take upwards of 40 minutes to compile and run). The reason for faster testing is that the standalone version does not depend on the large E3SM infrastructure and slow building process. To accomplish this, we first created a convenient NetCDF data utility to dump out many samples of the necessary fields to drive the CRM code. The CRM can be run with the option of a one-moment or two-moment microphysics scheme, the latter being more complicated and roughly five times more expensive. A standalone CRM framework was possible for the one-moment microphysics configuration but was deemed infeasible for the two-moment microphysics configuration due to significant dependencies with the rest of the E3SM. About 15,000 sample columns were dumped to a file over five model days of simulation to drive the standalone framework. The full standalone framework with one-moment microphysics was capable of running a test on a desktop system with a K40 GPU in about two minutes, including compilation time.

Unfortunately, it was discovered that there were certain bugs that showed up in the full model that did not present in the standalone framework. So, while the framework was useful, it was not ultimately sufficient for the full GPU port development. The full model required development cycles ranging from 20-40 minutes due to significant time required in compilation and testing.

Creating a Robust Test Suite

The full testing framework required some significant thought. Since the standalone framework effectively only tests an ensemble of one-time-step simulations, it could use simple diffs between the refactored output and baseline output, which agreed nearly to machine precision. The full model, however, simulates a non-linear chaotic fluid for which small differences amplify exponentially in time, scaled to the degree of non-linearity, such that small differences rapidly become large differences. Because of this, deciding whether a refactored change is due solely to machine-precision differences can be quite difficult.

We found it sufficient to create an envelope of expected differences after a single day of simulation. We created this envelope using -O0 and -O3 optimization flags with the compiler in question. The refactored code, then, is compared against the two baseline runs in a three-way diff that determines whether the refactored differences from -O0 are in line with other presumed bit-level differences established by the -O0 versus -O3 diff baseline.

The full-model test suite was based on 4 x 4 x 6 elements, each with 4 x 4 basis functions per element, 30 vertical levels in the global model and CRMs of dimension 8 x 1 x 28 in the x-, y-, and z-dimensions, respectively, for 2-moment microphysics and CRMs of dimension 4 x 4 x 28 for 1-moment microphysics. In these two component sets, we were able to test all of the code we plan to run on Summit during Early Science and INCITE allocations.

Unfortunately, even with only a single day of simulation, this test (with compilation included) generally took 20-40 minutes to complete. We used tests that stressed all of the code (in terms of code coverage) that we plan to use for Early Science experiments. The early stages of code refactoring did not involve OpenACC directives but rather simply exposing as much threading as possible in a tightly nested manner. Therefore, we used the GNU compiler because of its robustness and quickness to compile in the early stages. During this stage, tests generally only took about 10 minutes to complete. However, once we switched to the PGI compiler for adding OpenACC directives, tests took significantly longer due to longer compilation times and due to data transfers in the nascent stages of adding GPU directives.

It was important to us that we have a test suite that would catch bugs during the development effort rather than after the fact so that a wrong answer could be attributed to only a few lines of code changes. Debugging in an entire codebase can be extremely lengthy and difficult, and we sought to avoid that if we could. We were able to avoid costly debugging efforts by testing frequently and micro-committing to our Github repository.

Granted, we had to re-do nearly the entire porting effort after the initial port that used the standalone framework tests failed to find certain PGI OpenACC-related bugs that gave wrong answers. This necessitated a second porting effort to identify and work around these compiler bugs and to identify and fix additional bugs caused by coding errors.

The full-model test suite was based on 4 x 4 x 6 elements, each with 4 x 4 basis functions per element, 30 vertical levels in the global model and CRMs of dimension 8 x 1 x 28 in the x-, y-, and z-dimensions, respectively, for 2-moment microphysics and CRMs of dimension 4 x 4 x 28 for 1-moment microphysics. In these two component sets, we were able to test all of the code we plan to run on Summit during Early Science and INCITE allocations.

Responding to small per-node workloads

Climate is a unique science in the portfolio of applications running on DOE Leadership Computing Facility (LCF) machines. In climate modeling, the workload must be completed in a fixed amount of time. Because of this, every time we double the resolution, the workload per node is cut in half, leading to a simulation that is dominated by communication, rather than computation. The explanation is as follows: doubling the horizontal resolution increases the workload by a factor of 8x (2x for each spatial dimension (x and y) and another 2x for required refinement in time). So, to complete the simulation in the same amount of wall time, 8x as many compute nodes must be used. While space can be decomposed in parallel, time cannot because it is sequential. Thus, when using 8x more compute nodes, we only have 4x more data to decompose, meaning we have half the workload per node with ideal scaling. The situation is worse with more realistic scaling.

With half the workload per node for each 2x spatial grid refinement, eventually two things are inevitable. First, MPI overheads will begin to dominate, and they only get worse with refinement. Second, workloads become too small for accelerators to provide any benefit. This is because of poor use of the accelerators as it becomes impossible to effectively hide the latency of DRAM data transfers on the device and because kernel launch overheads eventually consume more time than the kernels themselves. Current E3SM simulations have already hit the wall due to high throughput requirements and strong scaling that is made worse with grid refinement. While the MMF configuration ameliorates the MPI problem successfully, it still suffers from very small per-node (and per-GPU) workloads.

Therefore, we must find ways to expose as much threading as possible in the GPU porting effort. In doing so, we require significantly more refactoring than most codebases in this effort, and we encounter significantly more race conditions than most codebases that must be handled accordingly.

Putting all CRM routines into modules

In the second effort of this project to port the CRM to GPUs, we placed a high premium on correctness. Along this vein, we decided it best to ensure all CRM routines existed inside Fortran modules. The reason for this is that all routines inside a module are checked for parameter agreement, while this is not the case otherwise. In the original code, nearly all routines existed outside modules. As we progressed through this effort, we found several instances where the wrong parameter types were being passed. We even found one instance where a function was being passed by reference, something most people probably considered impossible in a Fortran 77 code (by all means, it is hardly standard practice). We fixed that situation by simply using a module and calling the function since the function pointer was unnecessary. Moving all routines into modules resulted in several instances of circular module dependencies that required passing data by parameter rather than relying on module data use statements.

We also considered trying to pass all model data by parameter, which is by far a better practice than using data from modules as if they were common blocks, but that refactoring effort was infeasible in the compressed timetable for this particular project.

Moving data off the stack and allocating during runtime

We also had a significant refactoring effort in the second porting effort focused on moving module-level data from static Fortran arrays onto the heap by allocating them. We found that by allocating data for each CRM call rather than using static variable arrays inside modules, we alleviated a bug in which wrong answers resulted when running on the CPU with the PGI compiler (for reasons we still haven’t been able to figure out). But even with that bug fix aside, it is still considered bad practice to put significant amounts of data on the stack. While it might be somewhat faster, it is significantly harder to debug with sanitizer tools like Valgrind, which do not check accesses for variables created on the stack. While static analysis tools abound for C/C++, they are not well-developed for Fortran. Also, stacks are inherently serial data structures and must be replicated for each OpenMP thread on the CPU, wasting space that could otherwise be shared on the heap. We refactored the modules so that module-level data is always allocated rather than declared statically. Even after this refactoring effort, it took significant enlargement of the default Valgrind stack size to even get it to run.

Running valgrind and bounds tests on the CRM code

We ran the code using Valgrind to check for invalid memory accesses. We are not particularly worried about memory leaks because Fortran does have a basic garbage collection that cleans out arrays that have gone out of scope (though relying on this is poor practice, and module-level variables never go out of scope). We were, however, worried about invalid memory accesses in the CRM code, which we inherited, and which was not developed using any particularly robust software development practices or versioning. Again, Valgrind does not check variables on the stack, which still includes a number of subroutine-level variables in the code, and we had to increase the stacksize in Valgrind considerably.

We also ran the code with bounds checking, which did not reveal any errors, but it is likely the original developers of the code did indeed do bounds checking. The issue with Fortran bounds checking is that it inherently believes the bounds declared in subroutine headers. Since Fortran allows reshaping and even potentially resizing of arrays across subroutine boundaries, it is important to remember that Fortran bounds checking is not necessarily rigorous. To borrow a mathematical phrase, it is necessary but not sufficient.

Between bounds checking, running the code through Valgrind, and most importantly answer checking, we do have high confidence that the code is free of memory errors.

Re-dimensioning CRM data to include an extra dimension

The most significant refactoring effort by far was promoting nearly every variable in the CRM code to include an extra dimension. In the MMF configuration, there are generally many CRM instances per node because there are many global model columns per node. However, in the original code, the CRM code only performed computations for a single CRM instance at a time, and the loop over multiple instances was outside the code. It was not hard to determine that the parallelism within a single CRM instance was nowhere near enough for efficiency on the GPU.

Therefore, we needed to push the loop over multiple CRM instances inside the CRM code itself. To do this, the variables needed to include the extra dimension. In the first porting effort, we made this dimension the fastest varying. However, in the second effort, we decided it best to make it the slowest varying dimension. The reasoning was that the configurations of the CRM not yet ported (such as 2-moment microphysics configuration) would still run quickly on the CPU without having to push the loop across multiple CRM instances to the innermost loop, which would be a significant refactor in its own right. This also enabled faster running tests for a quicker development effort the second time around.

To do this, we decided the most incremental unit of work would be to redimension one variable at a time. Luckily, in the CRM, variables rarely changed name, and there were only a few instances of variables pointing into other variables, illuminated by a grep for “=>”. Further, there were several equivalence statements that had to be recast as Fortran pointers because equivalence statements only appear to be possible for statically defined variables, not runtime allocated variables. These Fortran pointers also turned out to be sources of several difficult-to-find bugs in the PGI compiler and its handling of pointers, especially within OpenACC.

The Github Pull Request for this refactoring effort alone consisted of 20,000 lines of code change, showing how large the effort was. However due to the robust testing framework, the answers were bit-for-bit similar to the non-refactored code.

Exposing threading in a tightly nested manner

With a lot of thought, desiring a porting approach that was both flexible and portable, we decided to use tightly nested looping wherever possible in order to collapse all available loops into a single level of parallelism that the compiler can then decompose however it wishes among gangs and vectors. It turns out, you really cannot allow the compiler much freedom even in that choice, but in general the approach was very effective. No matter the individual dimensions inside the CRM, which include x, y, and z dimensions (denoted “nx”, “ny”, and “nz”) as well as the number of CRM instances (denoted “ncrms”), this will give a reasonable decomposition into threads. The alternative is to assume “nx” is 128 or higher, which we cannot guarantee in general. Also, with four levels of parallelism, it is difficult to assign each its own level in the GPU hierarchy of vector, worker, gang in OpenACC speak and get efficiency that is robust to all choices of dimension lengths.

In exposing threading in a tightly nested manner, we had to push the loop across CRM instances down the callstack completely, though thankfully the callstack is not generally very deep in the CRM code. Also, there were many instances where loops were interrupted by integer arithmetic (mainly indexing conveniences) and if-statements. To handle these cases, we simply pushed the arithmetic and if-statements down inside the innermost loop. It is worth mentioning that on the CPU, this harms performance mostly because CPUs are highly reluctant to vectorize any loop with branching logic inside of it. However, if code is running on the GPU, it is technically already vectorized. And as long as the branching doesn’t occur within a half-warp of computation (generally 16-32 consecutive threads), the if-statements have no impact on performance.

Reducing the number of kernels

This code uses a particularly large number of kernels on the GPU, each of which has relatively little runtime and therefore a large kernel launch overhead. Therefore, it is important to reduce the relative overhead of kernel launches as much as possible by reducing the number of kernels as much as possible. To this end, there are many situations where kernels could be merged, and they generally required adding if-statements to account for disparate loop bounds between successive loop nests. In a few cases we duplicated work in order to merge kernels and remove algorithmic dependencies.

Handling disparate thread behavior from sub-cycling

There were two cases where different threads would have significantly different workloads, and both of these cases are due to what is called “sub-cycling.” Basically, a thread will determine the number of iterations it must take based on certain stability criteria. The first issue is a very high-level loop over “cycles” within the time stepping. This loop is there in case the Courant-Friedrichs-Lewy (CFL) restriction is violated due to abnormally large winds, particularly in the lower vertical levels in our configurations (where the vertical grid spacing is the finest), and it allows the inner time step loop to cycle up to four times relative to the other work in the main time stepping loop.

We considered allowing threads to individually respond to the disparate workload, but this would require if-statements on every single kernel in the cycling loop (which is nearly every kernel we ported). Not only would this negatively affect CPU performance for unported portions of the code, but it would be a very difficult refactoring effort with a high chance of introducing bugs that also greatly decreases readability (and thus maintainability). Therefore, we decided instead to simply have all CRMs obey the most stringent CFL condition within the group of CRMs being simulated on the MPI task. It turns out that this cycling is rarely triggered, and therefore, the performance impact is minimal.

The other case in which this occurred is in the precipitation fallout routine that lives in the one-moment microphysics. In that routine, there is also a stability criterion that leads to sub-cycling. We applied the same treatment of making all CRMs obey the most stringent stability criterion of the group of CRMs on the MPI task.

Introducing OpenACC Directives

After performing most of the refactoring on the CPU using the GNU compiler, we began adding OpenACC directives one kernel at a time using the PGI compiler and the same full-model test suite mentioned before. To keep the OpenACC code as efficient as possible, we also created wrapping data statements to keep data on the GPU between kernels as much as possible. Efficiency in this stage is hard to achieve because much of the code is still run on the CPU, meaning it requires a significant number of MPI tasks per node to complete in feasible amount of time. Yet, with many MPI tasks per node, the GPU kernels become too small to provide good speed-up.

The approach was taken (starting at the beginning of the main CRM time stepping loop) to port kernels in sequential order, progressing through the main time stepping loop one kernel at a time, and pushing the end of the wrapping data statements further down to keep data on the GPU between kernels, adding variables as the porting progresses.

We chose the parallel loop flavor of OpenACC over the kernels flavor because experience has shown that the more leeway you give the compiler in choosing its own optimizations, the worse it performs. Further, the kernels flavor is not as portable as the parallel loop form in practice. For one, GNU’s OpenACC implementation heavily prioritized development for the parallel loop form. Also, OpenMP’s 4.5+ specification heavily mirrors the parallel loop form and not the kernels form. “Kernels” is more of a descriptive mode of exposing threading, while parallel loop is more of a prescriptive mode, which OpenMP has long favored even for CPU threading.

Handling various types of race conditions

When a codebase has significant parallelism in the outer loops of the code, often race conditions do not surface. However, for applications with relatively little threading available per GPU, one must extend threading down to the inner loops of the code where race conditions are much more common. There are two main kinds of race conditions encountered in the CRM porting. The most common by far was the case of partial reductions over a number of dimensions. Whenever you perform a global reduction over all variable indices, you can simply use the reduction() clause in OpenACC. However, when it is only over some of the dimensions, it is more efficient to do a “atomic” update. This requires saving any excess computation in the right-hand-side into a temporary scalar and then applying an !$acc atomic update directive to the reduction statement using the scalar temporary. An example of this is below:

!$acc parallel loop collapse(3) private(tmp)
do k = 1 , nlev
do j = 1 , ny
do i = 1 , nx
  ! You can only apply an atomic statement that looks like: a=a+b
  ! You must precompute RHS, b, before the atomic statement
  tmp = (u(i,j,k) - u0(k))*factor_xy
  !$acc atomic update
  u_column(k) = u_column(k) + tmp
enddo
enddo
enddo

The other type of reduction we encountered was a prefix sum type reduction. These cases look like array(k) = array(k+1) + [arithmetic]. Since the ordering matters and the intermediate states are saved, this cannot be performed in parallel. There are parallel scan algorithms, but like parallel reductions, they are generally only efficient if they are over the entire array rather than over only some of the dimensions. In these cases, we chose to save the intermediate computations into an array, then fission the kernels, perform the prefix sum serially in the dimensions of the sum, and then continue in parallel after that. By minimizing the serial work as much as possible, the performance impact becomes minimal.

Development of a pool allocator

When we were investigating Managed Memory and discovered the PGI pool allocator was behaving sub-optimally in terms of performance, we decided to implement our own pool allocator in Fortran. We used the c_loc() and c_f_pointer() functions of the iso_c_binding Fortran 2003 specification to implement a pool allocator that could take a chunk of memory and cast it to arbitrary Fortran arrays of varying types and shapes. For the sake of simplicity, we used a stack-based implementation so we didn’t have to handle segmentation or worry about a true heap data structure. While this significantly improved allocation times when using Managed Memory, we did eventually discover that using Managed Memory significantly increased kernel launch overheads, and thus we weren’t able to use it. Therefore, this development is not a part of the final codebase, though we did document it on our project’s webpage.

Data Management

When it comes to data management, it would be nice to be able to use a unified memory address space between the CPU and GPU DRAM. However, there are several reasons why this is not wise. First, we do not believe unified memory address spaces are portable at this time. To use a truly unified address space like that available on Summit, the code cannot run on any other GPU machine.

Using the slightly more portable approach of CUDA Managed Memory (cudaMallocManaged in CUDA and -ta=tesla,managed in PGI), there are still some issues. First, the only compiler that even supports this is PGI, making it unportable. Second, it significantly increases the cost of allocations. For codes with significant workloads, this is often not a problem, but for codes that allocate often, they invariably have to create their own pool allocators to avoid the costs. On that note, PGI implements a pool allocator for managed memory, but in our experiments, it performed very poorly even tuning some of the available environment variables in runtime. The most problematic aspect of Managed Memory, however, was that in our experiments, it significantly increased kernel launch overheads. In some cases, gaps between kernels increased up to ten-fold. We believe this, again, is a symptom seen only when kernel runtimes are very small (order 5-10 microseconds at times) as we see in our code.

Since we chose not to use Unified or Managed memory, we had to manage all data movement ourselves. Fortunately, PGI automatically generates data statements on a per-kernel basis in Fortran codes, making this situation easier. While the PGI-generated data statements were usually correct, there were a few cases where things went wrong. When Fortran functions return arrays, for instance, array temporaries were created and allocated on the GPU that generally caused the code to crash during runtime. We fixed this issue by transforming those functions into subroutines. Also, there were cases where a different index range is read from a variable than is written to that same variable, leading to confusion in the compiler; and in some cases, these were merely “created” rather than copied in and out. We fixed these on a case-by-case basis.

We found we had to be careful in some cases with copyout because many kernels do not write to the entire variable inside the kernel. In those cases, a copyout clause is actually incorrect if you do not limit the indices. Therefore, to gain correctness in the intermediate port without having to look at the bounds being written to, we decided to simply “copy” most of the variables that were written to in order to avoid this possibility. Were there no parent data statements, this would lead to excessive copies from CPU to GPU, but we place all variables in parent data statements, causing OpenACC to ignore the per-kernel statements.

Given we had over 200 kernels and even more variables to deal with, we decided it was impractical to try to determine which variables were read, write, and read-write in the data statements wrapping the entire main time stepping loop. Therefore, since there is so much computation in the time stepping loop, we simply used an !$acc enter data copyin() for all variables in the main time stepping loop before the loop and a corresponding !$acc exit data copyout() after the loop.

Many of the subroutines have arrays local to that routine that need to be created and deallocated to avoid passing the data back and forth between kernels. In these cases, we use !$acc enter data create() at the subroutine beginning and !$acc exit data delete() at the end.

The reason we use unstructured data statements instead of structured data statements will be illuminated in the next section; but in short, structured data statements do not allow the async clause, which we need.

Reducing the impact of kernel launch overheads and GPU allocations

Because the kernels, in general, consume very little time (in the tens of microseconds and at times even less), it is important to reduce the impact of kernel launch overheads as much as possible. Further, GPU allocation is quite costly compared to normal CPU allocation. While it often does not impact performance for codes with more costly kernels, in our case, the cost is simply too high. To ameliorate both of these issues, we launch every kernel in the code asynchronously in the same OpenACC asyncid (analogous to a CUDA stream). What this does is overlap kernel launch overhead and GPU allocation with kernel execution. It is not a common use of asynchronous launches, but in our tests, even only using 1 MPI task per GPU with the Early Science configuration of 800 large CRMs per node, asynchronous launches gave us a 1.5x speed-up over synchronous kernels. When we use more MPI tasks per GPU or lower workloads per GPU, this speed-up will become significantly larger.

With this choice, unfortunately, maintainability of the GPU port takes a hit. If a developer introduces code that either does not run on the GPU or does not properly implement the asynchronous behavior, the code will get wrong answers on the GPU. This is why it is even more important that we developed a robust regression suite to determine when this occurs (and allows us to use git bisect to determine at which commit it occurs).

Kernel performance refactoring

There were a number of cases where explicit refactoring of kernels was required to get decent performance. There were many cases when the PGI compiler used a poor choice for the number of OpenACC gangs, which associate outer loop indices with different Streaming Multiprocessors (SMs) on the GPU. In several cases, only one gang was used, meaning the V100 will be underutilized by at least a factor of nearly 100x. In another couple of cases, the maximum number of gangs was used even though the kernel did not have nearly that much threading available. In each of these cases, we obtained performance by explicitly calculating the correct number of gangs and specifying it in the parallel loop directive using num_gangs() and vector_length() clauses.

In the pressure solver (pressure.F90), we found a kernel that was individually consuming roughly 20% of the total kernel runtime. Investigating further, we found two problems. First, the original code had every loop index calculating pi by computing acos⁡(-1) each time. We replaced this with a simple double precision constant. Second, there were some terms requiring cosines that only depended upon the horizontal indices and not the vertical or CRM instance indices. Therefore, we separated that calculation out to a smaller loop. These efforts reduced this section of code by about 20x in terms of cost.

There were other kernels that presented a difficult situation. They involved reducing 3-D CRM model variables down to a single vertical column of data. The problem was two-fold. First, an atomic statement had to be added. Second, the fastest varying indices of the reduced data is the vertical index k, while the fastest varying index of the 3-D data is the horizontal x-direction index, i. Thus, contiguous memory accesses throughout are impossible unless we permute the CRM instance index to be the fastest varying, which we may, in fact, do in the future. The best compromise was to see which variables are accessed the most often in the kernel (which is usually the reduced variable) and make that the fastest varying (i.e., the innermost) loop.

Increasing kernel size

The default E3SM code implements a “chunking” of the global model physics columns into an arbitrary number called PCOLS. This exists to allow the code to use a compile-time known size for many of the variables to allow for additional compiler optimizations. It defaults to 16, and we discovered through nvprof that this was significantly too small for efficiency on the GPU. Therefore, we found where this was set in the code and changed the value to 256, which in our tests, was enough to perform efficiently by gluing together CRMs into a single large kernel.

Optimizing MPI Tasks versus kernel size

When the GPU port was finally completed and on-node tests with nvprof showed minimal data movement within the main time stepping loop, it was time to perform optimizations on the code at large. In our Early Science configuration, which will be described in the “Performance Improvement” section, we found through trial and error that the most efficient configuration used 36 MPI tasks per node (6 MPI tasks per GPU). Ideally, the CRM would run with one MPI task per GPU, but this leads to the CPU code running too slowly because of a lack of parallelism. However, if we run with too many MPI tasks, the GPU kernels become too small. This gave rise to testing different numbers of MPI ranks per node, and 36 was found to be ideal. This requires the use of the gpumps flag in bsub, which enables the CUDA MPS server that allows multiple MPI tasks to simultaneously share a GPU at the same time. We separated the 36 tasks into six “resource sets” of six MPI tasks each to evenly divide among GPUs. Thus, we did not have to consider explicitly coding for more than one GPU at a time.

Creating a custom mpirun script

The Common Infrastructure for Modeling the Earth (CIME) is a complex set of python, perl, and shell scripts intended to aid configuration management at both DOE and NSF computing centers through the E3SM and CESM projects. The jsrun command from IBM turned out to be far too complex to try to handle directly through the complex CIME infrastructure, which would have to do arithmetic to evenly divide tasks among resource sets for a variety of options. Therefore, we ended up creating our own mpirun.summit script, which handles a lot of this arithmetic internally and only takes three flags: -n for the total number of MPI tasks, -N for the number of MPI tasks per node and -gpu if you want to create a jsrun configuration particularly suited to GPU runs.

Debugging different aspects of the GPU port

For debugging efforts, we used a variety of techniques and tools. In many cases, it was most efficient to simply print things to the terminal or comment things out of a kernel to use a sort of bisection search for the causes of invalid memory address errors or wrong answers. In a few difficult cases, we used the DDT tool to find where answers began to diverge from expected, or to automatically stop upon program failure and perform an in situ stack trace where core files had failed to provide anything useful. We also used environment variables to turn off asynchronicity in OpenACC with PGI and ran the code in DDT to determine which kernel was the offender via a stack trace.

We also found that nvprof can sometimes be effective for debugging, namely when data transfers are occurring in places you do not expect them to because of compiler bugs that alter the memory addresses of pointers in Fortran (described more in the next section). In these cases, we then decided to place a default(present) clause on the kernel in question in order to force the PGI compiler to fail and spit out exactly what variable was not present and which kernel, source file, and line number the problem occurred at. We’ve found these aspects of PGI’s compiler to be very helpful for debugging.

Working around PGI OpenACC bugs

While the PGI compiler is very powerful, and in our opinion, the most performant and robust OpenACC implementation in Fortran, we did encounter some serious bugs in the porting process. There were several cases mentioned before, where the compiler improperly handled return values of functions that were arrays. In these cases, the compiler would allocate a temporary variable (always called tmp, which we found confusing because we actually had a variable named tmp but it was not the one being created), and the code would subsequently crash. We fixed this by changing the functions into subroutines.

There were also bugs where the compiler would generate erroneous data statements for some of the kernels, which we fixed by creating explicit data statements for those kernels. For most of the kernels, we ended up creating explicit data statements anyway.

We had a few cases where we collapsed loops and had thread-private array variables that the compiler put at the wrong level (gang instead of vector). In these cases, we had to create separate gang and vector loops in the kernel (rather than collapsing to a single level of parallelism), and declare the private variables on the vector loop.

A pointer related compiler bug

We also encountered one particularly difficult-to-find PGI compiler bug related to Fortran pointers. Fortran pointers are very different from C pointers, which is precisely why “Cray pointers” exist. Fortran pointers are always automatically dereferenced upon mention. Further, they carry around several descriptors to keep track of lower and upper bounds as well as striding. Since they can, in general, point to strided (non-contigous) memory, all accesses must pass through the descriptor data structures, and the descriptors must be passed around to the GPU as well.

But since pointers are always dereferenced in Fortran, what happens in OpenACC when you have a pointer aliased to memory that’s already on the GPU? It turns out that PGI at least separately accounts for the descriptors and the data referenced to by the pointer, and it is not clear there is another way to handle it. So if the data is already on the GPU, you still have to copyin the pointer so the GPU has access to the descriptors, and there are separate counters for the data and the descriptors so it doesn’t lose track of things. We did find cases where we had not taken this into account.

When you pass a Fortran pointer to a subroutine, several behaviors are possible depending upon the subroutine’s interface that accepts that pointer. If the subroutine accepts the pointer as a pointer, then all of the descriptor is maintained across the subroutine interface. Since you cannot specify lower bounds when accepting a pointer into a subroutine, we confirmed with compiler developers that the compiler does indeed preserve the lower bounds as a part of the pointer’s descriptor. If the subroutine accepts the pointer as a normal array, then the pointer’s descriptor is discarded inside the scope of the subroutine. It was here that we discovered a bug in the compiler.

In several of our subroutines, when we passed the pointer into a subroutine as a normal array, it worked completely fine on the CPU. However, when we created a GPU kernel, we discovered (through nvprof and then error messages from default(present)) that the GPU pointer was changing across that subroutine interface for some reason. It turned out that the workaround was to discard the pointer descriptor before passing it to the subroutine by passing array(1,1,1,1), for instance. This only works if the subroutine accepts the array with explicit bounds as is customary in Fortran 77. If it accepts it as a Fortran 90 array, it will appear that you are passing a scalar to an argument that should be an array, and it will lead to a compile-time failure. Therefore, we created a Fortran 77-style array accepted into those subroutines and passed the first element of the array, which causes Fortran (which always passes by reference as a default) to pass only the address to the first element of the data to which the pointer was pointing. This ended up alleviating the bug.

Best practices for Fortran pointers in OpenACC

By and large, the most common agreed-upon best practice for pointers is simply not to mix them with OpenACC if you can help it. It is easiest to pass the pointer into a subroutine that accepts it as a normal array, thus ditching the pointer-specific descriptors and making it easier for OpenACC to handle.

Reporting bugs and packaging kernels for vendors

We regularly reported bugs to vendors and provided kernel examples and mini-applications based on various aspects of our code to vendors to improve their compilers’ performance and robustness.

Investigating the OpenMP Offload Transition

We also heavily investigated OpenMP 4.5+ offload in this porting effort. An effort was made to attempt to port the model with the IBM XL Fortran compiler in OpenMP 4.5, but it was found that their implementation was not mature enough to use at this point for this code, which is admittedly complex. In that effort, though, we created a set of Unified Directives using CPP variadic macro functions using the intersection set of the OpenMP 4.5 and OpenACC specifications, which is quite large. This came with significant help from Lixiang Luo from IBM and Jeff Larkin from Nvidia.

There is a strong one-to-one correspondence between much of OpenMP 4.5 target offload and OpenACC parallel loop structures. The one substantial difference, though, is asynchronous behavior. OpenACC mirrors the CUDA streams approach, while OpenMP 4.5 implements a much more capable but much more cumbersome data dependency-based approach. We believe that for many applications, the OpenMP approach is perhaps better. For us, however, we don’t need asynchronicity to handle data dependencies. We need it to hide overheads with everything in the same loop. Therefore, the OpenACC engine is far easier for us to use, and we have requested that OpenMP consider also implementing a streams-based approach. It was suggested that we try creating a single scalar variable that is depend “inout” for all kernels, but we deemed this too detrimental to code readability. In order for us to use OpenMP 4.5 asynchronous behavior, we plan to put kernel-wise data statements for all kernels and then use a pre-processor of some kind to transform them into OpenMP depend clauses.

The other potential difference between OpenACC and OpenMP 4.5 is in the handling of scalar private variables. In OpenACC, you do not need to declare scalars as private since that are automatically declared as private. In OpenMP, however, you traditionally do need to declare scalars as private. This could also prove quite cumbersome, and it would be nice for the OpenMP specification to consider allowing default privatization of scalars.

Documentation

Throughout the project, we regularly documented our approaches, best practices, and even tutorials on our projects Confluence site for the Exascale Computing Project, and we’ve seen interaction with several other projects that were interested in our approaches and had suggestions for improvements. Also, we’ve documented code changes on Github Pull Requests and Issues.

Capabilities delivered

Moist convection plays a critical role in many atmospheric phenomena, and the representation of convection in global models has long been one of the biggest sources of errors and uncertainty. Traditional convective parameterizations often provide a disjointed representation of the various cloud types and their interactions with other processes, such as radiation, microphysics, and boundary layer turbulence. Unifying these parameterizations to give a consistent treatment among these processes has been a subject of research for decades, but the MMF concept provides a simpler alternative approach, albeit an expensive one.

The MMF approach allows us to replace parameterizations of convection, microphysics, and turbulence with a single model that explicitly represents how these processes interact on the scale of convective circulations. This also allows the calculation of radiative heating to treat clear and cloudy conditions on these scales as well, which can drive secondary circulations and feedbacks on the convective scale that affect larger circulations. Aside from the general notion that explicitly resolved convective circulations are more realistic, the behavior of phenomena on the scale of the global grid have been shown to be dramatically improved by the MMF approach. The includes the pattern and intensity of precipitation, which is crucial for providing regionally relevant projections of water cycle statistics (i.e. floods and draughts) that are valuable to decision makers.

The MMF approach also provides unique flexibility for tailoring scientific experiments. We have begun exploring how the model responds to changes in the CRM domain and grid spacing, which have revealed surprising sensitivities that relate to the important scales of real-world convection. The MMF model also provides a unique way to probe cloud-radiative interactions. These types of experiments are possible with slight changes to the configuration and do not require any fundamental changes to the underlying model used to represent the cloud-scale physics. This capability sets the MMF approach apart from what can be explored with conventional models, and the GPU acceleration outlined here makes this type of research more tractable to the scientific community.

We also delivered documented best practices for codes with relatively little parallelism per GPU in terms of handling kernel launch overheads, data management, and asynchronicity. New capability has been delivered in terms of significantly more robust compilers as well as kernels to test for developing directives-based implementations like GNU’s OpenACC implementation and IBM XL’s OpenMP 4.5 implementation (and others as well).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly