Skip to content

Writing QUDA Kernels

Dean Howarth edited this page Jul 5, 2020 · 6 revisions

This page describes the QUDA convention in how to add new kernels to QUDA. This serves as both best-practice advice for kernel performance, a style guide to ensure uniformity across QUDA, and also to promote an abstracted programming style that emphasises readability and code portability.

Overview

This documentation will be split into two relatively separate parts,

  1. Part 1: A colloquial, somewhat informal, example driven explanation of how to write or modify a QUDA kernel
  2. Part 2: A more formal, abstract, and general explanation of how QUDA kernels are written.

Users new to QUDA and/or users new to GPU computing are advised to read Part 1 before tackling Part 2 so that they can familiarise themselves with the vernacular and common structures used in QUDA. Experienced programmers may find reading Part 1 useful, but may that they can write a QUDA kernel immediately after reading Part 2.

Writing QUDA Kernels Part1:

There are three types of QUDA kernel one can write:

  1. Pure fermion
  2. Pure gauge
  3. Both fermion and gauge

For example, the contraction kernel routines are pure fermion. One need only load fermion fields and an array to store the results of a contraction. These are the simplest types of QUDA kernel there are as they do not involve any form of communication: all the data one needs for a contraction is already on the MPI node.

As a second example, consider the plaquette kernel. This is a pure gauge routine, and loading gauge fields to QUDA kernels is a touch more involved. Furthermore, gauge fields routines generally involve using 'staples'. That is, gauge field links that reside on adjacent lattice points. This naturally involves communication for lattice spaces at the MPI boundary. QUDA handles this using ghost zones.

Thirdly, we will look at the Wilson dslash kernel. This is a kernel that uses both fermion and gauge fields (though the gauge fields will remain unchanged during the execution of the Wilson kernel.) We will see how QUDA uses ghost zones to overlap communications and computations of the fermion field.

Let us then follow the code path from the interface_quda.cpp routine void contractQuda(...) to see what is involved in writing a pure fermion kernel. This will be an 'stream of consciousness` introduction to a QUDA kernel, with a linear narrative following the code path. For a more abstract explanation, please refer to the summary.

void contractQuda(...): A pure fermion field function

The contractQuda routine accepts three pointers and some meta data. The pointers are the two fermion fields to be contracted and an array that will store the result of the contraction, which is the 16 \mu \nu contractions associated with the operation \Sum_{a} \phi_{a,\mu, X} \psi_{a,\nu, X} per lattice site. Once the necessary data transfers and memory allocations are complete, the GPU kernel is called:

contractQuda(*x[0], *y[0], d_result, cType);

which again accepts the two fermion fields and result array (now residing on the GPU) and some meta data. We shall now follow the code path and highlight how a pure fermion field kernel is written

contractQuda.cu: A pure fermion field kernel file

The interface function has called the GPU function contractQuda. Here is is in full:

void contractQuda(const ColorSpinorField &x, const ColorSpinorField &y, void *result, const QudaContractType cType)                                                                                                                         
  {
#ifdef GPU_CONTRACT
    checkPrecision(x, y);

    if (x.GammaBasis() != QUDA_DEGRAND_ROSSI_GAMMA_BASIS || y.GammaBasis() != QUDA_DEGRAND_ROSSI_GAMMA_BASIS)
      errorQuda("Unexpected gamma basis x=%d y=%d", x.GammaBasis(), y.GammaBasis());
    if (x.Ncolor() != 3 || y.Ncolor() != 3) errorQuda("Unexpected number of colors x=%d y=%d", x.Ncolor(), y.Ncolor());
    if (x.Nspin() != 4 || y.Nspin() != 4) errorQuda("Unexpected number of spins x=%d y=%d", x.Nspin(), y.Nspin());

    if (x.Precision() == QUDA_SINGLE_PRECISION) {
      contract_quda<float>(x, y, (complex<float> *)result, cType);
    } else if (x.Precision() == QUDA_DOUBLE_PRECISION) {
      contract_quda<double>(x, y, (complex<double> *)result, cType);
    } else {
      errorQuda("Precision %d not supported", x.Precision());
    }

#else
    errorQuda("Contraction code has not been built");
#endif
  }

The first thing to notice is that the body of the code is encapsulated in a preprocessor definition, which one can adjust at CMake time. If the QUDA_CONTRACT CMake option is not set, none of the following code will be compiled. This is an important feature to include as it keeps compile time to a minimum. After some checks, we see that the contract_quda function is instantiated on the desired precision:

template <typename real>
  void contract_quda(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *result,
                     const QudaContractType cType)
  {
    ContractionArg<real> arg(x, y, result);
    Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);
    contraction.apply(0);
    qudaDeviceSynchronize();
  }

This deceptively simple function executes the entire contraction kernel. We must study it line by line.

Parameter argument struct

ContractionArg<real> arg(x, y, result);

This line is little more than an object that will store the pointers to data on which we will work. You can see that it accepts the x and y fields (the two fermions to be contracted) and the result. It is templated on the precision. Here is the full contractionArg structure, which is located in quda/include/kernels/contraction.cuh

template <typename real> struct ContractionArg {
    int threads; // number of active threads required                                                                                                                                                                                         
    int X[4];    // grid dimensions                                                                                                                                                                                                           

    static constexpr int nSpin = 4;
    static constexpr int nColor = 3;
    static constexpr bool spin_project = true;
    static constexpr bool spinor_direct_load = false; // false means texture load                                                                                                                                                             

    // Create a typename F for the ColorSpinorField (F for fermion)                                                                                                                                                                           
    typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;

    F x;
    F y;
    matrix_field<complex<real>, nSpin> s;

    ContractionArg(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *s) :
      threads(x.VolumeCB()),
      x(x),
      y(y),
      s(s, x.VolumeCB())
    {
      for (int dir = 0; dir < 4; dir++) X[dir] = x.X()[dir];
    }
  };

This structure is fairly complex, but none of the elements are beyond the comprehension of a student of lattice gauge theory with a rudimentary knowledge of GPU computing. We first define the variables in the argument structure:

    int threads; // number of active threads required

Naturally, this is the number of threads in the block

    int X[4];    // grid dimensions                                                                                                                                                                                                           

This is the number of lattice sites on the MPI node

    static constexpr int nSpin = 4;
    static constexpr int nColor = 3;
    static constexpr bool spin_project = true;
    static constexpr bool spinor_direct_load = false; // false means texture load                                                                                                                                                             

These vales are hardcoded to prevent template explosion. This particular kernel will only work for spin=4, color=3 fermion fields. The spin_project boolean is a neat feature of QUDA that will automatically convert any gamma matrix ordered fermion into the desired gamma matrix order as it is loaded into the kernel. (KATE: do we need spinor_direct_load now that textures are eliminated?)

    // Create a typename F for the ColorSpinorField (F for fermion)                                                                                                                                                                           
    typedef typename colorspinor_mapper<real, nSpin, nColor, spin_project, spinor_direct_load>::type F;
    F x;
    F y;

This object is a QUDA data type that will handle the way in which fermions are passed to the kernel. As you can see, it needs to know the precision, color, and spin of the fermion, and well as any operations (such as spin_project) that need to be performed during the load. We give it a name F.

    matrix_field<complex<real>, nSpin> s;

This is a QUDA data type that will store the contraction. An nSpin X nSpin matrix of type complex<real>

We now come to the invocation of this argument structure:

ContractionArg(const ColorSpinorField &x, const ColorSpinorField &y, complex<real> *s) :
      threads(x.VolumeCB()),
      x(x),
      y(y),
      s(s, x.VolumeCB())
    {
      for (int dir = 0; dir < 4; dir++) X[dir] = x.X()[dir];
    }

This is what was called by the ContractionArg<real> arg(x, y, result); line. threads in is initialised with the checkerboarded sub-volume of a fermion field, the two fermion fields are, naturally, x, and y, and the result array is initialised with the address and the number of threads. The actual body of the argument structure is to then populate the X array with the MPI lattice dimensions.

This argument structure provides the kernel with the pointers it needs to locate data, and the parameters it needs to perform it's computation.

Instantiation of the kernel

Now that all the parameters and pointers have been established, we construct the kernel that will perform the computation:

Contraction<real, ContractionArg<real>> contraction(arg, x, y, cType);

Kernel structure

Autotuning

Putting it all together

Clone this wiki locally