# C Output and Parameter Interfaces

## Author: Zach Etienne
### Formatting improvements courtesy Brandon Clark

## Exploring C output and parameter interfaces in NRPy, this notebook initializes core Python/NRPy modules, performs common subexpression elimination (CSE), and generates C code. It further delves into the NRPy parameter interface and demonstrates how Single Instruction, Multiple Data (SIMD) paradigms can optimize NRPy generated C code. Additionally, it showcases several advanced features of the `c_codegen` function, including customization options, preprocessing, temporary variable control, symbol substitution, and automatic formatting.

### Required reading if you are unfamiliar with programming or [computer algebra systems](https://en.wikipedia.org/wiki/Computer_algebra_system). Otherwise, use for reference; you should be able to pick up the syntax as you follow the tutorial.
+ **[Python Tutorial](https://docs.python.org/3/tutorial/index.html)**
+ **[SymPy Tutorial](http://docs.sympy.org/latest/tutorial/intro.html)**

### NRPy Source Code for this module:  
* [c_codegen.py](../edit/c_codegen.py)

# Table of Contents

The module is organized as follows:

1. [Step 1](#Step-1:-Initialize-core-Python/NRPy-modules): Initialize core Python/NRPy modules
1. [Step 2](#Step-2:-Common-Subexpression-Elimination-(CSE)): Common Subexpression Elimination (CSE)
1. [Step 3](#Step-3:-Let's-generate-some-C-code!-NRPy's-core-C-code-output-routine,-c_codegen()): **Let's generate some C code!** NRPy's core C code output routine, `c_codegen()`
1. [Step 4](#Step-4:-Warp-speed!-SIMD-(Single-Instruction,-Multiple-Data)-in-NRPy-Generated-C-Code): **Warp speed!** SIMD (Single Instruction, Multiple Data) in NRPy-Generated C Code
1. [Step 5](#Step-5:-Customizing-Output-with-c_codegen-Options): Customizing Output with `c_codegen` Options
1. [Step 6](#Step-6:-Advanced-CSE:-Preprocessing-and-Custom-Prefixes): Advanced CSE: Preprocessing and Custom Prefixes
1. [Step 7](#Step-7:-Enforcing-Evaluation-Order-using-SCALAR_TMP): Enforcing Evaluation Order using `SCALAR_TMP`
1. [Step 8](#Step-8:-Post-processing-Symbol-Substitutions): Post-processing Symbol Substitutions
1. [Step 9](#Step-9:-Automatic-Code-Formatting-with-Clang): Automatic Code Formatting with Clang

# Step 1: Initialize core Python/NRPy modules
### \[Back to [top](#Table-of-Contents)\]

Let's start by importing all the needed modules from Python/NRPy for dealing with parameter interfaces and outputting C code. We also set the global infrastructure parameter to enable SIMD output.

In [None]:
# Step 1: Initialize core Python/NRPy modules
import nrpy.c_codegen as ccg    # NRPy: Core C code output module
import nrpy.params as par       # NRPy: parameter interface
import sympy as sp              # SymPy: The Python computer algebra package upon which NRPy depends

# Set infrastructure to BHaH for SIMD support (REAL_SIMD_ARRAY alias)
par.set_parval_from_str("Infrastructure", "BHaH")

# Step 2: Common Subexpression Elimination (CSE)
### \[Back to [top](#Table-of-Contents)\]

Let's begin with a simple [SymPy](http://www.sympy.org/) worksheet that makes use of SymPy's built in C code generator function, [ccode](http://docs.sympy.org/dev/modules/utilities/codegen.html)(), to evaluate the expression $x = b^2 \sin (2a) + \frac{c}{\sin (2a)}$.

In [None]:
# Step 2: Common Subexpression Elimination

# Declare some variables, using SymPy's symbols() function
a,b,c = sp.symbols("a b c")

# Set x = b^2*sin(2*a) + c/sin(2*a).
x = b**2*sp.sin(2*a) + c/(sp.sin(2*a))

# Convert the expression into C code
sp.ccode(x)

Computation of this expression in C requires 3 multiplications, one division, two sin() function calls, and one addition. Multiplications, additions, and subtractions typically require one clock cycle per SIMD element on a modern CPU, while divisions can require ~3x longer, and transcendental functions ~20x longer than additions or multiplications (See, e.g., [this page](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX&expand=118), [this page](http://www.agner.org/optimize/microarchitecture.pdf), or [this page](http://nicolas.limare.net/pro/notes/2014/12/16_math_speed/) for more details). 

One goal in generating C codes involving mathematical expressions in NRPy is to minimize the number of floating point operations, and SymPy provides a means to do this, known as [common subexpression elimination](https://en.wikipedia.org/wiki/Common_subexpression_elimination), or CSE.

CSE algorithms search for common patterns within expressions and declare them as new variables, so they need not be computed again. To call SymPy's CSE algorithm, we need only pass the expression to [sp.cse()](http://docs.sympy.org/latest/modules/simplify/simplify.html#sympy.simplify.cse_main.cse):

In [None]:
print(sp.cse(x))

As you can see, SymPy returned a list with two elements. The first element, $(\texttt{x0, sin(2*a)})$, indicates that a new variable $\texttt{x0}$ should be set to $\texttt{sin(2*a)}$. The second element yields the expression for our original expression $x$ in terms of the original variables, as well as the new variable $\texttt{x0}$. 

$$\texttt{x0} = \sin(2*a)$$ is the common subexpression, so that the final expression $x$ is given by $$x = pow(b,2)*\texttt{x0} + c/\texttt{x0}.$$

Thus, at the cost of a new variable assignment, SymPy's CSE has decreased the computational cost by one multiplication and one sin() function call.

NRPy makes full use of SymPy's CSE algorithm in generating optimized C codes, and in addition automatically adjusts expressions like `pow(x,2)` into `((x)*(x))`.

*Caveat: In order for a CSE to function optimally, it needs to know something about the cost of basic mathematical operations versus the cost of declaring a new variable. SymPy's CSE algorithm does not make any assumptions about cost, instead opting to declare new variables any time a common pattern is found more than once. The degree to which this is suboptimal is unclear.*

# Step 3: **Let's generate some C code!** NRPy's core C code output routine, `c_codegen()`
### \[Back to [top](#Table-of-Contents)\]

NRPy's `c_codegen()` function is the primary interface for converting SymPy expressions into optimized C code. It builds upon SymPy's `ccode()` and `cse()` functions and adds many features, including SIMD intrinsics, finite-difference stencil generation, and a variety of output customization options.

The basic signature is:
```python
c_codegen(sympyexpr, output_varname_str, **kwargs)
```

`c_codegen()` requires at least two arguments:
+ **sympyexpr** : a SymPy expression or a list of SymPy expressions.
+ **output_varname_str** : a string (or list of strings) specifying the variable name(s) to which the result(s) should be assigned.

Many optional keyword arguments control the details of the generated code. The most commonly used ones are:

- `prestring`, `poststring` : strings inserted before and after the generated code block.
- `include_braces` : whether to enclose the code in curly braces (default: `True`).
- `fp_type` : floating-point type (`"double"`, `"float"`, `"long double"`, `"double complex"`; default: `"double"`).
- `fp_type_alias` : an alternative name for the type (e.g., `"REAL"` in BHaH). Usually set automatically.
- `verbose` : whether to print a comment with the original SymPy expression (default: `True`).
- `enable_cse` : enable common subexpression elimination (default: `True`).
- `cse_varprefix` : prefix for CSE temporary variables (default: `""`).
- `enable_cse_preprocess` : perform additional factorization before CSE (default: `False`).
- `enable_simd` : generate SIMD intrinsics (default: `False`).
- `simd_find_more_FMAsFMSs` : try to detect more fused multiply-add/sub opportunities (default: `True` when SIMD enabled).
- `enable_fd_codegen` : generate finite-difference stencils and memory reads (default: `False`).
- `automatically_read_gf_data_from_memory` : automatically insert reads of grid functions from memory (default: `False`).
- `mem_alloc_style` : memory indexing order (`"210"` or `"012"`; default: `"210"`).
- `upwind_control_vec` : a list of symbols used for upwinding control.
- `SCALAR_TMP_varnames`, `SCALAR_TMP_sympyexprs` : enforce evaluation order by defining temporary variables.
- `postproc_substitution_dict` : substitute symbol names after code generation.
- `enable_clang_format` : run the output through `clang-format` (default: `False`).

Below we show the simplest usage, letting `c_codegen()` handle CSE automatically.

In [None]:
# Step 3: NRPy's C code output routine, `c_codegen()`

# Declare some variables, using SymPy's symbols() function
a,b,c = sp.symbols("a b c")

# Set x = b^2*sin(2*a) + c/sin(2*a).
x = b**2*sp.sin(2*a) + c/(sp.sin(2*a))

print(ccg.c_codegen(x,"x"))

# Step 4: Warp speed! SIMD (Single Instruction, Multiple Data) in NRPy-Generated C Code
### \[Back to [top](#Table-of-Contents)\]

Taking advantage of a CPU's SIMD instruction set can yield very nice performance boosts, but only when the CPU can be used to process a large data set that can be performed in parallel. It enables the computation of multiple parts of the data set at once. 

For example, given the expression 
$$\texttt{double x = a*b},$$ 
where $\texttt{double}$ precision variables $\texttt{a}$ and $\texttt{b}$ vary at each point on a computational grid, AVX compiler intrinsics will enable the multiplication computation at *four* grid points *each clock cycle*, *on each CPU core*. Therefore, without these intrinsic, the computation might take four times longer. Compilers can sometimes be smart enough to "vectorize" the loops over data, but when the mathematical expressions become too complex (e.g., in the context of numerically solving Einstein's equations of general relativity), the compiler will simply give up and refuse to enable SIMD vectorization.

As SIMD intrinsics can differ from one CPU to another, and even between compilers, NRPy outputs generic C macros for common arithmetic operations and transcendental functions. In this way, the C code's Makefile can decide the most optimal SIMD intrinsics for the given CPU's instruction set and compiler. For example, most modern CPUs support [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions), and a majority support up to [AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2), while some support up to [AVX512](https://en.wikipedia.org/wiki/AVX-512) instruction sets. For a full list of compiler intrinsics, see the [official Intel SIMD intrinsics documentation](https://software.intel.com/sites/landingpage/IntrinsicsGuide/).

To see how this works, let's return to our NRPy `c_codegen()` CSE example above, but this time enabling SIMD intrinsics:

In [None]:
# Step 4: Enable SIMD

# Declare some variables, using SymPy's symbols() function
a,b,c = sp.symbols("a b c")

# Set x = b^2*sin(2*a) + c/sin(2*a).
x = b**2*sp.sin(2*a) + c/(sp.sin(2*a))

print(ccg.c_codegen(x,"x", enable_simd=True))

The above SIMD code does the following.
* First it fills a constant SIMD array of type `REAL_SIMD_ARRAY` with the integer 2 to the double-precision 2.0. The larger C code in which the above-generated code will be embedded should automatically `#define REAL_SIMD_ARRAY` to e.g., __m256d or __m512d for AVX or AVX512, respectively. In other words, AVX intrinsics will need to set 4 double-precision variables in `REAL_SIMD_ARRAY` to 2.0, and AVX-512 intrinsics will need to set 8.
* Then it changes all arithmetic operations to be in the form of SIMD "functions", which are in fact #define'd in the larger C code as compiler intrinsics. 

`FusedMulAddSIMD(a,b,c)` performs a fused-multiply-add operation (i.e., `FusedMulAddSIMD(a,b,c)`=$a*b+c$), which can be performed on many CPUs nowadays (with FMA or AVX-512 instruction support) with a *single clock cycle*, at nearly the same expense as a single addition or multiplication.

Note that it is assumed that the SIMD code exists within a suitable set of nested loops, in which the innermost loop increments every 4 in the case of AVX double precision or 8 in the case of AVX-512 double precision.

As an additional note, NRPy's SIMD routines are aware that the C `pow(x,y)` function is exceedingly expensive when $|\texttt{y}|$ is a small integer. It will automatically convert such expressions into either multiplications of x or one-over multiplications of x, as follows (notice there are no calls to `PowSIMD()` intrinsics!):

In [None]:
# SIMD handling of integer powers and rational exponents
x = b**2 + a**(-3) + c*a**(sp.Rational(1,2))
print(ccg.c_codegen(x,"x", enable_simd=True))

For those who would like to maximize fused-multiply-adds (FMAs) and fused-multiply-subtracts (FMSs), NRPy has more advanced pattern matching, which can be enabled via the `simd_find_more_FMAsFMSs=True` option. **Note that finding more FMAs and FMSs may actually degrade performance, and the default behavior is found to be optimal on x86_64 CPUs.** In the below example, notice that the more advanced pattern matching finds another FMA:

In [None]:
print("// SIMD_find_more_FMAsFMSs=True:\n// searches for more FMAs/FMSs, which has been found to degrade performance on some CPUs:")
print(ccg.c_codegen(x,"x", enable_simd=True, simd_find_more_FMAsFMSs=True))

# Step 5: Customizing Output with `c_codegen` Options
### \[Back to [top](#Table-of-Contents)\]

The `c_codegen` function accepts many optional parameters to tailor the generated C code. Below we demonstrate a few of them.

In [None]:
# Example 5a: Suppress braces and verbose comment
print(ccg.c_codegen(x, "double result", include_braces=False, verbose=False))

# Example 5b: Add a pre-string and post-string
print(ccg.c_codegen(x, "double result", prestring="  // Compute result\n", poststring="  // Done\n", include_braces=False))

# Example 5c: Change floating-point type to float (uses sinf, sqrtf, etc.)
print(ccg.c_codegen(sp.sin(a), "float s", fp_type="float", include_braces=False, verbose=False))

# Example 5d: Custom prefix for CSE temporary variables
print(ccg.c_codegen(x, "double result", cse_varprefix="mycse_", include_braces=False, verbose=False))

# Step 6: Advanced CSE: Preprocessing and Custom Prefixes
### \[Back to [top](#Table-of-Contents)\]

When `enable_cse_preprocess=True`, NRPy performs additional factorization on the SymPy expressions before applying CSE. This can further reduce the operation count, especially for expressions involving rational coefficients. The preprocessing step also introduces symbols for common numeric constants (like 1/2) to avoid repeated divisions.

In [None]:
# Create a more complex expression
a,b,c = sp.symbols("a b c")
expr = b**2 * sp.sin(2*a) + c/sp.sin(2*a) + (a+b)**3 - (a+b)**2

print("Without preprocessing:")
print(ccg.c_codegen(expr, "double y", include_braces=False, verbose=False))

print("\nWith preprocessing (enable_cse_preprocess=True):")
print(ccg.c_codegen(expr, "double y", enable_cse_preprocess=True, include_braces=False, verbose=False))

# Step 7: Enforcing Evaluation Order using `SCALAR_TMP`
### \[Back to [top](#Table-of-Contents)\]

Sometimes you need to ensure that certain subexpressions are computed first and stored in named temporary variables. The `SCALAR_TMP_varnames` and `SCALAR_TMP_sympyexprs` parameters allow you to define such temporaries. You must include the temporary definitions as separate entries in the expression list and mark them as SCALAR_TMP. The left-hand side of each temporary definition is taken from `SCALAR_TMP_sympyexprs`, while the right-hand side is the corresponding expression. The main expression can then reference the temporary symbols.

In [None]:
# Step 7: Enforcing Evaluation Order using SCALAR_TMP

a,b,c = sp.symbols('a b c')
tmp_symb = sp.Symbol('tmp')
# Define two expressions: the temporary (tmp = a+b) and the final result (result = tmp * c)
expr_list = [a+b, tmp_symb * c]
output_list = ['tmp', 'result']
SCALAR_TMP_varnames = ['tmp']          # mark the first output as a temporary
SCALAR_TMP_sympyexprs = [tmp_symb]    # left-hand side symbol for the equation

print(ccg.c_codegen(expr_list, output_list,
                    SCALAR_TMP_varnames=SCALAR_TMP_varnames,
                    SCALAR_TMP_sympyexprs=SCALAR_TMP_sympyexprs,
                    include_braces=False, verbose=False))

# Step 8: Post-processing Symbol Substitutions
### \[Back to [top](#Table-of-Contents)\]

If you need to rename certain symbols in the generated code (for example, to add a suffix), you can provide a `postproc_substitution_dict`. The keys are the original symbol names (as strings), and the values are the substrings to append to those names. This substitution is applied after all other processing.

Here we replace every occurrence of symbol `a` with `a_new` in the output.

In [None]:
# Step 8: Post-processing substitution
a,b = sp.symbols('a b')
expr = a + b
print(ccg.c_codegen(expr, "result",
                    postproc_substitution_dict={'a': '_new'},
                    include_braces=False, verbose=False))

# Step 9: Automatic Code Formatting with Clang
### \[Back to [top](#Table-of-Contents)\]

Finally, if you have `clang-format` installed, you can ask `c_codegen` to format the output by setting `enable_clang_format=True`. This can be useful for producing clean, readable code that adheres to a specific style.

In [None]:
# Step 9: Automatic Code Formatting with Clang
x = a**2 + b**2
print(ccg.c_codegen(x, "double norm2", enable_clang_format=True))