# C Output and Parameter Interfaces

## Author: Zach Etienne
### Formatting improvements courtesy Brandon Clark

## Exploring C output and parameter interfaces in NRPy, this notebook initializes core Python/NRPy modules, performs common subexpression elimination (CSE), and generates C code. It further delves into the NRPy parameter interface and demonstrates how Single Instruction, Multiple Data (SIMD) paradigms can optimize NRPy generated C code.

### Required reading if you are unfamiliar with programming or [computer algebra systems](https://en.wikipedia.org/wiki/Computer_algebra_system ). Otherwise, use for reference; you should be able to pick up the syntax as you follow the tutorial.
+ **[Python Tutorial](https://docs.python.org/3/tutorial/index.html )**
+ **[SymPy Tutorial](http://docs.sympy.org/latest/tutorial/intro.html )**

### NRPy Source Code for this module:  
* [c_codegen.py](../edit/c_codegen.py)
* [NRPy_param_funcs.py](../edit/NRPy_param_funcs.py)
* [SIMD.py](../edit/SIMD.py)

# Table of Contents

The module is organized as follows:

1. [Step 1](#Step-1:-Initialize-core-Python/NRPy-modules): Initialize core Python/NRPy modules
1. [Step 2](#Step-2:-Common-Subexpression-Elimination-(CSE)): Common Subexpression Elimination (CSE)
1. [Step 3](#Step-3:-Let's-generate-some-C-code!-NRPy's-core-C-code-output-routine,-c_codegen()): **Let's generate some C code!** NRPy's core C code output routine, `c_codegen()`
1. [Step 4](#Step-4:-Warp-speed!-SIMD-(Single-Instruction,-Multiple-Data)-in-NRPy-Generated-C-Code): **Warp speed!** SIMD (Single Instruction, Multiple Data) in NRPy-Generated C Code
1. [Step 5](#Step-5:-Floating-point-precision-types-in-NRPy): Floating-point precision types in NRPy
1. [Step 6](#Step-6:-Integration-with-NRPy's-parameter-interface): Integration with NRPy's parameter interface
1. [Step 7](#Step-7:-Advanced-SIMD-optimizations): Advanced SIMD optimizations
1. [Step 8](#Step-8:-Code-formatting-and-customization-options): Code formatting and customization options

# Step 1: Initialize core Python/NRPy modules
### \[Back to [top](#Table-of-Contents)\]

Let's start by importing all the needed modules from Python/NRPy for dealing with parameter interfaces and outputting C code.

In [None]:
# Step 1: Initialize core Python/NRPy modules
import nrpy.c_codegen as ccg    # NRPy: Core C code output module
import nrpy.params as par       # NRPy: parameter interface
import sympy as sp              # SymPy: The Python computer algebra package upon which NRPy depends

# Step 2: Common Subexpression Elimination (CSE)
### \[Back to [top](#Table-of-Contents)\]

Let's begin with a simple [SymPy](http://www.sympy.org/ ) worksheet that makes use of SymPy's built in C code generator function, [ccode](http://docs.sympy.org/dev/modules/utilities/codegen.html )(), to evaluate the expression $x = b^2 \sin (2a) + \frac{c}{\sin (2a)}$.

In [None]:
# Step 2: Common Subexpression Elimination

# Declare some variables, using SymPy's symbols() function
a,b,c = sp.symbols("a b c")

# Set x = b^2*sin(2*a) + c/sin(2*a).
x = b**2*sp.sin(2*a) + c/(sp.sin(2*a))

# Convert the expression into C code
sp.ccode(x)

Computation of this expression in C requires 3 multiplications, one division, two sin() function calls, and one addition. Multiplications, additions, and subtractions typically require one clock cycle per SIMD element on a modern CPU, while divisions can require ~3x longer, and transcendental functions ~20x longer than additions or multiplications (See, e.g., [this page](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX&expand=118 ), [this page](http://www.agner.org/optimize/microarchitecture.pdf ), or [this page](http://nicolas.limare.net/pro/notes/2014/12/16_math_speed/ ) for more details). 

One goal in generating C code involving mathematical expressions in NRPy is to minimize the number of floating point operations, and SymPy provides a means to do this, known as [common subexpression elimination](https://en.wikipedia.org/wiki/Common_subexpression_elimination ), or CSE.

CSE algorithms search for common patterns within expressions and declare them as new variables, so they need not be computed again. To call SymPy's CSE algorithm, we need only pass the expression to [sp.cse()](http://docs.sympy.org/latest/modules/simplify/simplify.html#sympy.simplify.cse_main.cse):

In [None]:
print(sp.cse(x))

As you can see, SymPy returned a list with two elements. The first element, $(\texttt{x0, sin(2*a)})$, indicates that a new variable $\texttt{x0}$ should be set to $\texttt{sin(2*a)}$. The second element yields the expression for our original expression $x$ in terms of the original variables, as well as the new variable $\texttt{x0}$. 

$$\texttt{x0} = \sin(2*a)$$ is the common subexpression, so that the final expression $x$ is given by $$x = pow(b,2)*\texttt{x0} + c/\texttt{x0}.$$

Thus, at the cost of a new variable assignment, SymPy's CSE has decreased the computational cost by one multiplication and one sin() function call.

NRPy makes full use of SymPy's CSE algorithm in generating optimized C code, and in addition automatically adjusts expressions like `pow(x,2)` into `((x)*(x))`.

*Caveat: In order for a CSE to function optimally, it needs to know something about the cost of basic mathematical operations versus the cost of declaring a new variable. SymPy's CSE algorithm does not make any assumptions about cost, instead opting to declare new variables any time a common pattern is found more than once. The degree to which this is suboptimal is unclear.*

# Step 3: **Let's generate some C code!** NRPy's core C code output routine, `c_codegen()` 
### \[Back to [top](#Table-of-Contents)\]

NRPy's `c_codegen()` function provides the core of NRPy functionality. It builds upon SymPy's `ccode()` and `cse()` functions and adds the ability to generate [SIMD](https://en.wikipedia.org/wiki/SIMD ) [compiler intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ) for modern Intel and AMD-based CPUs. 

As `c_codegen()` is at the heart of NRPy, it will be useful to understand how it is called:

```python
c_codegen(sympyexpr, output_varname_str, **kwargs)
```

`c_codegen()` requires at least two arguments: 
+ **sympyexpr** is a SymPy expression or a list of SymPy expressions
+ **output_varname_str** is the variable name to assign the SymPy expression, or alternatively the list of variable names to assign the SymPy expressions. If a list is provided, it must be the same length as the list of SymPy expressions.

Additional keyword arguments include:
+ **fp_type**: Specifies the floating-point type ("double", "float", "long double", "double complex")
+ **include_braces**: Wrap the C output expression in curly braces (True/False)
+ **verbose**: Output a comment block displaying the input SymPy expressions (True/False)
+ **enable_cse**: Enable common-subexpression elimination (True/False)
+ **cse_varprefix**: Prefix for CSE temporary variables (string)
+ **enable_simd**: Generate SIMD intrinsics (True/False)
+ **prestring**: String to include before the code
+ **poststring**: String to include after the code

Let's explore these options:

In [None]:
# Step 3: NRPy's C code output routine, `c_codegen()`

# Declare some variables, using SymPy's symbols() function
a,b,c = sp.symbols("a b c")

# Set x = b^2*sin(2*a) + c/sin(2*a).
x = b**2*sp.sin(2*a) + c/(sp.sin(2*a))

# Basic usage with default parameters
print("## Basic usage:")
print(ccg.c_codegen(x,"x"))

# With custom CSE variable prefix and no braces
print("\n## Custom CSE prefix, no braces:")
print(ccg.c_codegen(x,"double y", cse_varprefix="temp_", include_braces=False))

# With verbose output disabled
print("\n## Non-verbose output:")
print(ccg.c_codegen(x,"double z", verbose=False))

# Step 4: Warp speed! SIMD (Single Instruction, Multiple Data) in NRPy-Generated C Code
### \[Back to [top](#Table-of-Contents)\]

Taking advantage of a CPU's SIMD instruction set can yield very nice performance boosts, but only when the CPU can be used to process a large data set that can be performed in parallel. It enables the computation of multiple parts of the data set at once. 

For example, given the expression 
$$\texttt{double x = a*b},$$ 
where $\texttt{double}$ precision variables $\texttt{a}$ and $\texttt{b}$ vary at each point on a computational grid, AVX compiler intrinsics will enable the multiplication computation at *four* grid points *each clock cycle*, *on each CPU core*. Therefore, without these intrinsic, the computation might take four times longer. Compilers can sometimes be smart enough to "vectorize" the loops over data, but when the mathematical expressions become too complex (e.g., in the context of numerically solving Einstein's equations of general relativity), the compiler will simply give up and refuse to enable SIMD vectorization.

As SIMD intrinsics can differ from one CPU to another, and even between compilers, NRPy outputs generic C macros for common arithmetic operations and transcendental functions. In this way, the C code's Makefile can decide the most optimal SIMD intrinsics for the given CPU's instruction set and compiler. For example, most modern CPUs support [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions ), and a majority support up to [AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2 ), while some support up to [AVX512](https://en.wikipedia.org/wiki/AVX-512 ) instruction sets. For a full list of compiler intrinsics, see the [official Intel SIMD intrinsics documentation](https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ).

**Important**: Before using SIMD, we must set the `Infrastructure` parameter to enable proper SIMD type handling.

In [None]:
# Step 4: Taking Advantage of SIMD (Single Instruction, Multiple Data) in NRPy-Generated C Code

# Set Infrastructure parameter for SIMD support
par.set_parval_from_str("Infrastructure", "BHaH")

# Declare some variables, using SymPy's symbols() function
a,b,c = sp.symbols("a b c")

# Set x = b^2*sin(2*a) + c/sin(2*a).
x = b**2*sp.sin(2*a) + c/(sp.sin(2*a))

print(ccg.c_codegen(x,"REAL_SIMD_ARRAY x", enable_simd=True))

The above SIMD code does the following.
* First it fills a constant SIMD array of type `REAL_SIMD_ARRAY `with the integer 2 to the double-precision 2.0. The larger C code in which the above-generated code will be embedded should automatically `#define REAL_SIMD_ARRAY` to e.g., _m256d or _m512d for AVX or AVX512, respectively. In other words, AVX intrinsics will need to set 4 double-precision variables in `REAL_SIMD_ARRAY` to 2.0, and AVX-512 intrinsics will need to set 8.
* Then it changes all arithmetic operations to be in the form of SIMD "functions", which are in fact #define'd in the larger C code as compiler intrinsics. 

FusedMulAddSIMD(a,b,c) performs a fused-multiply-add operation (i.e., `FusedMulAddSIMD(a,b,c)`=$a*b+c$), which can be performed on many CPUs nowadays (with FMA or AVX-512 instruction support) with a *single clock cycle*, at nearly the same expense as a single addition or multiplication, 

Note that it is assumed that the SIMD code exists within a suitable set of nested loops, in which the innermost loop increments every 4 in the case of AVX double precision or 8 in the case of AVX-512 double precision.

As an additional note, NRPy's SIMD routines are aware that the C `pow(x,y)` function is exceedingly expensive when $|\texttt{y}|$ is a small integer. It will automatically convert such expressions into either multiplications of x or one-over multiplications of x, as follows (notice there are no calls to `PowSIMD()` intrinsics!):

In [None]:
# Declare some variables, using SymPy's symbols() function
a,b,c = sp.symbols("a b c")

# Set x = b^2 + a^(-3) + c*a**(1/2).
x = b**2 + a**(-3) + c*a**(sp.Rational(1,2))

print(ccg.c_codegen(x,"REAL_SIMD_ARRAY x", enable_simd=True))

# Step 5: Floating-point precision types in NRPy
### \[Back to [top](#Table-of-Contents)\]

NRPy's `c_codegen()` supports multiple floating-point precision types through the `fp_type` parameter. This allows generation of code for different numerical requirements, from single-precision floats to extended-precision long doubles, and even complex numbers. The default is double precision, but you can explicitly specify other types.

In [None]:
# Step 5: Floating-point precision types in NRPy

# Declare variables
x, y = sp.symbols("x y")
expr = sp.sin(x)/x + sp.cos(y)

# Generate code for different floating-point types
print("## Float (single precision):")
print(ccg.c_codegen(expr, "float result", fp_type="float", verbose=False))

print("\n## Long double (extended precision):")
print(ccg.c_codegen(expr, "long double result", fp_type="long double", verbose=False))

print("\n## Double complex:")
print(ccg.c_codegen(expr + sp.I*expr, "double complex result", fp_type="double complex", verbose=False))

# Step 6: Integration with NRPy's parameter interface
### \[Back to [top](#Table-of-Contents)\]

`c_codegen()` seamlessly integrates with NRPy's parameter interface, reading configuration values such as floating-point type and infrastructure settings automatically. When `fp_type` is set to "set by NRPyParameter par.parval_from_str('fp_type')", it retrieves the value from NRPy's parameter system.

In [None]:
# Step 6: Integration with NRPy's parameter interface

# Set a parameter value
par.set_parval_from_str("fp_type", "float")

# Declare a simple expression
x = sp.symbols("x")
expr = sp.exp(-x**2)

# The fp_type parameter can read from NRPy's parameter system
print("## Using parameter interface for fp_type:")
print(ccg.c_codegen(expr, "result", fp_type="set by NRPyParameter par.parval_from_str('fp_type')", verbose=False))

# Infrastructure parameter affects type aliases
par.set_parval_from_str("Infrastructure", "BHaH")
print("\n## BHaH infrastructure sets REAL alias:")
print(ccg.c_codegen(expr, "result", fp_type="float", verbose=False))

# Reset to default
par.set_parval_from_str("fp_type", "double")
par.set_parval_from_str("Infrastructure", "NRPy")

# Step 7: Advanced SIMD optimizations
### \[Back to [top](#Table-of-Contents)\]

NRPy provides several advanced SIMD optimization flags that can fine-tune performance. The `simd_find_more_FMAsFMSs` option enables more aggressive pattern matching for fused multiply-add/subtract operations. **Note**: SIMD debug verification has limitations with transcendental functions.

In [None]:
# Step 7: Advanced SIMD optimizations

# Re-enable SIMD infrastructure
par.set_parval_from_str("Infrastructure", "BHaH")

# Declare variables and expression (avoiding transcendental functions for SIMD)
a, b, c, d = sp.symbols("a b c d")
expr = a*b + c*d + a*c + b*d  # Rich in FMA opportunities

# Default SIMD optimization
print("## Default SIMD optimization:")
print(ccg.c_codegen(expr, "REAL_SIMD_ARRAY result", enable_simd=True, verbose=False))

# More aggressive FMA search (may or may not improve performance)
print("\n## Aggressive FMA search:")
print(ccg.c_codegen(expr, "REAL_SIMD_ARRAY result", enable_simd=True, 
                    simd_find_more_FMAsFMSs=True, verbose=False))

# Example with sqrt (supported SIMD function)
print("\n## SIMD with sqrt (supported transcendental):")
print(ccg.c_codegen(a*sp.sqrt(b) + c*sp.sqrt(d), "REAL_SIMD_ARRAY result", 
                    enable_simd=True, verbose=False))

# Step 8: Code formatting and customization options
### \[Back to [top](#Table-of-Contents)\]

NRPy provides additional options for customizing the generated code's appearance and structure. The `enable_clang_format` option applies automatic code formatting, while `postproc_substitution_dict` allows post-processing substitutions on variable names.

In [None]:
# Step 8: Code formatting and customization options

# Declare variables
x, y = sp.symbols("x y")
expr = x**3 + y**3

# Generate code with clang formatting
print("## With clang-format enabled:")
print(ccg.c_codegen(expr, "double result", enable_clang_format=True, verbose=False))

# Demonstrate post-processing substitution
print("\n## With variable name substitution:")
print(ccg.c_codegen(expr, "double result", 
                    postproc_substitution_dict={"x": "_input", "y": "_input"},
                    verbose=False))

# Custom prestring and poststring for code wrapping
print("\n## With custom prestring and poststring:")
print(ccg.c_codegen(x*y, "double result", 
                    prestring="// Begin computation\n",
                    poststring="// End computation\n",
                    include_braces=True, verbose=False))