# SWFFT
Functionally, a Distribution object is instantiated based on a parent
MPI_Comm, and that Distribution instance will create and track the Cartesian
communicators for the initial 3D distribution and the three 2D pencil
distributions. A Dfft object is then instantiated based on the Distribution
object in order to coordinate the operations to actually execute the
3D distributed memory DFFT. The Dfft instance also has convenience methods
to access the communicators and geometric information for the MPI distribution
in "real space" (initial 3D distribution) and "k space" (2D pencils in z).

---
## Parameters
```
Build_Flags = '-g -O3 -march=native -lfftw3 -lm'
Run_Flags = 'mpirun -n 1 <.exe> 3 720'
```

---
## Roofline - Intel Haswell - 1Thread - 2301.0 Mhz
| GB/sec | L1 B/W | L2 B/W | L3 B/W | DRAM B/W |
|:---------|:------:|:------:|:------:|:--------:|
| **1 Thread per Node**   | 142.7  |  45.0  |  33.7  |   16.0   |

## Program Aggregate
| Experiment Aggregate Metrics | CPUTIME % | Inst/Cycle per Core | L1 DC Miss % | L2 DC Miss %  | L3 Miss % | L1 Loads/Cycle per Core | L2 B/W Used | L3 B/W Used  | DRAM B/W Used |
|:-----------------------------|:---------:|:-------------------:|:------------:|:-------------:|:---------:|:-----------------------:|:-----------:|:------------:|:-------------:|
| 1                            |  100.0 %   |       0.76          |         3.1% |         51.3% |     51.1% |                0.92     |       11.6% |        7.9%  |       8.5%    |
![](../assets/SWFFTBreakdown.png)

## [I] redistribute_2_and_3 - distribution.c line 1488 to line 1980
| redistribute_2_and_3 | CPUTIME % | Inst/Cycle per Core | L1 DC Miss % | L2 DC Miss %  | L3 Miss % | L1 Loads/Cycle per Core | L2 B/W Used | L3 B/W Used  | DRAM B/W Used |
|:---------------|:---------:|:-------------------:|:------------:|:-------------:|:---------:|:-----------------------:|:-----------:|:------------:|:-------------:|
| 1              |  49.4 %   |       0.51          |        27.8% |         60.8% |     50.4% |                0.11     |       18.3% |       14.9%  |      15.8%    |

```c
///
// redistribute between 2- and 3-d distributions.
//   a    input
//   b    ouput
//   d    distribution descriptor
//   dir  direction of redistribution
//
// This actually does the work.
///
static void redistribute_2_and_3(const complex_t *a,
                                 complex_t *b,
                                 distribution_t *d,
                                 int direction,
                                 int z_dim)
```

## t1_12 - generated from libfftw3
| t1_12.c | CPUTIME % | Inst/Cycle per Core | L1 DC Miss % | L2 DC Miss %  | L3 Miss % | L1 Loads/Cycle per Core | L2 B/W Used | L3 B/W Used  | DRAM B/W Used |
|:--------|:---------:|:-------------------:|:------------:|:-------------:|:---------:|:-----------------------:|:-----------:|:------------:|:-------------:|
| 1       |  16.3 %   |        1.2          |         0.1% |         11.5% |     12.0% |                1.97     |        1.1% |        0.2%  |       0.0%    |
```c
/* Generated by: ../../../genfft/gen_twiddle.native -compact -variables 4 -pipeline-latency 4 -n 12 -name t1_12 -include t.h */

/*
 * This function contains 118 FP additions, 60 FP multiplications,
 * (or, 88 additions, 30 multiplications, 30 fused multiply/add),
 * 47 stack variables, 2 constants, and 48 memory accesses
 */
#include "t.h"

static void t1_12(R *ri, R *ii, const R *W, stride rs, INT mb, INT me, INT ms)
```

## n1_15 - generated from libfftw3
| n1_15.c | CPUTIME % | Inst/Cycle per Core | L1 DC Miss % | L2 DC Miss %  | L3 Miss % | L1 Loads/Cycle per Core | L2 B/W Used | L3 B/W Used  | DRAM B/W Used |
|:--------|:---------:|:-------------------:|:------------:|:-------------:|:---------:|:-----------------------:|:-----------:|:------------:|:-------------:|
| 1       |  18.6 %   |       0.88          |         1.2% |         10.5% |    128.3% |                0.80     |        3.8% |        0.5%  |       1.5%    |

```c
/* Generated by: ../../../genfft/gen_notw.native -compact -variables 4 -pipeline-latency 4 -n 15 -name n1_15 -include n.h */

/*
 * This function contains 156 FP additions, 56 FP multiplications,
 * (or, 128 additions, 28 multiplications, 28 fused multiply/add),
 * 69 stack variables, 6 constants, and 60 memory accesses
 */
#include "n.h"

static void n1_15(const R *ri, const R *ii, R *ro, R *io, stride is, stride os, INT v, INT ivs, INT ovs)

```

## t1_4  - generated from libfftw3
| t1_4.c | CPUTIME % | Inst/Cycle per Core | L1 DC Miss % | L2 DC Miss %  | L3 Miss % | L1 Loads/Cycle per Core | L2 B/W Used | L3 B/W Used  | DRAM B/W Used |
|:-------|:---------:|:-------------------:|:------------:|:-------------:|:---------:|:-----------------------:|:-----------:|:------------:|:-------------:|
| 1      |  12.0 %   |       0.99          |         0.7% |         10.1% |      0.0% |                1.00     |        2.5% |        0.3%  |       0.0%    |
```c
/* Generated by: ../../../genfft/gen_twiddle.native -compact -variables 4 -pipeline-latency 4 -n 4 -name t1_4 -include t.h */

/*
 * This function contains 22 FP additions, 12 FP multiplications,
 * (or, 16 additions, 6 multiplications, 6 fused multiply/add),
 * 13 stack variables, 0 constants, and 16 memory accesses
 */
#include "t.h"

static void t1_4(R *ri, R *ii, const R *W, stride rs, INT mb, INT me, INT ms)
```