# Solving the Darcy-Flow problem using AFNO

In this notebook, we will introduce the brief theory behind the Adaptive Fourier Neural Operators and use them to solve the same data-driven Darcy flow problem that was introduced in the [FNO notebook](Darcy_Flow_using_Fourier_Neural_Operators.ipynb)

In contrast with the Fourier Neural Operator, which has a convolutional architecture, the AFNO leverages contemporary transformer architectures in the computer vision domain. Vision transformers have delivered tremendous success in computer vision. This is primarily due to effective self-attention mechanisms. To cope with this challenge, Guibas et al. proposed <a href="https://www.researchgate.net/publication/356601975_Adaptive_Fourier_Neural_Operators_Efficient_Token_Mixers_for_Transformers" rel="nofollow">Adaptive Fourier Neural Operator (AFNO)</a> as an efficient attention mechanism in the Fourier Domain. AFNO is based on the principled foundation of operator learning, which allows us to frame attention as a continuous global convolution efficiently in the Fourier domain. To handle challenges in vision, such as discontinuities in images and high-resolution inputs, AFNO proposes principled architectural modifications to FNO, resulting in memory and computational efficiency. This includes imposing a block-diagonal structure on the channel mixing weights, adaptively sharing weights across tokens, and sparsifying the frequency modes via soft-thresholding and shrinkage. 
This notebook uses the AFNO transformer for modelling a PDE system. While AFNO has been designed for scaling to extremely high-resolution inputs that the FNO cannot handle as well (<a href="https://arxiv.org/pdf/2202.11214.pdf" rel="nofollow">FourCastNet</a>), here we present a simple example using Darcy flow. This problem is intended as an illustrative starting point for data-driven training using AFNO in PhysicsNeMo, but should not be regarded as leveraging the full extent of AFNO's functionality. 


#### Contents of the Notebook

- [Theory of the Adaptive Fourier Neural Operator](#Theory-of-the-Adaptive-Fourier-Neural-Operator)
- [Solving the Darcy-Flow problem](#Solving-the-Darcy-Flow-problem)
    - [Problem Description](#Problem-Description)
    - [Step 1: Loading the Data](#Step-1:-Loading-the-Data)
    - [Step 2: Creating the nodes](#Step-2:-Creating-the-nodes)
    - [Step 3: Creating the Domain and defining the Constraints](#Step-3:-Creating-the-Domain-and-defining-the-Constraints)
    - [Step 4: Adding the Validator](#Step-4:-Adding-the-Validator)
    - [Step 5: Hydra Configuration](#Step-5:-Hydra-Configuration)
    - [Step 6: Solver and Training the model](#Step-6:-Solver-and-Training-the-model)
    - [Visualising the solution](#Visualising-the-solution)

#### Learning Outcomes
- How to use the AFNO transformer architecture in PhysicsNeMo
- Differences between the AFNO transformer and the FNO


## Theory of the Adaptive Fourier Neural Operator

The Adaptive-Fourier Neural Operator (AFNO) architecture is highly effective and computationally efficient for high-resolution inputs. It combines a key recent advance in modelling PDE systems, namely the Fourier Neural Operator (FNO), with the powerful Vision Transformer (ViT) model for image processing. FNO has shown great results in modelling PDE systems such as Navier-Stokes flows. The ViT and related variants of transformer models have achieved SOTA performance in image processing tasks. The multi-head self-attention (MHSA) mechanism of the ViT is key to its impressive performance. The self-attention mechanism models long-range interactions at each layer of the neural network, a feature that is absent in most convolutional neural networks. The drawback of the ViT self-attention architecture is that it scales as a quadratic function of the length of the token sequence and thus scales quadratically with input image resolution. The AFNO provides a solution to the scaling complexity of the ViT. The AFNO model implements a token mixing operation in the Fourier Domain. The computational complexity of the mixing operation is $\mathcal{O}(N_{token}\log N_{token})$ as opposed to the $\mathcal{O}({N_{token}^2})$ complexity of the vanilla ViT architecture.
The first step in the architecture involves dividing the input image into a regular grid with $h \times w$ equal-sized patches of size $p\times p$. The parameter $p$ is referred to as the patch size. For simplicity, we consider a single-channel image. Each patch is embedded into a token of size $d$, the embedding dimension. The patch embedding operation results in a token tensor ($X_{h\times w \times d}$) of size $h \times w \times d$. The patch size and embedding dimension are user-selected parameters. A smaller patch size allows the model to capture fine-scale details better while increasing the computational cost of training the model. A higher embedding dimension also increases the parameter count of the model. The token tensor is then processed by multiple layers of the transformer architecture performing spatial and channel mixing. 
The AFNO architecture implements the following operations in each layer.
The token tensor is first transformed to the Fourier domain with
\begin{equation}
z_{m,n} = [\mathrm{DFT}(X)]_{m,n},
\end{equation}
where $m,n$ is the index the patch location and DFT denotes a 2D discrete Fourier transform. The model then applies token weighting in the Fourier domain and promotes sparsity with a Soft-Thresholding and Shrinkage operation as
\begin{equation} 
\tilde{z}_{m,n} = S_{\lambda} ( \mathrm{MLP}(z_{m,n})),
\end{equation}
where $S_{\lambda}(x) = \mathrm{sign}(x) \max(|x| - \lambda, 0)$ with the sparsity controlling parameter $\lambda$, and $\mathrm{MLP(\cdot)}$ is a 2-layer multi-layer perceptron with block-diagonal weight matrices which are shared across all patches. The number of blocks in the block diagonal MLP weight matrices is a user-selected hyperparameter that should be tuned appropriately. The last operation in an ANFO layer is an inverse Fourier to transform back to the patch domain and add a residual connection as
\begin{equation}
y_{m,n} = [\mathrm{IDFT}(\tilde{Z})]_{m,n} + X_{m,n}.
\end{equation}
At the end of all the transformer layers, a linear decoder converts the feature tensor back to the image space.
There are several important hyper-parameters that affect the accuracy and computational cost of the AFNO. Empirically, the most important hyperparameters that should be tuned, keeping in mind the task at hand are the number of layers, patch size, embedding dimension and the number of blocks.


## Solving the Darcy-Flow problem

### Problem Description

The problem description follows from the FNO notebook. We would be building a surrogate model that learns the mapping between a permeability and pressure field of a Darcy flow system. The AFNO is based on an image transformer backbone, and as with all transformer architectures, the AFNO tokenizes the input field. Each token is embedded in a patch of the image. The tokenized image is processed by the transformer layers, followed by a linear decoder which generates the output image. 
<center><img src="images/afno_darcy.png" alt="Drawing" style="width:900px" /></center>


Similar to the FNO chapter, the training and validation data for this example can be found on the [Fourier Neural Operator Github page](https://github.com/zongyi-li/fourier_neural_operator). The data can be downloaded using an automated script similar to the FNO notebook. 

**Note:** In this notebook we will walk through the contents of [`darcy_AFNO_lazy.py`](../../source_code/darcy/darcy_AFNO.py) script. 

### Step 1: Loading the Data

Loading both the training and validation datasets into memory follows a similar process as seen in the FNO notebook. We will use the eager data loading for both datasets in this case. 

```python
import physicsnemo
from physicsnemo.sym.hydra import instantiate_arch
from physicsnemo.sym.hydra.config import PhysicsNeMoConfig
from physicsnemo.sym.key import Key

from physicsnemo.sym.domain import Domain
from physicsnemo.sym.domain.constraint import SupervisedGridConstraint
from physicsnemo.sym.domain.validator import GridValidator
from physicsnemo.sym.dataset import DictGridDataset
from physicsnemo.sym.solver import Solver

from physicsnemo.sym.utils.io.plotter import GridValidatorPlotter

from utilities import download_FNO_dataset, load_FNO_dataset


@modulphysicsnemous.sym.main(config_path="conf", config_name="config_AFNO")
def run(cfg: PhysicsNeMoConfig) -> None:

    # load training/ test data
    input_keys = [Key("coeff", scale=(7.48360e00, 4.49996e00))]
    output_keys = [Key("sol", scale=(5.74634e-03, 3.88433e-03))]

    download_FNO_dataset("Darcy_241", outdir="datasets/")
    invar_train, outvar_train = load_FNO_dataset(
        "datasets/Darcy_241/piececonst_r241_N1024_smooth1.hdf5",
        [k.name for k in input_keys],
        [k.name for k in output_keys],
        n_examples=1000,
    )
    invar_test, outvar_test = load_FNO_dataset(
        "datasets/Darcy_241/piececonst_r241_N1024_smooth2.hdf5",
        [k.name for k in input_keys],
        [k.name for k in output_keys],
        n_examples=100,
    )
```

The inputs for AFNO need to be perfectly divisible by the specified patch size (in this case, `patch_size=16`), which is not the case for this dataset. Therefore, we trim the input/output features such that they are have appropriate dimensionality `241x241 -> 240x240`.

```python
    # get training image shape
    img_shape = [
        next(iter(invar_train.values())).shape[-2],
        next(iter(invar_train.values())).shape[-1],
    ]

    # crop out some pixels so that img_shape is divisible by patch_size of AFNO
    img_shape = [s - s % cfg.arch.afno.patch_size for s in img_shape]
    print(f"cropped img_shape: {img_shape}")
    for d in (invar_train, outvar_train, invar_test, outvar_test):
        for k in d:
            d[k] = d[k][:, :, : img_shape[0], : img_shape[1]]
            print(f"{k}: {d[k].shape}")
```

### Step 2: Creating the nodes

Initializing the model and domain again follow the same steps as seen in the FNO notebook. For AFNO, we calculate the size of the domain after loading the dataset. The domain needs to be defined in the AFNO model, which is provided with the inclusion of the keyword argument `img_shape` in the `instantiate_arch` call. 

```python
    # make list of nodes to unroll graph on
    model = instantiate_arch(
        input_keys=input_keys,
        output_keys=output_keys,
        cfg=cfg.arch.afno,
        img_shape=img_shape,
    )
```

### Step 3: Creating the Domain and defining the Constraints 

The data-driven constraints and validators are then added to the domain in the same fashion as the FNO notebook. 

```python
    # add constraints to domain
    supervised = SupervisedGridConstraint(
        nodes=nodes,
        dataset=train_dataset,
        batch_size=cfg.batch_size.grid,
    )
    domain.add_constraint(supervised, "supervised")
```

### Step 4: Adding the Validator

We can now proceed and add the Validators in the same fashion as in the previous notebook.

```python
    # add validator
    val = GridValidator(
        nodes,
        dataset=test_dataset,
        batch_size=cfg.batch_size.validation,
        plotter=GridValidatorPlotter(n_examples=5),
    )
    domain.add_validator(val, "test")
```

### Step 5: Hydra Configuration

The AFNO is based on the ViT transformer architecture and requires tokenization of the inputs. Here each token is a patch of the image with a patch size defined in the configuration file through the parameter `patch_size`. The `embed_dim` parameter defines the size of the latent embedded features used inside the model for each patch. The contents of the [`config_AFNO.yaml`](../../source_code/darcy/conf/config_AFNO.yaml) are shown below. 

```yaml
defaults :
  - physicsnemo_default
  - arch:
      - afno
  - scheduler: tf_exponential_lr
  - optimizer: adam
  - loss: sum
  - _self_

arch:
  afno:
    patch_size: 16
    embed_dim: 256
    depth: 4
    num_blocks: 8
    
scheduler:
  decay_rate: 0.95
  decay_steps: 1000

training:
  rec_results_freq : 1000
  max_steps : 10000

batch_size:
  grid: 32
  validation: 32
```

### Step 6: Solver and Training the model

Once the domain and the configuration is set up, the `Solver` can be defined and the training can be started as seen in earlier notebooks. 

```python
    # make solver
    slv = Solver(cfg, domain)

    # start solver
    slv.solve()


if __name__ == "__main__":
    run()
```

Before we can start training, we can make use of Tensorboard for visualizing the loss values and convergence of several other monitors we just created. This can be done inside the Jupyter framework by selecting the directory in which the checkpoint will be stored by clicking on the small checkbox next to it. The option to launch a Tensorboard then shows up in that directory. Once you open Tensorboard, switch between the SCALARS , IMAGES , TEXT , TIME SERIES to visualise and view Validation and other information related to Training.

For this application, please verify if you are inside the `/jupyter_notebook/Operators` folder after launching Tensorboard.


1. The option to launch a Tensorboard then shows up in that directory.

<center><img src="../projectile/images/tensorboard.png" alt="Drawing" style="width:900px" /></center>

2. We can launch tensorboard using the following command: 

```
tensorboard --logdir /workspace/python/jupyter_notebook/ --port 8889
```

3. Open a new tab in your browser and head to [http://127.0.0.1:8889](http://127.0.0.1:8889) . You should see a screen similar to the below one. 

<center><img src="../projectile/images/tensorboard_browser.png" alt="Drawing" style="width:900px" /></center>

In [None]:
import os
os.environ["RANK"]="0"
os.environ["WORLD_SIZE"]="1"
os.environ["MASTER_ADDR"]="localhost"

In [None]:
!python ../../source_code/darcy/darcy_AFNO.py

### Visualising the solution

The checkpoint directory is saved based on the results recording frequency specified in the `rec_results_freq` parameter of its derivatives. The network directory folder contains several plots of the different validation predictions, some of which are shown below. 

AFNO validation predictions. (Left to right) Input permeability, true pressure, predicted pressure, error.

<center><img src="images/afno_darcy_pred1.png" alt="Drawing" style="width: 900px;"/></center>
<center><img src="images/afno_darcy_pred2.png" alt="Drawing" style="width: 900px;"/></center>
<center><img src="images/afno_darcy_pred3.png" alt="Drawing" style="width: 900px;"/></center>

It is important to recognize that AFNO's strengths lie in its ability to scale to much larger model sizes and datasets than what is used in this notebook/example. While not illustrated here, this example demonstrates the fundamental implementation of data-driven training using the AFNO architecture in PhysicsNeMo for you to extend to larger problems. 

--- 

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

---

# Licensing

Copyright © 2023 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.