# Setup Nvidia Modulus v22.03 on Sunbird using interactive GPU session

As of 17 Apr 2022, the link to Modulus tutorial is bit secret. Here is the link: https://docs.nvidia.com/deeplearning/modulus/index.html

# Installation
It turns out that Conda environment is experiencing lots of issues, thus I will use Python virtual environments with out Jupyter lab.

## Installing latest Python
If we have a look at available versions of Python in Sunbird, it is very old. The latest version is 3.6.
```sh
[s.1915438@sl1 ~]$ ls /usr/bin/python*
/usr/bin/python  /usr/bin/python2  /usr/bin/python2.7  /usr/bin/python2.7-config  /usr/bin/python2-config  /usr/bin/python3  /usr/bin/python3.6  /usr/bin/python3.6m  /usr/bin/python-config
```

If we want to create a virtual environment with latest Python then we can use Python from within a conda environment. 

Create a new Conda environment as follows. This will create a new conda environment with the latest python.
```sh
module load anaconda/2021.05
conda create --name modulus
source activate modulus
```
Let us check the Python version in the `modulus` environment.
```sh
(modulus) [s.1915438@sl1 ~]$ which python
/lustrehome/home/s.1915438/modulus/bin/python
(modulus) [s.1915438@sl1 ~]$ python --version
Python 3.9.12
```


We can use this python to create our Python virtual environment as follows. Also, I will create this in `/scratch/` partition as it is faster compared to `/lustrehome/` partition.
```sh
(modulus) [s.1915438@sl1 ~]$ cd /scratch/s.1915438
(modulus) [s.1915438@sl2 s.1915438]$ mkdir env
(modulus) [s.1915438@sl2 s.1915438]$ ls
ansys195  env  jupyter_env.sh  jupyter_log  jupyter.sh  modulus  Modulus_examples  Modulus_source
(modulus) [s.1915438@sl2 s.1915438]$ cd env
(modulus) [s.1915438@sl2 env]$ python3 -m venv modulus 
(modulus) [s.1915438@sl2 env]$
```

Now it is time to close the conda environment. The simplest way is to reestablish the ssh connection.

## Running Python virtual environment
A Python environment can be activate using this command:

```sh
[s.1915438@sl1 ~]$ cd /scratch/s.1915438
[s.1915438@sl1 s.1915438]$ source env/modulus/bin/activate
(modulus) [s.1915438@sl1 s.1915438]$ 
```

Now we can check the Python version:
```sh
(modulus) [s.1915438@sl1 s.1915438]$ which python
/scratch/s.1915438/env/modulus/bin/python
(modulus) [s.1915438@sl1 s.1915438]$ python --version
Python 3.9.12
(modulus) [s.1915438@sl1 s.1915438]$ 
```

## Installing Pytorch
Remember to install correct version of pytorch for Nvidia A100. Version `'1.11.0+cu102'` i.e. 1.11 with CUDA 10.2 is incompatible and you will see the following error.
```sh
(modulus) [s.1915438@sl2 helmholtz]$ srun python helmholtz.py
/scratch/s.1915438/env/modulus/lib/python3.9/site-packages/torch/cuda/__init__.py:145: UserWarning: 
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
```
So, install a later version such as `'1.11.0+cu113'` using `pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113`.

## Installing Nvidia Modulus v22.03
A `requirements.txt` file is present in this directory. It contains the command to install prerequisites for Modulus. Please, do not follow Nvidia's online instructions.

```sh
pip3 install matplotlib transforms3d future typing numpy quadpy numpy-stl==2.11.2 h5py sympy==1.5.1 termcolor psutil symengine==0.6.1 numba Cython chaospy torch_optimizer vtk chaospy termcolor omegaconf hydra-core einops timm tensorboard pandas orthopy ndim


pip3 install -U https://github.com/paulo-herrera/PyEVTK/archive/v1.1.2.tar.gz
```
Go to the Nvidia Modulus's source directory and install Modulus on `modulus` virtual environment.
```sh
[s.1915438@sl1 Modulus]$ ls
accompanying_licences  build  changelog_tensorflow.md  dist  Dockerfile  external  MANIFEST.in  modulus  modulus.egg-info  NVIDIA-OptiX-SDK-7.0.0-linux64.sh  README.md  requirements.txt  setup.cfg  setup.py
[s.1915438@sl1 Modulus]$ pwd
/scratch/s.1915438/Modulus_source/Modulus
[s.1915438@sl1 Modulus]$ python setup.py install
```
After some time you should see a success message
```sh
Using /scratch/s.1915438/modulus/lib/python3.9/site-packages
Finished processing dependencies for modulus==22.3
```

### Installing PySDF
A link: https://forums.developer.nvidia.com/t/modulus-22-03-bare-metal-installation-no-module-named-easy-install/210970

Copy PySDF files from previous i.e. from v21.06 `./Modulus/external/pysdf` and paste it `./Modulus/external`. I am doing this because, Python 3.9 no longer supports installation of `egg` files using `easy_install` which is the default method to install PySDF in Modulus v22.03.

Now we can proceed with the older instructions, from the older manual as follows.

```sh
(/scratch/s.1915438/modulus) [s.1915438@sl1 Modulus]$ pwd
/scratch/s.1915438/Modulus_source/Modulus
(/scratch/s.1915438/modulus) [s.1915438@sl1 Modulus]$ cd external/
(/scratch/s.1915438/modulus) [s.1915438@sl1 external]$ ls
eggs  lib  pysdf
(/scratch/s.1915438/modulus) [s.1915438@sl1 external]$ export LD_LIBRARY_PATH=$(pwd)/pysdf/:${LD_LIBRARY_PATH}
```
Now install PySDF
```sh
(modulus) [s.1915438@sl2 pysdf]$ pwd
/scratch/s.1915438/Modulus_source/Modulus/external/pysdf
(modulus) [s.1915438@sl2 pysdf]$ python setup.py install
```
after some time you will see
```sh
Installed /scratch/s.1915438/env/modulus/lib/python3.9/site-packages/pysdf-0.1-py3.9-linux-x86_64.egg
Processing dependencies for pysdf==0.1
Finished processing dependencies for pysdf==0.1
```

# Running an interactive GPU session
`salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2`

set the Number of GPU as you wish, number of CPU does not matter here.
```sh
(modulus) [s.1915438@sl2 helmholtz]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
salloc: Granted job allocation 7161838
salloc: Waiting for resource configuration
salloc: Nodes scs2041 are ready for job
```
We can see our job in two ways. Using `squeue --user=s.1915438` or `squeue --partition=accel_ai`.
```sh
[s.1915438@sl2 ~]$ squeue --partition=accel_ai
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7161842  accel_ai     bash s.191543  R       0:38      1 scs2041
           7161825  accel_ai Eval_ens   a.bip5  R    1:08:17      1 scs2041
```

## Running Nvidia Modulus example
We can use `srun` to run any Python on GPU as follows:
```sh
(modulus) [s.1915438@sl2 seismic_wave]$ srun python wave_2d.py 
training:
  max_steps: 40000
  grad_agg_freq: 1
  rec_results_freq: 1000
  :
  <Output continues>
```

## Cancelling model training
Nvidia Modulus trains the model forever and stores the data in `checkpoint` folder. We can cancel the training anytime or when the loss is satisfactory using pressing `ctrl+c` multiple times.

## Can't run SDF library and STL file support.
This is something I have to look at. For now here is the error.
```sh
(modulus) [s.1915438@sl1 s.1915438]$ cd Modulus_examples/examples/aneurysm/
(modulus) [s.1915438@sl1 aneurysm]$ ls
aneurysm.py  conf  openfoam  stl_files
(modulus) [s.1915438@sl1 aneurysm]$ srun python aneurysm.py
Error importing pysdf. Make sure 'libsdf.so' is in LD_LIBRARY_PATH and pysdf is installed
Traceback (most recent call last):
  File "/scratch/s.1915438/Modulus_examples/examples/aneurysm/aneurysm.py", line 25, in <module>
    from modulus.geometry.tessellation.tessellation import Tessellation
  File "/scratch/s.1915438/env/modulus/lib/python3.9/site-packages/modulus-22.3-py3.9.egg/modulus/geometry/tessellation/tessellation.py", line 11, in <module>
    import pysdf.sdf as pysdf
ImportError: libsdf.so: cannot open shared object file: No such file or directory
srun: error: scs2041: task 0: Exited with exit code 1
```