![logo](../../_static/images/NCI_logo.png)

-------

# Setup Pangeo Environment


In this notebook:

- Load Pangeo module
- Activate pangeo enviornment
- submit a batch job script to the queue system
- Set up port forwarding at the client machine
- Connect to the remote Jupyter lab server from client machine
- Utilise the dask server in Jupyter notebook
- Visulise dask dashboard
- Run Jupyter notebook in a batch job

--- 




**Pangeo** is a community platform for big data in geoscience, funded by US NSF. Pangeo project serves as a coordination point between scientists, software and computing infrastructure. The Pangeo software ecosystem involves open source tools such as xarray, iris, dask, jupyter, and many other packages. [This site](http://pangeo.io) provides guidance for accessing data and performing analysis using these tools. NCI has installed the Pangeo environment on Raijin by following instructions [here](http://pangeo.io/setup_guides/hpc.html). Please note that Pangeo will be transferred to Gadi when major Raijin/Gadi transition happens in Nov/Dec. It will not be available on Raijin from transition period onwards. This notebook provides instructions on how to use the Pangeo environment to run your jupyter notebook locally and interact with Raijin remotely on Raijin. 

### Load Pangeo module from Raijin and activate Pangeo environment

```
$ module load pangeo/2019.10
$ source ${PANGEO_ROOT}/etc/profile.d/conda.sh
$ conda activate pangeo
```
You will see pangeo appear in the bracets in front of the promt sign. You can quit the enviornment using **conda deactivate**

```
$codna deactivate
```

![1](images/pangeo_setup1.png)

If you ask where your Python command lives, it should direct you to where pangeo was installed on Raijin. 

![2](images/pangeo_setup2.png)



### Configure Jupyter

Run the following two lines of command.

```
$ jupyter notebook --generate-config
$ jupyter notebook password
```
It will promote you to enter a password for opening jupyter notebook on your local machine later. You can simply type a password and you need to remember it!

If the command does not work (often in older versions of Jupyter), there are [instructions](http://pangeo.io/setup_guides/hpc.html) on how to set up step-by-step.

### Start a Jupyter Notebook Server

First create a directory where you will run the jupyter notebook, let's call it "tutorial". 

Submit a batch job script as below to the queue system. You can create a shell script by copying the following commands into a script file. Let's name it as run_ipynb_job.sh. Or you can download the example script here. We request 2 notes with 32 CPU and 64GB memory in this instance. Further instructions  about job submission and running jobs on Raijin can be found [here](https://opus.nci.org.au/display/Help/Running+Jobs).


```
#!/bin/bash
#PBS -N pangeo_test
#PBS -P $YOUR_PROJECT_ID
#PBS -q express
#PBS -l walltime=5:00:00
#PBS -l ncpus=32
#PBS -l mem=64GB
#PBS -l jobfs=100GB
module load pangeo/2019.10
pangeo.ini.all.sh
sleep infinity
```

![3](images/pangeo_setup3.png)

Replace the requested resources, queue type and project ID with those suitable for you. Please always request one or multiple whole nodes in your job script. The above job will load the pangeo module, run the initialization script called **pangeo.ini.all.sh**, and keep alive in the job lifetime. The initialization script will set up the dask scheduler at one node and multiple workers on all nodes. After that, it will start up the jupyter lab and create port forwarding commands for the user by putting them into a file named ‘client_cmd’ . 

### Set up port forwarding at the client machine

Once the job is complete, there are two files appearing in your current directory. 

* client_cmd
* scheduler.json

![4](images/pangeo_setup4.png)


The file client_cmd contains commands to forward network traffics from the defined port number of worker node to client machine via the login node raijin.nci.org.au. In the example below, jupyter lab uses port 8343 and dask dashboard occupies port 8890 respectively at the Raijin worker node r225. Note both port numbers are randomly picked up in each job so they keep changing in different job submissions. 

### SSH login with/without password

If you have set up the SSH login without password you could paste the above two lines in one command line interface (CLI) of the client machine (recommend to use VDI or a computer with MAC OS or Linux). 

![5](images/pangeo_setup5.png)

Otherwise, you may need to run each command in separate CLIs and type in your Raijin password when needed: 
 
CLI_1: 
```
$ ssh -N -L 8343:r225:8343 jbw900@raijin.nci.org.au 
jbw900@raijin.nci.org.au's password: 
```

CLI_2: 
```
$ ssh -N -L 8890:r225:8890 jbw900@raijin.nci.org.au 
jbw900@raijin.nci.org.au's password: 
```

### Connect to the remote Jupyter lab server from your client machine

By typing in “localhost:8343” in a web browser of client machine, you could enter the remote jupyter lab interface. 

![6](images/pangeo_setup6.png)

Then it will prompt the password. Type the password that you set up in the second step in this tutorial. 

![7](images/pangeo_setup7.png)

Once your authentication passed, a jupyterlab interface will be launched in a few seconds.

![8](images/pangeo_setup8.png)

Now you are ready to run your own notebooks.

### Let's import a notebook example

You can drag and drop a notebook from your local computer into this Jupyterlab. Then the file will also appear in your working directory in Raijin. 

![9](images/pangeo_setup9.png)

The screen shot above shows

- left: jupyter notebook interface
- up right: local dir where a notebook is dragged and dropped into the Jupyterlab
- down right: Raijin command window showing the notebook appears instantly

### Utilize the dask server in Jupyter notebook

To utilize the dask server established from the PBS job, it is necessary to add and run the following cell at the beginning of your notebook: 

```
from dask.distributed import Client,LocalCluster 
client = Client(scheduler_file='scheduler.json') 
print(client) 
```

Its output will show configurations of client and cluster. Make sure the number of cores matches what you requested in the job script. Now you could run your notebook as usual.

### Terminate the job

After all work finished, add and run a cell as below to stop the job.

```
!pangeo.end.sh
```

### Recap important notes

Please make sure the following two lines are added at the beginning and the end of the notebook.

```
# start the dask client
client =  Client(scheduler_file='scheduler.json')
 
# stop the pbs job.
! pangeo.end.sh
```

### View threads using Dask dashboard

Open a new tab in the web browser, type the following, the second port in the client_cmd file. 
If the job starts running, you should be able to see the dynamic resources of the processing.

```
localhost:8890
```
![10](images/pangeo_setup10.png)

## Run Jupyter notebook in a batch job

### convert your jupyter notebook to a python script

```
jupyter nbconvert --to script [YOUR_NOTEBOOK}.ipynb
```

Make sure you have added the following lines at the beginning of the python script.

```
from dask.distributed import Client,LocalCluster 
client = Client(scheduler_file='scheduler.json') 
print(client) 
```


### Create the job script as below

```
#!/bin/bash 
#PBS -N pangeo_test 
#PBS -P YOUR_PROJECT_ID 
#PBS -q YOUR_QUEUE_TYPE 
#PBS -l walltime=5:00:00 
#PBS -l ncpus=32 
#PBS -l mem=64GB 
#PBS -l jobfs=100GB 
 
module load pangeo/2019.10 
pangeo.ini.all.sh 
source ${PANGEO_ROOT}/etc/profile.d/conda.sh 
conda activate pangeo 
 
cd $PBS_O_WORKDIR 
python YOUR_PYSCRIPT_NAME.py 
pangeo.end.sh
```

Modify parameters that suit your case, and name it as **run_py.sh**

### Submit your job script via 'qsub' command

```
qsub run_py.sh
```

### Reference

- http://pangeo.io