# Using Acacia in HPC workflows

Until now we have been using Rclone and MinIO client on the command line. You can certainly interact with Acacia using the command line on Pawsey systems, however there are more powerful ways to integrate object storage into your HPC workflows. Since the time limit for files on **/scratch** is 30 days, object storage on Acacia is one of the **primary means** for longer term data storage at Pawsey. In this tutorial we are going to look at:


* Using **rclone** and **mc** on Pawsey systems
* Using the Unix utility **tar** to create, examine, compress, and extract files in archive format.
* Leveraging job dependencies to combine Acacia with your supercomputer workflows.

## Getting onto a Pawsey system

For this tutorial you need to be using a Pawsey system. If you have not already logged in to a Pawsey system then please see the <a href="../T1_Getting_Access/L1_SSH_access.html">L1_SSH_access</a> page for logging in via SSH. Then revisit the <a href="../T1_Getting_Access/L3_MinIO_client.html">MinIO</a> and <a href="../T1_Getting_Access/L4_Rclone_client.html">Rclone</a> setup pages to setup access to Acacia from MinIO and Rclone clients on Pawsey systems. In each page, go to the section called "Configure (MinIO/Rclone) with your personal acccess key". 

## Prepare the mock data

If you haven't already prepared the mock data, then follow the instructions at  <a href="../T1_Getting_Access/L5_Mock_data.html">T1_Getting_Access -> L5_Mock_data</a> to unpack the mock data for working with Acacia.

Using the **command line** change directory to the **data** directory. This will be something like

```bash
cd /scratch/${PAWSEY_PROJECT}/${USER}/acacia_training/data
```

Replace **username** with your training login.

## A mini tutorial on tar 

TAR (Tape ARchive) is a Unix tool to sequentially aggregate many files and directories into **one** file. Tar's intended purpose was to prepare files for being written to tape, but the file it creates now serves as a general archive format that supports POSIX file metadata. A tar file may be compressed, and **tar** supports compression using **gzip** and **bzip2**. 

### When to use tar

It might be useful to integrate tar into your workflow under the following circumstances:

* When there are so many files that their storage in Acacia will exceed the nominal limit of 100,000 objects per bucket
* When there is a performance benefit in aggregating a large number of small files. Each upload to Acacia is a unique https connection which takes time to set up and tear down.
* When you don't need individual access to files.
* When you need to preserve empty directories

### When to use compression with tar

Tar files can be compressed as well. This is sometimes useful when the files benefit from being compressed. For many types of binary data the space saved through compression is often marginal.

### Creating archives

The basic syntax to create a tar archive is:

```bash
tar cf <output_name>.tar <things_to_include>
```
The **c** flag means **create** and the **f** (or **--file**) flag means use a **file** for output. For example to archive the **simulation** directory and all of the contents (including hidden files) run
```bash
tar cf simulation.tar simulation 
```
or to include everything in the current directory, including hidden files, use this:
```bash
tar cf ../simulation.tar .
```
Usually it's better to put the tar file somewhere else other than the directory being archived, otherwise you get a warning about the tar file trying to include itself in the archive.

**Exercise: Run these commands on the simulation directory**

#### Verbosity

If you need to see extra information use the **v** or **--verbose** flag for verbosity. This generates a lot of output for a big directory! Change directory back to **data** and run this

```bash
cd ../data
tar cvf simulation.tar simulation
```

#### Using compression to create an archive

The **z** or **j** flag (but not both) switches on compression. The **j** flag corresponds to **bzip2** compression and the resulting file usually is given a **.tar.bz2** extension. The flag **z** is for gzip compression and the file has a **.tar.gz** extension. Normally bzip2 achieves a better compression ratio than gzip but is slower.

```bash
tar zcf simulation.tar.gz simulation
```
or 
```bash
tar jcf simulation.tar.bz2 simulation
```

**Exercise: Use the "time" utility to see how long the tar archive creation process takes without compression.**

```bash
time tar cf simulation.tar simulation
```

Now try timing archive creation using the two compression options. 

* Which process took the longest? 
* How many times longer did it take than uncompressed? 
* Which process achieved the smallest size? Use **du -h \<filename\>** to look at the size of the file.

#### Compression in parallel

The standard tar with compression only uses one core, however there is a program called **pigz** (Parallel Implementation of GZip) which can use multiple cores, thus speeding up the compress and decompress component. We can enable this program using the --use-compress-program="pigz"

```bash
time tar cf simulation.tar.gz simulation --use-compress-program="pigz"
```
and to uncompress we use

```bash
time tar xf simulation.tar.gz --use-compress-program="pigz"
```

#### Exclude files

Files can be left out of the archive using one or more **--exclude** flags. Here we exclude all files ending with **.dat**. The **exclude** flag must come after the tar file name **simulation.tar**.

```bash
tar cvf simulation.tar --exclude='*.dat' simulation
```

### Listing archive contents

The **--list** or **t** flag shows what is in an archive. You can even use this on a compressed archive by using it in conjunction with a compression flag.

```bash
tar tf simulation.tar
```
or for a tar file that is compressed with bzip2 we do this.
```bash
tar tjf simulation.tar.bz2
```

There is even a rudimentary search facility using wildcards. Here we use it to look for log files in the archive.
```bash
tar tf simulation.tar --wildcards '*.log'
tar tf simulation.tar --wildcards 'simulation/log/data_0*.log'
```


**Exercise: list the contents of simulation.tar**

### Extracting archives

The **x** flag is used to extract files from an archive

```bash
tar xvf simulation.tar
```
You can extract the contents of a tar archive to another directory using this flag
```bash
tar xvf simulation.tar --directory other_directory
```
Compressed volumes are extracted using their respective compression flags. Use **j** for a bzip2 compressed archive and **z** for a gzip compressed archive.
```bash
tar xjvf simulation.tar.bz2
```
You can also extract a single file or directory from an archive, for example just the **log** directory and everything in it.
```bash
tar xvf simulation.tar simulation/log
```
Or you can use wildcards to extract specific things based on regular expression style patterns.
```bash
tar xvf simulation.tar --wildcards '*.log'
```

**Exercise: extract all of the .dat files from the simulation.tar archive and place them in another directory of your choosing.**

### Adding files to an archive

Duplicate the **simulation** directory

```bash
cp -r simulation simulation2
```

Adding files to a tar archive is accomplished using the **--append** or **r** flag. Run this to add the directory **simulation2** to the archive.

```bash
tar rf simulation.tar simulation2
```

Now list the contents of simulation.tar and check the size

```bash
tar tf simulation.tar
du -h simulation.tar
```

> Note: compressed tar files **cannot** be updated. You need to unpack the compressed file before appending more contents.

### Removing files from an archive

Now remove the **simulation2** directory from the tar archive using the **--delete** flag. There is no short form of the delete flag. 

```bash
tar --delete --file=simulation.tar simulation2
```
Note that the size of simulation.tar decreases accordingly.

### Comparing files in an archive

The **--compare**, **--diff**, or **-d** flag checks an archive to see if there are any differences betwen what is on the archive and what is on disk. Comparison is **not a recursive operation** though. Let's delete a file from the archive and add an extra file to the local copy.

```bash
tar --delete --file=simulation.tar simulation/results/data_00.dat
cp simulation/results/data_00.dat simulation/results/data_100.dat
```
Now compare the archive to what is on disk
```bash
tar --compare --file=simulation.tar simulation/results/data*.dat
```

### Streaming a tar archive to and from Acacia

Sometimes it can be problematic and/or slow to create an intermediate tar file. On Pawsey systems it is possible to stream the output from **tar** directly to Acacia using the **rclone rcat** command. I wouldn't recommend doing this for extremely large or mission-critical transfers though. Nor would I recommend this method for transferring data to Acacia from outside Pawsey. For those transactions you need the checksumming abilities in **rclone** and **mc**.

#### Streaming tar files to Acacia with progress

A single hyphen (-) instead of a filename tells **tar** to use standard output (input) as the destination (source). 

```bash
tar cf - simulation | rclone rcat acacia-mine:$BucketName/simulation.tar --progress
```
or tar with multicore compression
```bash
tar cf - simulation --use-compress-program="pigz" | rclone rcat acacia-mine:$BucketName/simulation.tar.gz --progress
```

Remove the **--progress** flag when using these commands in scripts, replace **--progress** with **-q** or **--quiet** to suppress unnecessary output.

#### Streaming from the archive

When streaming tar files from Acacia you can use the **cat** command. Don't use it with the **--progress** flag though!

```bash
rclone cat acacia-mine:$BucketName/simulation.tar | tar xf - 
```
or with multicore de-compression to another directory
```bash
mkdir -p simulation2
rclone cat acacia-mine:$BucketName/simulation.tar.gz | tar xf - --directory=simulation2/ --use-compress-program="pigz"
```

### Chunk limits and maximum file sizes

When streaming to an S3 backend such as Acacia, the uploads are chunked with a hard limit of 10,000 chunks. By default for **rclone** and **mc** each chunk size is 5MiB, giving a maximum tar file size of just under 50GiB. You can increase the chunk size for Rclone by adding the line

```bash
chunk_size = 1G
```
to each remote in ~/.config/rclone/rclone.conf. Each entry should have the **chunk_size** flag added, like this:

```bash
[acacia-mine]
type = s3
provider = Ceph
access_key_id = <Personal Access ID>
secret_access_key = <Personal Access Key>
endpoint = https://projects.pawsey.org.au
chunk_size = 1G
```
That will increase the maximum allowed tar file size to just under 10TiB.

MinIO client also has options for increasing the chunk size. I haven't been able to find out which config file switch to use. 

> From hpc-data I was able to upload 50GB of files to Acacia using tar streaming at a rate of around 133MB/s over nearly 6 minutes. Please note there are **no integrity checks** with this streaming method. For comparison **rclone sync** with option **--transfers 12** was able to sync this same directory of files at an average rate of 671 MB/s.

## Integrating object storage in Pawsey job submission scripts

Since **/scratch** is not for long term access then it can be advantageous to incorporate Acacia into Pawsey supercomputer workflows. Acacia can be both the **origin** and **destination** for data that is processed on a Pawsey supercomputer by your job scripts. Data is **staged** in from Acacia to scratch, a **compute** job is run, then processed data is **stored** back to Acacia where it can be accessed from anywhere.

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:80%">
    <img style="vertical-align:middle" src="../images/hpc_workflow.svg"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: HPC workflows with Acacia.</figcaption>
</figure>

There are important pieces of information to make this work.

* Job dependencies
* Use of the copy queues

### Leveraging job dependencies

A typical supercomputing job script might look something like this, variables in curly brackets {} you would normally substitute with information specific to you.

```bash
#!/bin/bash --login
#SBATCH --account={account}
#SBATCH --partition={work queue}
#SBATCH --job-name=superJob
#SBATCH --nodes=2
#SBATCH --ntasks-per-node={number of mpi tasks per node}
#SBATCH --cpus-per-task={number of openmp threads per mpi task}
#SBATCH --time=24:00:00
#SBATCH --export=NONE

module load sometool

srun mkdir -p {scratchDir}
cd {scratchDir}

srun toolName arguments
```

We call this script **superscript.sh**. It normally run like this:

```bash
sbatch superscript.sh
```
and produces the **Job ID** (i.e 5554515) in the output at job submission.

```bash
Submitted batch job 5554515
```
The **--parsable** flag just renders sbatch output with the job ID and cluster name (if defined)
```bash
5554515
```

If you capture the **Job ID** you can use it as a dependency for another script. Here we submit a script to stage data from Acacia and capture the Job ID using the **cut** command.

```bash
stageJobID=$(sbatch --parsable stagescript.sh | cut -d ' ' -f 1)
```

Now we can use that JobID to submit a job that uses the flags **--dependency=afterok:$JobID --parsable** to **wait** for the staging script to finish before starting. 

> **You must specify the **--dependency** flags **before** the name of the job script when running sbatch!**

```bash
superJobID=$(sbatch --dependency=afterok:$stageJobID --parsable superscript.sh  | cut -d ' ' -f 1)
```

Job dependencies are the key to leveraging Acacia with your Pawsey jobs!

### Stage and store scripts

Both stage and store scripts don't usually use much CPU resources. If you submit them to the **copy** queue it **won't count** towards your allocation. A typical staging script (stagescript.sh) might look like this. Notice the use of the **copy** queue and low CPU counts in the request.

```bash
#!/bin/bash --login
#SBATCH --account={account}
#SBATCH --job-name=stageTar
#SBATCH --partition={copy_queue}
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --export=NONE

module load rclone/{rclone_version}

# Stage files from Acacia

# Streaming approach
mkdir -p {scratchDir}
srun rclone cat {acaciaAlias}:{acaciaInPath} | tar xf - --directory {scratchDir}/ 
```

A typical storage script (storescript.sh) might look like this:

```bash
#!/bin/bash --login
#SBATCH --account={account}
#SBATCH --job-name=storeTar
#SBATCH --partition {copy_queue}
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --export=NONE

module load rclone/{rclone_version}

# Store files to Acacia

# Streaming approach
cd {scratchDir}
srun tar cf - . | rclone rcat {acaciaAlias}:{acaciaOutPath}
```

Of course you can put any data movement technique that you prefer into the **stage** and **store** scripts. For the store script we must also depend on the supercomputing job.

```bash
storeJobID=$(sbatch --dependency=afterok:$superJobID --parsable storescript.sh  | cut -d ' ' -f 1)
```

### Master workflow script

All of these steps can be combined into one master control script

```bash
#!/bin/bash

# Master script to control your Pawsey workflow

# Run the staging script, there is no prior dependency
stageJobID=$(sbatch stagescript.sh | cut -d ' ' -f 1)

# Now the supercomputing script, depend on staging
superJobID=$(sbatch --dependency=afterok:$stageJobID --parsable superscript.sh | cut -d ' ' -f 1)

# And the storage script, depend on supercomputing
storeJobID=$(sbatch --dependency=afterok:$superJobID --parsable storescript.sh | cut -d ' ' -f 1)

# Check the queues to see how we are going
squeue --me
```

## Exercise 1: Enable a trial workflow using job dependencies

In this exercise we are going to use the mock data as part of an exercise in using job dependencies in a Pawsey workflow. We are going to stage data from Acacia, run a supercomputer job, and store data back to Acacia. 

### Preparation of input tar file

We first need data on Acacia. Use **cd** to change directory to the **data** directory of the course material folder and unzip the **data.zip** file if the **simulation** directory is not present.

```bash
cd data
# If simulation is not present
unzip data.zip
```
Assuming rclone is set up, we **tar** the simulation directory and upload to Acacia as follows: 

```bash
tar zcf - simulation | rclone rcat acacia-mine:$BucketName/simulation.tar.gz
```

### Editing scripts

In the **scripts** directory of the course material folder are four scripts to accomplish this:

* stagescript.sh
* superscript.sh
* storescript.sh
* masterWorkflow.sh

The script **stagescript.sh** is responsible for staging data from Acacia, **superscript.sh** is a fake supercomputer job, and **storescript.sh** stores the data back to Acacia. Your task is to edit these scripts, replacing everything in curly brackets {} with meaningful values, so that the master workflow script **masterWorkflow.sh** is able to run the scripts without issue. You will need to supply the following values while editing the scripts.

* {account} - your project name/ID.
* {acaciaAlias} - alias to access Acacia.
* {acaciaInPath} - path (on Acacia) of the input tar file, e.g. <bucket_name\>/simulation.tar.gz (with no leading forward slash!)
* {acaciaOutPath} - path (on Acacia) of the tar file to use as output e.g. <bucket_name\>/simulation_out.tar.gz
* {copy_queue} - The queue to use for copying files. On Setonix it is **copy**, on other Pawsey systems it is **copyq**.
* {debug_queue} - The queue to use for the supercomputer job. On Setonix it is **debug**, on other Pawsey systems it is **debugq**.
* {rclone_version} run the command **module spider rclone** to see which version is available.
* {scratchDir} - path (on /scratch) of the directory where the incoming tar file will be unpacked, e.g. /scratch/${PAWSEY_PROJECT}/username/working


### Running the scripts

Now that you've edited the scripts in the **scripts** directory you can make them executable and go ahead and run them.

```bash
chmod u+x *.sh
./masterWorkflow.sh
```

Success means you see output similar to the following, there must be the text **Dependency** in the REASON section for **storeTar** and superJob.


```bash
JOBID    USER     ACCOUNT     PARTITION            NAME EXEC_HOST ST     REASON   START_TIME     END_TIME  TIME_LEFT NODES   PRIORITY
5555102  <username>  <account>   copyq            storeTar       n/a PD Dependency          N/A          N/A    1:00:00     1      75325
5555101  <username>  <account>   debugq           superJob       n/a PD Dependency          N/A          N/A       1:00     1      75325
5555100  <username>  <account>   copyq            stageTar       n/a PD   Priority          N/A          N/A    1:00:00     1      75325
```

List the contents of your bucket to make sure data is being copied there from the job

```bash
rclone ls acacia-mine:
```

By all means download the tar file that is created, and examine the contents to make sure everything is there.


## Exercise 2: Slightly more advanced scripting (Bonus)

Keeping track of the same variables over four scripts can be prone to error. It is more robust to define variables once. The job submission command **sbatch** has the ability to read a script from standard input, so we can compose job scripts as multi-line strings and feed them directly to **sbatch** without creating separate files. In the **scripts** directory are two files:

* advancedWorkflow.sh
* advancedWorkflow.py

Each performs the same task as the four scripts in the previous exercise, however they use either [Bash Heredocs](https://tldp.org/LDP/abs/html/here-docs.html) or [Python f-strings](https://realpython.com/python-f-strings/) to compose the **stage**, **super**, and **store** job scripts as multiline strings. You only need to edit variables at the start of the script. There is some extra functionality included in these scripts that might be useful for your workflows.

* If **acaciaInPath** is not set then the stage script is not queued
* If **runSuper** is not set then the super script is not queued
* If **acaciaOutPath** is not set then the store script is not queued
* Steps in the workflow will attempt to depend on previous ones. 

### Prepare the scripts

Choose a script to run (or both) and edit to set variables **account, acaciaAlias, acaciaInPath, scratchDir, and acaciaOutPath** as before. Examine the code where the job scripts are created, and see how these variables are included dynamically into the variables **stageScript**, **superScript**, and **storeScript**.

### Run the scripts

To run the script **advancedWorkflow.sh** do the following:

```bash
chmod u+x advancedWorkflow.sh
./advancedWorkflow.sh
```


The Python script **advancedWorkflow.py** needs the Python 3 module loaded

```bash
chmod u+x advancedWorkflow.py
module load python/<version>
./advancedWorkflow.py
```

Success means you see something like the following output. Note the **Dependency** text in the REASON field.

```bash
JOBID        USER ACCOUNT                   NAME EXEC_HOST ST     REASON START_TIME       END_TIME  TIME_LEFT NODES   PRIORITY
290501     <username> <account>             storeTar       n/a PD Dependency N/A                   N/A    1:00:00     1      75355
290500     <username> <account>             superJob       n/a PD Dependency N/A                   N/A       1:00     1      75355
290499     <username> <account>             stageTar       n/a PD   Priority N/A                   N/A    1:00:00     1      75355
```

As before, list the contents of your bucket to make sure data is being copied there from the job

```bash
rclone ls acacia-mine:
```

By all means download the tar file that is created, and examine the contents to make sure everything is there.


## Conclusion

If you got this far then congratulations! You have successfully ran a workflow that includes Acacia as part of your supercomputing jobs. In this lesson we looked at different tar techniques and how they can work to support file uploads to Acacia. Then we looked at ways of integrating these methods into your Pawsey workflows.