# Getting started with Acacia

Rclone is the preferred supported client for Acacia.For power usage of Acacia you may want to look at the [AWS command line interface](https://docs.aws.amazon.com/cli/latest/reference/s3/) or the [AWS boto3 library](https://aws.amazon.com/sdk-for-python/) to use Acacia from Python.

In this tutorial we are going to explore:

* Getting help
* Working with buckets
    * Creating, listing, and removing buckets
* Working with files and metadata
    * Copying simple files
    * Copying mutiple files
    * Adding and extracting metadata
    * Constructing URL's for download and upload
    * Removing objects
    * Synchronising directories to Acacia

## Getting help for MinIO and Rclone

The complete guide to Rclone commands is [here](https://rclone.org/commands/). To get help from the Rclone client use this command:

```bash
rclone --help
```
```bash
rclone command --help
```

**Exercise: Lookup help on the "mkdir" command with rclone**

## Buckets as storage containers

Buckets are basic containers for grouping objects. There is no concept of a sub-bucket, so the similarity to a folder ends there! 

### Important information

On Acacia: 

* A user or project may each have up to 1000 buckets.
* Up to a million objects may be in a bucket (less than 100,000 preferred).
* A user has 100GB of personal storage
* Projects are given 1TB of storage by default

The bucket name:

* **Must** be unique across the system!
* Bucket names can potentially be made public using links. Therefore bucket names **must** not contain any confidential information
    * Usernames can be a target for an attack
    * Email addresses can be exploited
    * Passwords and secret keys obviously not ok!
    * Initials are ok if they aren't a Pawsey username
* **Must** be 3-63 characters long
* **Must** begin and end with a lowercase letter or a digit
* **Must** be lowercase
* **May** contain lowercase letters, numbers, hyphens (-) and periods (.)
* **Must** not be formatted as an IP address (e.g 192.168.0.5), that would be problematic for uniqueness!

One way to create a unique bucket name is to incorporate something related to your work, i.e the project name. For example this is a valid bucket name.

```bash
courses01-acacia-workshop-2024
```

### Creating buckets

Here is the `rclone` command to create a bucket. Replace **\$BucketName** with the name you chose for the bucket.

```bash
rclone mkdir acacia-mine:$BucketName
```

> If you forget to use the colon (:) with rclone and use a forward slash (/) instead, then it starts working with your local filesystem!

Once a bucket is created it may not be renamed. The fallback is to then to make a new bucket and transfer objects between them.

**Exercise: Come up with a unique bucket name and make your own bucket on Acacia.**

There might even be a clash as your colleagues might make the same bucket name!

There will be quite a bit of copy-paste in the next few sections. You can set an environment variable for your chosen bucket name.

In a Windows Powershell

```powershell
Set-Variable -Name "BucketName" -Value "<insert bucket name here>"
```

In Bash shells such as on Pawsey

```bash
BucketName="<insert bucket name here>"
```


### Listing available buckets and objects

The **ls** command in Rclone lists available buckets and objects:

* **ls** lists objects only, **lsl** gives more information
* **lsd** lists buckets and pseudofolders
* **lsf** lists objects and buckets in easy to parse format. Must be used with a bucket name. 
* **lsjson** lists objects and buckets in JSON format

```bash
rclone lsd acacia-mine:
```

```bash
rclone lsf acacia-mine:
```

### Removing buckets

You can remove buckets with this command.

```bash
rclone rmdir acacia-mine:$BucketName
```


If the bucket contains objects and you want to delete everything in it then you can force a remove with this command.

```bash
rclone purge acacia-mine:$BucketName
```

As this command is dangerous, `rclone` supports the **--dry-run** option to perform a non-destructive test run. The Rclone help messages say that all flags come after the destination.

```bash
rclone purge acacia-mine:$BucketName --dry-run
```

**Exercise: Using one of the commands above, remove your bucket and make a new one.**


## Mock data

Now we can start working with data using MinIO client and Rclone. If you haven't already prepared the mock data, then follow the instructions at  <a href="../T1_Getting_Access/L5_Mock_data.html">T1_Getting_Access -> L5_Mock_data</a> to unpack the mock data for working with Acacia.

 In the **command line** change directory to **data -> simulation -> results**. On Windows this is 

```powershell
cd C:\path\to\acacia_training\data\simulation\results
```

On Linux and MacOS change directory to the **simulation** directory using something like this:

```bash
cd /path/to/acacia_training/data/simulation/results
```

## Working with files as objects

An object is a file that is uploaded to the data store.

### Limits

There is no minimum or maximum limit on the **size** of objects that can be uploaded to Acacia. Only the number of objects per bucket (1 million max, 100,000 preferred) and your storage allocation are the limits. 

The object name:

* **Must** be unique within a bucket. 
* **May** contain alpha-numeric characters
    * 0-9
    * A-Z
    * a-z
* **May** also contain these special characters:
    * Forward slash ( **/** )
    * Exclamation point ( **!** )
    * Hyphen or dash ( **-** )
    * Period ( **.** )
    * Asterisk ( **\*** )
    * Single quote (')
    * Open and close parentheses (())

### Copy a file to the object store

#### Simple copies **to** Acacia

Below we use `rclone` to copy a single file to Acacia. You can enable a progress bar with the **--progress** option.

```bash
rclone copy data_00.dat acacia-mine:$BucketName/ --progress
```

Congratulations, you just copied your first file to Acacia! It is important to remember that `rclone` works a bit like **rsync** and only copies if the **size** or **modification time** have changed. 

**Exercise: Make another bucket and copy the file again to the new bucket.**

This shows that different buckets can have the same object name.

#### Simple copies **from** Acacia

We can copy a single file from Acacia to your system in much the same way.

```bash
rclone copy acacia-mine:$BucketName/data_00.dat .
```

> When copying files from Acacia to /scratch with **rclone** always set the flag **--local-no-set-modtime** so that it doesn't set an old modification time on copied files. Old files get deleted as a result of the 30-day data retention policy. 

```bash
rclone copy acacia-mine:$BucketName/data_00.dat . --local-no-set-modtime
```

#### Pseudofolders

All storage within a bucket is flat storage, however when working with Acacia using `rclone` we can prepend any number of pseudofolders (separated by a forward slash) to the name of the object in order to keep file names unique. In this instance we use the pseudofolder **test**. Notice that we don't need to make the **test** pseudofolder beforehand.

```bash
rclone copy data_00.dat acacia-mine:$BucketName/test/ --progress 
```

```bash
rclone ls acacia-mine:$BucketName
```

In such instances the text **test/** is prepended to the name of the object. In a similar manner we can use the pseudofolder when copying back from object storage.

```bash
rclone copy acacia-mine:$BucketName/test/data_00.dat . --progress
```

#### Copies with checksums

Rclone can use md5 checksums to copy data, use the `--checksum` option:

```bash
rclone copy data_00.dat acacia-mine:$BucketName --checksum --progress
```

### Copy multiple files by selection

When copying multiple files with rclone you select the directory to copy and use the  `--include` option to select files using some common regular expressions. See [this page](https://rclone.org/filtering/) for more information.

```bash
# Copy to Acacia, don't forget the dot .
rclone copy . acacia-mine:$BucketName/ --include "data_0*.dat" --progress
```
```bash
# Copy from Acacia, don't forget the dot .
rclone copy acacia-mine:$BucketName/ . --include "data_0*.dat" --progress
```

### Moving objects

Rclone supports moving files on Acacia. Be careful with this command as it can overwrite existing objects.

```bash
rclone moveto acacia-mine:$BucketName/data_00.dat.temp acacia-mine:$BucketName/data_00.dat
```

### Working with metadata

Acacia has the ability to associate metadata with objects in the form of **key:value** pairs. This is very useful because the right metadata might save an expensive retrieval operation.

For example, we have two key:value pairs that we would like to associate with objects uploaded to our bucket

```python
"island": "Rottnest"
"season" : "winter"
```

#### Adding metadata

With **rclone**, uploading metadata is slightly complex, the key:value pairs are part of a http header entry whose name is prefixed by **X-Amz-Meta-**. Each key:value pair must have it's own **--header** option.

```bash
rclone copy data_00.dat acacia-mine:$BucketName/ --header "X-Amz-Meta-<key>: <value>" --header "X-Amz-Meta-<key>: <value>"
```
for our example above, this is

```bash
rclone copy data_00.dat acacia-mine:$BucketName/ --header "X-Amz-Meta-island: Rottnest" --header "X-Amz-Meta-season: winter"
```

> Note: **Metadata is only created when an object is created**. This is rather disappointing. With rclone, if a file is already on the object store then the above command will **not** update the metadata :-( In such instances the only way I can see to update the metadata is to use the option **--no-check-dest** which will upload the data again.

#### Extracting metadata

You can use the `lsjson` command, with option `-M` to extract metadata, and if extracting metadata for a single file add the `--stat` option. 

```bash
rclone lsjson -M --stat acacia-mine:$BucketName/data_00.dat
```

### Constructing URL's for sharing your data

#### Creating a download link

Sometimes you need to share data with someone else. Rclone supports the creation of a publicly accessible link using the command

```bash
rclone link acacia-mine:$BucketName/data_00.dat --expire 1d
```

The maximum time allowed for a valid link is 1 week. You can use the **--expire** option to set a shorter duration. Once links are created it appears they cannot be removed until expiration.

**Exercise: create your own download link to the file data_00.dat**

Test the link by downloading the file with a web browser.

### Removing objects

In addition to the brute force bucket removal options, `rclone` has the ability to remove individual objects.

#### Simple removal

If you just need to delete one file here is the command:

```bash
rclone delete acacia-mine:$BucketName/data_00.dat
```

#### Removal of more than one file

Rclone has filters to select objects when deleting. Just use the **--include** option and you can do regular expression style pattern matching.

```bash
rclone delete acacia-mine:$BucketName --include "data_*.dat"
```

**Exercise: use one of the delete commands above to remove all .dat files from your test bucket**

### Synchronising folder structures to and from Acacia

Object storage is indeed flat within a bucket, however we can replicate a directory structure using **pseudofolders**. Both mc and rclone have the ability to recursively synchronise a directory to the object store, and each will only copy files that need copying. Recent Pawsey experience suggests that rclone is a more robust tool for this task, with one huge caveat!

#### Mirror the contents of a local directory to Acacia

In your terminal use the **cd** command to change directory to **data**, the one above **simulation**. We are going to mirror the **contents** of the **simulation** directory to an acacia bucket and prepend **test** to object names so that the contents of simulation appear to be in the **test** pseudofolder.

```bash
rclone sync simulation acacia-mine:$BucketName/test/ --progress
```

#### Parallel transfers

With **rclone** we can also specify how many transfers to perform in parallel with the **--transfers** flag.

```bash
rclone delete acacia-mine:$BucketName/test
rclone sync simulation acacia-mine:$BucketName/test/ --progress --transfers 12
```

The Pawsey team recommends **--transfers 12**, but you can certainly try your own number of parallel transfers to see which is optimal. 

#### Excluding files 

You may also use multiple **--exclude** options with both commands to exclude certain files based on regular expressions. 

```bash
rclone sync simulation acacia-mine:$BucketName/test/ --progress --exclude "*.log"
```

#### Checksums for verified copies

As with the copy command, **rclone** has the option of using md5 checksums to ensure data consistency.

```bash
rclone sync simulation acacia-mine:$BucketName/test/ --progress --transfers 12 --checksum
```

You can also use the **check** command with rclone to check a directory of files

```bash
rclone check simulation acacia-mine:$BucketName/test/ --checksum --combined report.txt
```

#### Mirror the contents of an Acacia pseudofolder to a local directory

Synchronising a folder structure back from Acacia works much the same way. Here we make another directory called **simulation2** and synchronise the structure back from Acacia. Note that **rclone** client  will **destructively synchronize** a local directory!

```bash
rclone sync acacia-mine:$BucketName/test/ simulation2/
ls simulation2
# Notice that simulation2/temp is now gone
```

> It is advisable **not** to use **rclone sync** to move data from Acacia to a local filesystem as it could destroy data! Use a copy instead.

#### **Exercise: use rclone to mirror the simulation directory to your chosen bucket name.** 

Feel free to use the **--exclude** options to omit for example the log files. 

> Note: empty directories can't be represented directly on the object store. If you need that functionality then use tar to preserve file and directory structures. Try creating an empty directory in **simulation** and see if it is replicated to Acacia.

### Checking the size and number of objects in a storage location

From time to time you might need to count the number of objects in a bucket. You can either do this at the [portal](https://portal.pawsey.org.au) then click on ACACIA and then your storage access. 

<figure style="margin-bottom 3em; margin-top: 2em; margin-left:auto; margin-right:auto; width:80%">
    <img style="vertical-align:middle" src="../images/s3_bucket_viewer.png"> <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Figure: S3 bucket viewer on the Pawsey portal.</figcaption>
</figure>

Alternatively, you can use the rclone **size** command to tally the number of objects in a storage location.

```bash
# Get the size and number of all objects
rclone size acacia-mine:

# Get the size of number of objects in a bucket.
rclone size acacia-mine:$BucketName

# Get the size and number of objects in a pseudofolder called **temp**
rclone size acacia-mine:$BucketName/temp
```

### A brief conversation about permissions on project storage

Permissions on **project** storage is an **advanced** topic and still under investigation by Pawsey staff. However if you are using Acacia project storage here are some best practices to follow when the time comes.

* By default **rclone** creates objects and buckets with **private** permissions. For personal storage this means no other ordinary user has access to it. In the context of project storage, private permissions means **private to your project**. All users in your project have **full access** to files in project storage on Acacia. No other ordinary users have access to project storage, but you can make access available using URL's.
* Don't use Amazon S3 ACL's (Access Control Lists). They are a relic from the past history of the S3 storage protocol and recent Pawsey experience has shown them to be potentially dangerous. An Acacia user locked themselves out **of their own bucket** using ACL's and lost their data.
* If you need to make certain files in your project storage **read-only**, then use an object lock such as a **legal hold**.
* If you need to share data with someone else outside your project then:
    * Use URL's to share data with people outside your project.
    * Look at applying **policies** to buckets and objects to share them in a specific way.

## Clean up

You really don't want your test buckets taking up space. Purge them with the following command:

```bash
rclone purge acacia-mine:$BucketName
```