# Example: Uploading an xarray dataset to zarr

This notebook explains the basic steps needed to upload a small to moderatly sized dataset manually to the [persistent cloud bucket](https://infrastructure.2i2c.org/en/latest/topic/features.html?highlight=persistent%20bucket#persistent-buckets-on-object-storage) for LEAP.

You will learn the following things:

- How to upload an xarray Dataset into a cloud bucket using the zarr format
- How to share the url and enable other LEAP users to access the data from within the Jupyterhub
- How to delete data from the cloud bucket

<div class="admonition note" name="html-admonition" style="background: pink; padding: 5px">
<p class="title">Warning:🚨 Before you start!
    
Storing large amounts of data in the persistent cloud bucket can <b>dramatically increase</b> cost for the whole project.
Please be mindful of this community resource by following these rules:
<ul>
    <li>Make sure that your data is backed up in a second location, the cloud storage is meant for easy collaboration and analysis not as a permanent archive</li>
    <li>Discuss adding data over a few GBs with the Data and Computation Manager Julius Busecke (Slack:<it> @Julius Busecke</it>, 📧: julius@ldeo.columbia.edu)</li>
    <li>Delete data immeatly when it is not needed anymore</li>
</ul>
</p>
</div>

In [1]:
import xarray as xr
import gcsfs
import os

First we need to identify the persistent cloud bucket and user folder to which we will upload our data. This information is stored in the environmental variable `"PERSISTENT_BUCKET"` 

> Note: You can use the same workflow to store temporary data by using the `"SCRATCH_BUCKET" (this data is deleted after 7 days). 

In [2]:
os.environ['PERSISTENT_BUCKET']

'gs://leap-persistent/jbusecke'

👆 This should show a path including your github username

For the sake of this example I am going to create a very simple dataset in the notebook, but you can substitute any other `xarray.Dataset` (e.g. one loaded from netcdf via `xr.open_dataset()`

In [3]:
ds = xr.DataArray([1, 4, 6]).to_dataset(name='data')
ds

In order to save a zarr store similarly to how you would on a traditional filesystem we need to set up a cloud filesystem and a mapper.

In [4]:
fs = gcsfs.GCSFileSystem()
mapper = fs.get_mapper("gs://leap-persistent/jbusecke/testing/demo_write.zarr")

This mapper can then be used similarly to a filepath to save an xarray Dataset as a zarr store using `xr.Dataset.to_zarr()`

In [5]:
ds.to_zarr(mapper)

<xarray.backends.zarr.ZarrStore at 0x7fdd7ec9a340>

## Checking that files were written

You can use the convinence function `.ls`, similarly to calling the UNIX function `ls` in a terminal, to list contents of your bucket (or subfolders)

In [6]:
fs.ls("gs://leap-persistent/jbusecke/testing/")

['leap-persistent/jbusecke/testing/another_store.zarr',
 'leap-persistent/jbusecke/testing/demo_write.zarr']

🎉 You just wrote your first data to the cloud!

## How to access the data from the cloud

This data is now available to read for everyone who is a leap member!
If you want to enable a collaborator to easily load the data into an xarray Dataset you can given them the following snippet (or add this to a notebook in a github repository):  

In [7]:
import xarray as xr
import gcsfs

fs = gcsfs.GCSFileSystem()
mapper = fs.get_mapper("leap-persistent/jbusecke/testing/demo_write.zarr")
ds_new = xr.open_dataset(mapper, engine='zarr')
ds_new

Make sure to hardcode the path 👆 so that another user does not accidentally uses the path defined in `os.environ["PERSISTENT BUCKET"]`

## How to delete existing cloud stores

As mentioned above, you should delete data as soon as you are not working on it anymore. Again you can use the convinience function `.rm` on the cloud filesystem object which works similar to the [rm](https://en.wikipedia.org/wiki/Rm_(Unix)) Unix command in the shell.

> As with the `rm -r` command in linux, be careful that you know exactly what you are deleting!

In [8]:
fs.rm('leap-persistent/jbusecke/testing/demo_write.zarr', recursive=True)

Ok lets quickly check that the store is indeed deleted

In [9]:
fs.ls('leap-persistent/jbusecke/testing')

['leap-persistent/jbusecke/testing/another_store.zarr']

And indeed `demo_write.zarr` is gone. 