# Example: Uploading an xarray dataset to zarr

This notebook explains the basic steps needed to upload a small to moderatly sized dataset manually to the [persistent cloud bucket](https://infrastructure.2i2c.org/en/latest/topic/features.html?highlight=persistent%20bucket#persistent-buckets-on-object-storage) for LEAP.

You will learn the following things:

- How to upload an xarray Dataset into a cloud bucket using the zarr format
- How to share the url and enable other LEAP users to access the data from within the Jupyterhub
- How to delete data from the cloud bucket

<div class="admonition note" name="html-admonition" style="background: pink; padding: 5px">
<p class="title">Warning:ðŸš¨ Before you start!
    
Storing large amounts of data in the persistent cloud bucket can <b>dramatically increase</b> cost for the whole project.
Please be mindful of this community resource by following these rules:
<ul>
    <li>Make sure that your data is backed up in a second location, the cloud storage is meant for easy collaboration and analysis not as a permanent archive</li>
    <li>Discuss adding data over a few GBs with the Data and Computation Manager Julius Busecke (Slack:<it> @Julius Busecke</it>, ðŸ“§: julius@ldeo.columbia.edu)</li>
    <li>Delete data immeatly when it is not needed anymore</li>
</ul>
</p>
</div>

# Importing packages

In [1]:
import os

import gcsfs
import xarray as xr
from dask.diagnostics import ProgressBar
from google.oauth2.credentials import Credentials

# Creating functions and defining local filepaths

In [2]:
# local filepath

data_path_local = '/ocean/projects/atm200007p/jlin96/longSPrun_o3/'

# functions for loading in data


def ls(data_path=''):
    return os.popen(' '.join(['ls', data_path])).read().splitlines()


def get_filenames(month, year, data_path):
    filenames = ls(data_path)
    month = str(month).zfill(2)
    year = str(year).zfill(4)
    filenames = [data_path + x for x in filenames if 'h1.' + year + '-' + month in x]
    return filenames

# Load data in locally

In [3]:
# getting the filenames in a single list

filenames = []
for i in range(11):
    filenames = filenames + get_filenames(month=i + 2, year=0, data_path=data_path_local)
for i in range(12):
    filenames = filenames + get_filenames(month=i + 1, year=1, data_path=data_path_local)

In [4]:
# loading in the data

spData = xr.open_mfdataset(filenames, compat='override', join='override', coords='minimal')

In [5]:
# checking the size of the data

spData.nbytes / 1e9

371.678456584

In [None]:
# splitting the data into two chunks for uploading

spData1 = spData.isel(time=slice(0, 18000))
spData2 = spData.isel(time=slice(18000, None))

In [None]:
# # Unit test to check if file was saved correctly
# checking = xr.open_dataset("spData_Bridges2.zarr", engine="zarr", decode_times = False)
# print(spData == checking)
# print(spData.identical(checking))

# On leap.2i2c.cloud

### Step 1
Start an instance.
### Step 2
Open terminal.
### Step 3
Get a temporary token:

- (On Jupyterhub) Install google-cloud-sdk on running server with  terminal command mamba install google-cloud-sdk
- (On Jupyterhub) Generate token with gcloud auth print-access-token
- (On HPC) Copy token into a textfile token.txt

# Upload data to LEAP

## Token 1

In [None]:
with open('token.txt') as f:
    access_token = f.read().strip()

# setup a storage client using credentials
credentials = Credentials(access_token)

In [None]:
fs = gcsfs.GCSFileSystem(token=credentials)
mapper = fs.get_mapper('leap-persistent/jerrylin96/spData_Bridges2.zarr')
with ProgressBar():
    spData1.to_zarr(mapper, mode='w')

## Token 2

In [None]:
# REQUEST A BRAND NEW TOKEN BEFORE RUNNING THE NEW CELLS

In [7]:
with open('token.txt') as f:
    access_token = f.read().strip()

# setup a storage client using credentials
credentials = Credentials(access_token)

In [8]:
fs = gcsfs.GCSFileSystem(token=credentials)
mapper = fs.get_mapper('leap-persistent/jerrylin96/spData_Bridges2.zarr')
with ProgressBar():
    spData2.to_zarr(mapper, mode='a', append_dim='time')

[########################################] | 100% Completed | 1.06 sms
[########################################] | 100% Completed | 1.16 sms
[                                        ] | 0% Completed | 26.64 sms



[########################################] | 100% Completed | 39m 47s


In [None]:
checking = xr.open_dataset('test_file.zarr', engine='zarr', decode_times=False)

In [None]:
spData == checking