# My attempts to save datasets into zarrs in the google bucket. 
I am trying it with bedmachine and with PISM output.

One problem was with the bedmachine dataset. I sorted that and then managed to get a small dataset (initially loaded from a netcdf) into and back out of the google bucket. Now the issue that I am running out of memory and the server crashes. 

Next step is to try to get it running on the cluster, but .to_zarr does not run on the cluster as it stands. 

In [1]:
import dask
import dask.array as da
import dask.delayed
from dask.distributed import Client
import dask_gateway
import numpy as np
import xarray as xr
xr.set_options(display_style="html")
import fsspec
import gcsfs

### load the netcdf

In [2]:
url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc'  
with  fsspec.open(url, mode='rb')  as openfile:  
    bm = xr.open_dataset(openfile)  

# remove the variable mapping because it was causing an error in the write to zarr
bm.attrs = {**bm.attrs, **bm.mapping.attrs} # but keep the information in the attributes of the whole dataset. 
bm = bm.drop('mapping')   # remove the variable. 

### take small subset of the data 

In [3]:
bm_small = bm.isel(x=slice(0,20), y = slice(0, 20))  # create a very small version of the dataset for testing the upload and download
bm_small.nbytes/1e3   #  it is only 10 MB

9.76

### This can be written to a zarr directoryi in our bucket no problem. 

In [4]:
bm_small_mapper = fsspec.get_mapper('gs://ldeo-glaciology/temp/bm_small4.zarr', mode='ab',
                            token='../secrets/ldeo-glaciology-bc97b12df06b.json')  # get a mapper object using the token stored in the ooi environment
bm_small.to_zarr(bm_small_mapper, mode='w');   # write the dataset to zarr in the google basket

In [5]:
#bm_small_mapper = fsspec.get_mapper('gs://ldeo-glaciology/temp/bm_small4.zarr') # This also works - just to make sure we dont need the token to access
bm_small_reloaded = xr.open_zarr(bm_small_mapper) # reload the dataset using the same mapper as before
bm_small_reloaded.identical(bm_small)    # check that what we get back is the same as what we tried to load up.

True

### Start a cluster

In [3]:
# get the dask-gateway version
dask_gateway.__version__
# show the default dask-gateway settings
dask.config.config['gateway']
#default gateway call
gateway = dask_gateway.Gateway()
# default new_cluster call
cluster = gateway.new_cluster()
#gateway = Gateway()
gateway.list_clusters()
# the dashboard_link property will show the link that can be pasted into the Dask labextension
cluster.dashboard_link
# scale cluster to 8 workers using the scale() method
cluster.scale(8)
# connect a client
# the distributed client is used for running parallel tasks with Dask
client = Client(cluster)
cluster

VBox(children=(HTML(value='<h2>GatewayCluster</h2>'), HBox(children=(HTML(value='\n<div>\n<style scoped>\n    …

### This cell is according to https://gist.github.com/rabernat/4cc2eca3868abda7ddf89ed10f8007fb, how you get .to_zarr to run on the cluster. Currently it throws an error because gcfs_auth.tokens comes back an empty dict, when I think it should have some entries 

In [7]:
gcfs_auth = gcsfs.GCSFileSystem(project='ldeo-glaciology', token='../secrets/ldeo-glaciology-bc97b12df06b.json')
token = gcfs_auth.tokens[('ldeo-glaciology', 'full_control')]
gcfs_w_token = gcsfs.GCSFileSystem(project='ldeo-glaciology', token=token)
gcsmap = gcsfs.GCSMap('gs://ldeo-glaciology/temp/bm_small4.zarr', gcs=gcfs_w_token)
ds.to_zarr(gcsmap)

KeyError: ('ldeo-glaciology', 'full_control')

### Trying to load the full dataset into zarr fails on the largest version of ooi.pangeo.io, but on the larger https://us-central1-b.gcp.pangeo.io/ it works. 

## In neither case does it appear to use the cluster to do the computation, meaning that the whole dataset has to be loaded to the notebook server, causing it to crashs unless we use a large machine. 

In [7]:
bm_mapper = fsspec.get_mapper('gs://ldeo-glaciology/bedmachine/bm.zarr', mode='ab',
                            token='../secrets/ldeo-glaciology-bc97b12df06b.json')
bm.to_zarr(bm_mapper, mode='w');

In [10]:
bm_reloaded = xr.open_zarr(bm_mapper)  
bm_reloaded.identical(bm)

True

In [15]:
cluster.shutdown()

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError


## How to get .to_zarr to use the cluster?