## Data management on the cluster

The way this works: data is stored in one place (our GCS bucket) and we pull it down from there and push it back there. 

As much as possible, avoid creating data locally on the machine that is running your notebook. This will help us: 

1. Collaborate more readily -- everyone's code points to one place.
2. Work more rapidly -- worker nodes on the cluster get their own data.
3. Avoid losing data -- GCS is solid, but nodes are kinda ephemeral.

In the following code, we're going to read and write data into GCS using 

In [None]:
import h5py
import scipy.io as sio
import gcsfs


In [None]:
fs = gcsfs.GCSFileSystem(project='learning-2-learn-221016', token='browser')

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=586241054156-8986sjc0h0683jmpb150i0m8cucrttds.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.full_control&state=EDZhtNpeerAg6k7qVzRKCd0Opgaosg&prompt=consent&access_type=offline


In [16]:
fs.ls('/learning2learn/Buffalo/BuffaloLabTestSet/')

['learning2learn/Buffalo/BuffaloLabTestSet/CSC84a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC45_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC107_ex/',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC67a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC84_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC48_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC74_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC68a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC78_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC66_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC46a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC77a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC33_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC38a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC60a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC72a_ex.mat',
 'learning2learn/Buffalo/BuffaloLabTestSet/CSC95a_ex.mat',
 'learn

In [19]:
import gcsfs

from scipy.io import loadmat
import numpy as np

from dask import delayed
import dask.array as da

@delayed
def get_matfile(fname, project, keys):
    """ 
    Getsdata from a matfile
    
    """
    fs = gcsfs.GCSFileSystem(project=project)
    with fs.open(fname, 'rb') as f:
        mat = loadmat(f, squeeze_me=True)

    for k in keys:
        mat = mat[k]
    
    return mat.item()


def get_spkwv(fnames, project, shape=(100000, 33)):
    """ 
    Gets spike waveforms from data
    """
    data = []
    for fname in fnames:
        matfile = get_matfile(fname, project, ["dataToSave", "spkwv"])
        data.append(da.from_delayed(matfile, 
                                    shape=shape, 
                                    dtype=np.float64))
    return da.concatenate(data)

In [21]:
foo = get_spkwv(['learning2learn/Buffalo/BuffaloLabTestSet/CSC84a_ex.mat'], 'learning-2-learn-221016')

In [None]:
fs.

In [9]:
fs.put('/home/jovyan/dask-tutorial/data/random.hdf5', '/learning2learn/tutorial/random.hdf5')

OSError: Forbidden: https://www.googleapis.com/upload/storage/v1/b/learning2learn/o
Insufficient Permission

In [3]:
with fs.open('learning2learn/Buffalo/BuffaloLabTestSet/CSC28_ex.mat', 'rb') as f:
    f = h5py.File(f,'r')
    spkwv = f["dataToSave"]["spkwv"][()]

OSError: Unable to open file (file signature not found)