# Getting data from the cloud

One of the major use-cases for cloud is large-scale computing. Another one is straight-forward data-sharing. And naturally, the combination of these things. Let's consider why that is. One the growing concerns with data sharing is that for different users of the data to gain access to the (ever-larger) datasets that are available through "brain observatories", multiple copies of the data would have to be generated. This is costly, in the best case, and prohibitive, in the worst case. The mantra you will hear about this issue is that as the data grows larger, you will have to "bring the compute to the data"

What does this mean in practice? Using cloud computing as the basis for data sharing means that the data does not have to be copied out of the the cloud at any point during data analysis. In many cases, analysis of large-scale datasets intends to distill a large data-set into a set of conclusions. This usually means that the ultimate outputs of analysis can be very small relative to the data inputs (while acknowledging that there might be intermediate steps where data grows larger...). For example, a figure or a few numbers. If computation can be done to minimize transffer of very large datasets, this could 




## Amazon's Simple Storage Services and open datasets 

In practice, Amazon Web Services (AWS) has taken the lead in providing open access to neuroscience datasets, through its [open data program](https://registry.opendata.aws/). 

Some of the interesting datasets provided through this program: 

- [International Neuroimaging Data-Sharing Initiative (INDI)](https://registry.opendata.aws/fcp-indi/)
- [OpenNeuro](https://registry.opendata.aws/openneuro/)
- [Open NeuroData](https://registry.opendata.aws/open-neurodata/)
- [Allen Brain Observatory](https://registry.opendata.aws/allen-brain-observatory/)
- [Human Connectome Project](https://registry.opendata.aws/hcp-openaccess/)
- [NYU Langone & FAIR FastMRI Dataset](https://registry.opendata.aws/nyu-fastmri/)

Most of these datasets provide access to a lot of data without requiring any data use agreement, or any form of authentication. Anyone can download the data. An important exception is the Human Connectome Project. Access to this dataset requires acquiring and using a special set of credentials. To get these credentials, you will need to register at https://db.humanconnectome.org/, follow the instruction therein and agree to the terms and conditions of use (these are fairly straightforward). Once you have agreed, you can receive AWS credentials. AWS credentials are composed of two keys: an access key id and a secret access key. In general, you want to be very careful with your AWS credentials. These can typically be used to do whatever you can do on AWS. The HCP credentials can be used only to access the data that is publicly provided. Still, please keep those private -- they do provide access to the data.

That all said, we'll use another dataset as an example here. If you are interested in learning more about the HCP data, you should look into Noah Benson's [Wedensday lecture/tutorial](). 

Here, let's consider some data stored in OpenNeuro. OpenNeuro is the BRAIN Initiative's archive for human neuroimaging data. It provides ready access 

In [1]:
import s3fs

In [3]:
fs = s3fs.S3FileSystem()

In [4]:
# s3://openneuro.org/ds000233 ds000233-download/

In [6]:
ll = fs.ls('/openneuro.org/')

In [8]:
len(ll)

388

In [9]:
fs.get('openneuro.org/ds000001/sub-01/anat/sub-01_T1w.nii.gz', './foo.nii.gz')

In [10]:
ls

README.md                             xx-reading-data-from-the-cloud.ipynb
foo.nii.gz


In [None]:
fs.ls('fcp-indi/data/Projects/RocklandSample/RawDataBIDS/sub-A00008326/ses-ALGA/ses-ALGA/dwi/sub-A00008326_ses-ALGA_dwi.bvec')

In [11]:
fname = 'openneuro.org/ds000001/sub-01/anat/sub-01_T1w.nii.gz'

In [18]:
import gzip
from io import BytesIO
import nibabel as nib

with fs.open(fname) as ff:
    zz = gzip.open(ff)
    rr = zz.read()
    bb = BytesIO(rr)
    fh = nib.FileHolder(fileobj=bb)
    img = nib.Nifti1Image.from_file_map({'header': fh, 'image': fh})