# 02. Data Retrieval

This module shows how to access local data, and how to retrieve files through the assets API, including zarr storage. 

### Local Data
Once you have started up your Workspace, you might notice the _datasets_ folder. This contains datasets that are linked to the Workspace. The files in this folder are accessible in the same way as on your local machine. 

In [None]:
# !pip install pandas requests anndata zarr aiohttp fsspec

In [None]:
import json
import os
import requests
import warnings

import pandas as pd

import zarr
import anndata as ad

In this module we will reuse the same dataset from last module, as well as a new one. You can also inspect this dataset on the [Dataset Detail Page](https://portal.hubmapconsortium.org/browse/dataset/a1d17fdd270a69c813b872a927dfa5f3).

In [None]:
uuids = ['69c70762689b20308bb049ac49653342', 'a1d17fdd270a69c813b872a927dfa5f3']

Let's access some local data.

In [None]:
adata = ad.read_h5ad('./datasets/' + uuids[1] + '/secondary_analysis.h5ad')
adata

This folder is read-only! So if you create outputs based on each dataset, will have to create a separate folder for this.

## Using the assets API
Perhaps you want to export your notebook, and so you want to include a way to load the data from the notebook.

In [None]:
url = 'https://assets.hubmapconsortium.org/' + uuids[0] + '/' + 'sprm_outputs/reg001_expr.ome.tiff-SPRM_Image_Quality_Measures.json'

res = requests.get(url)

with open('./quality.json', mode='wb') as f:     
    f.write(res.content)

We have now retrieved the first data product of our dataset, and written it to _quality.json_. We can open the file and see what it contains.

In [None]:
json.load(open('./quality.json'))

Of course, we can also download a lot more files. Below shows a function to download any file from a dataset.

In [None]:
def retrieve_files_remote(uuid, file_name, outdir='.'): 
    '''
    For a given UUID and file name, retrieve this file and save it locally.

    Parameters
    ----------
    uuid : str
        UUID of dataset
    file_name : str
        relative location of desired file. 
    outdir : str, optional
        name of output folder. Default: '.'
    '''
    url = 'https://assets.hubmapconsortium.org/' + uuid + '/' + file_name

    extension = str.split(file_name, sep='.')[-1]

    # check if relative file_name has multiple subfolders
    # if so, extract the folder structure without the filename as a string
    folder_structure = str.split(file_name, sep='/')[0:-1]
    folder_structure_addition = '/' + '/'.join(folder_structure) + '/' if len(folder_structure) > 0 else ''

    if extension == 'h5ad':
        warnings.warn('Large files such as .h5ad files may take long to retrieve.')
    
    res = requests.get(url)

    if not os.path.exists(outdir + '/' + uuid + folder_structure_addition):
        os.makedirs(outdir + '/' + uuid + folder_structure_addition, exist_ok = True) # unlike os.mkdir, os.makedirs creates directories recursively

    with open(outdir + '/' + uuid + '/' + file_name, mode='wb') as f:     
        f.write(res.content)
        

In Module 2, we showed how to retrieve files for a dataset. Let's reuse that functionality here.

In [None]:
def get_files_for_uuids(uuids, search_api='https://search.api.hubmapconsortium.org/v3/portal/search'):
    '''
    Create a dictionary of files per dataset.

    Parameters
    ----------
    uuid : array of str or str
        UUID(s) of dataset(s)
    search_api : str, optional
        URL of search_api. Default: 'https://search.api.hubmapconsortium.org/v3/portal/search'
    '''
    hits = json.loads(
        requests.post(
            search_api,
            json={
                "size": 10000,
                "query": {"ids": {"values": uuids}},
                "_source": ["files"]
            }, 
        ).text
    )["hits"]["hits"]

    uuid_to_files = {}
    for hit in hits:
        file_paths = [file['rel_path'] for file in hit['_source']['files']]
        uuid_to_files[hit['_id']] = file_paths

    return uuid_to_files

In [None]:
uuid_to_files = get_files_for_uuids(uuids)

# Run to download all files from this dataset.
# This takes a few minutes.
# for file in uuid_to_files[uuids[0]]: 
#     retrieve_files_remote(uuids[0], file, outdir='data')

## Zarr
H5ad files are very large and can take a long time to retrieve. [Zarr](https://zarr.readthedocs.io/) is a storage format for N-dimensional arrays, which significantly speeds up loading times. This is why many files in HuBMAP datasets are indexed as Zarr files. We can load these objects through the remote Zarr storage.

In the files overview, you may have already seen some Zarr files, such as _anndata-zarr/reg001_expr-anndata.zarr/.zgroup_. These _.zgroup_ files do not contain any data, but are created when the Zarr groups are created, and indicate that these stores are present.

In [None]:
# get the zarr_url for this dataset and file
zarr_url = f'https://assets.hubmapconsortium.org/{uuids[1]}/hubmap_ui/anndata-zarr/secondary_analysis.zarr'

# get the X array
X_arr = zarr.open(zarr_url + "/X")

# load as pandas DataFrame
X_df = pd.DataFrame(X_arr)
X_df.head()

Perhaps we are very familiar with Anndata or the dataset's pipelines (as mentioned in Module 2) and we know exactly the Zarr stores that we want to retrieve. If that's not the case, usually, we would be able to retrieve the Zarr hierarchy through [_zarr.hierarchy_](https://zarr.readthedocs.io/en/stable/api/hierarchy.html). However, since this is a remote store, this is not possible. We can however use the existence of these _.zgroup_ files to figure out the structure of the Zarr files.

We can modify our file retrieval to get all the Zarr paths.

In [None]:
def get_zarr_paths(uuids, search_api = 'https://search.api.hubmapconsortium.org/v3/portal/search'):
    '''
    Get dictionary of zarr extensions for datasets.
    For each dataset, it has a new dictionary, with the base zarr storages as keys, and
    extensions as a list for it's value. 
    The base zarr storages can also be interpreted as the different anndata files.
    
    Parameters
    ----------
    uuids : list of str
        list with dataset UUIDs
    search_api : str, optional
        URL of HuBMAP Search API. Default: 'https://search.api.hubmapconsortium.org/v3/portal/search'
    
    Returns
    -------
    dictionary with for each UUID a new dictionary with base zarr stores and extensions
    '''
    hits = json.loads(
            requests.post(
                search_api,
                json={
                    'size': 10000,
                    'query': {'ids': {'values': uuids}},
                    '_source': ['files']
                }, 
            ).text
        )['hits']['hits']

    uuid_to_files = {}
    for hit in hits:
        # get all the file_paths for a dataset
        file_paths = [file['rel_path'] for file in hit['_source']['files']]

        # filter file_paths for zarr
        file_paths_zarr = [file_name for file_name in file_paths if 'zarr' in file_name]
        
        # get the roots of the zarr groups
        root_files = [file_name.replace('.zarr/.zgroup', '') for file_name in file_paths_zarr if '.zarr/.zgroup' in file_name]

        # create a dictionary from root to extension
        root_files_to_files = {root_file : [file.replace(root_file + '.zarr/', '') for file in file_paths_zarr if root_file in file] for root_file in root_files}
        
        uuid_to_files[hit['_id']] = root_files_to_files
    
    return uuid_to_files

In [None]:
zarr_paths = get_zarr_paths(uuids)[uuids[1]]
zarr_paths

## Try it for yourself!
We can use this file overview and our _zarr.open_ function to retrieve all Zarr stores. We can even automate this to retrieve the entire anndata object for datasets way faster than through the .h5ad file. Try it yourself! If you want a hint, you can look at the _load\_zarr_ template in the Portal.