# 02. Data Examination

This module shows the types of data and pipelines that are in the [HuBMAP Data Portal](https://portal.hubmapconsortium.org/). It shows how to retrieve metadata, and how to filter datasets based on this.

## Data Types
There are many dataset types. On the [Datasets page](https://portal.hubmapconsortium.org/search?entity_type[0]=Dataset), you see a filter on the left with many dataset types.

There are unprocessed and processed datasets in the Portal. Click the collapsing arrow next to 'CODEX'. It shows two subsections, 'CODEX' and 'CODEX [Cytokit + SPRM]'. The second subsection ('CODEX [Cytokit + SPRM]') are the processed datasets. One of these is [HBM926.SHNZ.594](https://portal.hubmapconsortium.org/browse/dataset/69c70762689b20308bb049ac49653342). 

### Exploring a Dataset on the Portal page.
Let's explore the detail page of [HBM926.SHNZ.594](https://portal.hubmapconsortium.org/browse/dataset/69c70762689b20308bb049ac49653342).
- The Data Products section lists a few of the available files that are available in this dataset. 
- A Vitessce Visualization is automatically rendered for eachd dataset. Explore the different panels to get a gist of the data in the dataset.
-  In the Provenance section (under the Visualization), in the Table on the right, we can see the primary dataset that this processed dataset is derived from. 
- In the Provenance section under Analysis Details, the analyses pipelines are listed. These are the ingest pipelines that the primary dataset is ingested into.
- In the Metadata section, we can find a few key metadata values. 
- In the Files section, we can see all the files that are available.

> **_NOTE:_**  Do we want to include screenshots?

We can also retrieve all of this information programmatically.

### Exploring a Dataset through the search API.
Each dataset is identified with two IDs. One is the HuBMAP ID, in the form of HBMXXX.XXXX.XXX, in our case _HBM926.SHNZ.594_. The other is the UUID, in our case _69c70762689b20308bb049ac49653342_. The UUID is used in the backend. When refering to datasets in the Workspaces, the UUID is also used. Let's first retrieve a method for converting between the two.


> **_NOTE:_**  Could change this to package and make it into: "There is a HuBMAP helper package that has a method for converting HuBMAP IDs into UUIDs and back." 

In [2]:
# Installing the required packages
# !pip install --upgrade pip
# !pip install numpy pandas requests wheel

In [4]:
# Importing the required packages
import requests
import json

from csv import DictReader, excel_tab
from io import StringIO

import pandas as pd

In [5]:
# This is the UUID of our dataset
uuids = ["69c70762689b20308bb049ac49653342"]

In [6]:
# This method retrieves a mapping between the UUID and HuBMAP ID.
def get_uuid_to_hubmap(uuids): 
    '''
    Retrieve a dictionary mapping uuids to hubmap ids.

    Parameters
    ----------
    uuids : list
        list with uuids
    
    Returns
    -------
    dictionary mapping uuids to hubmap ids
    '''
    ## Fetch metadata, and read it into a dataframe
    response = requests.post(
        'https://portal.hubmapconsortium.org/metadata/v0/datasets.tsv', json={'uuids': uuids}
    )
    metadata = list(DictReader(StringIO(response.text), dialect=excel_tab))
    metadata = pd.DataFrame(metadata[1:])

    ## Create mapping from uuid to hubmap id
    uuid_to_hubmap = dict(zip(metadata['uuid'], metadata['hubmap_id']))
    return uuid_to_hubmap

uuid_to_hubmap = get_uuid_to_hubmap(uuids)
uuid_to_hubmap

{'69c70762689b20308bb049ac49653342': 'HBM926.SHNZ.594'}

Using the Search API, we can then find information about processed datasets. For this, we send a POST request to the Search API. In this, we add our query. To find metadata of our dataset, we add the UUID of the dataset in our query. We can also add a size of returned elements.

In [24]:
# do search api with this.
search_api = "https://search.api.hubmapconsortium.org/v3/portal/search"

hits = json.loads(
    requests.post(
        search_api,
        json={
            "size": 10000,  # To make sure the list is not truncted, set this high.
            "query": {"ids": {"values": uuids}},
        },
    ).text
)["hits"]["hits"]

hits

[{'_id': '69c70762689b20308bb049ac49653342',
  '_index': 'hm_prod_public_portal',
  '_score': 1.0,
  '_source': {'analyte_class': 'protein',
   'anatomy_0': ['body'],
   'anatomy_1': ['spleen'],
   'ancestor_counts': {'entity_type': {'Dataset': 1, 'Donor': 1, 'Sample': 3}},
   'ancestor_ids': ['804df200e0003180cc5a62493ea5dced',
    '852dae577b0393aa888f8c3c66cd38ac',
    '5a878bd17066ab20aa35d1bab8e9ebc8',
    'ebbe62a5095f993c72f5f10b13118bc6',
    '54e9f1a94821a716e8e1546aee7b0f7a'],
   'ancestors': [{'contacts': [{'affiliation': 'Stanford',
       'first_name': 'John',
       'is_contact': 'TRUE',
       'last_name': 'Hickey',
       'middle_name_or_initial': 'W',
       'name': 'John Hickey',
       'orcid_id': '0000-0001-9961-7673',
       'version': '1'},
      {'affiliation': 'Stanford',
       'first_name': 'Chiara',
       'is_contact': 'TRUE',
       'last_name': 'Caraccio',
       'middle_name_or_initial': '',
       'name': 'Chiara Caraccio',
       'orcid_id': '0000-0002-

We can see that that is a lot of information, more than is available in the Portal! Instead of requesting all information, we can also search for specific things, such as the files and data types that a dataset contains. Here, we add an extra field to specify what information we want returned.

In [17]:
hits = json.loads(
    requests.post(
        search_api,
        json={
            'size': 10000,
            'query': {'ids': {'values': uuids}},
            '_source': ['files', 'data_types']
        }, 
    ).text
)['hits']['hits']

hits[0]['_source']

{'data_types': ['codex_cytokit'],
 'files': [{'description': 'AnnData Zarr store for storing and visualizing SPRM outputs.',
   'edam_term': 'EDAM_1.24.format_2333',
   'is_data_product': False,
   'is_qa_qc': False,
   'mapped_description': 'AnnData Zarr store for storing and visualizing SPRM outputs. (ZGROUP file)',
   'rel_path': 'anndata-zarr/reg001_expr-anndata.zarr/.zgroup',
   'size': 24,
   'type': 'unknown'},
  {'description': 'AnnData Zarr store for storing and visualizing SPRM outputs.',
   'edam_term': 'EDAM_1.24.format_2333',
   'is_data_product': False,
   'is_qa_qc': False,
   'mapped_description': 'AnnData Zarr store for storing and visualizing SPRM outputs. (ZGROUP file)',
   'rel_path': 'anndata-zarr/reg001_expr-anndata.zarr/layers/.zgroup',
   'size': 24,
   'type': 'unknown'},
  {'description': 'AnnData Zarr store for storing and visualizing SPRM outputs.',
   'edam_term': 'EDAM_1.24.format_2333',
   'is_data_product': False,
   'is_qa_qc': False,
   'mapped_descrip

We can also use ElasticNet queries to find specific datasets. Instead of adding UUIDs to find information on, we can specify the type of dataset that we are interested in, such as a datasets with CODEX data type and ome.tiff files.

In [21]:
# show how to do a filter

hits = json.loads(
    requests.post(
        search_api,
        json={
            "size": 100,
            "query": {
                "bool": {
                    "must": [
                        {
                            "match": {
                                "files.rel_path": "ome.tiff" # find entities with an ome.tiff file
                            }
                        },
                        {
                            "match": {
                                "mapped_data_types": "CODEX" # find entities with CODEX data types
                            }
                        }
                    ],
                    "filter": [
                        {
                            "bool": {
                                "must_not": {
                                    "exists": {
                                        "field": "next_revision_uuid" # this is an artifact of the Portal, filtering out some old data.
                                    }
                                }
                            }
                        },
                        {
                            "term": {
                                "entity_type.keyword": "Dataset" # find entities that are datasets
                            }
                        }
                    ]
                }
            },
            "_source": [
                "hubmap_id",
                "group_uuid",
                "uuid",
                "entity_type",
                "mapped_data_types",
                "metadata.metadata.assay_type", # this is how to return specific metadata fields
                "files"
            ]
        }, 
    ).text
)['hits']['hits']

hits

[{'_id': 'b69d1e2ad1bf1455eee991fce301b191',
  '_index': 'hm_prod_public_portal',
  '_score': 11.504652,
  '_source': {'entity_type': 'Dataset',
   'files': [{'description': "File containing Cytokit's calculations from deconvolution, drift compensation, and focal plan selection, in JSON format",
     'edam_term': 'EDAM_1.24.format_3464',
     'mapped_description': "File containing Cytokit's calculations from deconvolution, drift compensation, and focal plan selection, in JSON format (JSON file)",
     'rel_path': 'data.json',
     'size': 929112,
     'type': 'json'},
    {'description': 'Cytokit cytometry output for region 001, in OME-TIFF format',
     'edam_term': 'EDAM_1.24.format_3727',
     'mapped_description': 'Cytokit cytometry output for region 001, in OME-TIFF format (TIFF file)',
     'rel_path': 'output/cytometry/tile/ome-tiff/R001_X001_Y001.ome.tiff',
     'size': 238354,
     'type': 'unknown'},
    {'description': 'Cytokit cytometry output for region 001, in OME-TIFF fo

## Try it for yourself!
Try to examine some data. Perhaps you want to find specific datasets. Or you want to retrieve information about specific datasets. Feel free to use our daily office hours to ask questions about this as well!