# 02. Data Examination

This module shows the types of data and pipelines that are in the [HuBMAP Data Portal](https://portal.hubmapconsortium.org/). It shows how to retrieve metadata, and how to filter datasets based on this.

## Data Types
There are many dataset types. On the [Datasets page](https://portal.hubmapconsortium.org/search?entity_type[0]=Dataset), you see a filter on the left with many dataset types.

There are unprocessed and processed datasets in the Portal. Click the collapsing arrow next to 'CODEX'. It shows two subsections, 'CODEX' and 'CODEX [Cytokit + SPRM]'. The second subsection ('CODEX [Cytokit + SPRM]') are the processed datasets. One of these is [HBM926.SHNZ.594](https://portal.hubmapconsortium.org/browse/dataset/69c70762689b20308bb049ac49653342). 

### Exploring a Dataset on the Portal page.
Let's explore the detail page of [HBM926.SHNZ.594](https://portal.hubmapconsortium.org/browse/dataset/69c70762689b20308bb049ac49653342).

#### Primary and processed datasets
HBM926.SHNZ.594 is a processed or derived dataset. A processed dataset is created by running a primary dataset through various pipelines. The detail page shows both the primary and any processed datasets.

##### Primary dataset information & metadata
- The information about primary dataset is shown on top. Here we can see the primary dataset that this processed dataset is derived from.
- Metadata: In this section, we can find a few key metadata values about the assay and the donor.

##### Processed dataset information
Under the processed dataset, there are a few sections: 
- Summary: This section contains some basic information about this dataset.
- Visualization: A Vitessce Visualization is automatically rendered for each dataset. Explore the different panels to get a gist of the data in the dataset.
- Files: The Data Products tab lists a few of the available files that are available in this dataset that are often used. The File Browser tab lists all available files.
- Analysis Details & Protocols: Here, the analyses pipelines are listed. These are the ingest pipelines that the primary dataset is ingested into.

##### Other sections
Now let's examine the information below the derived dataset.
- Bulk Data Transfer: In this section, we can find links to the Globus directories for the primary and processed datasets.
- Provenance: Here we see a graphical overview of how the datasets are related.
- Attribution: This section lists the individuals who provided this dataset.

### Exploring a Dataset through the search API.
We can also retrieve most of this information programmatically.

Each dataset is identified with two IDs. One is the HuBMAP ID, in the form of HBMXXX.XXXX.XXX, in our case _HBM926.SHNZ.594_. The other is the UUID, in our case _69c70762689b20308bb049ac49653342_. The UUID is used in the backend. When refering to datasets in the Workspaces, the UUID is also used. There is a HuBMAP template helper package that has a method to retrieve a mapping for UUIDs to HuBMAP IDs.

In [None]:
# !pip install numpy pandas requests wheel hubmap_template_helper

In [None]:
# Importing the required packages
import requests
import json

from hubmap_template_helper import uuids as hth_uuids

In [None]:
# This is the UUID of our dataset
uuids = ['69c70762689b20308bb049ac49653342']

HuBMAP IDs are more readable than UUIDs, so we can convert these.

In [None]:
# This method retrieves a mapping between the UUID and HuBMAP ID.
# This uses a post request to https://portal.hubmapconsortium.org/metadata/v0/datasets.tsv

uuid_to_hubmap = hth_uuids.get_uuid_to_hubmap_mapping(uuids)
uuid_to_hubmap[uuids[0]]

Using the Search API, we can then find information about processed datasets. For this, we send a POST request to the Search API. In this, we add our query. To find metadata of our dataset, we add the UUID of the dataset in our query. We can also add a size of returned elements.

> Note: the search API only retrieves this metadata for processed datasets, not for primary datasets.

In [None]:
search_api = "https://search.api.hubmapconsortium.org/v3/portal/search"

hits = json.loads(
    requests.post(
        search_api,
        json={
            "size": 10000,  # To make sure the list is not truncted, set this high.
            "query": {"ids": {"values": uuids}},
        },
    ).text
)["hits"]["hits"]

hits

We can see that that is a lot of information, more than is available in the Portal! Instead of requesting all information, we can also search for specific things, such as the files and data types that a dataset contains. Here, we add an extra field to specify what information we want returned.

In [None]:
hits = json.loads(
    requests.post(
        search_api,
        json={
            'size': 10000,
            'query': {'ids': {'values': uuids}},
            '_source': ['files', 'data_types']
        }, 
    ).text
)['hits']['hits']

hits[0]['_source']

We can also use ElasticSearch queries to find specific datasets. Instead of adding UUIDs to find information on, we can specify the type of dataset that we are interested in, such as a datasets with CODEX data type and ome.tiff files.

In [None]:
# show how to do a filter

hits = json.loads(
    requests.post(
        search_api,
        json={
            "size": 100,
            "query": {
                "bool": {
                    "must": [
                        {
                            "match": {
                                "files.rel_path": "ome.tiff" # find entities with an ome.tiff file
                            }
                        },
                        {
                            "match": {
                                "assay_display_name": "CODEX" # find entities with CODEX data types
                            }
                        }
                    ],
                    "filter": [
                        {
                            "bool": {
                                "must_not": {
                                    "exists": {
                                        "field": "next_revision_uuid" # this is an artifact of the Portal, filtering out some old data.
                                    }
                                }
                            }
                        },
                        {
                            "term": {
                                "entity_type.keyword": "Dataset" # find entities that are datasets
                            }
                        }
                    ]
                }
            },
            "_source": [
                "hubmap_id",
                "group_uuid",
                "uuid",
                "entity_type",
                "assay_display_name",
                "donor.mapped_metadata.sex" # use dot notation to get specific fields
            ]
        }, 
    ).text
)['hits']['hits']

hits

## Try it for yourself!
Try to examine some data. Perhaps you want to find specific datasets. Or you want to retrieve information about specific datasets. Feel free to use our daily office hours to ask questions about this as well!