# Metadata access and use

In this notebook, we provide three tutorials. In the first tutorial we show how to access metadata about studies (used to calculate frequencies) using one of our services. In the second tutorial, we present Python functions to transform the metadata about studies into dictionaries indexed by their bioproject ids. The populations associated with each study are in turn indexed by their biosample ids. This initial transformation preserves the hierarchical structure of the studies and their populations. Finally, the third tutorial presents a way to index populations by their biosample ids but using a flatter structure than the original.

## Accessing the metadata

In this tutorial we provide code that allows you to access the metadata using our services. The function `get_metadata` below shows not only how to retrieve the metadata in JSON format, but also how to check if any errors occurred that prevented that retrieval.

In [None]:
%pip install --quiet requests
%pip install --quiet ratelimit

In [None]:
import json
from ratelimit import limits
from requests import get, codes as http_code
from typing import Any, Dict, List


VarFreqMetadata = List[Dict[str, Any]]
MetadataDict = Dict[str, Dict[str, Any]]


@limits(calls=1, period=1)  # Only one call per second
def get_metadata() -> VarFreqMetadata:
    """
    Retrieve information that describes all studies and populations
    used by the frequency endpoints
    """
    METADATA_URL = ("https://api.ncbi.nlm.nih.gov/variation/v0/"
                    "metadata/frequency")

    reply = get(METADATA_URL)
    if reply.status_code != http_code.ok:
        raise Exception("Request failed: {}\n{}".format(
            reply.status_code, METADATA_URL))

    content_type = reply.headers['content-type']
    if content_type != 'application/json':
        raise Exception("Unexpected content type: {}\n{}".format(
            content_type, METADATA_URL))

    return reply.json()

Now that we have that function available, we can use it to retrieve the metadata. We also print it in a human-friendly manner.

In [None]:
metadata = get_metadata()

print(json.dumps(metadata, indent=4, sort_keys=True))

You can see that the metadata as retrieved from our backend consists of a list of dictionaries. Each dictionary corresponds to a study or "Bioproject" that has a given accession called `bioproject_id` and contains a list of populations. Each of those populations is called a "Biosample" and has an accession called `biosample_id`. A population may, in turn, have sub-populations. Every one of those sub-populations has a biosample id, and may contain subpopulations too. Biosample ids are unique across all studies.

## Transforming the metadata to dictionaries indexed by accessions

In its raw form, the metadata coming from our backends is a list of dictionaries that contain information about studies (Bioprojects). The populations that provided data for each such study are stored in lists inside each study dictionary. Thus, to find the information about a given population, you need to first iterate sequentially over the list of studies. Then you iterate over the list of populations until you find the accession (stored with the key `biosample_id`) you are looking for.

In this tutorial we provide Python code that allows you to transform that list into a dictionary of dictionaries. The outermost dictionary contains the information of each study indexed by that study's bioproject id. This is done by the `convert_metadata_to_dict` function below. 

Each of the study dictionaries will contain another dictionary whose keys are the accessions (biosample ids) of its populations. Because a population can contain sub-populations, their dictionaries contain those subpopulations organized as dictionaries with ther biosample ids used as keys. This re-organization is performed by the function `to_pop_dict` below. It is called by `convert_metadata_to_dict` for each study's list of populations.

In [None]:
def to_pop_dict(pop: Any) -> Dict[str, Any]:
    pop_accession = pop.pop("biosample_id", None)

    sub_pops = pop.pop("subs", None)

    # at this point, the population object contains
    # just simple elements
    pop_dict: Dict[str, Any] = pop.copy()

    if sub_pops:
        pop_sub_pops: Dict[str, Any] = dict()
        for sub_pop in sub_pops:
            pop_sub_pops.update(to_pop_dict(sub_pop))
        pop_dict["subs"] = pop_sub_pops

    return {pop_accession: pop_dict}


def convert_metadata_to_dict(metadata_orig: VarFreqMetadata) -> MetadataDict:
    metadata_dict: MetadataDict = dict()
    for study in metadata_orig:
        bioproject_id = study.pop("bioproject_id", None)
        study_pop_md = study.pop("populations", None)

        # at this point, study contains just simple elements
        study_metadata: Dict[str, Any] = study.copy()

        populations: Dict[str, Any] = dict()
        for pop in study_pop_md:
            populations.update(to_pop_dict(pop))

        study_metadata.update(
            {"populations": populations})

        metadata_dict.update({bioproject_id: study_metadata})

    return metadata_dict

In [None]:
import copy

metadata_dict = convert_metadata_to_dict(copy.deepcopy(metadata))

In [None]:
print(json.dumps(metadata_dict, indent=4, sort_keys=True))

## Indexing populations by their biosample ids

Here, we demonstrate how to index populations by their biosample ids using a flatter structure than the original. In the original metadata, when a population has one or more subpopulations, they are included in a list. If one of those subpopulations has itself subpopulations, you need to explore the structure even more deeply until you find the population you need.

In the flatter version, all populations are at the top level of the hierarchy. If one of them has subpopulations, their biosample ids alone are stored in a list. You can locate any population identified by any of those ids in the list by querying the flatter dictionary.

The function `build_population_dictionary` builds this dictionary of populations by visiting each bioproject in the metadata, and adding each of its populations to the dictionary. It uses the function `flatten_population_tree` to do this.

In [None]:
def flatten_population_tree(pop_list: List[MetadataDict]) -> MetadataDict:
    result: MetadataDict = dict()
    for population in pop_list:
        biosample_id = population.pop("biosample_id", None)
        sub_pops = population.pop("subs", None)

        result.update({biosample_id: population})
        if sub_pops:
            subs = flatten_population_tree(sub_pops)
            if subs:
                population["subs"] = list(subs.keys())
                result.update(subs)
    return result

def build_population_dictionary(metadata_orig: VarFreqMetadata) -> MetadataDict:
    result: MetadataDict = dict()

    for study in metadata_orig:
        result.update(
            flatten_population_tree(study["populations"]))

    return result

Now we can call that function and use it to first print the information about the global population. Then we print the entire dictionary.

In [None]:
pop_dict = build_population_dictionary(copy.deepcopy(metadata))

# prints information about the global population
print(json.dumps(pop_dict["SAMN10492705"], indent=4, sort_keys=True))

In [None]:
# prints the entire dictionary
print(json.dumps(pop_dict, indent=4, sort_keys=True))