# Consuming ICGC datasets from APIs

**Goal**: obtain simple somatic mutation (SSM) data from the [BRCA projects](https://pubmed.ncbi.nlm.nih.gov/27135926/) for donor samples that were analyzed using whole genome sequencing (WGS) platforms.

[Here is a list of ICGC Data Portal API Endpoints](https://docs.icgc.org/portal/api-endpoints/). We will query the API using its [Portal Query Language (PQL)](https://github.com/icgc-dcc/dcc-portal/blob/develop/dcc-portal-pql/PQL.md).

1. To obtain file sizes, we will query the [/download/sizePQL](https://docs.icgc.org/portal/api-endpoints/#!/download/getDataTypeSizePerFileTypeFromPQL) endpoint.

In [1]:
from pathlib import Path

import requests

In [2]:
data_dir = Path.cwd() / "data"

In [3]:
url_base = "https://dcc.icgc.org/api/v1/"

In [4]:
def get_filesize(url_base: str, pql_query: str, datatype: str="ssm"):
    url_filesize = f"{url_base}download/sizePQL?pql={pql_query}"

    response = requests.get(url=url_filesize)
    if response.status_code != 200:
        raise IOError(f"GET {url_filesize} resulted in status code {response.status_code}")

    file_sizes = response.json()["fileSize"]
    for dataset in file_sizes:
        if dataset["label"] == datatype:
            return dataset["sizes"]

    raise ValueError(f"GET {url_filesize} does not contain the {datatype} data type.")


def download_brca_ssm_datasets():
    url_base = "https://dcc.icgc.org/api/v1/"
    brca_projects = ["BRCA-EU", "BRCA-FR", "BRCA-UK", "BRCA-US"]

    for brca_project in brca_projects:
        pql_query = f"select(*),in(donor.projectId,'{brca_project}'),in(donor.availableDataTypes,'ssm'),in(donor.analysisTypes,'WGS')"
        print(f"SSM file in project {brca_project} is of size: {(get_filesize(url_base, pql_query, datatype='ssm')/1024**2):>.2f} MB.")

In [5]:
download_brca_ssm_datasets()

SSM file in project BRCA-EU is of size: 152.09 MB.
SSM file in project BRCA-FR is of size: 29.09 MB.
SSM file in project BRCA-UK is of size: 22.33 MB.
SSM file in project BRCA-US is of size: 1.89 MB.


## Multiple entries for the same mutation in a donor

A simple somatic mutation (SSM) donor dataset in the ICGC portal can contain multiple records for the same variant in a donor. These records differ in fields: `consequence_type`, `aa_mutation`, `cds_mutation`, `gene_affected`, and `transcript_affected`. This is the result of [SnpEff](http://pcingola.github.io/SnpEff/), a genome variant annotation and effect prediction tool.

A single variant can have multiple functional effects (`consequence_type`). One of the reasons is due to the presence of [multiple gene isoforms](https://en.wikipedia.org/wiki/Gene_isoform). These isoforms, while coming from the same locus, can differ in transcription start site, coding DNA sequences, and/or untranslated regions. As a result, [these gene isoforms can have different functions](https://en.wikipedia.org/wiki/Protein_isoform). Sometimes a variant may be transcribed and can introduce synonymous or missense mutation to the transcript. Other times the variant may not be present in the transcript isoform but can influence splice site recognition. Due to these reasons, for the same variant in a donor, we can have multiple `transcript_affected` for the same `gene_affected`.

Additionally, sometimes a variant can exist some distance upstream/downstream of another gene and influence its transcription. As a result, `gene_affected` can also differ for the same variant in a donor.