<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/law-staging/Notebooks/How_to_use_nested_tables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Nested Tables in BigQuery


ISB-CGC Community Notebooks
Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!
```
Title:   Working with Nested Tables in BigQuery
Author:  Lauren Wolfe
Created: 2023-07-17
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_use_nested_tables.ipynb
Purpose: To demonstrate strategies for working with nested table structures in Google BigQuery.
Notes:
```

## Introduction

The data that GDC provides through their API is naturally nested. A given patient may, for instance, have multiple diagnoses at the same time. Frequently, they may receive multiple treatments to combat their cancer or have multiple follow up visits to monitor their cancer's progression. The data generated in these examples means that there will be a variable number of records of each type for a given research subject. Traditionally, ISB-CGC has stored these supplemental records in a separate table. However, they are extracted from GDC in the form of a nested JSON object.

BigQuery supports nested columns, and there are advantages to storing the data in this format. Since all of the data is stored in one table, you don't have to locate and join the supplemental tables in order to access data. Instead, you can use UNNEST clauses to flatten the data as needed for your research. This guide provides an introduction to [using UNNEST in your SQL queries](https://cloud.google.com/bigquery/docs/arrays#querying_nested_arrays).

## Getting Started

Here, we import library modules necessary to run our script, including the bigquery module.

In [1]:
import time
import datetime
import typing

from google.cloud import bigquery

# When receiving a BQ result, we can convert each record row to a dict.
# This is the typing definition for the resulting object.
BigQueryRowObject = dict[str, typing.Union[str, bool, int, float, datetime.datetime, None, dict]]

Next we will need to Authorize our access to BigQuery and the Google Cloud. For more information, see '[Quick Start Guide to ISB-CGC](https://https://nbviewer.org/github/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb)'. Alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).

In [None]:
!gcloud auth application-default login

Finally, we'll define our billing project. Edit the variable below.

In [None]:
# Create a variable for which client to use with BigQuery
project_num = 'your-billing-project-CHANGE-ME' # Update with your Google Project number
location = 'US'

if project_num == 'your-billing-project-CHANGE-ME':
    print('Please update the project number with your Google Cloud Project')
else:
    client = bigquery.Client(project_num)

## Example 1: Retrieve treatment_ids for a Given Patient

Say that we want to retrieve all of the treatment_ids associated with a given patient. There are multiple ways we could achieve that. One option is to use a BigQuery QueryJob to retrieve the patient record, then use python to locate all the treatment_ids, as shown below:

In [None]:
def get_query_result(sql: str) -> typing.Union[bigquery.table.RowIterator, None]:
  # Executes a given SQL statement and returns the query result.

  # initialize the QueryJob
  job_config = bigquery.QueryJobConfig()

  # execute the query
  query_job = client.query(query=sql, location=location, job_config=job_config)

  while query_job.state != 'DONE':
    query_job = client.get_job(query_job.job_id, location=location)
    # wait for the job to complete
    if query_job.state != 'DONE':
      time.sleep(5)

  query_job = client.get_job(query_job.job_id, location=location)

  if query_job.error_result is not None:
    print(f'[ERROR] {query_job.error_result}')
    return None

  # Return the query result as a RowIterator object
  return query_job.result()


sql = """
  SELECT *
  FROM `isb-cgc-bq.CGCI_versioned.clinical_nested_gdc_r37`
  WHERE case_id = 'c3f876f4-2d3a-4d60-b6c4-019f94010330'
"""

query_result = get_query_result(sql=sql)

case_record_list = list()

for case_row in query_result:
  # convert RowIterator object into a list of dictionaries
  case_record = dict(case_row.items())
  case_record_list.append(case_record)

for case_record in case_record_list:
  diagnosis_list = case_record['diagnoses']

  for diagnosis in diagnosis_list:
    treatment_list = diagnosis['treatments']

    for treatment in treatment_list:
      print(treatment['treatment_id'])

An alternative, and simpler, approach is to use an UNNEST clause to retrieve the treatment_ids. This example employs the bigquery magic command, which you can learn more about [here](https://notebook.community/GoogleCloudPlatform/python-docs-samples/notebooks/tutorials/bigquery/BigQuery%20query%20magic).

In [None]:
%%bigquery --project your-billing-project-CHANGE-ME
SELECT treatment.treatment_id
FROM `isb-cgc-bq.CGCI_versioned.clinical_nested_gdc_r37`,
UNNEST(diagnoses) AS diagnosis, # first we unnest diagnoses to access its columns
UNNEST(treatments) AS treatment # then we unnest treatments, a child column of diagnoses
WHERE case_id = 'c3f876f4-2d3a-4d60-b6c4-019f94010330'


## Example 2: Retrieve submitter_ids from Diagnoses and Treatments

In the query below, we select a patient record using the case_id, then use UNNEST to retrieve the diagnosis and treatment submitter_ids.

Note: diagnoses and treatments both contain a column named submitter_id, which creates a naming conflict when unnested. You can address this by explicitly renaming the columns, as shown in the example below. If you don't, BigQuery will append an integer suffix to every duplicate column name. (e.g. submitter_id, submitter_id_1, etc.)

In [None]:
%%bigquery --project your-billing-project-CHANGE-ME
SELECT diagnosis.submitter_id AS diagnosis_submitter_id,
  treatment.submitter_id AS treatment_submitter_id
FROM `isb-cgc-bq.CGCI_versioned.clinical_nested_gdc_r37` AS base_case,
UNNEST(diagnoses) AS diagnosis
LEFT JOIN UNNEST(diagnosis.treatments) AS treatment
WHERE case_id = '39dce88d-112c-4a3d-b2d2-11e0616594d8'

## Example 3: Unnesting Multiple Column Groups

The case we used in the above query doesn't have any follow_up records, so let's look at a different `case_id`: `18395371-3c84-4d39-8ace-a3546e9ea34e`

If we unnest all three nested columns, the resulting output will actually be the Cartesian product, representing every possible combination of the diagnosis, treatment and follow_up records.

The new case has one diagnosis record, 8 treatment records, and 6 follow_up records. $1*8*6=48$. Our result will have 48 rows.

The best way to avoid this is to unnest one set of nested columns at a time, e.g. diagnoses (and optionally its children: diagnoses.treatments, diagnoses.pathlogy_details, diagnoses.annotations).

In [None]:
%%bigquery --project your-billing-project-CHANGE-ME
SELECT base_case.case_id,
  diagnosis.diagnosis_id,
  treatment.treatment_id,
  follow_up.follow_up_id
FROM `isb-cgc-bq.CGCI_versioned.clinical_nested_gdc_r37` AS base_case
LEFT JOIN UNNEST(diagnoses) AS diagnosis
LEFT JOIN UNNEST(diagnosis.treatments) AS treatment
LEFT JOIN UNNEST(follow_ups) AS follow_up
WHERE case_id = '18395371-3c84-4d39-8ace-a3546e9ea34e'