<a href="https://colab.research.google.com/github/isb-cgc/Community-Notebooks/blob/master/HTAN/Python%20Notebooks/Creating_CDS_Data_Import_Manifests_Using_BQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating CDS Data Import Manifests Using Google BigQuery




        Title:   Creating CDS Data Import Manifests Using Google BigQuery
        Author:  Clarisse Lau
        Created: April 2024
        Purpose: Demonstrate how Google BigQuery tables can be used to generate a DRS manifest for retrieval of data from Cancer Data Service (CDS)


# 1. Introduction & Overview
The Human Tumor Atlas Network ([HTAN](https://humantumoratlas.org/)) is a National Cancer Institute (NCI)-funded Cancer Moonshot<sup>SM</sup> initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease [[Cell April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)]


### 1.1 Goal
The Cancer Data Service (CDS) houses HTAN controlled-access genomic level 1 & 2 data, and open-access imaging level 2 data accessible through the Seven Bridges Cancer Genomics Cloud (SB-CGC).

To access CDS data, users are encouraged to utilize the [Import from a manifest file](https://docs.cancergenomicscloud.org/docs/import-from-a-drs-server#import-from-a-manifest-file) method as detailed in the SB-CGC documentation. This involves importing a manifest file into their SB-CGC Data Studio workspace, which automatically retrieves files specified in the manifest.

Manifests for HTAN data can be generated in three different ways. This notebook aims to illustrate the process of creating a CDS Data Repository Service (DRS) manifest compatible with the above import method using Google BigQuery.

The remaining two methods for generating manifests are described in the HTAN Missing Manual: HTAN Missing Manual: https://docs.humantumoratlas.org/access_controlled/cds_access/

### 1.2 Inputs, Outputs, & Data
The originating data can be found on the [HTAN Data Portal](https://humantumoratlas.org/), and the compiled tables are on the [ISB-Cancer Gateway in the Cloud](https://isb-cgc.appspot.com/bq_meta_search/).

Each query output loads to a Data Table, an interactive display of resulting columns and rows. You are able to select the link below the table to review the Data Table Notebook (https://colab.research.google.com/notebooks/data_table.ipynb) that gives tips on filtering and further customizing the table.

### 1.3 Notes
The queries and results in this notebook correspond to ISB-CGC HTAN Release 5. To choose a different release, edit the BigQuery table names in this notebook by replacing the string r5 with a selected numbered release, e.g. r4. To get results for the most current data release, replace r5 with current and HTAN_versioned with HTAN.

(For example replace isb-cgc-bq.HTAN_versioned.clinical_tier1_demographics_r5 with isb-cgc-bq.HTAN.clinical_tier1_demographics_current).



# 2. Environment & Module Setup

In [1]:
# import libraries
import pandas as pd

Enable interactive table formatting:

In [2]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

# 3. Google Authentication

Running the BigQuery cells in this notebook requires a Google Cloud Project. Instructions for creating a project can be found in [Google Cloud Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console). The instance needs to be authorized to bill the project for queries. For more information on getting started with ISB-CGC see [Quick Start Guide to ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found in [Google Cloud Authentication Documentation](https://cloud.google.com/docs/authentication).

## 3.1 Authenticating with Google Credentials



#### Option 1. Running in Google Colab

If you are using Google Colab, run the code block below to authenticate

In [3]:
from google.colab import auth
auth.authenticate_user()

#### Option 2. Running on local machine

Alternatively, if you're running the notebook locally, take the following steps to authenticate.

1.   Run `gcloud auth application-default login` on your local machine
2.   Run the command below replacing `<path to key>` with the path to your credentials file

In [None]:
# %env GOOGLE_APPLICATION_CREDENTIALS=<path to key>

## 3.2 Initializing the Google BigQuery client


In [4]:
# Import the Google BigQuery client
from google.cloud import bigquery

# Set the google project that will be billed for this notebook's computations
google_project = 'my-project'

# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

# 4. Selecting Data




In this example, we select for a subset of CyCIF images submitted by the HTAN TNP-SARDANA center. The `CDS_Release` indicator allows us to select for files that are available in CDS.

In [9]:
sardana_cycif = client.query("""
  SELECT DISTINCT entityId,Data_Release,CDS_Release,Filename,HTAN_Center
  FROM `isb-cgc-bq.HTAN_versioned.imaging_level2_metadata_r5`
  WHERE HTAN_Center = 'HTAN TNP SARDANA'
  AND Imaging_Assay_Type = 't-CyCIF'
  AND CDS_Release IS NOT NULL
""").result().to_dataframe()

sardana_cycif

Unnamed: 0,entityId,Data_Release,CDS_Release,Filename,HTAN_Center
0,syn25075311,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-002.ome.tif,HTAN TNP SARDANA
1,syn25075782,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-007.ome.tif,HTAN TNP SARDANA
2,syn25076441,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-014.ome.tif,HTAN TNP SARDANA
3,syn25075400,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-020.ome.tif,HTAN TNP SARDANA
4,syn25075901,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-025.ome.tif,HTAN TNP SARDANA
5,syn25075429,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-029.ome.tif,HTAN TNP SARDANA
6,syn25075926,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-034.ome.tif,HTAN TNP SARDANA
7,syn25075282,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-039.ome.tif,HTAN TNP SARDANA
8,syn25075948,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-044.ome.tif,HTAN TNP SARDANA
9,syn25114909,Release 2.0,v22.6.2.img,imaging_level_2/WD-76845-045.ome.tif,HTAN TNP SARDANA


# 5. Obtaining DRS URIs

A mapping of HTAN Synapse `entityID`s and `HTAN Data File ID`s to their corresponding CDS DRS URI can be found in the ISB-CGC BigQuery table `isb-cgc-bq.HTAN.cds_drs_mapping`. Below, we view the table contents.

In [10]:
cds_drs = client.query("""
  SELECT * FROM `isb-cgc-bq.HTAN_versioned.cds_drs_map_r5`
""").result().to_dataframe()

In [11]:
cds_drs.head(100)

Unnamed: 0,name,entityId,HTAN_Data_File_ID,drs_uri
0,S1.bam,syn26546212,HTA7_1_1453,drs://nci-crdc.datacommons.io/dg.4DFC/adde2232...
1,S2.bam,syn26546217,HTA7_1_1564,drs://nci-crdc.datacommons.io/dg.4DFC/e836d169...
2,S3.bam,syn26546241,HTA7_1_1598,drs://nci-crdc.datacommons.io/dg.4DFC/b7a010f7...
3,S4.bam,syn26546518,HTA7_1_1609,drs://nci-crdc.datacommons.io/dg.4DFC/897a0598...
4,S5.bam,syn26546529,HTA7_1_1620,drs://nci-crdc.datacommons.io/dg.4DFC/3a036835...
...,...,...,...,...
95,S96.bam,syn26546615,HTA7_1_1671,drs://nci-crdc.datacommons.io/dg.4DFC/c639614d...
96,S97.bam,syn26546632,HTA7_1_1672,drs://nci-crdc.datacommons.io/dg.4DFC/e39c917f...
97,S98.bam,syn26546633,HTA7_1_1673,drs://nci-crdc.datacommons.io/dg.4DFC/c27ca2f0...
98,S99.bam,syn26546634,HTA7_1_1674,drs://nci-crdc.datacommons.io/dg.4DFC/69bd494b...


### Subset DRS IDs to files of interest

Utilizing the above tables, we can create a dataframe consisting of our files of interest, with `name` and `drs_uri` columns as minimally required of a DRS manifest file: https://docs.cancergenomicscloud.org/docs/import-from-a-drs-server#manifest-file-format

In [12]:
drs_table = client.query("""
  SELECT name,drs_uri FROM `isb-cgc-bq.HTAN_versioned.cds_drs_map_r5`
  WHERE entityId in (
  SELECT DISTINCT entityId FROM
  `isb-cgc-bq.HTAN_versioned.imaging_level2_metadata_r5`
  WHERE HTAN_Center = 'HTAN TNP SARDANA'
  AND Imaging_Assay_Type = 't-CyCIF'
  AND CDS_Release IS NOT NULL
  )
""").result().to_dataframe()

drs_table

Unnamed: 0,name,drs_uri
0,CRC01b_07.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/624649d2...
1,CRC01b_08.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/62467074...
2,WD-76845-002.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/6246b71e...
3,WD-76845-007.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/6246fe54...
4,WD-76845-014.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/6247468e...
5,WD-76845-020.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/62478d6a...
6,WD-76845-025.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/62422b68...
7,WD-76845-029.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/624820d6...
8,WD-76845-034.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/6247d978...
9,WD-76845-039.ome.tif,drs://nci-crdc.datacommons.io/dg.4DFC/6248b49c...


### Save result to CSV

### Option 1. Save to Disk

In [None]:
table_name = 'sardana_cycif.csv'
drs_table.to_csv(table_name,index=False)

Save the resulting table to disk as a CSV file. If running in Google Colab, the saved file can be found under the folder icon in the left sidebar and then downloaded locally.



### Option 2. Copy/Paste table to CSV (Google Colab only)

In the resulting data table above, click the copy icon in the upper right (next to `Filter`). Copy the provided CSV and paste to a local file.

This file can now be uploaded to SB-CGC to obtain the data files of interest following the [Import a DRS Manifest](https://docs.cancergenomicscloud.org/docs/import-from-a-drs-server#import-from-a-manifest-file) instructions.


# 6. Relevant Citations and Links



[HTAN Portal](https://humantumoratlas.org/)

[Missing Manual](https://docs.humantumoratlas.org/)

[Overview paper, Cell, April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469)

[NCI Cancer Data Service](https://datacommons.cancer.gov/repository/cancer-data-service)