# Integrating Controlled and Open Access 10X Visium Data in SB-CGC Data Studio

    Title:   Integrating Controlled and Open Access 10X Visium Data in SB-CGC Data Studio
    Author:  Clarisse Lau (clau@systemsbiology.org)
    Created: May 2023

# 1. Introduction & Overview
[HTAN](https://humantumoratlas.org/) is a National Cancer Institute (NCI)-funded Cancer Moonshot<sup>SM</sup> initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease. [Cell April 2020](https://www.sciencedirect.com/science/article/pii/S0092867420303469) 

__Important__: This notebook is intended to be run within the Seven Bridges Cancer Genomics Cloud (SB-CGC) Data Studio. You must have dbGaP authorization to access controlled-access HTAN data within SB-CGC. See the [HTAN Missing Manual](https://docs.humantumoratlas.org/access_controlled/db_gap/) for instructions on how to request dbGaP access.

### 1.1 Goal

This notebook will demonstrate how open-access HTAN data can be pulled from Synapse into to SB-CGC and used in conjunction with lower level dbGaP authorized data in the cloud. We utilize ISB-CGC BigQuery metadata tables to obtain relevant file info. 

### 1.2 Inputs and Outputs

In this example, we aim to replicate outputs of the spaceranger pipeline using 10X Visium data submitted by the Washington University in St. Louis (WashU) HTAN Center for run _HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_.



# 2. Environment & Module Setup

#### 2.1 Google Authentication

Running the BigQuery cells in this notebook requires a Google Cloud Project, instructions for creating a project can be found in the [Google Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console). The instance needs to be authorized to bill the project for queries. For more information on getting started in the cloud see [Quick Start Guide to ISB-CGC](https://nbviewer.org/github/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb) and alternative authentication methods can be found in the [Google Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console).

Before running this notebook, follow Google's documentation to install [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) for your OS. 
Then open Terminal from Data Studio Launcher and run the following to set up application credentials to access Google BigQuery:

`gcloud auth application-default login`

Follow the prompts to complete authentication.

#### 2.2 Download and Install Spaceranger

Follow 10X [documentation](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/installation) to download and install Spaceranger

#### 2.3 Install Libraries

In [None]:
%pip install google-cloud-bigquery
%pip install synapseclient
%pip install protobuf==3.20.1 
%pip install db-dtypes

## 3.0 Import and Instantiate Libraries

In [8]:
import sevenbridges as sbg
import os
import pandas as pd

from google.cloud import bigquery
import synapseclient

In [5]:
# set the google project that will be billed for this notebook's computations
google_project = '<your-google-project>'

# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)

In [None]:
# instantiate synapse client
syn = synapseclient.Synapse()
syn.login()

In [7]:
# instantiate SB python client
# Requires SB developer auth token: https://docs.sevenbridges.com/docs/get-your-authentication-token

auth_token = '<your-auth-token>'

os.environ['SB_API_ENDPOINT'] = 'https://cgc-api.sbgenomics.com/v2' 
os.environ['SB_AUTH_TOKEN'] = auth_token

api = sbg.Api()

# 4. Analysis

### 4.1 Obtain relevant file metadata info from ISB-CGC

In [15]:
f = client.query("""
    WITH l1 AS (
        SELECT Filename,
            HTAN_Parent_Biospecimen_ID,
            Component,
            File_Format,
            entityId,
            Run_ID
        FROM `isb-cgc-bq.HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level1_metadata_current`
        WHERE RUN_ID = 'HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test'
    ),
    l2 AS (
        SELECT Filename,
            HTAN_Parent_Biospecimen_ID,
            Component,
            File_Format,
            entityId,
            Run_ID
        FROM `isb-cgc-bq.HTAN.10xvisium_spatialtranscriptomics_scRNAseq_level2_metadata_current`
        WHERE RUN_ID = 'HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test'
    ),
    aux AS (
        SELECT Filename,
            HTAN_Parent_Biospecimen_ID,
            Component,
            File_Format,
            entityId,
            Run_ID
        FROM `isb-cgc-bq.HTAN.10xvisium_spatialtranscriptomics_auxiliaryfiles_metadata_current
`
        WHERE RUN_ID = 'HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test'
    )
    SELECT * FROM l1
    UNION ALL 
    SELECT * FROM l2
    UNION ALL
    SELECT * FROM aux

""").result().to_dataframe()

with pd.option_context('display.max_colwidth', None):
    display(f)

Unnamed: 0,Filename,HTAN_Parent_Biospecimen_ID,Component,File_Format,entityId,Run_ID
0,visium_level_2_pdac_bam/HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test-possorted_genome_bam.bam,HTA12_27_5,10xVisiumSpatialTranscriptomics-RNA-seqLevel2,bam,syn51201377,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test
1,visium_auxiliary_pdac/HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test-scalefactors_json.json,HTA12_27_5,10xVisiumSpatialTranscriptomics-AuxiliaryFiles,json,syn51283237,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test
2,visium_auxiliary_pdac/HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test-tissue_lowres_image.png,HTA12_27_5,10xVisiumSpatialTranscriptomics-AuxiliaryFiles,png,syn51283252,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test
3,visium_auxiliary_pdac/B1-HT264P1-S1H2Fc2U1.tif,HTA12_27_5,10xVisiumSpatialTranscriptomics-AuxiliaryFiles,tif,syn51283214,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test
4,visium_level_1/TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_S3_L002_R1_001.fastq.gz,HTA12_27_5,10xVisiumSpatialTranscriptomics-RNA-seqLevel1,fastq,syn29282084,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test
5,visium_level_1/TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_S3_L002_R2_001.fastq.gz,HTA12_27_5,10xVisiumSpatialTranscriptomics-RNA-seqLevel1,fastq,syn29290193,HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test


### 4.2 Pull in fastq files

RNA-seq Level 1 fastq files are controlled access and can be accessed via SB-CGC with dbGaP authorization

1. Navigate to the CDS Data File Explorer: https://cgc.sbgenomics.com/datasets/file-repository 
2. Search by Sample ID 'HTA12_27_5'
3. Select for 'FASTQ.GZ' files
4. Add the resulting files to your project

Check that the files have been added to your workspace:

In [12]:
! ls /sbgenomics/project-files

TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_S3_L002_R1_001.fastq.gz
TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test_S3_L002_R2_001.fastq.gz


### 4.3 Download image file

Auxiliary files including tiffs are open access. We can download the high res image from Synapse

In [5]:
tiff = syn.get('syn51283214')

In [6]:
tiff.path

'/home/jovyan/.synapseCache/778/123019778/B1-HT264P1-S1H2Fc2U1.tif'

### 4.4 Run Spaceranger pipeline

In [19]:
# https://github.com/reykajayasinghe/HTAN/blob/main/Single_cell_preprocessing/run_spaceranger_1.2.2.sh

! sample="HT264P1-Test" # Output directory
! sample2="TWAS-HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test" # Sample name from FASTQ filename
! TIF_image=tiff.path # Path to brightfield image input
! SLIDE_SERIAL_ID="V10Y07-094" # Slide ID
! AREA=B1 # https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/slide-info
! datadirectory="/sbgenomics/project-files/" # Path to FASTQs
! reference="refdata-gex-GRCh38-2020-A" # Path to Reference


In [None]:
! spaceranger count --id=${sample} --transcriptome=${reference} --fastqs=${datadirectory} --sample=${sample2} --image=${TIF_image} --slide=${SLIDE_SERIAL_ID} --area=${AREA} --reorient-images=true --localcores=32 --localmem=150

### 4.5 Check output files

In [21]:
! ls -R ${sample}/outs/

HT264P1-Test/outs/:
analysis		       possorted_genome_bam.bam
cloupe.cloupe		       possorted_genome_bam.bam.bai
filtered_feature_bc_matrix     raw_feature_bc_matrix
filtered_feature_bc_matrix.h5  raw_feature_bc_matrix.h5
metrics_summary.csv	       spatial
molecule_info.h5	       web_summary.html

HT264P1-Test/outs/analysis:
clustering  diffexp  pca  tsne	umap

HT264P1-Test/outs/analysis/clustering:
gene_expression_graphclust	    gene_expression_kmeans_5_clusters
gene_expression_kmeans_10_clusters  gene_expression_kmeans_6_clusters
gene_expression_kmeans_2_clusters   gene_expression_kmeans_7_clusters
gene_expression_kmeans_3_clusters   gene_expression_kmeans_8_clusters
gene_expression_kmeans_4_clusters   gene_expression_kmeans_9_clusters

HT264P1-Test/outs/analysis/clustering/gene_expression_graphclust:
clusters.csv

HT264P1-Test/outs/analysis/clustering/gene_expression_kmeans_10_clusters:
clusters.csv

HT264P1-Test/outs/analysis/clustering/gene_expression_kmeans_2_clusters:
clusters.c

# 5. Relevant Citations and Links

https://github.com/reykajayasinghe/HTAN/blob/main/Single_cell_preprocessing/run_spaceranger_1.2.2.sh

https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/count