## Notebook to access TCGA open data from Azure

Here we will show an example of querying and downloading [TCGA open data hosted on azure] and use it downstream with TCGABiolinks to format and analyze.

### Step 1: Install necessary packages

In [1]:
#Install the required packages, [mamtagiri/tcgaazureR](https://github.com/mamtagiri/tcgaazureR) is a package to access TCGA open data on azure
devtools::install_github('mamtagiri/tcgaazureR')

Skipping install of 'tcgaazureR' from a github remote, the SHA1 (f26190a7) has not changed since last install.
  Use `force = TRUE` to force installation



In [2]:
# Install TCGAbiolinks for downstream analysis of TCGA open data
if (!requireNamespace("BiocManager", quietly = TRUE))
    suppressMessages(install.packages("BiocManager"))

suppressMessages(BiocManager::install("TCGAbiolinks"))

“package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'TCGAbiolinks'”


### Step 2: Load the packages

In [3]:
packages<-c("TCGAbiolinks","AzureStor","AzureRMR","tcgaazureR","dplyr")
lapply(packages, require, character.only = TRUE)


Loading required package: TCGAbiolinks

Loading required package: AzureStor

Loading required package: AzureRMR

Loading required package: tcgaazureR

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




### Step 3: Load metadata and query the data

In [4]:
data(metadata, package="tcgaazureR") #load tcga meta data from tcgaazureR package

#try an example query to download required data, projectid,datacategory,datatype,sampletype
query<-tcgaquery(metadata,projectid="TCGA-DLBC",datacategory="Copy Number Variation",datatype="Masked Copy Number Segment",sampletype="Primary Tumor")


In [5]:
# a look into the query data object, this is very similar to TCGAbiolinks format
query
query$results[[1]][1:5,1:7]

results,project,data.category,data.type,access,legacy,experimental.strategy,sample.type
<I<list>>,<chr>,<chr>,<chr>,<I<list>>,<lgl>,<I<list>>,<I<list>>
"c(""8805a....",TCGA-DLBC,Copy Number Variation,Masked Copy Number Segment,open,False,Genotypi....,Primary ....


Unnamed: 0_level_0,id,cases,data_format,access,file_name,file_size,state
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>
1,8805ab31-c7a8-4543-80e3-02bfd08f9211,TCGA-GS-A9TU-01A-11D-A381-01,TXT,open,VOLAR_p_TCGAb_397_398_399_NSP_GenomeWideSNP_6_A09_1473382.nocnv_grch38.seg.v2.txt,10014,released
2,8edbbd43-a8ec-4743-a27a-cb975d81101e,TCGA-FF-8041-01A-11D-2209-01,TXT,open,CENTS_p_TCGASNP_212_216_217_N_GenomeWideSNP_6_D10_1039584.nocnv_grch38.seg.v2.txt,15879,released
3,6339ae78-f0f1-45e0-a315-0f898d9e6f94,TCGA-G8-6907-01A-11D-2209-01,TXT,open,CENTS_p_TCGASNP_212_216_217_N_GenomeWideSNP_6_E07_1039560.nocnv_grch38.seg.v2.txt,8504,released
4,c2253381-06d0-43e7-a621-7bba63c7d5cd,TCGA-FF-A7CX-01A-12D-A381-01,TXT,open,VOLAR_p_TCGAb_397_398_399_NSP_GenomeWideSNP_6_B09_1473338.nocnv_grch38.seg.v2.txt,28039,released
5,a82aacde-3814-4516-9449-2c8a86e5f66d,TCGA-G8-6914-01A-11D-2209-01,TXT,open,CENTS_p_TCGASNP_212_216_217_N_GenomeWideSNP_6_G03_1039550.nocnv_grch38.seg.v2.txt,34434,released


In [6]:
#subetting the data only for two files for the purpose of this demo
list<-c("f207eeee-e644-4eeb-8de8-91e086bb2324","ff5bfff6-6716-477b-966a-18ac6ed4aa50")
query$results[[1]]<-filter(query$results[[1]],query$results[[1]]$id %in% list)

### Step 4: Download the data from Azure

In [7]:
# PROVIDE THE SAS token and URL for the TCGA data on azure
container<-blob_container("https://datasettcga.blob.core.windows.net/dataset",sas ="sp=rl&st=2022-11-07T16:59:11Z&se=2030-12-01T00:59:11Z&spr=https&sv=2021-06-08&sr=c&sig=A4AWnyISkPi9JZRanNwcQNgAagxUih1J%2FeJ9T5kHyfc%3D")


In [8]:
# download the list of two files from 19 to a folder structure which is similar to TCGAbiolinks
querydownload(query,"GDCdata",container=container)

The data is now downloaded into the local directory under a folder "GDCdata". YOu can choose to use the downstream tool of your choice or you can format the data for tcgabiolinks

### Step 5: Prepare data for analysis

In [9]:
#Prepare the data for downstream analysis as in TCGAbiolinks (you can skip this step if you use other tools for analysis)

In [10]:
cnvdata<-GDCprepare(query)

Reading copy number variation files



In [11]:
#The copy number files in our example now will be combined and we can look at it
cnvdata[1:10,]



GDC_Aliquot,Chromosome,Start,End,Num_Probes,Segment_Mean,Sample
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
7c1555c9-336c-474b-997a-5d6229e3469e,1,3301765,6176837,1980,-0.5139,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,6183275,30120508,12420,-0.0386,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,30122383,30147019,20,0.2789,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,30150062,65356699,20940,-0.0431,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,65356834,66871428,1092,-0.0005,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,66871602,85961574,11340,-0.0329,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,85962623,90939165,3161,-0.0054,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,90941020,103963580,7533,-0.0398,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,103963601,113784174,5732,-0.0099,TCGA-G8-6909-01A-11D-2209-01
7c1555c9-336c-474b-997a-5d6229e3469e,1,113787580,115219844,921,-0.0477,TCGA-G8-6909-01A-11D-2209-01
