## Download multiple modalities of pan-cancer data from TCGA

The data is accessed directly from the [Genome Data Commons](https://gdc.cancer.gov/about-data/publications/pancanatlas).

NOTE: this download script uses the `md5sum` shell utility to verify file hashes. This script was developed and tested on a Linux machine, and `md5sum` commands may have to be changed to work on other platforms.

In [1]:
import os
import pandas as pd
from urllib.request import urlretrieve

First, we load a manifest file containing the GDC API ID and filename for each relevant file, as well as the md5 checksum to make sure the whole/uncorrupted file was downloaded.

The manifest included in this GitHub repo was downloaded from https://gdc.cancer.gov/node/971 on December 1, 2020.

In [2]:
manifest_df = pd.read_csv(os.path.join('data', 'manifest.tsv'),
                          sep='\t', index_col=0)
manifest_df.head()

Unnamed: 0_level_0,id,filename,md5,size
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mirna_sample,55d9bf6f-0712-4315-b588-e6f8e295018e,PanCanAtlas_miRNA_sample_information_list.txt,02bb56712be34bcd58c50d90387aebde,553408
methylation_27k,d82e2c44-89eb-43d9-b6d3-712732bf6a53,jhu-usc.edu_PANCAN_merged_HumanMethylation27_H...,5cec086f0b002d17befef76a3241e73b,5022150019
methylation_450k,99b0c493-9e94-4d99-af9f-151e46bab989,jhu-usc.edu_PANCAN_HumanMethylation450.betaVal...,a92f50490cf4eca98b0d19e10927de9d,41541692788
rppa,fcbb373e-28d4-4818-92f3-601ede3da5e1,TCGA-RPPA-pancan-clean.txt,e2b914c7ecd369589275d546d9555b05,18901234
rna_seq,3586c0da-64d0-4b74-a449-5ff4d9136611,EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2....,02e72c33071307ff6570621480d3c90b,1882540959


### Download gene expression data

In [3]:
rnaseq_id, rnaseq_filename = manifest_df.loc['rna_seq'].id, manifest_df.loc['rna_seq'].filename
url = 'http://api.gdc.cancer.gov/data/{}'.format(rnaseq_id)
exp_filepath = os.path.join('data', rnaseq_filename)

if not os.path.exists(exp_filepath):
    urlretrieve(url, exp_filepath)
else:
    print('Downloaded data file already exists, skipping download')

Downloaded data file already exists, skipping download


In [4]:
md5_sum = !md5sum $exp_filepath
print(md5_sum[0])
assert md5_sum[0].split(' ')[0] == manifest_df.loc['rna_seq'].md5

02e72c33071307ff6570621480d3c90b  data/EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv


### Download mutation data

In [5]:
mutation_id, mutation_filename = manifest_df.loc['mutation'].id, manifest_df.loc['mutation'].filename
url = 'http://api.gdc.cancer.gov/data/{}'.format(mutation_id)
mutation_filepath = os.path.join('data', mutation_filename)

if not os.path.exists(mutation_filepath):
    urlretrieve(url, mutation_filepath)
else:
    print('Downloaded data file already exists, skipping download')

Downloaded data file already exists, skipping download


In [6]:
md5_sum = !md5sum $mutation_filepath
print(md5_sum[0])
assert md5_sum[0].split(' ')[0] == manifest_df.loc['mutation'].md5

639ad8f8386e98dacc22e439188aa8fa  data/mc3.v0.2.8.PUBLIC.maf.gz


### Download gene set from Park et al. paper

We want to download the set of genes analyzed in [Park et al. 2021](https://www.nature.com/articles/s41467-021-27242-3). We are particularly interested in the "Class 2/3/4" genes from Figure 1, which are inferred to be "two-hit" genes where non-synonymous mutations and CNVs tend to co-occur more often than would be expected by chance.

In [7]:
import pandas as pd

park_df = pd.read_excel(
    'https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-021-27242-3/MediaObjects/41467_2021_27242_MOESM11_ESM.xlsx',
    sheet_name='SupplFig3', header=1
)
print(park_df.shape)
park_df.head()

(453, 17)


Unnamed: 0,Gene,Tissue,Pair,Effect,Pval,FDR,LogFDR,Target,Unnamed: 8,Gene.1,Tissue.1,Pair.1,Effect.1,Pval.1,FDR.1,LogFDR.1,Target.1
0,ACVR1,UCEC,ACVR1_UCEC,0.0,0.997445,0.681521,0.166514,B_Target,,ACVR1,UCEC,ACVR1_UCEC,-0.482699,0.370095,0.571956,0.24263,B_Target
1,ACVR2A,COADREAD,ACVR2A_COADREAD,0.0,0.997631,0.681315,0.166646,B_Target,,ACVR2A,COADREAD,ACVR2A_COADREAD,0.187531,0.753981,0.780954,0.107369,B_Target
2,ACVR2A,LIHC,ACVR2A_LIHC,-1.000178,0.040008,0.122307,0.912515,B_Target,,ACVR2A,LIHC,ACVR2A_LIHC,0.163421,0.834858,0.821548,0.085362,B_Target
3,AJUBA,HNSC,AJUBA_HNSC,-0.93605,0.008733,0.045816,1.338886,A_Hit,,AJUBA,HNSC,AJUBA_HNSC,0.711518,0.026392,0.147049,0.832509,B_Target
4,AKT1,BRCA,AKT1_BRCA,0.0,0.997163,0.681829,0.166318,B_Target,,AKT1,BRCA,AKT1_BRCA,0.535765,0.092176,0.312738,0.504805,B_Target


In [8]:
# the paper uses an FDR threshold of 0.1, so we do the same here
fdr_threshold = 0.1

park_loss_df = (park_df
  .loc[park_df.FDR < fdr_threshold, :]
  .iloc[:, :8]
  .set_index('Pair')
)
print(park_loss_df.shape)
print(park_loss_df.Gene.unique())
park_loss_df.head()

(79, 7)
['AJUBA' 'ARID1A' 'ARID2' 'BAP1' 'BCOR' 'BRD7' 'CDH1' 'CDKN2A' 'CTCF'
 'CTNNB1' 'CUL3' 'CYLD' 'EP300' 'EPAS1' 'FAT1' 'FGFR3' 'HRAS' 'KEAP1'
 'MAP2K4' 'MAP3K1' 'NF1' 'NF2' 'NRAS' 'NSD1' 'PBRM1' 'PIK3R1' 'PPP2R1A'
 'PSIP1' 'PTEN' 'PTPDC1' 'RB1' 'RUNX1' 'SIN3A' 'SMAD4' 'SMARCA4' 'STK11'
 'TP53' 'TSC1' 'TSC2' 'VHL' 'ZFHX3' 'ZNF750']


Unnamed: 0_level_0,Gene,Tissue,Effect,Pval,FDR,LogFDR,Target
Pair,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AJUBA_HNSC,AJUBA,HNSC,-0.93605,0.008732955,0.045816,1.338886,A_Hit
ARID1A_LGG,ARID1A,LGG,-0.904202,0.01956617,0.076291,1.117471,A_Hit
ARID1A_STAD,ARID1A,STAD,1.133548,0.0003979932,0.000294,3.517309,B_Target
ARID1A_UCEC,ARID1A,UCEC,1.597876,3.451937e-07,0.0,5.0,B_Target
ARID2_LIHC,ARID2,LIHC,-0.978419,0.01019284,0.050276,1.298552,A_Hit


In [9]:
park_gain_df = (park_df
  .loc[park_df['FDR.1'] < fdr_threshold, :]
  .iloc[:, 9:]
)
park_gain_df.columns = park_gain_df.columns.str.replace('.1', '')
park_gain_df.set_index('Pair', inplace=True)
print(park_gain_df.shape)
print(park_gain_df.Gene.unique())
park_gain_df.head()

(30, 7)
['ARID1A' 'ATRX' 'BRAF' 'CTNNB1' 'CUL3' 'EGFR' 'EPAS1' 'ERBB4' 'FBXW7'
 'GATA3' 'IDH1' 'KIT' 'KMT2B' 'KRAS' 'NFE2L2' 'NRAS' 'PDGFRA' 'PPP2R1A'
 'TP53']


Unnamed: 0_level_0,Gene,Tissue,Effect,Pval,FDR,LogFDR,Target
Pair,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ARID1A_UCEC,ARID1A,UCEC,-2.351526,0.001154365,-0.000447,3.340299,B_Target
ATRX_LGG,ATRX,LGG,1.440987,4.025141e-08,0.0,5.0,A_Hit
BRAF_SKCM,BRAF,SKCM,1.239939,2.042839e-11,0.0,5.0,A_Hit
BRAF_THCA,BRAF,THCA,-2.449684,0.0007509064,-0.000209,3.659041,B_Target
CTNNB1_UCEC,CTNNB1,UCEC,-1.257861,0.01597968,0.096292,1.016366,B_Target


In [10]:
park_loss_df.to_csv('./data/park_loss_df.tsv', sep='\t')
park_gain_df.to_csv('./data/park_gain_df.tsv', sep='\t')