## Download multiple modalities of pan-cancer data from TCGA

The data is accessed directly from the [Genome Data Commons](https://gdc.cancer.gov/about-data/publications/pancanatlas).

NOTE: this download script uses the `md5sum` shell utility to verify file hashes. This script was developed and tested on a Linux machine, and `md5sum` commands may have to be changed to work on other platforms.

In [1]:
import os
import pandas as pd
from urllib.request import urlretrieve

First, we load a manifest file containing the GDC API ID and filename for each relevant file, as well as the md5 checksum to make sure the whole/uncorrupted file was downloaded.

The manifest included in this GitHub repo was downloaded from https://gdc.cancer.gov/node/971 on December 1, 2020.

In [2]:
manifest_df = pd.read_csv(os.path.join('data', 'manifest.tsv'),
                          sep='\t', index_col=0)
manifest_df.head()

Unnamed: 0_level_0,id,filename,md5,size
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mirna_sample,55d9bf6f-0712-4315-b588-e6f8e295018e,PanCanAtlas_miRNA_sample_information_list.txt,02bb56712be34bcd58c50d90387aebde,553408
methylation_27k,d82e2c44-89eb-43d9-b6d3-712732bf6a53,jhu-usc.edu_PANCAN_merged_HumanMethylation27_H...,5cec086f0b002d17befef76a3241e73b,5022150019
methylation_450k,99b0c493-9e94-4d99-af9f-151e46bab989,jhu-usc.edu_PANCAN_HumanMethylation450.betaVal...,a92f50490cf4eca98b0d19e10927de9d,41541692788
rppa,fcbb373e-28d4-4818-92f3-601ede3da5e1,TCGA-RPPA-pancan-clean.txt,e2b914c7ecd369589275d546d9555b05,18901234
rna_seq,3586c0da-64d0-4b74-a449-5ff4d9136611,EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2....,02e72c33071307ff6570621480d3c90b,1882540959


### Download gene expression data

In [5]:
rnaseq_id, rnaseq_filename = manifest_df.loc['rna_seq'].id, manifest_df.loc['rna_seq'].filename
url = 'http://api.gdc.cancer.gov/data/{}'.format(rnaseq_id)
exp_filepath = os.path.join('data', rnaseq_filename)

if not os.path.exists(exp_filepath):
    urlretrieve(url, exp_filepath)
else:
    print('Downloaded data file already exists, skipping download')

Downloaded data file already exists, skipping download


In [6]:
md5_sum = !md5sum $exp_filepath
print(md5_sum[0])
assert md5_sum[0].split(' ')[0] == manifest_df.loc['rna_seq'].md5

02e72c33071307ff6570621480d3c90b  data/EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv


### Download mutation data

In [9]:
mutation_id, mutation_filename = manifest_df.loc['mutation'].id, manifest_df.loc['mutation'].filename
url = 'http://api.gdc.cancer.gov/data/{}'.format(mutation_id)
mutation_filepath = os.path.join('data', mutation_filename)

if not os.path.exists(mutation_filepath):
    urlretrieve(url, mutation_filepath)
else:
    print('Downloaded data file already exists, skipping download')

In [10]:
md5_sum = !md5sum $exp_filepath
print(md5_sum[0])
assert md5_sum[0].split(' ')[0] == manifest_df.loc['mutation'].md5

639ad8f8386e98dacc22e439188aa8fa  data/mc3.v0.2.8.PUBLIC.maf.gz
