## Download multiple modalities of pan-cancer data from TCGA

The data is accessed directly from the [Genome Data Commons](https://gdc.cancer.gov/about-data/publications/pancanatlas).

NOTE: this download script uses the `md5sum` shell utility to verify file hashes. This script was developed and tested on a Linux machine, and `md5sum` commands may have to be changed to work on other platforms.

In [1]:
import pandas as pd
from urllib.request import urlretrieve

import wget

import sys; sys.path.append('..')
import config as cfg

First, we load a manifest file containing the GDC API ID and filename for each relevant file, as well as the md5 checksum to make sure the whole/uncorrupted file was downloaded.

The manifest included in this GitHub repo was downloaded from https://gdc.cancer.gov/node/971 on December 1, 2020.

In [2]:
manifest_df = pd.read_csv(cfg.data_dir / 'manifest.tsv',
                          sep='\t', index_col=0)
manifest_df.head()

Unnamed: 0_level_0,id,filename,md5,size
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mirna_sample,55d9bf6f-0712-4315-b588-e6f8e295018e,PanCanAtlas_miRNA_sample_information_list.txt,02bb56712be34bcd58c50d90387aebde,553408
methylation_27k,d82e2c44-89eb-43d9-b6d3-712732bf6a53,jhu-usc.edu_PANCAN_merged_HumanMethylation27_H...,5cec086f0b002d17befef76a3241e73b,5022150019
methylation_450k,99b0c493-9e94-4d99-af9f-151e46bab989,jhu-usc.edu_PANCAN_HumanMethylation450.betaVal...,a92f50490cf4eca98b0d19e10927de9d,41541692788
rppa,fcbb373e-28d4-4818-92f3-601ede3da5e1,TCGA-RPPA-pancan-clean.txt,e2b914c7ecd369589275d546d9555b05,18901234
rna_seq,3586c0da-64d0-4b74-a449-5ff4d9136611,EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2....,02e72c33071307ff6570621480d3c90b,1882540959


### Download gene expression data

In [3]:
rnaseq_id, rnaseq_filename = manifest_df.loc['rna_seq'].id, manifest_df.loc['rna_seq'].filename
url = 'http://api.gdc.cancer.gov/data/{}'.format(rnaseq_id)
exp_filepath = cfg.data_dir / rnaseq_filename

if not exp_filepath.is_file():
    urlretrieve(url, exp_filepath)
else:
    print('Downloaded data file already exists, skipping download')

Downloaded data file already exists, skipping download


In [4]:
md5_sum = !md5sum $exp_filepath
print(md5_sum[0])
assert md5_sum[0].split(' ')[0] == manifest_df.loc['rna_seq'].md5

02e72c33071307ff6570621480d3c90b  /home/jake/research/mutation-fn/data/EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv


### Download mutation data

In [5]:
mutation_id, mutation_filename = manifest_df.loc['mutation'].id, manifest_df.loc['mutation'].filename
url = 'http://api.gdc.cancer.gov/data/{}'.format(mutation_id)
mutation_filepath = cfg.data_dir / mutation_filename

if not mutation_filepath.is_file():
    urlretrieve(url, mutation_filepath)
else:
    print('Downloaded data file already exists, skipping download')

Downloaded data file already exists, skipping download


In [6]:
md5_sum = !md5sum $mutation_filepath
print(md5_sum[0])
assert md5_sum[0].split(' ')[0] == manifest_df.loc['mutation'].md5

639ad8f8386e98dacc22e439188aa8fa  /home/jake/research/mutation-fn/data/mc3.v0.2.8.PUBLIC.maf.gz


### Download CNV data

We download the same copy number data used in the [pancancer classifiers](https://github.com/greenelab/pancancer/blob/d1b3de7fa387d0a44d0a4468b0ac30918ed66886/scripts/initialize/download_data.sh#L33).

These CNV calls have been thresholded using GISTIC, and the output includes 5 values: [-2, -1, 0, 1, 2], which correspond to "deep loss", "moderate loss", "no change", "moderate gain", and "deep gain", respectively.

In [7]:
copy_url = 'https://ndownloader.figshare.com/files/11095412'
copy_filepath = cfg.data_dir / 'pancan_GISTIC_threshold.tsv'

if not copy_filename.is_file():
    wget.download(copy_url, out=str(copy_filepath))
else:
    print('Downloaded data file already exists, skipping download')

Downloaded data file already exists, skipping download


### Download gene set from Park et al. paper

We want to download the set of genes analyzed in [Park et al. 2021](https://www.nature.com/articles/s41467-021-27242-3). We are particularly interested in the "Class 2/3/4" genes from Figure 1, which are inferred to be "two-hit" genes where non-synonymous mutations and CNVs tend to co-occur more often than would be expected by chance.

In [8]:
import pandas as pd

park_df = pd.read_excel(
    'https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-021-27242-3/MediaObjects/41467_2021_27242_MOESM11_ESM.xlsx',
    sheet_name='SupplFig3', header=1
)
print(park_df.shape)
park_df.head()

(453, 17)


Unnamed: 0,Gene,Tissue,Pair,Effect,Pval,FDR,LogFDR,Target,Unnamed: 8,Gene.1,Tissue.1,Pair.1,Effect.1,Pval.1,FDR.1,LogFDR.1,Target.1
0,ACVR1,UCEC,ACVR1_UCEC,0.0,0.997445,0.681521,0.166514,B_Target,,ACVR1,UCEC,ACVR1_UCEC,-0.482699,0.370095,0.571956,0.24263,B_Target
1,ACVR2A,COADREAD,ACVR2A_COADREAD,0.0,0.997631,0.681315,0.166646,B_Target,,ACVR2A,COADREAD,ACVR2A_COADREAD,0.187531,0.753981,0.780954,0.107369,B_Target
2,ACVR2A,LIHC,ACVR2A_LIHC,-1.000178,0.040008,0.122307,0.912515,B_Target,,ACVR2A,LIHC,ACVR2A_LIHC,0.163421,0.834858,0.821548,0.085362,B_Target
3,AJUBA,HNSC,AJUBA_HNSC,-0.93605,0.008733,0.045816,1.338886,A_Hit,,AJUBA,HNSC,AJUBA_HNSC,0.711518,0.026392,0.147049,0.832509,B_Target
4,AKT1,BRCA,AKT1_BRCA,0.0,0.997163,0.681829,0.166318,B_Target,,AKT1,BRCA,AKT1_BRCA,0.535765,0.092176,0.312738,0.504805,B_Target


In [9]:
park_loss_df = park_df.iloc[:, :8].set_index('Pair')
print(park_loss_df.shape)
park_loss_df.head()

(453, 7)


Unnamed: 0_level_0,Gene,Tissue,Effect,Pval,FDR,LogFDR,Target
Pair,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ACVR1_UCEC,ACVR1,UCEC,0.0,0.997445,0.681521,0.166514,B_Target
ACVR2A_COADREAD,ACVR2A,COADREAD,0.0,0.997631,0.681315,0.166646,B_Target
ACVR2A_LIHC,ACVR2A,LIHC,-1.000178,0.040008,0.122307,0.912515,B_Target
AJUBA_HNSC,AJUBA,HNSC,-0.93605,0.008733,0.045816,1.338886,A_Hit
AKT1_BRCA,AKT1,BRCA,0.0,0.997163,0.681829,0.166318,B_Target


In [10]:
park_gain_df = park_df.iloc[:, 9:]
park_gain_df.columns = park_gain_df.columns.str.replace('.1', '', regex=False)
park_gain_df.set_index('Pair', inplace=True)
print(park_gain_df.shape)
park_gain_df.head()

(453, 7)


Unnamed: 0_level_0,Gene,Tissue,Effect,Pval,FDR,LogFDR,Target
Pair,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ACVR1_UCEC,ACVR1,UCEC,-0.482699,0.370095,0.571956,0.24263,B_Target
ACVR2A_COADREAD,ACVR2A,COADREAD,0.187531,0.753981,0.780954,0.107369,B_Target
ACVR2A_LIHC,ACVR2A,LIHC,0.163421,0.834858,0.821548,0.085362,B_Target
AJUBA_HNSC,AJUBA,HNSC,0.711518,0.026392,0.147049,0.832509,B_Target
AKT1_BRCA,AKT1,BRCA,0.535765,0.092176,0.312738,0.504805,B_Target


### Download oncogene/tumor suppressor information for Park et al. genes

In [11]:
# downloaded from https://www.sciencedirect.com/science/article/pii/S009286741830237X
# oncogene/TSG predictions for genes/cancer types using 20/20+ classifier
class_df = pd.read_excel(
    cfg.data_dir / '1-s2.0-S009286741830237X-mmc1.xlsx', 
    sheet_name='Table S1', index_col='KEY', header=3
)
class_df.rename(columns={'Tumor suppressor or oncogene prediction (by 20/20+)':
                         'classification'},
                inplace=True)

class_df.head()

  warn(msg)


Unnamed: 0_level_0,Gene,Cancer,classification,Decision,Tissue Frequency,Pancan Frequency,Consensus Score,Correlation adusted score,Novel,Rescue Notes,Note about previous publication
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ABL1_PANCAN,ABL1,PANCAN,,rescued,,0.011675,0.0,,0,Evidence from OncoImpact/DriverNET overlap (SN...,
ACVR1_UCEC,ACVR1,UCEC,oncogene,official,0.05303,0.00749,1.5,1.5,0,,0
ACVR1B_PANCAN,ACVR1B,PANCAN,possible tsg,official,,0.010904,1.0,0.0,0,,Found in 24132290
ACVR2A_COADREAD,ACVR2A,COADREAD,tsg,official,0.028481,0.013988,1.5,1.5,0,,Found in 22810696
ACVR2A_LIHC,ACVR2A,LIHC,possible tsg,official,0.031073,0.013988,1.5,1.5,0,,Found in private communication about integrati...


In [12]:
loss_class_df = (park_loss_df
    .merge(class_df.loc[:, ['classification']], left_index=True, right_index=True)
)

# format oncogene/TSG classification to work with vogelstein genes
loss_class_df['classification'] = (
    loss_class_df.classification.str.replace('possible ', '')
                                    .replace('tsg', 'TSG')
                                    .replace('oncogene', 'Oncogene')
)

print(loss_class_df.classification.unique())
loss_class_df.head()

['Oncogene' 'TSG']


Unnamed: 0,Gene,Tissue,Effect,Pval,FDR,LogFDR,Target,classification
ACVR1_UCEC,ACVR1,UCEC,0.0,0.997445,0.681521,0.166514,B_Target,Oncogene
ACVR2A_COADREAD,ACVR2A,COADREAD,0.0,0.997631,0.681315,0.166646,B_Target,TSG
ACVR2A_LIHC,ACVR2A,LIHC,-1.000178,0.040008,0.122307,0.912515,B_Target,TSG
AJUBA_HNSC,AJUBA,HNSC,-0.93605,0.008733,0.045816,1.338886,A_Hit,TSG
AKT1_BRCA,AKT1,BRCA,0.0,0.997163,0.681829,0.166318,B_Target,Oncogene


In [13]:
gain_class_df = (park_gain_df
    .merge(class_df.loc[:, ['classification']], left_index=True, right_index=True)
)

# format oncogene/TSG classification to work with vogelstein genes
gain_class_df['classification'] = (
    gain_class_df.classification.str.replace('possible ', '')
                                    .replace('tsg', 'TSG')
                                    .replace('oncogene', 'Oncogene')
)

print(gain_class_df.classification.unique())
gain_class_df.head()

['Oncogene' 'TSG']


Unnamed: 0,Gene,Tissue,Effect,Pval,FDR,LogFDR,Target,classification
ACVR1_UCEC,ACVR1,UCEC,-0.482699,0.370095,0.571956,0.24263,B_Target,Oncogene
ACVR2A_COADREAD,ACVR2A,COADREAD,0.187531,0.753981,0.780954,0.107369,B_Target,TSG
ACVR2A_LIHC,ACVR2A,LIHC,0.163421,0.834858,0.821548,0.085362,B_Target,TSG
AJUBA_HNSC,AJUBA,HNSC,0.711518,0.026392,0.147049,0.832509,B_Target,TSG
AKT1_BRCA,AKT1,BRCA,0.535765,0.092176,0.312738,0.504805,B_Target,Oncogene


In [14]:
loss_class_df.to_csv(cfg.data_dir / 'park_loss_df.tsv', sep='\t')
gain_class_df.to_csv(cfg.data_dir / 'park_gain_df.tsv', sep='\t')