## Download pan-cancer RNA-seq read counts data from UCSC Xena Browser

The other RNA-seq data we downloaded in `0_data_download` contains pre-processed RPKM values. In order to do differential expression analysis, most methods recommend using count data, or something similar such as [RSEM expected counts](https://support.bioconductor.org/p/90672/#90678) (which is what we'll download here).

GDC does not seem to store RNA-seq read counts (that I'm aware of), so we'll download it from the UCSC Xena Browser instead. This data was generated as part of the Pan-Cancer Atlas project so it should apply to the same set of samples.

In [1]:
import pandas as pd
# from urllib.request import urlretrieve

import sys; sys.path.append('..')
import config as cfg

cfg.de_data_dir.mkdir(parents=True, exist_ok=True)
cfg.raw_de_data_dir.mkdir(parents=True, exist_ok=True)

In [2]:
base_url = 'https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/'
filename = 'tcga_gene_expected_count'

url = base_url + filename + '.gz'
output_filename = cfg.raw_de_data_dir / (filename + '.tsv.gz')

if not output_filename.is_file():
    print('Raw data file does not exist, downloading...')
    counts_df = pd.read_csv(url, sep='\t')
    counts_df.to_csv(output_filename, sep='\t')
else:
    print('Loading from existing raw data file')
    counts_df = pd.read_csv(output_filename, sep='\t', index_col=0)
    
counts_df.iloc[:5, :5]

Loading from existing raw data file


Unnamed: 0,sample,TCGA-19-1787-01,TCGA-S9-A7J2-01,TCGA-G3-A3CH-11,TCGA-EK-A2RE-01
0,ENSG00000242268.2,0.0,4.6439,0.0,0.0
1,ENSG00000259041.1,0.0,0.0,0.0,0.0
2,ENSG00000270112.3,2.0,2.8074,0.0,0.0
3,ENSG00000167578.16,10.3835,9.9144,8.9539,10.0543
4,ENSG00000278814.1,0.0,0.0,0.0,0.0


## Process counts matrix

In [3]:
print(counts_df.shape)

counts_df = (counts_df
    .set_index('sample')
    .dropna(axis='rows')
    .transpose()
    .sort_index(axis='rows')
    .sort_index(axis='columns')
)

counts_df.index.rename('sample_id', inplace=True)
counts_df.columns.name = None

(60498, 10531)


In [4]:
counts_df.iloc[:5, :5]

Unnamed: 0_level_0,ENSG00000000003.14,ENSG00000000005.5,ENSG00000000419.12,ENSG00000000457.13,ENSG00000000460.16
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-02-0047-01,11.0587,1.0,9.1111,8.5989,8.0783
TCGA-02-0055-01,11.3393,9.5372,11.0437,8.9492,8.8225
TCGA-02-2483-01,12.116,1.0,11.1235,9.5243,9.3524
TCGA-02-2485-01,12.1724,2.3219,9.9986,9.2997,9.394
TCGA-04-1331-01,12.5887,3.4594,11.4268,10.1968,10.0932


In [5]:
# per the documentation for the Xena Browser, these are log-transformed
# expected counts - see: 
# https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_gene_expected_count.json
#
# we want to un-log transform them here (2^x - 1), and round to the nearest integer,
# to prepare for DE analysis
print('After transform:', counts_df.min().min(), counts_df.max().max())
counts_df = ((2 ** counts_df) - 1).round(0).astype(int)
print('Before transform:', counts_df.min().min(), counts_df.max().max())

After transform: 0.0 24.3103
Before transform: 0 20803168


In [6]:
counts_df.to_csv(cfg.processed_counts_file, sep='\t')