## Preprocess TCGA mutational signatures

Load the downloaded data and curate sample IDs.

Mutational signature information for the TCGA whole-exome samples isn't available from GDC like the other datasets we're using, but we can get them from the [ICGC data portal here](https://dcc.icgc.org/releases/PCAWG/mutational_signatures/). These were originally generated in [this paper](https://www.nature.com/articles/s41586-020-1943-3).

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import mpmp.config as cfg
import mpmp.utilities.tcga_utilities as tu

### Read TCGA Barcode Curation Information

Extract information from TCGA barcodes - `cancer-type` and `sample-type`. See https://github.com/cognoma/cancer-data for more details

In [2]:
(cancer_types_df,
 cancertype_codes_dict,
 sample_types_df,
 sampletype_codes_dict) = tu.get_tcga_barcode_info()
cancer_types_df.head(2)

Unnamed: 0,TSS Code,Source Site,Study Name,BCR,acronym
0,1,International Genomics Consortium,ovarian serous cystadenocarcinoma,IGC,OV
1,2,MD Anderson Cancer Center,glioblastoma multiforme,IGC,GBM


In [3]:
sample_types_df.head(2)

Unnamed: 0,Code,Definition,Short Letter Code
0,1,Primary Solid Tumor,TP
1,2,Recurrent Solid Tumor,TR


### Load and process mutational signatures data

In [4]:
# these are the "single base signatures" described in the paper linked above, 
# or for more information see: 
# https://cancer.sanger.ac.uk/cosmic/signatures/SBS/index.tt
# as far as I can tell, DBS and ID signatures weren't generated for TCGA whole-exome samples
url = (
    'https://dcc.icgc.org/api/v1/download'
    '?fn=/PCAWG/mutational_signatures/Signatures_in_Samples/SP_Signatures_in_Samples/'
    'TCGA_WES_sigProfiler_SBS_signatures_in_samples.csv'
)
mut_sigs_df = pd.read_csv(url, index_col=1)
mut_sigs_df.index.rename('sample_id', inplace=True)

print(mut_sigs_df.shape)
print(mut_sigs_df.columns)
mut_sigs_df.iloc[:5, :5]

(9493, 67)
Index(['Cancer Types', 'Accuracy', 'SBS1', 'SBS2', 'SBS3', 'SBS4', 'SBS5',
       'SBS6', 'SBS7a', 'SBS7b', 'SBS7c', 'SBS7d', 'SBS8', 'SBS9', 'SBS10a',
       'SBS10b', 'SBS11', 'SBS12', 'SBS13', 'SBS14', 'SBS15', 'SBS16',
       'SBS17a', 'SBS17b', 'SBS18', 'SBS19', 'SBS20', 'SBS21', 'SBS22',
       'SBS23', 'SBS24', 'SBS25', 'SBS26', 'SBS27', 'SBS28', 'SBS29', 'SBS30',
       'SBS31', 'SBS32', 'SBS33', 'SBS34', 'SBS35', 'SBS36', 'SBS37', 'SBS38',
       'SBS39', 'SBS40', 'SBS41', 'SBS42', 'SBS43', 'SBS44', 'SBS45', 'SBS46',
       'SBS47', 'SBS48', 'SBS49', 'SBS50', 'SBS51', 'SBS52', 'SBS53', 'SBS54',
       'SBS55', 'SBS56', 'SBS57', 'SBS58', 'SBS59', 'SBS60'],
      dtype='object')


Unnamed: 0_level_0,Cancer Types,Accuracy,SBS1,SBS2,SBS3
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-AB-2802-03B-01W-0728-08,AML,0.811,3,0,0
TCGA-AB-2803-03B-01W-0728-08,AML,0.608,4,0,0
TCGA-AB-2804-03B-01W-0728-08,AML,0.826,0,0,0
TCGA-AB-2805-03B-01W-0728-08,AML,0.903,12,0,0
TCGA-AB-2806-03B-01W-0728-08,AML,0.896,40,0,0


In [5]:
# update sample IDs to remove multiple samples measured on the same tumor
# and to map with the clinical information
mut_sigs_df.index = mut_sigs_df.index.str.slice(start=0, stop=15)
mut_sigs_df = mut_sigs_df.loc[~mut_sigs_df.index.duplicated(), :]
print(mut_sigs_df.shape)

(9493, 67)


In [6]:
(mut_sigs_df
    .drop(columns=['Cancer Types', 'Accuracy'])
    .to_csv(cfg.data_types['mut_sigs'], sep='\t')
)

### Process TCGA cancer type and sample type info from barcodes

See https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codes for more details.

In [7]:
# get sample info and save to file

tcga_id = tu.get_and_save_sample_info(mut_sigs_df,
                                      sampletype_codes_dict,
                                      cancertype_codes_dict,
                                      training_data='mut_sigs')

print(tcga_id.shape)
tcga_id.head()

(9493, 4)


Unnamed: 0,sample_id,sample_type,cancer_type,id_for_stratification
0,TCGA-AB-2802-03,Primary Blood Derived Cancer - Peripheral Blood,LAML,LAMLPrimary Blood Derived Cancer - Peripheral ...
1,TCGA-AB-2803-03,Primary Blood Derived Cancer - Peripheral Blood,LAML,LAMLPrimary Blood Derived Cancer - Peripheral ...
2,TCGA-AB-2804-03,Primary Blood Derived Cancer - Peripheral Blood,LAML,LAMLPrimary Blood Derived Cancer - Peripheral ...
3,TCGA-AB-2805-03,Primary Blood Derived Cancer - Peripheral Blood,LAML,LAMLPrimary Blood Derived Cancer - Peripheral ...
4,TCGA-AB-2806-03,Primary Blood Derived Cancer - Peripheral Blood,LAML,LAMLPrimary Blood Derived Cancer - Peripheral ...


In [8]:
# get cancer type counts and save to file
cancertype_count_df = (
    pd.DataFrame(tcga_id.cancer_type.value_counts())
    .reset_index()
    .rename({'index': 'cancertype', 'cancer_type': 'n ='}, axis='columns')
)

file = os.path.join(cfg.sample_info_dir, 'tcga_mut_sigs_sample_counts.tsv')
cancertype_count_df.to_csv(file, sep='\t', index=False)

cancertype_count_df.head()

Unnamed: 0,cancertype,n =
0,BRCA,930
1,LUAD,528
2,LGG,512
3,PRAD,484
4,THCA,478
