# TCGA data

@mmm, December 24, 2025

Data and information of TCGA data.  
Select RNAseq data and generate complete metadata.  
See also previous metadata 2021 in 
- /Users/mmalumbres/Desktop/BioDATA/TCGA/210226 TCGA_metadata_analysis.ipynb
- /Users/mmalumbres/Desktop/BioDATA/TCGA/210226 TCGA_metadata.ipynb

In [8]:
DESKTOP = '/Users/mmalumbres/Desktop/'
TCGA = "/Users/mmalumbres/Desktop/BioDATA/TCGA/"
XENA = "/Users/mmalumbres/Desktop/BioDATA/TCGA/Xena/"
PANCANATLAS = '/Users/mmalumbres/Desktop/BioDATA/TCGA/GDC TCGA PanCanAtlas/'
PANCAN = '/Users/mmalumbres/Desktop/BioDATA/TCGA/GDC Pan-Cancer (PANCAN)/'

In [3]:
import pandas as pd

data obtained from:  
- PanCanAtlas: https://gdc.cancer.gov/about-data/publications/pancanatlas
- GDC TCGA PANCAN from Xena: https://xenabrowser.net/datapages/?cohort=GDC%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

# Summary of data  

### 1. RNAseq
- 1.1. `xena`: XENA: RSEM_Hugo_norm_count: 58581 genes (GeneSymbol) x 10536 samples (TCGA-S9-A7J2-01).
- 1.2. `transcripts`: PANCAN: 60488 genes (Ensembl_ID) × 11058 samples (TCGA-OR-A5JP-01A).
- 1.3. `transcripts_PanCanAtlas`: PanCanAtlas: 20531 genes (other gene_id: ?|100134869) x 11070 samples (TCGA-OR-A5J1-01A-11R-A29S-07; unusual codes).

### 2. Gene based
- 2.1. `cnv`: > 2 million data multiple per gene and sample. DO NOT USE
- 2.2. `masked_cnv` > 2 million data multiple per gene and sample. DO NOT USE
- 2.1. `maf` (PANCANATLAS + mc3.v0.2.8.PUBLIC.maf.gz): > 3 millones de mutaciones. DO NOT USE

### 3. Tumor data (e.g. TCGA-OR-A5J1-01)
- 3.1. `basic_phenotype`: PANCAN. Basic clinical data per sample
- 3.2. `survival`: PANCAN. survival
- 3.3. `phenotype`: PANCAN phenotype
- 3.4. `purity`: PANCANATLAS ABSOLUTE Genome purity and doublings. OK
- 3.5. `paradigm`: PANCANATLAS PARADIGM Data interference of pathways. 19k pathways - 9830 samples. too many pathways! - CHECK!
- 3.6. `gistic`: Copy Number per Gene: 19729 genes x 11369 samples. numeric values. OK but CHECK too large.
- 3.7. `mc3`: Xena MC3 SNPs... all genes by sample. 40543 genes (RB1) x 9105 samples (TCGA-02-0003-01). OK. but select most frequent genes?? or use externally=

### 4. Patient data (e.g. TCGA-OR-A5J1)
- 4.1. `clinical`: PANCANATLAS PFI, PFS, OS, etc. OK
- 4.2. `clinical_follow`: PANCANATLAS additional follow up. >10k patients. OK

### 5. Others
- 5.1. `quality`: sample quality: PANCANATLAS + "merged_sample_quality_annotations.tsv". multiple samples per tumor and platform. 79,286 rows. DO NOT USE

# 1. Download data

## 1. RNAseq data

### 1.1. Xena

In [22]:
# RNAseq data
XENA = '/Users/mmalumbres/Desktop/BioDATA/TCGA/data/Xena/'
xena = pd.read_csv(XENA + "Gene expression/tcga_RSEM_Hugo_norm_count.tsv", sep="\t")
print(xena.shape)
xena.head(3)

(58581, 10536)


Unnamed: 0,sample,TCGA-S9-A7J2-01,TCGA-G3-A3CH-11,TCGA-EK-A2RE-01,TCGA-44-6778-01,TCGA-F4-6854-01,TCGA-AB-2863-03,TCGA-C8-A1HL-01,TCGA-EW-A2FS-01,TCGA-05-4420-01,...,TCGA-DJ-A2QC-01,TCGA-A8-A09K-01,TCGA-61-1907-01,TCGA-IB-7885-01,TCGA-B6-A0IA-01,TCGA-VQ-AA6F-01,TCGA-BR-8588-01,TCGA-24-2254-01,TCGA-DD-A115-01,TCGA-FV-A3I0-11
0,CTD-2588J6.1,0.7385,0.0,0.0,0.0,0.0,1.1763,0.0,0.0,0.9652,...,0.0,0.0,0.0,0.0,0.0,1.1745,0.0,0.0,0.0,0.0
1,RP11-433M22.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CTD-2588J6.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 1.2. PANCAN

In [65]:
# RNAseq data
transcripts = pd.read_csv(PANCAN + "RNAseq/GDC-PANCAN.htseq_counts.tsv", sep="\t")
transcripts

Unnamed: 0,Ensembl_ID,TCGA-OR-A5JP-01A,TCGA-OR-A5JG-01A,TCGA-OR-A5K1-01A,TCGA-OR-A5JR-01A,TCGA-OR-A5KU-01A,TCGA-OR-A5L9-01A,TCGA-OR-A5JQ-01A,TCGA-OR-A5K4-01A,TCGA-OR-A5JL-01A,...,TCGA-VD-A8KI-01A,TCGA-V4-A9E9-01A,TCGA-V4-A9F7-01A,TCGA-V4-A9EO-01A,TCGA-V4-A9EU-01A,TCGA-WC-A87T-01A,TCGA-WC-AA9A-01A,TCGA-V4-A9EA-01A,TCGA-RZ-AB0B-01A,TCGA-V4-A9F8-01A
0,ENSG00000000003.13,10.769838,10.721099,10.253847,11.448116,10.843921,7.787903,11.971184,11.188589,11.762797,...,10.372865,10.380461,11.545930,10.696968,10.588715,11.394463,9.748193,9.533330,9.667112,8.405141
1,ENSG00000000005.5,2.584963,4.000000,1.000000,1.000000,1.000000,0.000000,2.000000,1.584963,3.700440,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,ENSG00000000419.11,11.361396,11.456868,11.002112,11.324181,10.714246,6.442943,11.385862,11.252665,10.682117,...,8.791163,8.451211,10.055282,9.879583,10.205793,10.062046,10.133142,6.629357,10.018200,7.781360
3,ENSG00000000457.12,9.152285,8.654636,7.781360,9.055282,7.948367,5.727920,9.057992,8.527477,9.226412,...,8.447083,7.483816,9.797662,9.525521,9.535275,9.702173,9.303781,6.475733,9.350939,7.499846
4,ENSG00000000460.15,7.693487,7.832890,6.918863,7.754888,6.459432,3.807355,7.592457,7.475733,6.584963,...,6.584963,6.247928,7.912889,7.714246,7.768184,7.942515,7.787903,5.247928,7.906891,6.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60483,__no_feature,20.857955,20.913038,20.642255,21.594382,20.915012,20.335652,21.262439,20.937683,21.253776,...,20.078441,20.108500,21.910417,20.804356,21.859678,22.122194,21.561402,20.012948,21.125540,19.634309
60484,__ambiguous,21.196319,20.855109,20.250448,21.013543,20.282017,18.897063,21.192075,20.704917,21.144501,...,21.207991,21.127259,21.393229,21.368524,21.066775,21.344917,21.649232,20.622553,20.987240,20.894101
60485,__too_low_aQual,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
60486,__not_aligned,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [79]:
transcripts.columns = transcripts.columns.str[:15]
transcripts.head(3)

Unnamed: 0_level_0,TCGA-OR-A5JP-01,TCGA-OR-A5JG-01,TCGA-OR-A5K1-01,TCGA-OR-A5JR-01,TCGA-OR-A5KU-01,TCGA-OR-A5L9-01,TCGA-OR-A5JQ-01,TCGA-OR-A5K4-01,TCGA-OR-A5JL-01,TCGA-OR-A5LS-01,...,TCGA-VD-A8KI-01,TCGA-V4-A9E9-01,TCGA-V4-A9F7-01,TCGA-V4-A9EO-01,TCGA-V4-A9EU-01,TCGA-WC-A87T-01,TCGA-WC-AA9A-01,TCGA-V4-A9EA-01,TCGA-RZ-AB0B-01,TCGA-V4-A9F8-01
Ensembl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003,10.769838,10.721099,10.253847,11.448116,10.843921,7.787903,11.971184,11.188589,11.762797,10.932953,...,10.372865,10.380461,11.54593,10.696968,10.588715,11.394463,9.748193,9.53333,9.667112,8.405141
ENSG00000000005,2.584963,4.0,1.0,1.0,1.0,0.0,2.0,1.584963,3.70044,4.584963,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419,11.361396,11.456868,11.002112,11.324181,10.714246,6.442943,11.385862,11.252665,10.682117,10.590587,...,8.791163,8.451211,10.055282,9.879583,10.205793,10.062046,10.133142,6.629357,10.0182,7.78136


In [70]:
transcripts = transcripts.drop("Ensembl_ID", axis=1).set_index("Ensembl")
print(transcripts.shape)
transcripts.head(3)

(60488, 11057)


Unnamed: 0_level_0,TCGA-OR-A5JP-01A,TCGA-OR-A5JG-01A,TCGA-OR-A5K1-01A,TCGA-OR-A5JR-01A,TCGA-OR-A5KU-01A,TCGA-OR-A5L9-01A,TCGA-OR-A5JQ-01A,TCGA-OR-A5K4-01A,TCGA-OR-A5JL-01A,TCGA-OR-A5LS-01A,...,TCGA-VD-A8KI-01A,TCGA-V4-A9E9-01A,TCGA-V4-A9F7-01A,TCGA-V4-A9EO-01A,TCGA-V4-A9EU-01A,TCGA-WC-A87T-01A,TCGA-WC-AA9A-01A,TCGA-V4-A9EA-01A,TCGA-RZ-AB0B-01A,TCGA-V4-A9F8-01A
Ensembl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003,10.769838,10.721099,10.253847,11.448116,10.843921,7.787903,11.971184,11.188589,11.762797,10.932953,...,10.372865,10.380461,11.54593,10.696968,10.588715,11.394463,9.748193,9.53333,9.667112,8.405141
ENSG00000000005,2.584963,4.0,1.0,1.0,1.0,0.0,2.0,1.584963,3.70044,4.584963,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419,11.361396,11.456868,11.002112,11.324181,10.714246,6.442943,11.385862,11.252665,10.682117,10.590587,...,8.791163,8.451211,10.055282,9.879583,10.205793,10.062046,10.133142,6.629357,10.0182,7.78136


### 1.3. PanCanAtlas

In [21]:
# RNAseq data
transcripts_PanCanAtlas = pd.read_csv(PANCANATLAS + "EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv", sep="\t")
print(transcripts_PanCanAtlas.shape)
transcripts_PanCanAtlas.head(3)

(20531, 11070)


Unnamed: 0,gene_id,TCGA-OR-A5J1-01A-11R-A29S-07,TCGA-OR-A5J2-01A-11R-A29S-07,TCGA-OR-A5J3-01A-11R-A29S-07,TCGA-OR-A5J5-01A-11R-A29S-07,TCGA-OR-A5J6-01A-31R-A29S-07,TCGA-OR-A5J7-01A-11R-A29S-07,TCGA-OR-A5J8-01A-11R-A29S-07,TCGA-OR-A5J9-01A-11R-A29S-07,TCGA-OR-A5JA-01A-11R-A29S-07,...,TCGA-CG-4449-01A-01R-1157-13,TCGA-CG-4462-01A-01R-1157-13,TCGA-CG-4465-01A-01R-1157-13,TCGA-CG-4466-01A-01R-1157-13,TCGA-CG-4469-01A-01R-1157-13,TCGA-CG-4472-01A-01R-1157-13,TCGA-CG-4474-01A-02R-1157-13,TCGA-CG-4475-01A-01R-1157-13,TCGA-CG-4476-01A-01R-1157-13,TCGA-CG-4477-01A-01R-1157-13
0,?|100130426,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
1,?|100133144,3.2661,2.6815,1.7301,0.0,0.0,1.1673,1.4422,0.0,4.4556,...,4.358154,5.676995,5.21935,14.846708,20.115492,6.997533,18.311906,12.057112,18.62874,17.874417
2,?|100134869,3.9385,8.9948,6.565,1.5492,4.4709,6.0529,2.2876,1.3599,5.0581,...,2.65636,3.342794,2.423442,5.055287,11.626054,13.654193,7.417109,11.585177,11.482418,14.919338


In [85]:
transcripts_PanCanAtlas.columns = transcripts_PanCanAtlas.columns.str[:15]
transcripts_PanCanAtlas.head(3)

Unnamed: 0,gene_id,TCGA-OR-A5J1-01,TCGA-OR-A5J2-01,TCGA-OR-A5J3-01,TCGA-OR-A5J5-01,TCGA-OR-A5J6-01,TCGA-OR-A5J7-01,TCGA-OR-A5J8-01,TCGA-OR-A5J9-01,TCGA-OR-A5JA-01,...,TCGA-CG-4449-01,TCGA-CG-4462-01,TCGA-CG-4465-01,TCGA-CG-4466-01,TCGA-CG-4469-01,TCGA-CG-4472-01,TCGA-CG-4474-01,TCGA-CG-4475-01,TCGA-CG-4476-01,TCGA-CG-4477-01
0,?|100130426,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
1,?|100133144,3.2661,2.6815,1.7301,0.0,0.0,1.1673,1.4422,0.0,4.4556,...,4.358154,5.676995,5.21935,14.846708,20.115492,6.997533,18.311906,12.057112,18.62874,17.874417
2,?|100134869,3.9385,8.9948,6.565,1.5492,4.4709,6.0529,2.2876,1.3599,5.0581,...,2.65636,3.342794,2.423442,5.055287,11.626054,13.654193,7.417109,11.585177,11.482418,14.919338


### Common samples

In [81]:
xena.columns

Index(['sample', 'TCGA-S9-A7J2-01', 'TCGA-G3-A3CH-11', 'TCGA-EK-A2RE-01',
       'TCGA-44-6778-01', 'TCGA-F4-6854-01', 'TCGA-AB-2863-03',
       'TCGA-C8-A1HL-01', 'TCGA-EW-A2FS-01', 'TCGA-05-4420-01',
       ...
       'TCGA-DJ-A2QC-01', 'TCGA-A8-A09K-01', 'TCGA-61-1907-01',
       'TCGA-IB-7885-01', 'TCGA-B6-A0IA-01', 'TCGA-VQ-AA6F-01',
       'TCGA-BR-8588-01', 'TCGA-24-2254-01', 'TCGA-DD-A115-01',
       'TCGA-FV-A3I0-11'],
      dtype='object', length=10536)

In [82]:
transcripts.columns

Index(['TCGA-OR-A5JP-01', 'TCGA-OR-A5JG-01', 'TCGA-OR-A5K1-01',
       'TCGA-OR-A5JR-01', 'TCGA-OR-A5KU-01', 'TCGA-OR-A5L9-01',
       'TCGA-OR-A5JQ-01', 'TCGA-OR-A5K4-01', 'TCGA-OR-A5JL-01',
       'TCGA-OR-A5LS-01',
       ...
       'TCGA-VD-A8KI-01', 'TCGA-V4-A9E9-01', 'TCGA-V4-A9F7-01',
       'TCGA-V4-A9EO-01', 'TCGA-V4-A9EU-01', 'TCGA-WC-A87T-01',
       'TCGA-WC-AA9A-01', 'TCGA-V4-A9EA-01', 'TCGA-RZ-AB0B-01',
       'TCGA-V4-A9F8-01'],
      dtype='object', length=11057)

In [83]:
# xena & PANCAN
common = xena.columns.isin(transcripts.columns)
common.sum()

10334

In [86]:
# PANCAN & PanCanAtlas
common = transcripts.columns.isin(transcripts_PanCanAtlas.columns)
common.sum()

10904

## 2. Gene & Chromosome data

### 2.1. Copy number

In [33]:
cnv = pd.read_csv(PANCAN + "copy number/GDC-PANCAN.cnv.tsv", sep="\t")
print(cnv.shape)
cnv.head(3)

(8235503, 5)


Unnamed: 0,sample,Chrom,Start,End,value
0,TCGA-OR-A5JE-01A,1,62920,15381037,-0.554
1,TCGA-OR-A5JE-01A,1,15383915,21811974,-0.1369
2,TCGA-OR-A5JE-01A,1,21815733,21898797,0.2401


### 2.2. Masked CNV

In [35]:
masked_cnv = pd.read_csv(PANCAN + "copy number/GDC-PANCAN.masked_cnv.tsv", sep="\t")
print(masked_cnv.shape)
masked_cnv.head(3)

(3490756, 5)


Unnamed: 0,sample,Chrom,Start,End,value
0,TCGA-OR-A5JK-01A,1,3301765,119984772,-0.2971
1,TCGA-OR-A5JK-01A,1,146716041,197788606,0.0862
2,TCGA-OR-A5JK-01A,1,197788716,197799377,-1.3311


### 2.3. Somatic Mutation Datasets

All > 2 million rows..

In [28]:
varscan2 = pd.read_csv(PANCAN + "somatic mutation/GDC-PANCAN.varscan2_snv.tsv", sep="\t")
print(varscan2.shape)
varscan2.head(3)

(2854561, 11)


Unnamed: 0,Sample_ID,gene,chrom,start,end,ref,alt,Amino_Acid_Change,effect,filter,dna_vaf
0,TCGA-OR-A5J6-01A,OMA1,chr1,58539205,58539205,T,C,p.T30T,synonymous_variant,PASS,0.117647
1,TCGA-OR-A5J6-01A,SH3RF3,chr2,109437072,109437072,C,G,p.S585W,missense_variant,PASS,0.212871
2,TCGA-OR-A5J6-01A,SCN2A,chr2,165344781,165344781,A,T,p.H930L,missense_variant,PASS,0.220401


In [29]:
somaticsniper = pd.read_csv(PANCAN + "somatic mutation/GDC-PANCAN.somaticsniper_snv.tsv", sep="\t")
print(somaticsniper.shape)
somaticsniper.head(3)

(2202277, 11)


Unnamed: 0,Sample_ID,gene,chrom,start,end,ref,alt,Amino_Acid_Change,effect,filter,dna_vaf
0,TCGA-OR-A5K8-01A,RP1-163M9.7,chr1,16707615,16707615,C,T,,downstream_gene_variant,PASS,0.542857
1,TCGA-OR-A5K8-01A,HHIPL2,chr1,222547868,222547868,C,A,p.E59D,missense_variant,PASS,0.811321
2,TCGA-OR-A5K8-01A,GALNT2,chr1,230279475,230279475,C,G,,3_prime_UTR_variant,PASS,0.777778


In [30]:
mutect2 = pd.read_csv(PANCAN + "somatic mutation/GDC-PANCAN.mutect2_snv.tsv", sep="\t")
print(mutect2.shape)
mutect2.head(3)

(3175929, 11)


Unnamed: 0,Sample_ID,gene,chrom,start,end,ref,alt,Amino_Acid_Change,effect,filter,dna_vaf
0,TCGA-P6-A5OH-01A,RER1,chr1,2402218,2402219,-,CAAGGGCATCCTTGTGGCTATGGTCTGTACTTTCTTCTACGCTTT,p.F138_D139insYAFKGILVAMVCTFF,inframe_insertion,PASS,0.172414
1,TCGA-P6-A5OH-01A,WASF2,chr1,27410171,27410171,G,A,p.A287V,missense_variant,PASS,0.255319
2,TCGA-P6-A5OH-01A,MACF1,chr1,39455094,39455094,T,A,p.H6923Q,missense_variant,PASS,0.072072


In [31]:
muse = pd.read_csv(PANCAN + "somatic mutation/GDC-PANCAN.muse_snv.tsv", sep="\t")
print(muse.shape)
muse.head(3)

(2684788, 11)


Unnamed: 0,Sample_ID,gene,chrom,start,end,ref,alt,Amino_Acid_Change,effect,filter,dna_vaf
0,TCGA-OR-A5J7-01A,CTBS,chr1,84570701,84570701,C,T,p.G66E,missense_variant,PASS,0.146667
1,TCGA-OR-A5J7-01A,ATF6,chr1,161791444,161791444,C,G,p.P131A,missense_variant,PASS,0.491525
2,TCGA-OR-A5J7-01A,SLC35F3,chr1,234309160,234309160,C,A,p.A154E,missense_variant,PASS,0.218182


## 3. Sample data

### 3.1. Basic clinical data by sample

In [38]:
basic_phenotype = pd.read_csv(PANCAN + "basic-phenotype/GDC-PANCAN.basic_phenotype.tsv", sep="\t")
print(basic_phenotype.shape)
basic_phenotype.head(3)

(19188, 7)


Unnamed: 0,sample,program,sample_type_id,sample_type,project_id,Age at Diagnosis in Years,Gender
0,TCGA-69-7978-01A,TCGA,1,Primary Tumor,TCGA-LUAD,59.0,Male
1,TCGA-AR-A24Z-01A,TCGA,1,Primary Tumor,TCGA-BRCA,57.0,Female
2,TCGA-D1-A103-01A,TCGA,1,Primary Tumor,TCGA-UCEC,87.0,Female


In [107]:
basic_phenotype["sample"] = basic_phenotype["sample"].str[:15]
basic_phenotype = basic_phenotype.drop_duplicates(subset="sample")
print(basic_phenotype.shape)
basic_phenotype.head(3)

(16186, 7)


Unnamed: 0,sample,program,sample_type_id,sample_type,project_id,Age at Diagnosis in Years,Gender
0,TCGA-69-7978-01,TCGA,1,Primary Tumor,TCGA-LUAD,59.0,Male
1,TCGA-AR-A24Z-01,TCGA,1,Primary Tumor,TCGA-BRCA,57.0,Female
2,TCGA-D1-A103-01,TCGA,1,Primary Tumor,TCGA-UCEC,87.0,Female


### 3.2. Survival per sample

In [43]:
survival = pd.read_csv(PANCAN + "phenotype/GDC-PANCAN.survival.tsv", sep="\t")
print(survival.shape)
survival.head(3)

(18492, 4)


Unnamed: 0,sample,OS,_PATIENT,OS.time
0,TCGA-OR-A5KZ-01A,1,TCGA-OR-A5KZ,125
1,TCGA-OR-A5LC-01A,1,TCGA-OR-A5LC,159
2,TCGA-P6-A5OF-01A,1,TCGA-P6-A5OF,207


In [108]:
survival["sample"] = survival["sample"].str[:15]
survival = survival.drop_duplicates(subset="sample")
print(survival.shape)
survival.head(3)

(15719, 4)


Unnamed: 0,sample,OS,_PATIENT,OS.time
0,TCGA-OR-A5KZ-01,1,TCGA-OR-A5KZ,125
1,TCGA-OR-A5LC-01,1,TCGA-OR-A5LC,159
2,TCGA-P6-A5OF-01,1,TCGA-P6-A5OF,207


### 3.3. Phenotype per sample

In [45]:
phenotype = pd.read_csv(PANCAN + "phenotype/GDC-PANCAN.TCGA_phenotype.tsv", sep="\t")
print(phenotype.shape)
phenotype.head(3)

(14318, 58)


Unnamed: 0,sample,demographic.age_at_index,demographic.created_datetime,demographic.days_to_birth,demographic.days_to_death,demographic.demographic_id,demographic.ethnicity,demographic.gender,demographic.race,demographic.state,...,exposures.years_smoked,id,project.name,project.project_id,tissue_source_site.name,samples.is_ffpe,samples.sample_id,samples.sample_type,samples.sample_type_id,samples.tissue_type
0,TCGA-AX-A064-01A,81.0,,-29625.0,,7a2e57b2-e19b-5407-967c-2f076d5ee7fc,not reported,female,white,released,...,,633e6994-3e8b-4c0b-9b79-06cf4ab6f9bc,Uterine Corpus Endometrial Carcinoma,TCGA-UCEC,Gynecologic Oncology Group,False,3d85e546-d9f4-49ba-adbd-301118cd1307,Primary Tumor,1,Not Reported
1,TCGA-AP-A0L8-01A,70.0,,-25733.0,1484.0,dc7a4a86-4e5d-5763-939f-1ef9f9df5bf5,not reported,female,black or african american,released,...,,9aa8c2b4-db32-4899-bf0c-d578cb175e90,Uterine Corpus Endometrial Carcinoma,TCGA-UCEC,MSKCC,False,d1a7a58f-7a35-4a94-a337-051ae56ded00,Primary Tumor,1,Not Reported
2,TCGA-D1-A177-01A,70.0,,-25602.0,,dd1d384b-f3e5-53cb-b3e1-cff1a4d84850,not hispanic or latino,female,white,released,...,,e8c58d8b-51d7-4599-af79-469ee6f269db,Uterine Corpus Endometrial Carcinoma,TCGA-UCEC,Mayo Clinic,False,97587fd4-0f6f-4af5-ac0a-7328abb40e9c,Primary Tumor,1,Not Reported


In [109]:
phenotype["sample"] = phenotype["sample"].str[:15]
phenotype = phenotype.drop_duplicates(subset="sample")
print(phenotype.shape)
phenotype.head(3)

(13926, 58)


Unnamed: 0,sample,demographic.age_at_index,demographic.created_datetime,demographic.days_to_birth,demographic.days_to_death,demographic.demographic_id,demographic.ethnicity,demographic.gender,demographic.race,demographic.state,...,exposures.years_smoked,id,project.name,project.project_id,tissue_source_site.name,samples.is_ffpe,samples.sample_id,samples.sample_type,samples.sample_type_id,samples.tissue_type
0,TCGA-AX-A064-01,81.0,,-29625.0,,7a2e57b2-e19b-5407-967c-2f076d5ee7fc,not reported,female,white,released,...,,633e6994-3e8b-4c0b-9b79-06cf4ab6f9bc,Uterine Corpus Endometrial Carcinoma,TCGA-UCEC,Gynecologic Oncology Group,False,3d85e546-d9f4-49ba-adbd-301118cd1307,Primary Tumor,1,Not Reported
1,TCGA-AP-A0L8-01,70.0,,-25733.0,1484.0,dc7a4a86-4e5d-5763-939f-1ef9f9df5bf5,not reported,female,black or african american,released,...,,9aa8c2b4-db32-4899-bf0c-d578cb175e90,Uterine Corpus Endometrial Carcinoma,TCGA-UCEC,MSKCC,False,d1a7a58f-7a35-4a94-a337-051ae56ded00,Primary Tumor,1,Not Reported
2,TCGA-D1-A177-01,70.0,,-25602.0,,dd1d384b-f3e5-53cb-b3e1-cff1a4d84850,not hispanic or latino,female,white,released,...,,e8c58d8b-51d7-4599-af79-469ee6f269db,Uterine Corpus Endometrial Carcinoma,TCGA-UCEC,Mayo Clinic,False,97587fd4-0f6f-4af5-ac0a-7328abb40e9c,Primary Tumor,1,Not Reported


### 3.4. Genome Purity and doublings

In [None]:
ABSOLUTE purity/ploidy file

In [47]:
purity = pd.read_csv(PANCANATLAS + "TCGA_mastercalls.abs_tables_JSedit.fixed.txt", sep="\t")
print(purity.shape)
purity.head(3)

(10786, 10)


Unnamed: 0,array,sample,call status,purity,ploidy,Genome doublings,Coverage for 80% power,Cancer DNA fraction,Subclonal genome fraction,solution
0,TCGA-OR-A5J1-01,TCGA-OR-A5J1-01A-11D-A29H-01,called,0.9,2.0,0.0,9.0,0.9,0.02,new
1,TCGA-OR-A5J2-01,TCGA-OR-A5J2-01A-11D-A29H-01,called,0.89,1.3,0.0,6.0,0.84,0.16,new
2,TCGA-OR-A5J3-01,TCGA-OR-A5J3-01A-11D-A29H-01,called,0.93,1.27,0.0,5.0,0.89,0.11,new


### 3.5. PARADIGM Pathway Inference Matrix

PARADIGM pathway analysis of mRNASeq expression data. The method for inferring patient-specific genetic activities incorporating curated pathway interactions among genes. A gene is modeled by a factor graph as a set of interconnected variables encoding the expression and known activity of a gene and its products, allowing the incorporation of many types of omic data as evidence. The method predicts the degree to which a pathway's activities (e.g. internal gene states, interactions or high-level 'outputs') are altered in the patient using probabilistic inference.
- https://academic.oup.com/bioinformatics/article/26/12/i237/282591



In [16]:
# Data compressed in `merge_merged_reals.tar.gz`
paradigm = pd.read_csv(PANCANATLAS + "merge_merged_reals.txt", sep="\t")
print(paradigm.shape)
paradigm.head(3)

(19503, 9830)


Unnamed: 0,Gene,TCGA-06-2557,TCGA-85-8287,TCGA-DH-A7UV,TCGA-YG-AA3O,TCGA-EE-A29Q,TCGA-FP-A9TM,TCGA-94-A5I4,TCGA-BP-4799,TCGA-CZ-4856,...,TCGA-EK-A2RL,TCGA-XC-AA0X,TCGA-E2-A150,TCGA-BR-A4J4,TCGA-EL-A3H3,TCGA-AJ-A3NE,TCGA-C8-A12N,TCGA-DU-7302,TCGA-44-7669,TCGA-WK-A8Y0
0,UBE2Q1,0.0,0.0,-1.57024,0.0,0.104788,-1.57024,-2.16255,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.288492
1,GRIK_3_homomer_(complex),0.364151,-1.22353,0.334967,-1.00429,0.334967,-1.07048,-1.22353,0.334967,0.334967,...,0.213701,0.334967,0.364151,-1.07048,-1.07048,0.213701,0.334967,0.239549,-1.00429,-1.00429
2,UBE2Q2,0.0,0.0,0.0,0.0,0.0,-1.44315,-1.10958,0.192315,0.0,...,0.192315,0.0,0.0,-1.10958,-1.44315,0.0,0.192315,0.0,0.0,0.0


### 3.6. Copy number per gene

In [50]:
gistic = pd.read_csv(PATH + "copy number gene/GDC-PANCAN.gistic.tsv", sep="\t")
print(gistic.shape)
gistic.head(3)

(19729, 11369)


Unnamed: 0.1,Unnamed: 0,TCGA-OR-A5KO-01A,TCGA-OR-A5KZ-01A,TCGA-OR-A5LA-01A,TCGA-OR-A5LP-01A,TCGA-OR-A5LR-01A,TCGA-OR-A5LJ-01A,TCGA-OR-A5KY-01A,TCGA-OR-A5L9-01A,TCGA-OR-A5LH-01A,...,TCGA-VD-A8KM-01A,TCGA-VD-AA8Q-01A,TCGA-V3-A9ZX-01A,TCGA-V4-A9EC-01A,TCGA-V4-A9EK-01A,TCGA-V4-A9EU-01A,TCGA-V4-A9F1-01A,TCGA-WC-A885-01A,TCGA-WC-A87W-01A,TCGA-YZ-A983-01A
0,ENSG00000000003.13,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ENSG00000000005.5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ENSG00000000419.11,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3.7. Xena SNP  
processed subset derived from the MC3 data with nonsilent calls summarized at the gene level per sample. samples without any PASS nonsilent mutations may be excluded. Entries are usually:
- 0 : no nonsilent mutation present
- 1 : nonsilent mutation present
- Only genes with at least one mutation are included

In [17]:
mc3 = pd.read_csv(XENA + "SNP/mc3.v0.2.8.PUBLIC.nonsilentGene.xena.txt", sep="\t")
mc3 = mc3.set_index("sample").T
print(mc3.shape)
mc3.head(3)

(9104, 40543)


sample,UBE2Q2,CHMP1B,PSMA2P1,SHQ1P1,CPHL1P,SSXP10,REM1,TCOF1,NSRP1,OPA6,...,TULP2,OR1E5,RP11-390F4.3,GNGT2,GNGT1,PTRF,DIAPH2-AS1,SELV,NFIX,SELP
TCGA-02-0003-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-02-0033-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-02-0047-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4. Patient data

### 4.1. Clinical data

https://gdc.cancer.gov/about-data/publications/pancanatlas recommends using TCGA-CDR-SupplementalTableS1.xlsx (see PATH)
- A curated resource of the clinical annotations for TCGA data and provides recommendations for use of clinical endpoints
- It is strongly recommended that this file be used for clinical elements and survival outcome data first; more details please see the TCGA-CDR paper: Liu et al. Cell 2018: https://www.cell.com/cell/fulltext/S0092-8674(18)30229-0
- Columns are explained in the sheets: TCGA-CDR_Notes and ExtraEndpoints_Notes in PATH+TCGA-CDR-SupplementalTableS1.xlsx
  

In [10]:
clinical = pd.read_csv(PANCANATLAS + "Clinical end points.tsv", sep="\t")
print(clinical.shape)
clinical.head(3)

(11160, 19)


Unnamed: 0.1,Unnamed: 0,bcr_patient_barcode,type,PFI.1,PFI.time.1,PFI.2,PFI.time.2,PFS,PFS.time,DSS_cr,DSS.time.cr,DFI.cr,DFI.time.cr,PFI.cr,PFI.time.cr,PFI.1.cr,PFI.time.1.cr,PFI.2.cr,PFI.time.2.cr
0,1,TCGA-OR-A5J1,ACC,1,754,1,754,1,754,1,1355,1,754,1,754,1,754,1,754
1,2,TCGA-OR-A5J2,ACC,1,289,1,289,1,289,1,1677,#N/D,#N/D,1,289,1,289,1,289
2,3,TCGA-OR-A5J3,ACC,1,53,1,53,1,53,0,2091,1,53,1,53,1,53,1,53


### 4.2. Clinical data with follow-up

In [20]:
clinical_follow = pd.read_csv(PANCANATLAS + "clinical_PANCAN_patient_with_followup.tsv", sep="\t",
                             encoding="latin1")
print(clinical_follow.shape)
clinical_follow.head(3)

(10956, 746)


  clinical_follow = pd.read_csv(PANCANATLAS + "clinical_PANCAN_patient_with_followup.tsv", sep="\t",


Unnamed: 0,bcr_patient_uuid,bcr_patient_barcode,acronym,gender,vital_status,days_to_birth,days_to_death,days_to_last_followup,days_to_initial_pathologic_diagnosis,age_at_initial_pathologic_diagnosis,...,total_bilirubin_upper_limit,platelet_result_count,fibrosis_ishak_score,fetoprotein_outcome_value,fetoprotein_outcome_upper_limit,fetoprotein_outcome_lower_limit,inter_norm_ratio_lower_limit,family_cancer_type_txt,bilirubin_upper_limit,days_to_last_known_alive
0,B3164F7B-C826-4E08-9EE6-8FF96D29B913,TCGA-OR-A5J1,ACC,MALE,Dead,-21496,1355.0,[Not Available],0,58,...,,,,,,,,,,
1,8E7C2E31-D085-4B75-A970-162526DD07A0,TCGA-OR-A5J2,ACC,FEMALE,Dead,-16090,1677,[Not Available],0,44,...,,,,,,,,,,
2,DFD687BC-6E69-42F7-AF94-D17FC150D1A1,TCGA-OR-A5J3,ACC,FEMALE,Alive,-8624,[Not Applicable],2091.0,0,23,...,,,,,,,,,,


## 5. Others

### 5.1. Quality annotations

In [15]:
# 
quality = pd.read_csv(PANCANATLAS + "merged_sample_quality_annotations.tsv", sep="\t")
print(quality.shape)
quality.head(3)

(79286, 12)


Unnamed: 0,patient_barcode,aliquot_barcode,cancer type,platform,patient_annotation,sample_annotation,aliquot_annotation,aliquot_annotation_updated,AWG_excluded_because_of_pathology,AWG_pathology_exclusion_reason,Reviewed_by_EPC,Do_not_use
0,TCGA-01-0628,TCGA-01-0628-11A-01D-0356-01,OV,Genome_Wide_SNP_6,Organ-Specific Control,,,,0.0,,0.0,False
1,TCGA-01-0628,TCGA-01-0628-11A-01D-0383-05,OV,HumanMethylation27,Organ-Specific Control,,,,0.0,,0.0,False
2,TCGA-01-0630,TCGA-01-0630-11A-01D-0356-01,OV,Genome_Wide_SNP_6,Organ-Specific Control,,,,0.0,,0.0,False


# 2. Merge metadata

## 2.1. Original metadata 2021.  
Metadata per sample (TCGA-S9-A7J2-01). 10535 samples x 234 metadata columns

In [110]:
metadata = pd.read_csv(XENA + "210301_TCGA_RNAseq_metadata.tsv", sep="\t")
print(metadata.shape)
metadata.head(3)

(10535, 234)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,...,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,25.0,MALE,WHITE,,,Oligodendroglioma,G3,...,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,53.0,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,...,,,,,,,,,,
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,26.0,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2.2. Add new metadata

In [111]:
# Basic_phenotype
common = metadata["Sample_ID"].isin(basic_phenotype["sample"])
common.sum()

10534

In [112]:
# Basic_phenotype
metadata = metadata.merge(basic_phenotype, how="left", left_on="Sample_ID", right_on="sample")
print(metadata.shape)
metadata.head(3)

(10535, 241)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,...,chr20,chr21,chr22,sample,program,sample_type_id_y,sample_type_y,project_id,Age at Diagnosis in Years,Gender
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,25.0,MALE,WHITE,,,Oligodendroglioma,G3,...,0.0,0.0,0.0,TCGA-S9-A7J2-01,TCGA,1.0,Primary Tumor,TCGA-LGG,25.0,Male
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,53.0,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,...,,,,TCGA-G3-A3CH-11,TCGA,11.0,Solid Tissue Normal,TCGA-LIHC,53.0,Male
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,26.0,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,...,0.0,0.0,0.0,TCGA-EK-A2RE-01,TCGA,1.0,Primary Tumor,TCGA-CESC,26.0,Female


In [113]:
# Survival
common = metadata["Sample_ID"].isin(survival["sample"])
common.sum()

10320

In [114]:
# Survival
metadata = metadata.merge(survival, how="left", left_on="Sample_ID", right_on="sample")
print(metadata.shape)
metadata.head(3)

(10535, 245)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,...,program,sample_type_id_y,sample_type_y,project_id,Age at Diagnosis in Years,Gender,sample_y,OS_y,_PATIENT,OS.time_y
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,25.0,MALE,WHITE,,,Oligodendroglioma,G3,...,TCGA,1.0,Primary Tumor,TCGA-LGG,25.0,Male,TCGA-S9-A7J2-01,0.0,TCGA-S9-A7J2,62.0
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,53.0,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,...,TCGA,11.0,Solid Tissue Normal,TCGA-LIHC,53.0,Male,TCGA-G3-A3CH-11,0.0,TCGA-G3-A3CH,780.0
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,26.0,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,...,TCGA,1.0,Primary Tumor,TCGA-CESC,26.0,Female,TCGA-EK-A2RE-01,0.0,TCGA-EK-A2RE,57.0


In [115]:
# Phenotype
common = metadata["Sample_ID"].isin(phenotype["sample"])
common.sum()

10534

In [116]:
# phenotype
metadata = metadata.merge(phenotype, how="left", left_on="Sample_ID", right_on="sample")
print(metadata.shape)
metadata.head(3)

(10535, 303)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,...,exposures.years_smoked,id,project.name,project.project_id,tissue_source_site.name,samples.is_ffpe,samples.sample_id,samples.sample_type,samples.sample_type_id,samples.tissue_type
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,25.0,MALE,WHITE,,,Oligodendroglioma,G3,...,,194e8231-e8d4-477f-8c62-e11f2484ecce,Brain Lower Grade Glioma,TCGA-LGG,Dept of Neurosurgery at University of Heidelberg,False,abd56cd9-8545-4874-91b4-c77834f3e3df,Primary Tumor,1.0,Not Reported
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,53.0,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,...,,3a2d7482-c246-411e-841f-71d96e353328,Liver Hepatocellular Carcinoma,TCGA-LIHC,Alberta Health Services,False,6b915dde-b8ac-4cc6-839b-e226e914b499,Solid Tissue Normal,11.0,Not Reported
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,26.0,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,...,,9aa36ac2-8418-4109-a3c1-63ca8742baab,Cervical Squamous Cell Carcinoma and Endocervi...,TCGA-CESC,Gynecologic Oncology Group,False,7bcf54b6-3ffa-4014-92c6-51a7099e6f02,Primary Tumor,1.0,Not Reported


In [117]:
# Purity
common = metadata["Sample_ID"].isin(purity["array"])
common.sum()

9461

In [118]:
# purity
metadata = metadata.merge(purity, how="left", left_on="Sample_ID", right_on="array")
print(metadata.shape)
metadata.head(3)

(10535, 313)


  metadata = metadata.merge(purity, how="left", left_on="Sample_ID", right_on="array")


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,...,array,sample_y,call status,purity,ploidy,Genome doublings,Coverage for 80% power,Cancer DNA fraction,Subclonal genome fraction,solution
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,25.0,MALE,WHITE,,,Oligodendroglioma,G3,...,TCGA-S9-A7J2-01,TCGA-S9-A7J2-01A-11D-A349-01,called,0.89,3.89,1.0,17.0,0.94,0.01,new
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,53.0,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,...,,,,,,,,,,
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,26.0,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,...,TCGA-EK-A2RE-01,TCGA-EK-A2RE-01A-11D-A18H-01,called,0.86,1.97,0.0,9.0,0.85,0.0,new


In [119]:
# PARADIGM & GISTIC not included... too large

In [120]:
# Clinical
common = metadata["Patient_ID"].isin(clinical["bcr_patient_barcode"])
common.sum()

10496

In [121]:
# clinical
metadata = metadata.merge(clinical, how="left", left_on="Patient_ID", right_on="bcr_patient_barcode")
print(metadata.shape)
metadata.head(3)

(10535, 332)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,...,DSS_cr,DSS.time.cr,DFI.cr,DFI.time.cr,PFI.cr,PFI.time.cr,PFI.1.cr,PFI.time.1.cr,PFI.2.cr,PFI.time.2.cr
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,25.0,MALE,WHITE,,,Oligodendroglioma,G3,...,0,62,#N/D,#N/D,0,62,0,62,0,62
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,53.0,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,...,0,780,1,116,1,116,1,116,1,116
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,26.0,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,...,0,57,#N/D,#N/D,0,57,0,57,0,57


In [122]:
# Clinical with followup
common = metadata["Patient_ID"].isin(clinical["bcr_patient_barcode"])
common.sum()

10496

In [123]:
# clinical with followup
metadata = metadata.merge(clinical_follow, how="left", left_on="Patient_ID", right_on="bcr_patient_barcode")
print(metadata.shape)
metadata.head(3)

(10535, 1078)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,age_at_initial_pathologic_diagnosis_x,gender_x,race_x,ajcc_pathologic_tumor_stage,clinical_stage_x,histological_type_x,histological_grade,...,total_bilirubin_upper_limit,platelet_result_count,fibrosis_ishak_score,fetoprotein_outcome_value,fetoprotein_outcome_upper_limit,fetoprotein_outcome_lower_limit,inter_norm_ratio_lower_limit,family_cancer_type_txt,bilirubin_upper_limit,days_to_last_known_alive
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,25.0,MALE,WHITE,,,Oligodendroglioma,G3,...,,,,,,,,,,
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,53.0,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,...,,,,,,,,,,
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,26.0,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,...,,,,,,,,,,


In [124]:
metadata.to_csv(DESKTOP + "251225_TCGA_metadata.tsv", sep="\t")

## 2.3. Add mutant genes from XENA SNP MC3

In [10]:
TCGA_BULLKpy = "/Users/mmalumbres/Library/CloudStorage/OneDrive-VHIO/BioInformatics/BioProjects/MM01_BULLKpy/bullkpy-skeleton/"

In [12]:
metadata_final = pd.read_csv(TCGA_BULLKpy + "251225_TCGA_metadata.tsv", sep="\t",
                            low_memory = False)
print(metadata_final.shape)
metadata_final.head(3)

(10534, 1085)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,initial_pathologic_dx_year,...,kmeans_20,kmeans_25,kmeans_30,Proliferation_score,T_cell_score,Neuroendocrine_score,score_prolif,S_score,G2M_score,phase
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,MALE,WHITE,,,Oligodendroglioma,G3,2013.0,...,10,13,27,-1.28115,-3.294353,2.421188,-2.581529,-0.847903,-0.972469,G1
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,2010.0,...,13,6,13,-1.226665,0.957415,-0.122859,-3.387491,-0.637804,-1.100115,G1
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,2010.0,...,9,15,5,1.434147,-0.939168,-0.041094,2.428172,1.475654,1.156046,S


In [18]:
mc3.head(3)

sample,UBE2Q2,CHMP1B,PSMA2P1,SHQ1P1,CPHL1P,SSXP10,REM1,TCOF1,NSRP1,OPA6,...,TULP2,OR1E5,RP11-390F4.3,GNGT2,GNGT1,PTRF,DIAPH2-AS1,SELV,NFIX,SELP
TCGA-02-0003-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-02-0033-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-02-0047-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
list_genes_lung = ["TP53", "RB1", "KRAS", "EGFR", "KEAP1", "NFE2L2", "STK11", "PIK3CA",
                 "SMARCA4", "BRAF", "PTEN", "CDKN2A", "CCND1", "NOTCH1", "CREBBP", "EP300",
                 "KMT2D", "KMT2C", "NOTCH1", "NOTCH2", "NOTCH3", "SLIT2", "MYC", "MYCL", "MYCN",
                 "SOX2", "ASCL1", "NEUROD1", "POU2F3", "YAP1"]

In [24]:
mc3_genes_lung = mc3[list_genes_lung]
print(mc3_genes_lung.shape)
mc3_genes_lung.head(3)

(9104, 30)


sample,TP53,RB1,KRAS,EGFR,KEAP1,NFE2L2,STK11,PIK3CA,SMARCA4,BRAF,...,NOTCH3,SLIT2,MYC,MYCL,MYCN,SOX2,ASCL1,NEUROD1,POU2F3,YAP1
TCGA-02-0003-01,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-02-0033-01,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-02-0047-01,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
mc3_genes_lung.to_csv(TCGA_BULLKpy + "251225_TCGA_lung_mutant_genes.tsv", sep="\t")

In [26]:
# metadata with 30 lung cancer mutations
metadata_final2 = metadata_final.merge(mc3_genes_lung, how="left", left_on="Sample_ID", right_index=True)
print(metadata_final2.shape)
metadata_final2.head(3)

(10534, 1115)


Unnamed: 0,Sample_ID,Patient_ID,Project_ID,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,initial_pathologic_dx_year,...,NOTCH3,SLIT2,MYC,MYCL,MYCN,SOX2,ASCL1,NEUROD1,POU2F3,YAP1
0,TCGA-S9-A7J2-01,TCGA-S9-A7J2,LGG,MALE,WHITE,,,Oligodendroglioma,G3,2013.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,TCGA-G3-A3CH-11,TCGA-G3-A3CH,LIHC,MALE,ASIAN,Stage IIIA,,Hepatocellular Carcinoma,G2,2010.0,...,,,,,,,,,,
2,TCGA-EK-A2RE-01,TCGA-EK-A2RE,CESC,FEMALE,WHITE,,Stage IIA,Cervical Squamous Cell Carcinoma,G2,2010.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
metadata_final2.to_csv(TCGA_BULLKpy + "251225_TCGA_metadata_lung_mutant_genes.tsv", sep="\t")

# PENDING
2.2. Add mutations at the gene level

In [51]:
gistic2 = gistic.rename(columns={"Unnamed: 0": "sample"}).set_index("sample").T
gistic2

sample,ENSG00000000003.13,ENSG00000000005.5,ENSG00000000419.11,ENSG00000000457.12,ENSG00000000460.15,ENSG00000000938.11,ENSG00000000971.14,ENSG00000001036.12,ENSG00000001084.9,ENSG00000001167.13,...,ENSG00000281760.1,ENSG00000281781.1,ENSG00000281817.1,ENSG00000281844.1,ENSG00000281855.1,ENSG00000281873.1,ENSG00000281883.1,ENSG00000281887.1,ENSG00000281889.1,ENSG00000281899.1
TCGA-OR-A5KO-01A,0,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
TCGA-OR-A5KZ-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
TCGA-OR-A5LA-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-OR-A5LP-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-OR-A5LR-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-V4-A9EU-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-V4-A9F1-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-WC-A885-01A,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,1,0,0,0,0,0
TCGA-WC-A87W-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
gistic_clean = gistic2.loc[:, gistic2.sum(axis=0) > 0]
print(gistic_clean.shape)
gistic_clean.head(3)

(11368, 11939)


sample,ENSG00000000003.13,ENSG00000000005.5,ENSG00000000419.11,ENSG00000000457.12,ENSG00000000460.15,ENSG00000000971.14,ENSG00000001084.9,ENSG00000001167.13,ENSG00000001497.15,ENSG00000001561.6,...,ENSG00000281656.1,ENSG00000281674.1,ENSG00000281676.1,ENSG00000281700.1,ENSG00000281741.1,ENSG00000281781.1,ENSG00000281817.1,ENSG00000281844.1,ENSG00000281855.1,ENSG00000281889.1
TCGA-OR-A5KO-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
TCGA-OR-A5KZ-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-OR-A5LA-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [78]:
gistic_clean.columns = gistic_clean.columns.str.replace(r"\.\d+$", "", regex=True)
gistic_clean.head(3)

sample,ENSG00000000003,ENSG00000000005,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000971,ENSG00000001084,ENSG00000001167,ENSG00000001497,ENSG00000001561,...,ENSG00000281656,ENSG00000281674,ENSG00000281676,ENSG00000281700,ENSG00000281741,ENSG00000281781,ENSG00000281817,ENSG00000281844,ENSG00000281855,ENSG00000281889
TCGA-OR-A5KO-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
TCGA-OR-A5KZ-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-OR-A5LA-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [85]:
gistic_FINAL = gistic_clean.rename(columns=mapping)
gistic_FINAL

sample,TSPAN6,TNMD,DPM1,SCYL3,FIRRM,CFH,GCLC,NFYA,LAS1L,ENPP4,...,ENSG00000281656,ENSG00000281674,ENSG00000281676,ENSG00000281700,ENSG00000281741,ENSG00000281781,ENSG00000281817,ENSG00000281844,ENSG00000281855,ENSG00000281889
TCGA-OR-A5KO-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
TCGA-OR-A5KZ-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-OR-A5LA-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-OR-A5LP-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-OR-A5LR-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-V4-A9EU-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-V4-A9F1-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
TCGA-WC-A885-01A,0,0,0,0,0,0,0,1,0,1,...,0,1,0,0,0,0,0,1,1,0
TCGA-WC-A87W-01A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [88]:
gistic_FINAL.to_csv(DESKTOP + "gistic.tsv", sep="\t")

In [86]:
gistic_FINAL_selected = gistic_FINAL[["TP53", "RB1"]]
gistic_FINAL_selected

KeyError: "None of [Index(['TP53', 'RB1'], dtype='object', name='sample')] are in the [columns]"

# Tools: Gene IDs

Ensembl → BioMart → Ensembl Genes database

In [5]:
annot = pd.read_csv(DESKTOP + "251224_biomart_export.tsv", sep="\t")
annot.head(3)

  annot = pd.read_csv(DESKTOP + "251224_biomart_export.tsv", sep="\t")


Unnamed: 0,Gene stable ID,Gene stable ID version,Transcript stable ID,Transcript stable ID version,Chromosome/scaffold name,Gene start (bp),Gene end (bp),Strand,Gene name,Source of gene name,Transcript name,Source of transcript name,Gene type,Gene Synonym,HGNC symbol,HGNC ID
0,ENSG00000210049,ENSG00000210049.1,ENST00000387314,ENST00000387314.1,MT,577,647,1,MT-TF,HGNC Symbol,MT-TF-201,Transcript name,Mt_tRNA,MTTF,MT-TF,HGNC:7481
1,ENSG00000210049,ENSG00000210049.1,ENST00000387314,ENST00000387314.1,MT,577,647,1,MT-TF,HGNC Symbol,MT-TF-201,Transcript name,Mt_tRNA,TRNF,MT-TF,HGNC:7481
2,ENSG00000211459,ENSG00000211459.2,ENST00000389680,ENST00000389680.2,MT,648,1601,1,MT-RNR1,HGNC Symbol,MT-RNR1-201,Transcript name,Mt_rRNA,12S,MT-RNR1,HGNC:7470


In [6]:
gene_names = annot[["Gene stable ID", "Gene name"]].drop_duplicates().set_index("Gene stable ID")
print(gene_names.shape)
gene_names.head(3)

(86369, 1)


Unnamed: 0_level_0,Gene name
Gene stable ID,Unnamed: 1_level_1
ENSG00000210049,MT-TF
ENSG00000211459,MT-RNR1
ENSG00000210077,MT-TV


In [7]:
gene_names2 = gene_names.drop_duplicates()
print(gene_names2.shape)
gene_names2.head(3)

(41452, 1)


Unnamed: 0_level_0,Gene name
Gene stable ID,Unnamed: 1_level_1
ENSG00000210049,MT-TF
ENSG00000211459,MT-RNR1
ENSG00000210077,MT-TV


In [8]:
mapping = gene_names2["Gene name"].to_dict()