# Load PanCancer data

Firstly, we load the pre-processed PanCancer gene expression data (see `PanCancer_Gene_Filtering` notebook), whose format is log2(TPM+0.001).

In [1]:
import pandas as pd
import numpy as np

In [2]:
%%time
df_gene_exp = pd.read_hdf("../data/PanCancer/mad_filter_pancan_all_TCGA_20.h5", key="expression")

CPU times: user 2.17 s, sys: 817 ms, total: 2.99 s
Wall time: 3 s


In [3]:
print("Samples={}; Genes={};".format(*df_gene_exp.shape))

Samples=10535; Genes=20000;


In [4]:
df_gene_exp.head()

sample,ENSG00000279009.1,ENSG00000160182.2,ENSG00000257767.2,ENSG00000211935.3,ENSG00000105388.14,ENSG00000129455.15,ENSG00000143556.8,ENSG00000230937.9,ENSG00000242371.1,ENSG00000131002.11,...,ENSG00000179454.13,ENSG00000099974.7,ENSG00000168807.16,ENSG00000146067.15,ENSG00000114127.10,ENSG00000233476.3,ENSG00000169241.17,ENSG00000184428.12,ENSG00000275202.1,ENSG00000267544.1
TCGA-02-0047-01,-9.9658,-9.9658,5.1811,-1.4305,-4.035,-3.1714,-2.7274,-9.9658,0.346,3.483,...,2.5213,1.5709,2.2051,4.8059,2.1606,4.0454,3.7614,2.9432,1.1833,0.1124
TCGA-02-0055-01,1.2756,-9.9658,1.7532,-0.9132,-9.9658,-9.9658,-2.2447,-9.9658,0.1519,-4.6082,...,0.8246,1.9712,2.4386,4.6697,1.8282,4.9842,5.3509,3.3856,0.5955,0.4447
TCGA-02-2483-01,2.7314,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,3.6042,...,1.5064,0.605,1.2333,3.8611,1.3679,3.9883,4.1211,4.6277,-0.013,0.9493
TCGA-02-2485-01,-9.9658,-9.9658,-9.9658,-2.8262,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,3.863,...,1.8524,1.766,1.5758,4.889,2.167,4.1587,3.5399,4.4176,0.9862,1.6604
TCGA-04-1331-01,-9.9658,-1.1172,-9.9658,-9.9658,-9.9658,7.5607,-9.9658,4.1764,-1.2481,-9.9658,...,1.1577,0.7321,2.4753,5.9693,1.7009,2.2573,5.2235,4.2795,-0.4521,-0.6193


The next thing to do is to check if the data frame contains any NA. If so, either remove the rows that contain them (dropna method) or use any other imputation method:

In [5]:
df_gene_exp.isnull().values.any()

False

# Data-Split and Exploration

In order to apply the transfer learning (TL) approach, we need to split the PanCancer gene expression dataset into two subsets: non-Lung and Lung datasets. Lung dataset will contain the samples both from LUAD and LUSC tumors.

We first load the survival clinical outcomes dataset curated by [*Liu et al.*](https://www.ncbi.nlm.nih.gov/pubmed/29625055), previously saved in `PanCancer_Data_Obtaining` notebook. As we are interested in performing survival analysis using a TL approach, we only use the Lung and non-Lung samples contained in the survival dataset.

In [7]:
df_survival = pd.read_hdf("../data/PanCancer/pancan.h5", key="sample_clinical")

In [8]:
print(df_survival.shape)

df_survival.head()

(10496, 33)


Unnamed: 0,_PATIENT,cancer type abbreviation,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,clinical_stage,histological_type,histological_grade,initial_pathologic_dx_year,...,residual_tumor,OS,OS.time,DSS,DSS.time,DFI,DFI.time,PFI,PFI.time,Redaction
TCGA-02-0047-01,TCGA-02-0047,GBM,78.0,MALE,WHITE,,,Untreated primary (de novo) GBM,,2005.0,...,,1.0,448.0,1.0,448.0,,,1.0,57.0,
TCGA-02-0055-01,TCGA-02-0055,GBM,62.0,FEMALE,WHITE,,,Untreated primary (de novo) GBM,,2005.0,...,,1.0,76.0,1.0,76.0,,,1.0,6.0,
TCGA-02-2483-01,TCGA-02-2483,GBM,43.0,MALE,ASIAN,,,Untreated primary (de novo) GBM,,2008.0,...,,0.0,466.0,0.0,466.0,,,0.0,466.0,
TCGA-02-2485-01,TCGA-02-2485,GBM,53.0,MALE,BLACK OR AFRICAN AMERICAN,,,Untreated primary (de novo) GBM,,2009.0,...,,0.0,470.0,0.0,470.0,,,1.0,186.0,
TCGA-04-1331-01,TCGA-04-1331,OV,78.0,FEMALE,WHITE,,Stage IIIC,Serous Cystadenocarcinoma,G3,2004.0,...,,1.0,1336.0,1.0,1336.0,1.0,459.0,1.0,459.0,Redacted


We check that there are no duplicated samples:

In [9]:
survival_sample = df_survival.index
survival_sample.duplicated().any()

False

In [10]:
df_survival['cancer type abbreviation'].value_counts(normalize=False, dropna=False)

BRCA    1211
KIRC     603
LUAD     574
THCA     571
HNSC     564
PRAD     548
LUSC     548
LGG      522
SKCM     470
STAD     450
OV       427
BLCA     426
LIHC     421
COAD     329
KIRP     321
CESC     309
SARC     264
ESCA     195
UCEC     194
PCPG     185
PAAD     183
LAML     173
GBM      165
TGCT     137
THYM     121
READ     102
KICH      91
MESO      87
UVM       79
ACC       77
UCS       57
DLBC      47
CHOL      45
Name: cancer type abbreviation, dtype: int64

Given the similarity observed between LUAD and LUSC tumor types in [*Liu et al.*](https://www.ncbi.nlm.nih.gov/pubmed/29625055) from a survival analysis perspective (median follow-up times, K-M plots, etc.), we join the samples from both type of tumors in one single dataset:

In [11]:
# Samples are filtered from the survival dataset, as we are interesting in survival analysis
lung_sample = df_survival[df_survival['cancer type abbreviation'].apply(lambda x: x in ["LUAD", "LUSC"])].index
lung_sample.shape

(1122,)

In [12]:
len(df_gene_exp.index.intersection(lung_sample))

1122

We also load the samples types information dataset previously saved in `PanCancer_Data_Obtaining` notebook:

In [14]:
df_sample = pd.read_hdf("../data/PanCancer/pancan.h5", key="sample_type")

In [15]:
print(df_sample.shape)

df_sample.head()

(10534, 4)


Unnamed: 0,sample_type_id,sample_type,_primary_disease,tumor_normal
TCGA-02-0047-01,1.0,Primary Tumor,glioblastoma multiforme,Tumor
TCGA-02-0055-01,1.0,Primary Tumor,glioblastoma multiforme,Tumor
TCGA-02-2483-01,1.0,Primary Tumor,glioblastoma multiforme,Tumor
TCGA-02-2485-01,1.0,Primary Tumor,glioblastoma multiforme,Tumor
TCGA-04-1331-01,1.0,Primary Tumor,ovarian serous cystadenocarcinoma,Tumor


We check that there are no duplicated samples:

In [16]:
sample_info = df_sample.index
sample_info.duplicated().any()

False

We check the number of Lung samples contained in the samples types dataset:

In [17]:
len(df_sample.index.intersection(lung_sample))

1122

## Lung

This subset contains all the Lung (LUSC and LUAD) samples from the PanCancer dataset, and is intended to be used during the fine-tuning phase of a TL approach.

We first create the Lung gene expression dataset:

In [18]:
df_gene_exp_lung = df_gene_exp.loc[lung_sample]
df_gene_exp_lung.shape

(1122, 20000)

### Data exploration

Now, we filter the common Lung samples from the samples types and the survival clinical outcomes datasets, and explore their available information:

In [19]:
df_sample_lung = df_sample.loc[lung_sample]
df_sample_lung.shape

(1122, 4)

In [20]:
# Verify that the samples are in the same order as in the gene expression datasets
df_gene_exp_lung.index.equals(df_sample_lung.index)

True

In [21]:
df_survival_lung = df_survival.loc[lung_sample, :]
df_survival_lung.shape

(1122, 33)

In [22]:
# Verify that the samples are in the same order as in the gene expression datasets
df_gene_exp_lung.index.equals(df_survival_lung.index)

True

#### Tumor-Normal binary variable

In [23]:
# Tumor/normal variable
variable = "tumor_normal"
print("Number of samples with this information:",
      sum(df_sample_lung[variable].value_counts(normalize=False, dropna=False)))

df_sample_lung[variable].value_counts(normalize=False, dropna=False)

Number of samples with this information: 1122


Tumor     1013
Normal     109
Name: tumor_normal, dtype: int64

In [24]:
# Sample type variable
variable = "sample_type"
print("Number of samples with this information:",
      sum(df_sample_lung[variable].value_counts(normalize=False, dropna=False)))

df_sample_lung[variable].value_counts(normalize=False, dropna=False)

Number of samples with this information: 1122


Primary Tumor          1011
Solid Tissue Normal     109
Recurrent Tumor           2
Name: sample_type, dtype: int64

#### Survival clinical outcomes

In [25]:
# Overall survival
print("Number of samples with this information:",
      sum(df_survival_lung.OS.value_counts(normalize=False)))

df_survival_lung.OS.value_counts(normalize=False, dropna=False)

Number of samples with this information: 1122


0.0    663
1.0    459
Name: OS, dtype: int64

In [26]:
# Disease specific survival
print("Number of samples with this information:",
      sum(df_survival_lung.DSS.value_counts(normalize=False)))

df_survival_lung.DSS.value_counts(normalize=False, dropna=False)

Number of samples with this information: 1023


0.0    791
1.0    232
NaN     99
Name: DSS, dtype: int64

In [27]:
# Progression-free interval
print("Number of samples with this information:",
      sum(df_survival_lung['PFI'].value_counts(normalize=False)))

df_survival_lung['PFI'].value_counts(normalize=False, dropna=False)

Number of samples with this information: 1122


0.0    726
1.0    396
Name: PFI, dtype: int64

In [28]:
# Disease-free interval
print("Number of samples with this information:",
      sum(df_survival_lung['DFI'].value_counts(normalize=False)))

df_survival_lung['DFI'].value_counts(normalize=False, dropna=False)

Number of samples with this information: 664


0.0    502
NaN    458
1.0    162
Name: DFI, dtype: int64

### Export

We write the Lung gene expression and the sample info datasets into an HDF5 file, in machine learning format (rows as samples):

In [29]:
%%time
# Export h5 format file: create an HDF5 file with two datasets (contained in the root group, the file object)
with pd.HDFStore("../data/PanCancer/Lung_pancan.h5", "w") as store:
    store["expression"] = df_gene_exp_lung
    store["sample"] = df_sample_lung
    store["survival_outcome"] = df_survival_lung

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['_PATIENT', 'cancer type abbreviation', 'gender', 'race',
       'ajcc_pathologic_tumor_stage', 'clinical_stage', 'histological_type',
       'histological_grade', 'menopause_status', 'vital_status',
       'tumor_status', 'cause_of_death', 'new_tumor_event_type',
       'new_tumor_event_site', 'new_tumor_event_site_other',
       'treatment_outcome_first_course', 'margin_status', 'residual_tumor',
       'Redaction'],
      dtype='object')]

  exec(code, glob, local_ns)


CPU times: user 272 ms, sys: 173 ms, total: 444 ms
Wall time: 650 ms


## non-Lung

This subset includes all the PanCancer samples except the Lung ones, containing samples from 31 different tumor types. The non-Lung dataset is intended to be used during the pre-training phase of a TL approach.

In [30]:
no_lung_sample = df_survival.index.difference(lung_sample)
len(no_lung_sample)

9374

In [31]:
# Check that the number of non-Lung samples is correct
len(no_lung_sample) == (df_survival.shape[0] - df_gene_exp_lung.shape[0])

True

We first create the non-Lung gene expression dataset:

In [32]:
df_gene_exp_no_lung = df_gene_exp.loc[no_lung_sample]
df_gene_exp_no_lung.shape

(9374, 20000)

### Data exploration

We now explore the samples type (tumor or normal), and then some clinical information associated to them.

#### Tumor-Normal binary variable

We filter the non-Lung samples from the expression dataset contained in the PanCancer samples types dataset:

In [33]:
sample_no_lung_common = df_gene_exp_no_lung.index.intersection(df_sample.index)
len(sample_no_lung_common)

9374

In [34]:
df_sample_no_lung = df_sample.loc[sample_no_lung_common]
df_sample_no_lung.shape

(9374, 4)

In [35]:
# Sample type variable
df_sample_no_lung.sample_type.value_counts(normalize=True, dropna=False)

Primary Tumor                                      0.869639
Solid Tissue Normal                                0.064327
Metastatic                                         0.041818
Primary Blood Derived Cancer - Peripheral Blood    0.018455
Recurrent Tumor                                    0.004587
Additional - New Primary                           0.001067
Additional Metastatic                              0.000107
Name: sample_type, dtype: float64

In [36]:
# Tumor/Normal variable
df_sample_no_lung.tumor_normal.value_counts(normalize=True, dropna=False)

Tumor     0.935673
Normal    0.064327
Name: tumor_normal, dtype: float64

#### Clinical variables

We filter the non-Lung samples from the expression dataset contained in the PanCancer clinical dataset:

In [37]:
survival_no_lung_common = df_gene_exp_no_lung.index.intersection(df_survival.index)
len(survival_no_lung_common)

9374

In [38]:
df_survival_no_lung = df_survival.loc[survival_no_lung_common]
df_survival_no_lung.shape

(9374, 33)

In [39]:
# Overall survival
variable = "OS"
print("Number of samples with this information:",
      sum(df_survival_no_lung[variable].value_counts(normalize=False)))

df_survival_no_lung[variable].value_counts(normalize=True, dropna=False)

Number of samples with this information: 9367


0.0    0.698208
1.0    0.301045
NaN    0.000747
Name: OS, dtype: float64

In [40]:
# Progression-free interval
variable = "PFI"
print("Number of samples with this information:",
      sum(df_survival_no_lung[variable].value_counts(normalize=False)))

df_survival_no_lung[variable].value_counts(normalize=True, dropna=False)

Number of samples with this information: 9194


0.0    0.641988
1.0    0.338809
NaN    0.019202
Name: PFI, dtype: float64

In [41]:
# Disease-specific survival
variable = "DSS"
print("Number of samples with this information:",
      sum(df_survival_no_lung[variable].value_counts(normalize=False)))

df_survival_no_lung[variable].value_counts(normalize=True, dropna=False)

Number of samples with this information: 8990


0.0    0.755174
1.0    0.203862
NaN    0.040964
Name: DSS, dtype: float64

In [42]:
# Disease-free interval
variable = "DFI"
print("Number of samples with this information:",
      sum(df_survival_no_lung[variable].value_counts(normalize=False)))

df_survival_no_lung[variable].value_counts(normalize=True, dropna=False)

Number of samples with this information: 4671


NaN    0.501707
0.0    0.400469
1.0    0.097824
Name: DFI, dtype: float64

### Export

We write the non-Lung gene expression and the sample info datasets into an HDF5 file, in machine learning format (rows as samples):

In [43]:
%%time
# Export h5 format file: create an HDF5 file with three datasets (contained in the root group, the file object)
with pd.HDFStore("../data/PanCancer/non_Lung_pancan.h5", "w") as store:
    store["expression"] = df_gene_exp_no_lung
    store["sample_type"] = df_sample_no_lung
    store["sample_clinical"] = df_survival_no_lung

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['_PATIENT', 'cancer type abbreviation', 'gender', 'race',
       'ajcc_pathologic_tumor_stage', 'clinical_stage', 'histological_type',
       'histological_grade', 'menopause_status', 'vital_status',
       'tumor_status', 'cause_of_death', 'new_tumor_event_type',
       'new_tumor_event_site', 'new_tumor_event_site_other',
       'treatment_outcome_first_course', 'margin_status', 'residual_tumor',
       'Redaction'],
      dtype='object')]

  exec(code, glob, local_ns)


CPU times: user 3.12 s, sys: 1.37 s, total: 4.49 s
Wall time: 11.7 s
