# Preparing the expression background - Subset of Gtex toil dataset
This notebook show an example of how to prepare a background dataset with only a subset of tissues from the GTex toil dataset. The dataset will be re-structured in the directory `../input/gene_expression/[DATASETNAME]`. The code internally will generate one `metadata_[DATASTNAME].csv` inside that folder. This file is a table with two columns: tissue (name of the tissue) and nsamples (number of samples), the names used for tissues inside this table will be the ones used to name the individual tissue folders containing the expression data files and to build the job array list. In addition, it will create one folder per tissue inside `../input/gene_expression/[DATASETNAME]/subgroups` , inside each tissue folder two files will be created: `[tissue].RDS` and `[tissue].csv` with the same counts matrix stored in the two different formats.  

In summary, the folder will look like this:

- `../input/gene_expression/[DATASETNAME]`   
    - `metadata_[DATASTNAME].csv` dataset metadata file generated automatically when splitting the data into tissues
    - `genes_[DATASETNAME].csv` single-column file generated automatically with gene ids of all genes in universe
    - `subgroups`   
        - `tissue_1`   
            - `tissue_1.csv` expression matrix table in CSV format   
        - `tissue_2`    
            - `tissue_2.csv`
         
**Definition of the gene universe**

The file `genes_[DATASETNAME].csv` contains the list of genes in the gene universe of this dataset. This list contains the genes that have at least `min_cts` in at least `min_sam_cts` samples. By default, these parameters are set to 3 counts in at least 1 sample. These filters are part of the changes implemented in version 2.2 and generate datasets with the tag `gfilter`.

In [1]:
suppressPackageStartupMessages({
    library(dplyr)
    library(plyr)
    library(biomaRt)
    library(data.table)
    library(furrr)
})
source('../funcs/misc.R')
options(stringsAsFactors=FALSE)

## PDxN requirements

In [2]:
# Dataset name
dsname<-"gtextoil_iBrain" # Dataset name (will be used in the config file) 

# Filtering parameters
min_cts<-3 # Minimum number of gene counts
min_sam_cts<-1 # Minimum number of samples with min_cts
min_sam_tissue<-10 # Minimum number of samples per tissue

In [3]:
# Directory structure - DO NOT CHANGE
output_dir <- file.path("../../input/gene_expression",dsname)
metadata_file <- file.path(output_dir,paste0("metadata_",dsname,".csv"))
gene_file <- file.path(output_dir,paste0("genes_",dsname,".csv"))
create_directory(output_dir)
message("Creating dataset directory at ",output_dir)

Creating dataset directory at ../../input/gene_expression/gtextoil_iBrain



In [4]:
# Specify the names of the primary sites that will be included in the new dataset
ps_list<-c('Blood','BloodVessel','BoneMarrow','Brain','Nerve','Pituitary','Spleen')

## Pre-processed inputs

Load the pre-processed files for the whole GTex toil dataset generated in the notebook `prep_background-gtex.ipynb`

In [5]:
# Pre-processed GTex counts and metadata files
indir<-"~/projects/pdxn_2.0/data/background/gtex_toil"
metafile<-"GTEX_phenotype"
cts.proc<-read.table(file.path(indir,"gtex_RSEM_Hugo_norm_count_entrez_mapped.csv")) 
meta<-read.table(paste0(indir,"/",metafile,"_processed.tsv"),header = T)

In [6]:
message('Original dataset has ',ncol(cts.proc),' samples')

Original dataset has 7851 samples



In [7]:
dim(meta)

## Split dataset by tissue

In this section we take the processed input files (counts and metadata), extract the samples corresponding to the subset of tissues, and then split the global count matrix into different tissues to create the dataset structure required for PxN.



In [8]:
# Filter samples corresponding to selected tissues 
meta.filt<-meta %>%
            mutate(Sample_name = gsub("-",".",Sample)) %>%
            filter(Sample_name %in% colnames(cts.proc)) %>% # remove samples without tissue annotation
            filter(primary_site %in% ps_list) # extract desired subset of primary sites
        
cts.proc.filt<-cts.proc[,meta.filt$Sample_name] # Filter counts to match samples in metadata
message('Filtered dataset contains ',ncol(cts.proc.filt),' samples')

Filtered dataset contains 2746 samples



In [10]:
# Build list of count matrices by tissue
cts.df <- meta.filt %>%
            dplyr::select(tissue,Sample_name) %>%
            group_by(tissue) %>%
            tidyr::nest(samples=c(Sample_name)) %>%
            ungroup() %>%
            mutate(data=purrr::map(samples,function(s,...){ cts.proc.filt[,s[[1]]] } )) %>%
            mutate(nsamples=length(samples))%>%
            filter(nsamples>=min_sam_tissue) %>%
            dplyr::select(tissue,data)
cts.list <- cts.df$data
names(cts.list) <- cts.df$tissue

In [11]:
# Filter gene expression matrices - this step creates the gene universe file internally
cts.list.filt <- filter_tissue_expr_list(cts_list = cts.list,
                                         min_counts = min_cts,
                                         min_samples = min_sam_cts,
                                         gu_file = gene_file)
names(cts.list.filt$filtered_cts) <- names(cts.list)

Gene universe contains 18994 genes

Wrote gene universe to file ../../input/gene_expression/gtextoil_iBrain/genes_gtextoil_iBrain.csv

Returning filtered matrices and gene universe



In [12]:
# Create tissue files  - the function writes the files internally 
res <- tissue_list_to_dirs(cts_list = cts.list.filt$filtered_cts,
                           output_dir = output_dir,
                           meta_file = metadata_file)

In [13]:
# Verify that the matrices have the correct dimensions and the splitting was successful
str(res)

List of 2
 $ ndirs: int 22
 $ sizes:List of 22
  ..$ BloodVessel_ArteryTibial                : int [1:2] 18994 281
  ..$ BloodVessel_ArteryCoronary              : int [1:2] 18994 118
  ..$ Brain_BrainCortex                       : int [1:2] 18994 105
  ..$ Nerve_NerveTibial                       : int [1:2] 18994 278
  ..$ Spleen                                  : int [1:2] 18994 100
  ..$ Pituitary                               : int [1:2] 18994 107
  ..$ Brain_BrainCerebellum                   : int [1:2] 18994 119
  ..$ Blood_WholeBlood                        : int [1:2] 18994 337
  ..$ Blood_CellsEBV-transformedlymphocytes   : int [1:2] 18994 107
  ..$ BloodVessel_ArteryAorta                 : int [1:2] 18994 207
  ..$ Brain_BrainSubstantianigra              : int [1:2] 18994 57
  ..$ Brain_BrainAnteriorcingulatecortex-BA24 : int [1:2] 18994 83
  ..$ Brain_BrainFrontalCortex-BA9            : int [1:2] 18994 95
  ..$ Brain_BrainCerebellarHemisphere         : int [1:2] 18994 93
  ..$