# Gene expression background - Microarray barcode dataset

The dataset will be re-structured in the directory `../input/gene_expression/[DATASETNAME]`. The code internally will generate one `metadata_[DATASTNAME].csv` inside that folder. This file is a table with two columns: tissue (name of the tissue) and nsamples (number of samples), the names used for tissues inside this table will be the ones used to name the individual tissue folders containing the expression data files and to build the job array list. In addition, it will create one folder per tissue inside `../input/gene_expression/[DATASETNAME]/subgroups` , inside each tissue folder two files will be created: `[tissue].RDS` and `[tissue].csv` with the same counts matrix stored in the two different formats.  

In summary, the folder will look like this:

- `../input/gene_expression/[DATASETNAME]`   
    - `metadata_[DATASTNAME].csv` dataset metadata file generated automatically when splitting the data into tissues
    - `genes_[DATASETNAME].csv` single-column file generated automatically with gene ids of all genes in universe
    - `subgroups`   
        - `tissue_1`   
            - `tissue_1.csv` expression matrix table in CSV format   
        - `tissue_2`    
            - `tissue_2.csv`
         
**Definition of the gene universe**

The file `genes_[DATASETNAME].csv` contains the list of genes in the gene universe of this dataset. This list contains the genes that have at least `min_cts` in at least `min_sam_cts` samples. By default, these parameters are set to 3 counts in at least 1 sample. These filters are part of the changes implemented in version 2.2 and generate datasets with the tag `gfilter`.

In [1]:
suppressPackageStartupMessages({
    library(dplyr)
    library(plyr)
    library(biomaRt)
    library(data.table)
    library(furrr)  
})
source('../funcs/misc.R')
options(stringsAsFactors=FALSE)

## PDxN requirements

In [2]:
# Dataset name
dsname<-"HGU133plus2_gfilter"

# Filtering parameters
min_cts<-3 # Minimum number of gene counts
min_sam_cts<-1 # Minimum number of samples with min_cts
min_sam_tissue<-10 # Minimum number of samples per tissue

In [3]:
# Directory structure - DO NOT CHANGE 
output_dir <- file.path("../../input/gene_expression",dsname)
metadata_file <- file.path(output_dir,paste0("metadata_",dsname,".csv"))
gene_file <- file.path(output_dir,paste0("genes_",dsname,".csv"))
create_directory(output_dir)
message("Creating dataset directory at ",output_dir)

Creating dataset directory at ../../input/gene_expression/HGU133plus2_gfilter



## Dataset-specific inputs

In [4]:
# Dataset original inputs
barcode_dir <- "~/projects/pdxn_2.0/data/background/microarray/HGU133plus2"
tissue_file <- "~/projects/pdxn_2.0/data/background/microarray/Barcode3.tissue.RDS"

In [5]:
# Load dataset
bcdfiles <- list.files(barcode_dir,recursive = T,pattern = ".collapse.RDS",full.names = T) 
tissue_annot <- readRDS(tissue_file)

## Process original files 

This section loads the legacy objects form the original PDxN pipeline and extracts the count matrices and tissue annotation into separate objects that will be used to split the dataset into tissues in the next step. 

In [6]:
# Clean up tissue names
tissue_annot <- tissue_annot %>%
                mutate(tissue=gsub(",_ie,_|_\\(.*","",tissue) %>%
                              gsub("\\.+|\\._|%:|%_","_",.))

In [7]:
# Process dataset
cts.list<-bcdfiles %>%
          lapply(.,function(path){
                tissue_exprs <- readRDS(path)$datETcollapsed
                return(tissue_exprs)
           })
names(cts.list)<-gsub("\\.collapse.*","",basename(bcdfiles)) %>%  # clean up tissue names
                 gsub("\\._ie\\._|_\\(.*","",.) %>%
                 gsub("\\.+|\\._","_",.)
names(cts.list)

## Split dataset by tissue

This section generates a named list with the tissue expression matrices of all the experiments available for that tissue or subgroup. The file structure will be the following:

- `../input/gene_expression/[DATASETNAME]`   
    - `metadata_[DATASTNAME].csv` dataset metadata file generated automatically when splitting the data into tissues
    - `genes_[DATASETNAME].csv` single-column file generated automatically with gene ids of all genes in universe
    - `subgroups`   
        - `tissue_1`   
            - `tissue_1.csv` expression matrix table in CSV format   
        - `tissue_2`    
            - `tissue_2.csv`

In [8]:
# Build list of count matrices by tissue

all.samples<-lapply(cts.list,colnames)%>%
             unlist()
cts.df<-tissue_annot %>%
        filter(sample %in% all.samples) %>%
        group_by(tissue,series) %>%
        dplyr::summarize(samples=list(sample),
                         nsamples=length(sample)) %>%
        mutate(data=purrr::map2(tissue,
                                samples,
                                function(tis,sam,...){
                                   as.data.frame(cts.list[[tis]][,unlist(sam)])
                                })
              ) %>%
        filter(nsamples>=10) %>%
        mutate(tissue_series=paste(tissue,series,sep="_"))
cts.list<-cts.df$data
names(cts.list)<-cts.df$tissue_series

[1m[22m`summarise()` has grouped output by 'tissue'. You can override using the `.groups` argument.


In [9]:
# Filter gene expression matrices 
cts.list.filt <- filter_tissue_expr_list(cts_list = cts.list,
                                         min_counts = min_cts,
                                         min_samples = min_sam_cts,
                                         gu_file = gene_file)
names(cts.list.filt$filtered_cts) <- names(cts.list)

Gene universe contains 20590 genes

Wrote gene universe to file ../input/gene_expression/HGU133plus2_gfilter/genes_HGU133plus2_gfilter.csv

Returning filtered matrices and gene universe



In [10]:
# Check that all matrices are correct
lapply(cts.list.filt$filtered_cts,function(mat){
    sum(is.na(mat))
}) %>% 
unlist() %>% 
sum()

In [11]:
# Create tissue files - written automatically by the function
res <- tissue_list_to_dirs(cts_list = cts.list.filt$filtered_cts,
                           output_dir = output_dir,
                           meta_file = metadata_file)

In [12]:
# Verify that the matrices have the correct dimensions and the splitting was successful
str(res)

List of 2
 $ ndirs: int 134
 $ sizes:List of 134
  ..$ accumbens_GSE7307                                                 : int [1:2] 20590 14
  ..$ adipose_tissue_GSE13070                                           : int [1:2] 20590 34
  ..$ adipose_tissue_GSE13506                                           : int [1:2] 20590 49
  ..$ adipose_tissue_GSE28005                                           : int [1:2] 20590 13
  ..$ adipose_tissue_subcutaneous_GSE17170                              : int [1:2] 20590 25
  ..$ adipose_tissue_subcutaneous_GSE26339                              : int [1:2] 20590 11
  ..$ adipose_tissue_subcutaneous_GSE27949                              : int [1:2] 20590 11
  ..$ adrenal_gland_cortex_GSE10927                                     : int [1:2] 20590 10
  ..$ airway_epithelial_cells_GSE11784                                  : int [1:2] 20590 40
  ..$ airway_epithelial_cells_GSE13933                                  : int [1:2] 20590 11
  ..$ aortic_valve_GS