# About the notebook
[Back to the topic](pathway_toc.ipynb)

We are in step 04. Now we have matrix and gene set list prepare. We can perform gene set analysis. Gage is widely used and flexible package for conducting gene set analyses for RNA_Seq. **There are several technical aspects of gage that could be improved (it makes very strong assumptions).** We should expect that over time new methods will be published that will address these technical issues


<img src="./fig/03 pathway analysis steps.png">

----

# Set up environment

In [33]:
source("Pathway_config.R")
source("Pathway_util.R")

# Read in data

Import count table

In [34]:
attach(file.path(OUTDIR, "dds_rld.RData"))

The following objects are masked from file:/home/jovyan/work/scratch/analysis_output/out/dds_rld.RData (pos = 4):

    dds_add, rld_add



Import pathway information

In [51]:
attach(file.path(OUTDIR, "genesets_cne_h99.RData"))

The following object is masked from file:/home/jovyan/work/scratch/analysis_output/out/genesets_cne_h99.RData (pos = 3):

    genesets_cne_h99

The following object is masked from file:/home/jovyan/work/scratch/analysis_output/out/genesets_cne_h99.RData (pos = 5):

    genesets_cne_h99

The following object is masked from file:/home/jovyan/work/scratch/analysis_output/out/genesets_cne_h99.RData (pos = 7):

    genesets_cne_h99



In [52]:
genesets_cne_h99[1]

# Extract the fold change

There are two ways to use the gage function. One of it is to get log2 fold change of genes comparing two group of samples.

In [37]:
head(colData(dds_add), 2)

DataFrame with 2 rows and 11 columns
              Label   Strain    Media experiment_person libprep_person
        <character> <factor> <factor>          <factor>       <factor>
1_RZ_J       1_RZ_J      H99      YPD              expA          prepB
10_RZ_C     10_RZ_C    mar1d      YPD              expA          prepA
        enrichment_method prob.gene prob.nofeat prob.unique     depth
                 <factor> <numeric>   <numeric>   <numeric> <numeric>
1_RZ_J                 RZ 0.6689001   0.2170956   0.8859957   3541358
10_RZ_C                RZ 0.7497438   0.2006517   0.9503955   1742594
        sizeFactor
         <numeric>
1_RZ_J   1.3586026
10_RZ_C  0.8098675

get the fold change between medium TC over YPD. 

In [38]:
### Get results from DESeq2 DE analysis
ddsres_add_media <- DESeq2::results(dds_add, contrast = c("Media", "YPD" , "TC"))

### Extract the estimated fold changes
ddsfc_add_media  <- ddsres_add_media$log2FoldChange

### Assign the gene name to the fold change vector
names(ddsfc_add_media) <- rownames(ddsres_add_media)

In [39]:
head(ddsfc_add_media)

get the fold change between strain mar1d over h99. 

In [40]:
### Get results from DESeq2 DE analysis
ddsres_add_strain <- DESeq2::results(dds_add, contrast = c("Strain", "H99" , "mar1d"))

### Extract the estimated fold changes
ddsfc_add_strain  <- ddsres_add_strain$log2FoldChange

### Assign the gene name to the fold change vector
names(ddsfc_add_strain) <- rownames(ddsres_add_strain)

In [41]:
head(ddsfc_add_strain)

# Pathway analysis performed using gage package

Calculate pathway level statistics using the gage package. For the details of the gage method, one can read [package document](https://bioconductor.org/packages/release/bioc/manuals/gage/man/gage.pdf) and [the gage paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-161)

```
### Notes
### This example is using the estimated fold changes for DESeq2 for the inference
### Accordingly, use.fold is set to TRUE and the indices for the ref and target
### samples are set to NULL. The theory behind using fold changes is iffy
### Also, it puts limits on min and max on gene set sizes. This is a tuning parameter
### and our choices are arbitrary.
### Finally, it tests whether within a gene set the genes point in the same direction

gageres <- gage::gage(ddsfc_add,
                      gsets = genesets_cne_h99,
                      use.fold = TRUE,
                      ref = NULL, 
                      samp = NULL,  
                      set.size = c(10, 500),
                      same.dir = TRUE) 
```

In [44]:
### Geneset analysis using the "microarray" approach. We will use rlog
### transformed expressions for the purpose of this demonstration

gageres_media <- gage::gage(
    assay(rld_add),
    gsets = genesets_cne_h99,
    use.fold = FALSE,
                      
    ### reference condition or phenotype 
    ### (i.e. the control group) 
    ref = which(colData(rld_add)[["Media"]]=="YPD"), 
    
    ### a numeric vector of column numbers for 
    ### the target condition or phenotype 
    ### (i.e.the experiment group)
    samp = which(colData(rld_add)[["Media"]]=="TC"),
    compare = "unpaired",
                        
    ### he effective gene set size, 
    ### i.e. the number of genes included in the gene set test
    set.size = c(10, 500),
                      
    ### provide two sided test "greater" & "less"
    same.dir = TRUE) 

In [45]:
gageres_strain <- gage::gage(
    assay(rld_add),
    gsets = genesets_cne_h99,
    use.fold = FALSE,
                      
    ### reference condition or phenotype 
    ### (i.e. the control group) 
    ref = which(colData(rld_add)[["Strain"]]=="H99"), 
    
    ### a numeric vector of column numbers for 
    ### the target condition or phenotype 
    ### (i.e.the experiment group)
    samp = which(colData(rld_add)[["Strain"]]=="mar1d"),
    compare = "unpaired",
                        
    ### he effective gene set size, 
    ### i.e. the number of genes included in the gene set test
    set.size = c(10, 500),
                      
    ### provide two sided test "greater" & "less"
    same.dir = TRUE)

the content of gageres

In [46]:
print(class(gageres_media))
print(names(gageres_media))

[1] "list"
[1] "greater" "less"    "stats"  


let's take a look at each element of the results

In [49]:
gageres_media$greater %>% head(2)

Unnamed: 0,p.geomean,stat.mean,p.val,q.val,set.size,exp1
ec00053 | Ascorbate and aldarate metabolism,4.502818e-06,4.561043,4.502818e-06,0.001328574,100,4.502818e-06
ec00051 | Fructose and mannose metabolism,1.034578e-05,4.365729,1.034578e-05,0.001328574,103,1.034578e-05


In [19]:
gageres_media$less %>% head(2)

Unnamed: 0,p.geomean,stat.mean,p.val,q.val,set.size,exp1
PWY-3781 | aerobic respiration I (cytochrome c),2.54021e-11,-7.647006,2.54021e-11,2.156638e-08,43,2.54021e-11
ec00190 | Oxidative phosphorylation,1.274301e-10,-6.810042,1.274301e-10,5.409407e-08,72,1.274301e-10


# Store the results

In [50]:
outfile <- file.path(OUTDIR, "res_gage.RData")
save(gageres_media, 
     gageres_strain,
     file = outfile)