# Statistical analysis
This notebook performs differential expression analysis using the real template experiment and simulated experiments, as a null set. Then the set of differentially expressed genes (DEGs) obtained from this analysis are used to perform gene set enrichment analysis (GSEA) to identify pathways enriched in these set of DEGs.

In [1]:
%load_ext autoreload
%load_ext rpy2.ipython
%autoreload 2

import os
import sys
import pandas as pd
import numpy as np
import random
import rpy2

sys.path.append("../")
from functions import utils

from numpy.random import seed
randomState = 123
seed(randomState)

In [2]:
# Read in config variables
base_dir = os.path.abspath(os.path.join(os.getcwd(),"../"))

config_file = os.path.abspath(os.path.join(base_dir,
                                           "Rank_pathways",
                                           "init_config.tsv"))
params = utils.read_config(config_file)

In [3]:
# User selected experiment id
project_id = "SRP000762"

In [4]:
# Load params
local_dir = params["local_dir"]
dataset_name = params['dataset_name']
num_runs=25

In [5]:
# Load real template experiment
template_data_file = os.path.join(
    local_dir,
    "recount2_template_data.tsv")

# Load metadata file with grouping assignments for samples
metadata_file = os.path.join(
    base_dir,
    "Rank_pathways",
    "data",
    "metadata",
    "SRP000762_groups.tsv")

## Differential expression analysis

In [6]:
%%R
# Select 59
# Run one time
#if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
#BiocManager::install("limma")

NULL


In [7]:
%%R
library('limma')

**Get differentially expressed genes from template experiment**

In [10]:
%%R -i metadata_file -i project_id -i template_data_file -i local_dir

source('../functions/DE_analysis.R')

get_DE_stats(metadata_file,
             project_id, 
             template_data_file,
             "template",
             local_dir,
             "real")

[1] 60


**Get differentially expressed genes from each simulated experiment**

In [12]:
%%R -i metadata_file -i project_id -i base_dir -i local_dir -i num_runs

source('../functions/DE_analysis.R')

num_sign_DEGs_simulated <- c()

for (i in 0:(num_runs-1)){
  simulated_data_file <- paste(local_dir, "pseudo_experiment/selected_simulated_data_", project_id, "_", i, ".txt", sep="")
  cat(paste("running file: ", simulated_data_file, "...\n", sep=""))
  
  run_output <- get_DE_stats(metadata_file,
                             project_id, 
                             simulated_data_file,
                             "simulated",
                             local_dir,
                             i)
  
  num_sign_DEGs_simulated <- c(num_sign_DEGs_simulated, run_output)
}
median(num_sign_DEGs_simulated)

running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_0.txt...
running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_1.txt...
running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_2.txt...
running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_3.txt...
running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_4.txt...
running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_5.txt...
running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_6.txt...
running file: /home/alexandra/Documents/Data/Rank_pathways/pseudo_experiment/selected_simulated_data_SRP000762_7.txt...
running file: /home/alexandra/Documents/

**Check**

As a check, we compared the number of DEGs identified here versus what was reported in the [Reddy et. al. publication](https://www.ncbi.nlm.nih.gov//pubmed/19801529), which found:
* 234 genes with a significant (FDR < 5%) change in expression in response to DEX treatment. 
* After removing pseudogenes (listed in Supplemental Table S3), 209 differentially expressed genes remained 
* Of the DEX-responsive genes, more showed increases in transcript levels 123 (59%) than showed decreases 86 (41%), and the up-regulation was slightly but significantly stronger than the down-regulation

By comparison:
* Our study found 60 DEGs instead of 234. 
* Spot checking the genes identified with their list of DEX-responsive genes (Supplementary Dataset 2), we found the same genes and FC direction was consistent though magnitudes of theirs was lower compared to ours. 

In [21]:
# Load association statistics for template experiment
template_DE_stats_file = os.path.join(
    local_dir,
    "DE_stats",
    "DE_stats_template_data_"+project_id+"_real.txt")

In [35]:
template_DE_stats = pd.read_csv(
    template_DE_stats_file,
    header=0,
    sep='\t',
    index_col=0)

print(template_DE_stats[template_DE_stats['adj.P.Val']<0.05].shape)
template_DE_stats[template_DE_stats['adj.P.Val']<0.05]

(60, 6)


Unnamed: 0,logFC,AveExpr,t,P.Value,adj.P.Val,B
ENSG00000067082.14,19.385246,18.673139,14.055682,2.34201e-12,1.359233e-07,9.739709
ENSG00000177606.6,-8.236487,9.185753,-13.126898,8.859555e-12,2.57091e-07,9.292456
ENSG00000114315.3,-6.285228,6.042663,-12.840532,1.355507e-11,2.622318e-07,9.141666
ENSG00000185950.8,8.790781,17.705828,12.594113,1.96614e-11,2.852721e-07,9.006626
ENSG00000095752.6,-7.50579,7.299177,-12.204218,3.582671e-11,4.15855e-07,8.782431
ENSG00000116285.12,46.615799,44.6677,11.279883,1.577568e-10,1.525955e-06,8.194544
ENSG00000155090.14,-10.7781,12.558313,-10.361364,7.527217e-10,5.66231e-06,7.521248
ENSG00000163545.8,-8.541758,8.566756,-10.340694,7.805104e-10,5.66231e-06,7.504966
ENSG00000113916.17,4.155621,4.467837,9.972028,1.502563e-09,9.689362e-06,7.205605
ENSG00000204055.4,-5.577962,6.383792,-9.523395,3.407894e-09,1.97784e-05,6.817415


In [23]:
#%%R
#BiocManager::install('EnhancedVolcano')
#devtools::install_github('kevinblighe/EnhancedVolcano')

In [24]:
%%R
library(EnhancedVolcano)

In [26]:
%%R -i project_id -i template_DE_stats_file -i local_dir

source('../functions/DE_analysis.R')

create_volcano(template_DE_stats_file,
               project_id,
               "adj.P.Val",
               local_dir)

## Gene set enrichment analysis
Use DE association statistics to rank pathways that are enriched....

**Check**

Again, we want to compare our ranked pathways found against what was reported in the original publication.

*The DEX-responsive genes that we identified are primarily implicated in two broad classes: stress response and development (Table 1; Reimand et al. 2007). Comparison to Gene Ontology (GO) categories (Ashburner et al. 2000) revealed that the identified genes are involved in stress response (P = 6 × 10−11), organ development (P = 5 × 10−15), cell differentiation (P = 1 × 10−11), hormone secretion (P = 4 × 10−7), and apoptosis (P = 5 × 10−12).*