# Application: new experiment

This notebook allows users to find specific genes in their experiment of interest using an existing VAE model

This notebook will generate a `generic_gene_summary_<experiment id>.tsv` file that contains a z-score per gene that indicates how specific a gene is the experiment in question.

In [1]:
%load_ext autoreload
%load_ext rpy2.ipython
%autoreload 2

In [2]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from ponyo import utils
from generic_expression_patterns_modules import process, new_experiment_process, stats, ranking

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
Using TensorFlow backend.


## User input

User needs to define the following in the [config file](../configs/config_new_experiment.tsv):

1. Template experiment. This is the experiment you are interested in studying
2. Training compendium used to train VAE, including unnormalized gene mapped version and normalized version
3. Scaler transform used to normalize the training compendium
4. Directory containing trained VAE model
5. Experiment id to label newly create simulated experiments

The user also needs to provide metadata files:
1. `<experiment id>_process_samples.tsv` contains 2 columns (sample ids, label that indicates if the sample is kept or removed). See [example](data/metadata/cis-gem-par-KU1919_process_samples.tsv). **Note: This file is not required if the user wishes to use all the samples in the template experiment file.**
2. `<experiment id>_groups.tsv` contains 2 columns: sample ids, group label to perform DE analysis. See [example](data/metadata/cis-gem-par-KU1919_groups.tsv)

In [3]:
# Read in config variables
base_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))

config_filename = os.path.abspath(
    os.path.join(base_dir, "configs", "config_new_experiment.tsv")
)

params = utils.read_config(config_filename)

In [4]:
# Load config params

# Local directory to store intermediate files
local_dir = params['local_dir']

# Number of simulated experiments to generate
num_runs = params['num_simulated']

# Directory containing trained VAE model
vae_model_dir = params['vae_model_dir']

# Dimension of latent space used in VAE model
latent_dim = params['latent_dim']

# ID for template experiment
# This ID will be used to label new simulated experiments
project_id = params['project_id']

# Template experiment filename
template_filename = params['raw_template_filename']
mapped_template_filename = params['mapped_template_filename']
normalized_template_filename = params['normalized_template_filename']
processed_template_filename = params['processed_template_filename']

# Training dataset used for existing VAE model
mapped_compendium_filename = params['mapped_compendium_filename']

# Normalized compendium filename
normalized_compendium_filename = params['normalized_compendium_filename']

# Scaler transform used to scale compendium data into 0-1 range for training
scaler_filename = params['scaler_filename']

# Test statistic used to rank genes by
col_to_rank_genes = params['rank_genes_by']

# Minimum mean count per gene
count_threshold = params['count_threshold']

In [5]:
# Load metadata files

# Load metadata file with processing information
sample_id_metadata_filename = os.path.join(
    "data",
    "metadata",
    f"{project_id}_process_samples.tsv"
)

# Load metadata file with grouping assignments for samples
metadata_filename = os.path.join(
    "data",
    "metadata",
    f"{project_id}_groups.tsv"
)

In [6]:
# Output filename
gene_summary_filename = f"generic_gene_summary_{project_id}.tsv"

## Map template experiment to same feature space as training compendium

In order to simulate a new gene expression experiment, we will need to encode this experiment into the learned latent space. This requires that the feature space (i.e. genes) in the template experiment match the features in the compendium used to train the VAE model. These cells process the template experiment to be of the expected input format:
* Template data is expected to be a matrix that is sample x gene
* Template experiment is expected to have the same genes as the compendium experiment. Genes that are in the template experiment but not in the compendium are removed. Genes that are in the compendium but missing in the template experiment are added and the gene expression value is set to the median gene expression value of that gene across the samples in the compendium.

In [7]:
# Template experiment needs to be of the form sample x gene
template_filename_only = template_filename.split("/")[-1].split(".")[0]
transposed_template_filename = os.path.join(local_dir, template_filename_only+"_transposed.txt")

new_experiment_process.transpose_save(template_filename, transposed_template_filename)

In [8]:
new_experiment_process.process_template_experiment(
    transposed_template_filename,
    mapped_compendium_filename,
    scaler_filename,
    mapped_template_filename,
    normalized_template_filename,
)

(72, 58528)
(49651, 17755)


## Simulate experiments based on template experiment

Embed template experiment into learned latent space and linearly shift template experiment to different locations of the latent space to create new experiments

In [9]:
# Simulate experiments based on template experiment
normalized_compendium_data = pd.read_csv(normalized_compendium_filename, sep="\t", index_col=0, header=0)
normalized_template_data = pd.read_csv(normalized_template_filename, sep="\t", index_col=0, header=0)

for run_id in range(num_runs):
    new_experiment_process.embed_shift_template_experiment(
        normalized_compendium_data,
        normalized_template_data,
        vae_model_dir,
        project_id,
        scaler_filename,
        local_dir,
        latent_dim,
        run_id
    )

Instructions for updating:
If using Keras pass *_constraint arguments to layers.



## Process template and simulated experiments

* Remove samples not required for comparison
* Make sure ordering of samples matches metadata for proper comparison
* Make sure values are cast as integers if using DESeq
* Filter lowly expressed genes if using DESeq

In [10]:
if "human_general_analysis" in vae_model_dir:
    method = "deseq"
else:
    method = "limma"

In [11]:
if not os.path.exists(sample_id_metadata_filename):
    sample_id_metadata_filename = None
    
if method == "deseq":
    stats.process_samples_for_DESeq(
        mapped_template_filename,
        metadata_filename,
        processed_template_filename,
        count_threshold,
        sample_id_metadata_filename,
    )

    for i in range(num_runs):
        simulated_filename = os.path.join(
            local_dir,
            "pseudo_experiment",
            f"selected_simulated_data_{project_id}_{i}.txt"
        )
        out_simulated_filename = os.path.join(
            local_dir,
            "pseudo_experiment",
            f"selected_simulated_data_{project_id}_{i}_processed.txt"
        )
        stats.process_samples_for_DESeq(
            simulated_filename,
            metadata_filename,
            out_simulated_filename,
            count_threshold,
            sample_id_metadata_filename,
    )
else:
    stats.process_samples_for_limma(
        mapped_template_filename,
        metadata_filename,
        processed_template_filename,
        sample_id_metadata_filename,
    )

    for i in range(num_runs):
        simulated_filename = os.path.join(
            local_dir,
            "pseudo_experiment",
            f"selected_simulated_data_{project_id}_{i}.txt"
        )
        stats.process_samples_for_limma(
            simulated_filename,
            metadata_filename,
            None,
            sample_id_metadata_filename,
    )

sample ids are ordered correctly
sample ids are ordered correctly
sample ids are ordered correctly


## Differential expression analysis

* If data is RNA-seq then use DESeq2 (using human_general_analysis model)
* If data is microarray then use Limma (using human_cancer_analysis, pseudomonas_analysis models)

In [12]:
# Create subdirectory: "<local_dir>/DE_stats/"
os.makedirs(os.path.join(local_dir, "DE_stats"), exist_ok=True)

In [13]:
%%R -i metadata_filename -i project_id -i processed_template_filename -i local_dir -i base_dir -i method

source(paste0(base_dir, '/generic_expression_patterns_modules/DE_analysis.R'))

# File created: "<local_dir>/DE_stats/DE_stats_template_data_<project_id>_real.txt"
if (method == "deseq"){
    get_DE_stats_DESeq(
        metadata_filename,
        project_id, 
        processed_template_filename,
        "template",
        local_dir,
        "real"
    )
}
else{
    get_DE_stats_limma(
        metadata_filename,
        project_id, 
        processed_template_filename,
        "template",
        local_dir,
        "real"
    ) 
}





Attaching package: ‘BiocGenerics’



    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB



    plotMA



    IQR, mad, sd, var, xtabs



    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


Attaching package: ‘S4Vectors’



    expand.grid








    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.




Attaching package: ‘matrixStats’



    anyMissing, rowMedians



Attaching package: ‘DelayedArray’

[1] "Checking sample ordering..."
[1] TRUE


In [14]:
%%R -i metadata_filename -i project_id -i base_dir -i local_dir -i num_runs -i method

source(paste0(base_dir, '/generic_expression_patterns_modules/DE_analysis.R'))

# Files created: "<local_dir>/DE_stats/DE_stats_simulated_data_<project_id>_<n>.txt"
for (i in 0:(num_runs-1)){
    simulated_data_filename <- paste(
        local_dir, 
        "pseudo_experiment/selected_simulated_data_",
        project_id,
        "_", 
        i,
        "_processed.txt",
        sep = ""
    )
    if (method == "deseq"){
        get_DE_stats_DESeq(
            metadata_filename,
            project_id, 
            simulated_data_filename,
            "simulated",
            local_dir,
            i
            )
    }
    else {
        get_DE_stats_limma(
            metadata_filename,
            project_id, 
            simulated_data_filename,
            "simulated",
            local_dir,
            i
            )
        }
    }

   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.



[1] "Checking sample ordering..."
[1] TRUE
[1] "Checking sample ordering..."
[1] TRUE


## Rank genes

Genes are ranked by their "generic-ness" - how frequently these genes are changed across the simulated experiments using user-specific test statistic (i.e. log2 fold change).

In [15]:
analysis_type = "DE"
template_DE_stats_filename = os.path.join(
    local_dir,
    "DE_stats",
    f"DE_stats_template_data_{project_id}_real.txt"
)

template_DE_stats, simulated_DE_summary_stats = ranking.process_and_rank_genes_pathways(
    template_DE_stats_filename,
    local_dir,
    num_runs,
    project_id,
    analysis_type,
    col_to_rank_genes,
)

## Summary table

* Gene ID: Gene identifier (hgnc symbols for human data or PA number for *P. aeruginosa* data)
* (Real): Statistics for template experiment
* (Simulated): Statistics across simulated experiments
* Number of experiments: Number of simulated experiments
* Z-score: High z-score indicates that gene is more changed in template compared to the null set of simulated experiments (high z-score = highly specific to template experiment)


Note: If using DESeq, genes with NaN in `Adj P-value (Real)` column are those genes flagged because of the `cooksCutoff` parameter. The cook's distance as a diagnostic to tell if a single sample has a count which has a disproportionate impact on the log fold change and p-values. These genes are flagged with an NA in the pvalue and padj columns of the result table. For more information you can read [DESeq FAQs](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pvaluesNA)

In [16]:
# Get summary table
summary_gene_ranks = ranking.generate_summary_table(
    template_DE_stats_filename,
    template_DE_stats,
    simulated_DE_summary_stats,
    col_to_rank_genes,
    local_dir,
    'gene',
    params
)

summary_gene_ranks.sort_values(by="Z score", ascending=False).head(10)



Unnamed: 0,Gene ID,Adj P-value (Real),Rank (Real),abs(log2FoldChange) (Real),log2FoldChange (Real),Median adj p-value (simulated),Rank (simulated),Mean abs(log2FoldChange) (simulated),Std deviation (simulated),Number of experiments (simulated),Z score
SLK,SLK,0.0005911571,6483.0,0.940282,0.940282,0.940476,252.0,0.064046,0.002037,2,430.165315
LRRC8A,LRRC8A,1.902213e-08,6881.0,1.343314,1.343314,0.618578,1654.0,0.285278,0.002915,2,362.935455
LSM14A,LSM14A,0.003610193,5202.0,0.547553,-0.547553,0.781809,951.0,0.193577,0.001562,2,226.66877
FNDC3B,FNDC3B,0.4231353,2653.0,0.232234,-0.232234,0.997149,17.0,0.004305,0.001086,2,209.788662
MLF2,MLF2,0.001210815,5534.0,0.614119,-0.614119,0.77907,1303.0,0.242409,0.001872,2,198.594709
NF2,NF2,5.559495e-07,6750.0,1.131627,1.131627,0.827481,1603.0,0.279273,0.005274,2,161.623695
USP9X,USP9X,0.2603234,3620.0,0.336301,0.336301,0.931154,269.0,0.068864,0.002355,2,113.539452
GOLGB1,GOLGB1,0.006112168,6233.0,0.822894,0.822894,0.365382,3906.0,0.636068,0.002012,2,92.857931
EIF2S2,EIF2S2,1.716679e-05,6751.0,1.135946,-1.135946,0.813965,4268.0,0.711979,0.004639,2,91.395952
ELL2,ELL2,0.0001977546,6917.0,1.412726,-1.412726,0.969871,404.0,0.096958,0.015325,2,85.858717


In [17]:
summary_gene_ranks.isna().any()

Gene ID                                 False
Adj P-value (Real)                       True
Rank (Real)                             False
abs(log2FoldChange) (Real)              False
log2FoldChange (Real)                   False
Median adj p-value (simulated)          False
Rank (simulated)                        False
Mean abs(log2FoldChange) (simulated)    False
Std deviation (simulated)                True
Number of experiments (simulated)       False
Z score                                  True
dtype: bool

In [18]:
summary_gene_ranks[summary_gene_ranks.isna().any(axis=1)]

Unnamed: 0,Gene ID,Adj P-value (Real),Rank (Real),abs(log2FoldChange) (Real),log2FoldChange (Real),Median adj p-value (simulated),Rank (simulated),Mean abs(log2FoldChange) (simulated),Std deviation (simulated),Number of experiments (simulated),Z score
FAM83A,FAM83A,2.241092e-24,7075.0,2.794028,2.794028,0.537293,3332.0,0.531827,,1,
TRPV2,TRPV2,1.196376e-17,7074.0,2.771896,2.771896,0.342293,5420.0,0.985449,,1,
TRIB3,TRIB3,,7069.0,2.587617,-2.587617,0.586547,1213.0,0.230900,,1,
ICAM1,ICAM1,3.981838e-02,7067.0,2.557440,2.557440,0.724900,2190.0,0.355869,,1,
UCA1,UCA1,3.158603e-08,7060.0,2.364145,2.364145,0.903759,2645.0,0.419016,,1,
STC1,STC1,1.088229e-21,7056.0,2.320779,2.320779,0.301070,3449.0,0.551488,,1,
ZC3H12A,ZC3H12A,2.698803e-03,7055.0,2.316884,2.316884,0.673281,4190.0,0.689088,,1,
SCNN1A,SCNN1A,1.241827e-08,7053.0,2.251175,-2.251175,0.847718,3532.0,0.565898,,1,
DDX58,DDX58,4.092460e-12,7044.0,2.132860,2.132860,0.379020,4273.0,0.713044,,1,
KRT80,KRT80,3.176570e-08,7037.0,2.051189,-2.051189,0.001941,5494.0,1.008882,,1,


In [19]:
# Save
summary_gene_ranks.to_csv(gene_summary_filename, sep='\t')