# Simulate pseudo experiments using template experiment

This notebook generates new pseudo-experiments using the experiment-preserving approach in the [experiment level simulation](../Pseudomonas/Pseudomonas_experiment_lvl_sim.ipynb). In this simulation we are preserving the experiment type but not the actual experiment so the relationship between samples within an experiment are preserved but the genes that are expressed will be different (module [simulate_compendium](../functions/generate_data_parallel.py)).

The expression patterns in these new experiments are compared against the patterns in the experiments generated in [generate_random_sampled_experiment.ipynb](generate_random_sampled_experiment.ipynb) using [differential expression analysis](DE_analysis_run.R) and [pathway enrichment analysis](find_enrichment_run.R)

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
from pathlib import Path
import ast
import pandas as pd
import numpy as np
import seaborn as sns
import random
import glob
from sklearn import preprocessing

sys.path.append("../")
from ponyo import utils
import generate_labeled_data

import warnings
warnings.filterwarnings(action='ignore')

from numpy.random import seed
randomState = 123
seed(randomState)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# Read in config variables
config_file = os.path.abspath(os.path.join(os.getcwd(),"../configs", "config_Pa_experiment_limma.tsv"))
params = utils.read_config(config_file)

In [3]:
# Load parameters
num_runs = 100
dataset_name = params["dataset_name"]
num_simulated_experiments = params["num_simulated_experiments"]
NN_architecture = params["NN_architecture"]
local_dir = params["local_dir"]

In [4]:
# Input files
base_dir = os.path.abspath(
  os.path.join(
      os.getcwd(), "../"))    # base dir on repo

# Load experiment id file
# Contains ALL experiment ids
experiment_ids_file = os.path.join(
    base_dir,
    dataset_name,
    "data",
    "metadata",
    "experiment_ids.txt")

normalized_data_file = os.path.join(
    base_dir,
    dataset_name,
    "data",
    "input",
    "train_set_normalized.pcl")

original_data_file = os.path.join(
    local_dir,
    "pseudo_experiment",
    "Pa_compendium_02.22.2014.pcl")

mapping_file = os.path.join(
    base_dir,
    dataset_name,
    "data",
    "metadata",
    "sample_annotations.tsv")

## Generate simulated data with labels

Simulate a compendia by experiment and label each new sample with the experiment id that it originated from

In [5]:
# Load experiment id file
# Contains ALL experiment ids
base_dir = os.path.abspath(
  os.path.join(
      os.getcwd(), "../"))    # base dir on repo

experiment_ids_file = os.path.join(
    base_dir,
    dataset_name,
    "data",
    "metadata",
    "experiment_ids.txt")

In [6]:
# Generate simulated data
# Generate simulated data
simulated_labeled_data_file = os.path.join(
    local_dir,
    "pseudo_experiment",
    "simulated_data_labeled.txt.xz")
if not Path(simulated_labeled_data_file).exists():
    generate_labeled_data.simulate_compendium_labeled(experiment_ids_file, 
                                                      num_simulated_experiments,
                                                      normalized_data_file,
                                                      NN_architecture,
                                                      dataset_name,
                                                      local_dir,
                                                      base_dir)

## Process data

Notice: Originally the expression data was 0-1 normalized for use in training the VAE, however when we performed differential expression analyses we found that the normalized data had reduced variance that resulted in an inconsistency between the number of DEGs found compared to the publication. Therefore, we are re-scaling our normalized data to be in the original range of data.

In [7]:
# Load simulated data
simulated_data_file = os.path.join(
    local_dir,
    "pseudo_experiment",
    "simulated_data_labeled.txt.xz")

In [8]:
# Read data
original_data = pd.read_table(
    original_data_file,
    header=0,
    sep='\t',
    index_col=0).T

simulated_data = pd.read_table(
    simulated_data_file,
    header=0,
    sep='\t',
    index_col=0)

print(original_data.shape)
print(simulated_data.shape)

(950, 5549)
(5346, 5550)


In [9]:
original_data.head(5)

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA5561,PA5562,PA5563,PA5564,PA5565,PA5566,PA5567,PA5568,PA5569,PA5570
05_PA14000-4-2_5-10-07_S2.CEL,9.62009,10.575783,9.296287,9.870074,8.512268,7.903954,7.039473,10.209826,9.784684,5.485688,...,7.740609,9.730384,10.516061,10.639916,9.746849,5.768592,9.224442,11.512176,12.529719,11.804896
54375-4-05.CEL,9.327996,10.781977,9.169988,10.269239,7.237999,7.663758,6.855194,9.631573,9.404465,5.684067,...,7.127736,9.687607,10.199612,9.457152,9.318372,5.523898,7.911031,10.828271,11.597643,11.26852
AKGlu_plus_nt_7-8-09_s1.CEL,9.368599,10.596248,9.714517,9.487155,7.804147,7.681754,6.714411,9.497601,9.523126,5.766331,...,7.343241,9.717993,10.419979,10.164667,10.305005,5.806817,8.57573,10.85825,12.255953,11.309662
anaerobic_NO3_1.CEL,9.083292,9.89705,8.068471,7.310218,6.723634,7.141148,8.492302,7.740717,7.640251,5.267993,...,7.37474,8.287819,9.437053,8.936576,9.418147,5.956482,7.481406,7.687985,9.205525,9.395773
anaerobic_NO3_2.CEL,8.854901,9.931392,8.167126,7.526595,6.864015,7.154523,8.492109,7.716687,7.268094,5.427256,...,7.425398,8.588969,9.313851,8.684602,9.272818,5.729479,7.699086,7.414436,9.363494,9.424762


In [10]:
simulated_data.head(5)

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA5562,PA5563,PA5564,PA5565,PA5566,PA5567,PA5568,PA5569,PA5570,experiment_id
0,0.649,0.707,0.453,0.678,0.305,0.465,0.375,0.544,0.437,0.403,...,0.497,0.673,0.509,0.645,0.174,0.462,0.229,0.508,0.687,E-MEXP-2606_0
1,0.622,0.69,0.438,0.676,0.323,0.5,0.362,0.551,0.445,0.353,...,0.511,0.64,0.516,0.644,0.188,0.487,0.264,0.524,0.678,E-MEXP-2606_0
2,0.641,0.703,0.451,0.671,0.325,0.458,0.37,0.549,0.449,0.387,...,0.512,0.684,0.523,0.655,0.169,0.453,0.256,0.538,0.695,E-MEXP-2606_0
3,0.573,0.65,0.402,0.692,0.301,0.533,0.369,0.573,0.432,0.299,...,0.534,0.534,0.494,0.622,0.227,0.529,0.303,0.489,0.647,E-MEXP-2606_0
4,0.617,0.676,0.459,0.665,0.334,0.469,0.368,0.545,0.442,0.351,...,0.527,0.679,0.529,0.647,0.186,0.459,0.289,0.553,0.686,E-MEXP-2606_0


In [11]:
# 0-1 normalize per gene
scaler = preprocessing.MinMaxScaler()
original_data_scaled = scaler.fit_transform(original_data)
original_data_scaled_df = pd.DataFrame(original_data_scaled,
                                columns=original_data.columns,
                                index=original_data.index)

original_data_scaled_df.head(5)

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA5561,PA5562,PA5563,PA5564,PA5565,PA5566,PA5567,PA5568,PA5569,PA5570
05_PA14000-4-2_5-10-07_S2.CEL,0.853357,0.72528,0.640617,0.811465,0.69446,0.533958,0.158865,0.889579,0.884945,0.176558,...,0.466871,0.702785,0.790965,0.893249,0.789939,0.164157,0.97047,0.887472,0.900484,0.880012
54375-4-05.CEL,0.77879,0.767873,0.614859,0.907865,0.3988,0.460849,0.113876,0.761351,0.80174,0.222709,...,0.35202,0.694387,0.733186,0.639074,0.681204,0.110301,0.619554,0.747656,0.749893,0.805374
AKGlu_plus_nt_7-8-09_s1.CEL,0.789155,0.729508,0.725913,0.718989,0.53016,0.466327,0.079507,0.731643,0.827707,0.241847,...,0.392405,0.700352,0.773422,0.791118,0.931585,0.17257,0.797148,0.753785,0.856253,0.811099
anaerobic_NO3_1.CEL,0.71632,0.585079,0.390211,0.193248,0.279456,0.301781,0.513547,0.342051,0.415668,0.125914,...,0.398308,0.419574,0.593955,0.527203,0.706524,0.20551,0.504767,0.105662,0.363409,0.54478
anaerobic_NO3_2.CEL,0.658015,0.592172,0.410331,0.245504,0.312028,0.305852,0.513499,0.336723,0.334226,0.162965,...,0.407801,0.478697,0.57146,0.473054,0.669643,0.155548,0.562927,0.049738,0.388931,0.548814


In [12]:
# Re-scale simulated data back into the same range as the original data
simulated_data_numeric = simulated_data.drop(columns=['experiment_id'])
simulated_data_scaled = scaler.inverse_transform(simulated_data_numeric)

simulated_data_scaled_df = pd.DataFrame(simulated_data_scaled,
                                columns=simulated_data_numeric.columns,
                                index=simulated_data_numeric.index)

simulated_data_scaled_df['experiment_id'] = simulated_data['experiment_id']
simulated_data_scaled_df.head(5)

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA5562,PA5563,PA5564,PA5565,PA5566,PA5567,PA5568,PA5569,PA5570,experiment_id
0,8.819587,10.487285,8.376346,9.317434,6.833727,7.677395,7.924795,8.651416,7.737733,6.459046,...,8.682196,9.869977,8.85187,9.175711,5.813316,7.321338,8.291288,10.100462,10.417829,E-MEXP-2606_0
1,8.713823,10.404987,8.302797,9.309152,6.911305,7.792385,7.871545,8.682983,7.77429,6.244121,...,8.753506,9.689239,8.884444,9.17177,5.876926,7.414908,8.462489,10.199493,10.353151,E-MEXP-2606_0
2,8.78825,10.467921,8.366539,9.288449,6.919925,7.654397,7.904314,8.673964,7.792569,6.39027,...,8.7586,9.930223,8.917017,9.215116,5.790598,7.287653,8.423357,10.286145,10.47532,E-MEXP-2606_0
3,8.521882,10.211342,8.126278,9.375404,6.816487,7.900805,7.900218,8.782193,7.714885,6.012002,...,8.870659,9.108687,8.78207,9.085079,6.054123,7.572106,8.653256,9.982863,10.130371,E-MEXP-2606_0
4,8.694238,10.337211,8.405766,9.263605,6.958714,7.690537,7.896122,8.655925,7.760581,6.235524,...,8.835004,9.902838,8.944937,9.183592,5.867838,7.31011,8.584776,10.378987,10.410642,E-MEXP-2606_0


In [13]:
# Read in metadata
metadata = pd.read_table(
    mapping_file, 
    header=0, 
    sep='\t', 
    index_col=0)

metadata.head()

Unnamed: 0_level_0,sample_name,ml_data_source,description,nucleic_acid,medium,genotype,od,growth_setting_1,growth_setting_2,strain,temperature,treatment,additional_notes,variant_phenotype,abx_marker,biotic_int_lv_2,biotic_int_lv_1
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
E-GEOD-46947,GSM1141730 1,GSM1141730_PA01_ZnO_PZO_.CEL,Pseudomonas aeruginosa PAO1 LB aerated 5 h wi...,RNA,LB,,,planktonic,aerated,PAO1,37.0,1 mM ZnO nanoparticles,Grown for 5h,,,,
E-GEOD-46947,GSM1141729 1,GSM1141729_PA01_none_PC_.CEL,Pseudomonas aeruginosa PAO1 LB aerated 5 h,RNA,LB,,,planktonic,aerated,PAO1,37.0,,Grown for 5h,,,,
E-GEOD-65882,GSM1608059 1,GSM1608059_Planktonic_1.CEL,PAO1 WT. Planktonic. Rep1,RNA,PBM plus 1 g / L glucose.,WT,0.26,Planktonic,Aerated,PAO1,37.0,,Grown shaking at 200rpm,,,,
E-GEOD-65882,GSM1608060 1,GSM1608060_Planktonic_2.CEL,PAO1 WT. Planktonic. Rep2,RNA,PBM plus 1 g / L glucose.,WT,0.26,Planktonic,Aerated,PAO1,37.0,,Grown shaking at 200rpm,,,,
E-GEOD-65882,GSM1608061 1,GSM1608061_Planktonic_3.CEL,PAO1 WT. Planktonic. Rep3,RNA,PBM plus 1 g / L glucose.,WT,0.26,Planktonic,Aerated,PAO1,37.0,,Grown shaking at 200rpm,,,,


In [14]:
map_experiment_sample = metadata[['sample_name', 'ml_data_source']]
map_experiment_sample.head()

Unnamed: 0_level_0,sample_name,ml_data_source
experiment,Unnamed: 1_level_1,Unnamed: 2_level_1
E-GEOD-46947,GSM1141730 1,GSM1141730_PA01_ZnO_PZO_.CEL
E-GEOD-46947,GSM1141729 1,GSM1141729_PA01_none_PC_.CEL
E-GEOD-65882,GSM1608059 1,GSM1608059_Planktonic_1.CEL
E-GEOD-65882,GSM1608060 1,GSM1608060_Planktonic_2.CEL
E-GEOD-65882,GSM1608061 1,GSM1608061_Planktonic_3.CEL


# Template experiment E-GEOD-51409

This experiment measures the transcriptome of *P. aeruginosa* under two different growth conditions: 28 degrees and 37 degress.

In [15]:
# Get experiment id
experiment_id = 'E-GEOD-51409'

In [16]:
# Get original samples associated with experiment_id
selected_mapping = map_experiment_sample.loc[experiment_id]
original_selected_sample_ids = list(selected_mapping['ml_data_source'].values)

selected_original_data = original_data.loc[original_selected_sample_ids]

selected_original_data.head(10)

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA5561,PA5562,PA5563,PA5564,PA5565,PA5566,PA5567,PA5568,PA5569,PA5570
GSM1244967_PAO1-22-replicate-01.CEL,9.049257,9.927143,8.885547,8.80466,5.988185,7.764461,8.350282,7.774165,7.774105,5.510923,...,9.356331,7.377822,9.644671,7.429156,7.654936,6.131965,6.241217,7.684121,9.072845,10.695175
GSM1244968_PAO1-22-replicate-02.CEL,8.833167,9.917035,9.009681,8.900852,6.096402,7.749163,8.319773,8.301315,7.600733,5.706577,...,9.486076,6.894396,9.723022,7.788555,7.605215,6.231041,6.303251,7.815412,8.896422,10.714222
GSM1244969_PAO1-22-replicate-03.CEL,8.884647,9.907316,8.737792,8.628758,6.270097,7.989337,8.304787,7.953646,7.694889,5.679747,...,9.080317,7.296979,9.65595,7.345836,7.659305,5.702764,6.390429,7.863996,9.196426,10.671757
GSM1244970_PAO1-37-replicate-01.CEL,8.778351,9.872437,8.755126,8.662006,7.212951,8.4263,8.670188,8.653405,7.737566,5.69507,...,9.153401,7.927784,9.57273,8.284933,8.580799,5.959742,6.265907,7.824123,10.907826,12.177075
GSM1244971_PAO1-37-replicate-02.CEL,9.061243,9.828194,8.342299,8.842026,6.465854,7.970151,8.43216,8.22671,7.877492,5.785088,...,9.337408,8.013215,9.553362,8.590636,8.628973,5.778791,6.830793,8.052581,10.929728,12.152272
GSM1244972_PAO1-37-replicate-03.CEL,8.808541,10.165703,8.627964,8.689108,6.905749,8.306635,8.780996,8.567739,7.190508,5.742351,...,9.013357,7.840225,9.405921,8.208101,8.480249,6.232756,5.991988,7.848818,10.879207,12.175344


In [17]:
# Want to get simulated samples associated with experiment_id
# Since we sampled experiments with replacement, we want to find the first set of samples matching the experiment id
match_experiment_id = ''
for experiment_name in simulated_data_scaled_df['experiment_id'].values:
    if experiment_name.split("_")[0] == experiment_id:
        match_experiment_id = experiment_name 

In [18]:
# Get simulated samples associated with experiment_id
selected_simulated_data = simulated_data_scaled_df[simulated_data_scaled_df['experiment_id'] == match_experiment_id]

# Map sample ids from original data to simulated data
selected_simulated_data.index = original_selected_sample_ids
selected_simulated_data = selected_simulated_data.drop(columns=['experiment_id'])

selected_simulated_data.head(5)

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA5561,PA5562,PA5563,PA5564,PA5565,PA5566,PA5567,PA5568,PA5569,PA5570
GSM1244967_PAO1-22-replicate-01.CEL,8.298603,10.182295,8.297893,9.429233,6.833727,7.733248,8.17466,8.051645,7.193946,5.603645,...,6.802119,8.259426,8.856749,7.674571,7.638906,6.235865,6.434294,8.853806,9.419624,9.131456
GSM1244968_PAO1-22-replicate-02.CEL,8.290769,10.138725,8.288087,9.404389,6.825107,7.766102,8.145987,8.015568,7.166528,5.603645,...,6.87149,8.20849,8.840318,7.665264,7.627084,6.258582,6.404352,8.824457,9.413434,9.210507
GSM1244969_PAO1-22-replicate-03.CEL,8.314272,10.225865,8.332216,9.458218,6.859586,7.71025,8.17466,8.09674,7.230503,5.616541,...,6.775438,8.310362,8.878657,7.693184,7.650727,6.222234,6.460494,8.897829,9.444382,9.109896
GSM1244970_PAO1-37-replicate-01.CEL,8.380864,10.196819,8.33712,9.197353,7.264719,7.509838,7.965756,8.385352,7.504681,5.921734,...,6.770102,8.743319,9.097733,8.367921,8.490059,6.453954,6.913373,9.494587,10.162356,9.72793
GSM1244971_PAO1-37-replicate-02.CEL,8.40045,10.211342,8.356733,9.213916,7.251789,7.513123,7.969852,8.389861,7.509251,5.908838,...,6.802119,8.712757,9.130594,8.377228,8.49794,6.43578,6.90963,9.494587,10.162356,9.778236


In [19]:
# Save selected samples
# This will be used as input into R script to identify differentially expressed genes
selected_simulated_data_file = os.path.join(
    local_dir,
    "pseudo_experiment",
    "selected_simulated_data_"+experiment_id+"_example.txt")

selected_original_data_file = os.path.join(
    local_dir,
    "pseudo_experiment",
    "selected_original_data_"+experiment_id+"_example.txt")

selected_simulated_data.to_csv(
        selected_simulated_data_file, float_format='%.3f', sep='\t')

selected_original_data.to_csv(
        selected_original_data_file, float_format='%.3f', sep='\t')

## Generate multiple simulated experiments 

Generate different simulated datasets using the same E-GEOD-51409 template experiment and shifting the experiment in the linear space in different directions multiple times

In [20]:
# Generate multiple simulated datasets
for i in range(num_runs):
    generate_labeled_data.shift_template_experiment(
        normalized_data_file,
        experiment_id,
        NN_architecture,
        dataset_name,
        scaler,
        local_dir,
        base_dir,
        i)