# Generate data and calculate similarity

The goal of this notebook is to determine how much of the structure in the original dataset (single experiment) is retained after adding some number of experiments.

For this simulation experiment we wanted to capture the individual experiment structure.
In particular, we simulated data by (1) preserving the relationship between samples within an experiment but (2) shifting the samples in space.

Criteria (1) will account for the type of experiment, such as treatment vs non-treatment.  Criteria (2) will reflect a different type of perturbation, like a different antibiotic.  

The approach is to,
1. Randomly sample an experiment from the Pseudomonas compendium
2. Embed samples from the experiment into the trained latent space
3. Randomly shift the samples to a new location in the latent space. This new location will be selected based on the distribution of samples in the latent space 

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import glob
import pandas as pd
import numpy as np
import random

import warnings
warnings.filterwarnings(action='ignore')

sys.path.append("../")
from functions import generate_data

from numpy.random import seed
randomState = 123
seed(randomState)

Using TensorFlow backend.


In [2]:
# User parameters
NN_architecture = 'NN_2500_30'
analysis_name = 'analysis_1'
num_simulated_experiments = 50

In [3]:
# Input files

# base dir on repo
base_dir = os.path.abspath(os.path.join(os.getcwd(),"../..")) 

# base dir on local machine for data storage
# os.makedirs doesn't recognize `~`
local_dir = local_dir = os.path.abspath(os.path.join(os.getcwd(), "../../../..")) 

NN_dir = base_dir + "/models/" + NN_architecture

normalized_data_file = os.path.join(
    base_dir,
    "data",
    "input",
    "train_set_normalized.pcl")

### Load file with experiment ids

In [4]:
experiment_ids_file = os.path.join(
    base_dir,
    "data",
    "metadata",
    "experiment_ids.txt")

### Generate simulated data

In [5]:
# Generate simulated data
generate_data.simulate_compendium(experiment_ids_file, 
                                  num_simulated_experiments,
                                  normalized_data_file,
                                  NN_architecture,
                                  analysis_name
                                 )

Directory already exists: 
 /home/alexandra/Documents/Data/Batch_effects/simulated/analysis_1


Normalized gene expression data contains 950 samples and 5549 genes
E-MEXP-2606
E-GEOD-15697
E-GEOD-18594
E-GEOD-13252
E-GEOD-24038
E-GEOD-9989
E-GEOD-33241
E-GEOD-26932
E-GEOD-22665
E-GEOD-6741
E-GEOD-46603
E-GEOD-24784
E-GEOD-28953
E-GEOD-33160
E-MEXP-2812
E-GEOD-7266
E-GEOD-17179
E-GEOD-2430
E-GEOD-16970
E-GEOD-45695
E-GEOD-28953
E-MTAB-1381
E-GEOD-10362
E-GEOD-46603
E-GEOD-15697
E-GEOD-9989
E-GEOD-2430
E-GEOD-16970
E-GEOD-22999
E-GEOD-45695
E-GEOD-6741
E-GEOD-7968
E-GEOD-36647
E-MEXP-1183
E-GEOD-22665
E-GEOD-28953
E-GEOD-8083
E-GEOD-32032
E-GEOD-36753
E-GEOD-28953
E-MEXP-1183
E-GEOD-35632
E-GEOD-33244
E-GEOD-7704
E-GEOD-7704
E-GEOD-24784
E-GEOD-9592
E-GEOD-9592
E-GEOD-51076
E-GEOD-24784
Return: simulated gene expression data containing 353 samples and 5550 genes
