Update readme and latent dim argument (#34)

* add more details to readme on simulations and how to run them * fix typos * add newlines to table * update functions to take latent_dim as arg * Update README.md Co-authored-by: Ben Heil <ben.jer.heil@gmail.com> * Update README.md Co-authored-by: Ben Heil <ben.jer.heil@gmail.com> * update readme based on PR Co-authored-by: Ben Heil <ben.jer.heil@gmail.com>
greenelab · Apr 28, 2021 · d44a4b6 · d44a4b6
1 parent 5668219
commit d44a4b6
Show file tree

Hide file tree

Showing 6 changed files with 77 additions and 52 deletions.
diff --git a/README.md b/README.md
@@ -3,12 +3,12 @@
 # ponyo 
 [![Coverage Status](https://coveralls.io/repos/github/greenelab/ponyo/badge.svg?branch=master)](https://coveralls.io/github/greenelab/ponyo?branch=master)
 
-**Alexandra Lee and Casey Greene 2020**
+**Alexandra J. Lee and Casey S. Greene 2020**
 
 **University of Pennsylvania**
 
 This repository is named after the the character [Ponyo](https://en.wikipedia.org/wiki/Ponyo), from Hayao Miyazaki's animated film *Ponyo*, as she uses her magic to simulate a human appearance after getting a sample of human blood. 
-The method simulates a compendia of new gene expression data based on existing gene expression data to learn a representation of gene expression patterns.
+The method simulates new gene expression data by training a generative neural network on existing gene expression data to learn a representation of gene expression patterns.
 
 ## Installation
 
@@ -18,6 +18,15 @@ This package can be installed using pip:
 pip install ponyo
 ```
 
+## Types of simulations
+There are 3 types of simulations that ponyo implements:
+
+| Name | Description |
+| :--- | :---------- |
+| Simulation by random sampling| This approach simulates gene expression data by randomly sampling from the latent space distribution. The function to run this approach is divided into 2 components: `simulate_by_random_sampling` is a wrapper which loads VAE trained models from directory `<root>/<analysis name>/"models"/<NN_architecture>` and `run_sample_simulation` which runs the simulation. Note: `simulate_by_random_sampling` assumes the files are organized as described above. If this directory organization doesn't apply to you, then you can directly use `run_sample_simulation` by passing in the trained VAE models. An example of how to use this can be found [here](https://github.com/greenelab/ponyo/blob/master/human_tests/Human_random_sampling_simulation.ipynb). |
+| Simulation by latent transformation| This approach simulates gene expression data by encoding experiments into the latent space and then shifting samples from that experiment in the latent space. Unlike the "Simulation by random sampling" approach, this method accounts for experiment level information by shifting samples from the same experiment together. The function to run this approach is divided into 2 components: `simulate_by_latent_transformation` is a wrapper which loads VAE trained models from directory `<root>/<analysis name>/"models"/<NN_architecture>` and `run_latent_transformation_simulation` which runs the simulation. Note: `simulate_by_latent_transformation` assumes the files are organized as described above. If this directory organization doesn't apply to you, then you can directly use `run_latent_transformation_simulation` by passing in the  VAE models trained using `run_tybalt_training` in [vae.py](https://github.com/greenelab/ponyo/blob/master/ponyo/vae.py). <br><br>There are 2 flavors of this approach: `simulate_by_latent_transform` inputs a dataset with multiple experiments (these are your template experiments) and then it outputs the same number of new simulated experiments that are created by shifting each of those input template experiments. An example of how to use this can be found [here](https://github.com/greenelab/ponyo/blob/master/human_tests/Human_latent_transform_simulation.ipynb). The second flavor is `shift_template_experiment` which inputs a single template experiment and can output multiple simulated experiments based on that one template by shifting that template experiment to different locations in the latent space. An example for how to use this can be found [here](https://github.com/greenelab/ponyo/blob/master/human_tests/Human_template_simulation.ipynb).|
+
+
 ## How to use
 Example notebooks using ponyo on test data can be found in [human_tests](https://github.com/greenelab/ponyo/tree/master/human_tests)
 
@@ -40,12 +49,12 @@ The tables lists the core parameters required to generate simulated data using m
 | local_dir| str: Parent directory on local machine to store intermediate results|
 | dataset_name| str: Name for analysis directory containing notebooks using ponyo|
 | raw_data_filename| str: File storing raw gene expression data|
-| normalized_data_filename| str: File storing normalized gene expression data|
+| normalized_data_filename| str: File storing normalized gene expression data. This file is generated by [normalize_expression_data()](https://github.com/greenelab/ponyo/blob/master/ponyo/train_vae_modules.py).|
 | metadata_filename*| str: File containing metadata associated with data|
 | experiment_ids_filename*| str: File containing list of experiment ids that have gene expression data available|
-| scaler_transform_filename| str: File to store mapping from normalized to raw gene expression range|
+| scaler_transform_filename| str: Python pickle file to store mapping from normalized to raw gene expression range. This file is generated by [normalize_expression_data()](https://github.com/greenelab/ponyo/blob/master/ponyo/train_vae_modules.py).|
 | simulation_type | str: Name of simulation approach directory to store results locally|
-| NN_architecture | str: Name of neural network architecture to use. Format 'NN_<intermediate layer>_<latent layer>'|
+| NN_architecture | str: Name of neural network architecture to use. Format `NN_<intermediate layer>_<latent layer>`|
 | learning_rate| float: Step size used for gradient descent. In other words, it's how quickly the  methods is learning|
 | batch_size | str: Training is performed in batches. So this determines the number of samples to consider at a given time|
 | epochs | int: Number of times to train over the entire input dataset|
@@ -61,3 +70,9 @@ The tables lists the core parameters required to generate simulated data using m
 | metadata_experiment_colname* | str: Column header that contains experiment id that maps expression data and metadata|
 | metadata_sample_colname* | str: Column header that contains sample id that maps expression data and metadata|
 | project_id*| int: If using template-based approach, experiment id to use as template experiment|
+
+For guidance on setting VAE training prameters, see configurations used in [simulate-expression-compendia](https://github.com/greenelab/simulate-expression-compendia/configs) and [generic-expression-patterns](https://github.com/greenelab/generic-expression-patterns/configs) repositories
+
+
+## Acknowledgements
+We would like to thank Marvin Thielk for adding coverage to tests and Ben Heil for contributing code to add more flexibility.
diff --git a/human_tests/Human_latent_transform_simulation.ipynb b/human_tests/Human_latent_transform_simulation.ipynb
@@ -129,6 +129,7 @@
     "metadata_filename = params[\"metadata_filename\"]\n",
     "experiment_id_filename = params[\"experiment_ids_filename\"]\n",
     "NN_architecture = params['NN_architecture']\n",
+    "latent_dim = params['latent_dim']\n",
     "num_simulated_experiments = params['num_simulated_experiments']\n",
     "metadata_delimiter = params[\"metadata_delimiter\"]\n",
     "experiment_id_colname = params['metadata_experiment_colname']\n",
@@ -175,7 +176,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
@@ -196,7 +197,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -225,7 +226,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -239,7 +240,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 10,
    "metadata": {
     "scrolled": true
    },
@@ -260,23 +261,23 @@
       "\n",
       "Train on 45 samples, validate on 5 samples\n",
       "Epoch 1/10\n",
-      "45/45 [==============================] - 4s 88ms/step - loss: 2511.2365 - val_loss: 2078.2676\n",
+      "45/45 [==============================] - 4s 98ms/step - loss: 2511.2365 - val_loss: 2078.2676\n",
       "Epoch 2/10\n",
       "45/45 [==============================] - 4s 79ms/step - loss: 1688.8236 - val_loss: 2374.3589\n",
       "Epoch 3/10\n",
       "45/45 [==============================] - 4s 79ms/step - loss: 1664.0755 - val_loss: 1454.6667\n",
       "Epoch 4/10\n",
       "45/45 [==============================] - 4s 79ms/step - loss: 1509.4538 - val_loss: 1387.5260\n",
       "Epoch 5/10\n",
-      "45/45 [==============================] - 4s 80ms/step - loss: 1474.1985 - val_loss: 1371.2039\n",
+      "45/45 [==============================] - 4s 79ms/step - loss: 1474.1985 - val_loss: 1371.2039\n",
       "Epoch 6/10\n",
-      "45/45 [==============================] - 4s 80ms/step - loss: 1489.1452 - val_loss: 1350.6823\n",
+      "45/45 [==============================] - 4s 79ms/step - loss: 1489.1452 - val_loss: 1350.6823\n",
       "Epoch 7/10\n",
-      "45/45 [==============================] - 4s 80ms/step - loss: 1502.0319 - val_loss: 1949.6031\n",
+      "45/45 [==============================] - 4s 79ms/step - loss: 1502.0319 - val_loss: 1949.6031\n",
       "Epoch 8/10\n",
-      "45/45 [==============================] - 4s 80ms/step - loss: 1381.4732 - val_loss: 1232.3323\n",
+      "45/45 [==============================] - 4s 79ms/step - loss: 1381.4732 - val_loss: 1232.3323\n",
       "Epoch 9/10\n",
-      "45/45 [==============================] - 4s 80ms/step - loss: 1419.9623 - val_loss: 1151.1223\n",
+      "45/45 [==============================] - 4s 79ms/step - loss: 1419.9623 - val_loss: 1151.1223\n",
       "Epoch 10/10\n",
       "45/45 [==============================] - 4s 80ms/step - loss: 1384.7468 - val_loss: 1161.4500\n"
      ]
@@ -309,7 +310,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 11,
    "metadata": {
     "scrolled": false
    },
@@ -329,6 +330,7 @@
     "    num_simulated_experiments,\n",
     "    normalized_data_filename,\n",
     "    NN_architecture,\n",
+    "    latent_dim,\n",
     "    dataset_name,\n",
     "    analysis_name,\n",
     "    metadata_filename,\n",
@@ -349,7 +351,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -380,7 +382,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -389,7 +391,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -399,7 +401,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -424,7 +426,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -445,7 +447,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -466,7 +468,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
@@ -546,7 +548,7 @@
        "SRR592749  5.665261  4.215596    background"
       ]
      },
-     "execution_count": 20,
+     "execution_count": 18,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -563,7 +565,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
@@ -580,7 +582,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "<ggplot: (8741103944825)>\n"
+      "<ggplot: (8733738930705)>\n"
      ]
     }
    ],

diff --git a/human_tests/Human_random_sampling_simulation.ipynb b/human_tests/Human_random_sampling_simulation.ipynb
@@ -898,7 +898,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.9"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,