# Generative Adversarial Network Training for Genes
This notebook describes the usage of the __geneGAN_train.py__ script to train a GAN for gene similarity measures or connection discovery.

In [1]:
from geneGAN_train import geneGAN
import numpy as np

## Dependencies
__dataPath__: Path to GAN training matrix with ixj entries, i being the number of training sampels and j being the number of genes used in training. For gene similarity, each row should have to two entries set to one (corresponding to co-occurring genes), while the rest are zero.

__autoencoderPath__: Path to autoencoder training matrix which includes all possible gene pairs used for pretraining the autoencoder.


In [2]:
dataPath = 'data/cosmic_patients_min_50_in.npy'
autencoderPath = 'data/all_comb_two_hot.npy'
data = np.load(dataPath)

## GAN Setting

### GAN Architecture in terms of feature vector dimension:

__Autoencoder__: inputDim x [compressor_size] x embed_size x [decompressor_size] x inputDim

__Generator__: noise_size x [generator_size] x embed_size

__Generator + Decoder__: noise_size x [generator_size] x embed_size x [decompressor_size] x inputDim

__Discriminator__: inputDim x [discriminator_size]

### Other Settings

__datatype__: if 'binary', last layer of decoder uses tanh to map to [0,1], otherwise ReLu.

__bnDecay__: batch normalization decay

__l2scale__: weight of L2 regularization loss

__modelPath__: points to existing model for continuation of training. If '', then a new training is started.

__outPath__: points to folder, where the model and results will be saved. 'training_test' is the folder and 'test' is the start of the filenames. All files will be named 'test...'.

__pretrainEpochs__: number of autoencoder pretrainingEpochs

__nEpochs__: number of main GAN training epochs

__discriminatorTrainPeriod__: number of discriminator trainings at each training step, used to set D/G training ratio.

__generatorTrainPeriod__: number of generator trainings at each training step, used to set D/G training ratio.

__pretrainBatchSize__: Autoencoder pretraining batch sizes.

__batchSize__: Main training batchSizes.

__saveMaxKeep__: maximum number of intermediate models saved.

__keepProb__: Dropout keep probabitity.

In [3]:
inputDim = data.shape[1]
embed_size = 256
noise_size = 256
generator_size = [256, 256]
discriminator_size = [256, 128, 1]
compressor_size = []
decompressor_size = []

ggan = geneGAN(dataType='binary',
            inputDim=inputDim,
            embeddingDim=embed_size,
            randomDim=noise_size,
            generatorDims=generator_size,
            discriminatorDims=discriminator_size,
            compressDims=compressor_size,
            decompressDims=decompressor_size,
            bnDecay=0.99,
            l2scale=0.001)

ggan.train(dataPath=dataPath,
           autoencoderData=autencoderPath,
           modelPath='',
           outPath='trainings/training_test/test',
           pretrainEpochs=2, #100,
           nEpochs=2, #1000,
           discriminatorTrainPeriod=2,
           generatorTrainPeriod=1,
           pretrainBatchSize=100,
           batchSize=1000,
           saveMaxKeep=0,
           keepProb=0.5)

Pretrain_Epoch:0, trainLoss:20.905212, validLoss:11.789073, validReverseLoss:0.000000
Pretrain_Epoch:1, trainLoss:7.404072, validLoss:2.667455, validReverseLoss:0.000000
Epoch:00000, time:43.19 d_loss:0.97, g_loss:22.59, acc:1.00, score_t:0.99, score_v:0.99, gen_v:0
Epoch:00001, time:36.33 d_loss:0.10, g_loss:14.92, acc:1.00, score_t:1.00, score_v:1.00, gen_v:0
INFO:tensorflow:trainings/training_test/test is not in all_model_checkpoint_paths. Manually adding it.
trainings/training_test/test
best epoch scaled: 0
best epoch unscaled: 0


## Outputs
### test_training_stats.npz
In this numpy file various training indicators are stored. They are divided into 4 variables which store a vector of various metrics at each epoch.
#### ae_training_status
This variable stores 4 quantities per epoch: 
- epoch
- training time
- training loss
- validation loss

#### main_training_status
This variable stores 16 quantities per epoch:
- epoch
- training time
- discriminator loss
- generator loss
- validation accuracy (batch mode)
- validation AUC (batch mode)
- validation accuracy (single mode)
- validation AUC (single mode)
- training accuracy (batch mode)
- training AUC (batch mode)
- training accuracy (single mode)
- training AUC (single mode)
- mean discriminator value for pairs in dataset (train+valid in single mode)
- mean discriminator value for pairs not in dataset (single mode)
- mean discriminator value for pairs in dataset (train+valid in batch mode)
- mean discriminator value for pairs not in dataset (batch mode)

#### quality_status
This variable stores 11 quantities related to the discriminator performance. Unscaled in this context means, that the decision boundary is drawn at 0.5 ([0,0.5) are false pairs, [0.5,1.0] are real pairs) and scaled uses the best performing boundary. The quality score is defined as $\sqrt{(Sensitivity-1)^2+(Specificity-1)^2}$, zero being an indicator for good performance.
The stored quantities are:
- number of correctly classified training pairs (unscaled)
- number of correctly classified validation pairs (unscaled)
- number of falsely classified of pairs not in dataset (unscaled)
- quality score training (unscaled)
- quality score validation (unsclaed)
- number of correctly classified training pairs (scaled)
- number of correctly classified validation pairs (scaled)
- number of falsely classified of pairs not in dataset (scaled)
- quality score training (scaled)
- quality score validation (scaled)
- best boundary (used for scaled classification)


#### generator_training_status
This varible stores two quantities related to the generator output:
- number of valid generator samples during the epoch. Valid means, that the rounded output corresponed to a two-hot vector
- number of unique unique valid generator samples during the epoch
***
### train_ind.npy and valid_ind.npy
The two matrix array contain the division of the main training dataset into training and validation sets. For example, with data[valid_ind], the validation data can be retrieved.
***
### Tensorflow model files
The remainder of the files in the folder are the model files saved by tensorflow.

In [4]:
data = np.load('trainings/training_test/test_training_stats.npz')
quality_status = data['quality_status']