# Bias Invariant RNA-Seq Annotation

Welcome to this notebook where we'll run an example using our novel RNA-Seq annotation method.

In this notebook you will be able to reproduce some of our results yourself!

We'll go through the following steps:

<ol>
<li>Install the source code for the DA model as published on our github repo</li>
<li>Load training, test and bias injection data sets</li>
<li>Run and evaluate a full training cycle for our model</li>
</ol>





Befor we can import we first have to install the package, run the following cell.<br>

In [2]:
!pip install .

Processing /projectbig/jupyternotebook/rnaseq_augmentation/rna_augment
Building wheels for collected packages: rna-augment
  Building wheel for rna-augment (setup.py) ... [?25ldone
[?25h  Created wheel for rna-augment: filename=rna_augment-1.0-cp37-none-any.whl size=4821 sha256=5a330fd1a116c87151c1d91740a555c9d846f07f3e41b57b700b186f4ec3fe2b
  Stored in directory: /tmp/pip-ephem-wheel-cache-n0kp13i6/wheels/6b/93/40/f3e9867b3e873b21b81882c92ddac3850b395e576c949465e7
Successfully built rna-augment
Installing collected packages: rna-augment
Successfully installed rna-augment-1.0


In [3]:
from src import da_model, load_data

Next we'll load some source, target and bias data. We'll reproduce the results for the DA G+S-T expereiment for tissue prediction as described in our paper. <br>
As source data we load all the GTEx data originally used in the paper as well as all the SRA data as bias data. As target we'll load a random subset (frac=0.5) of the origianl TCGA test data. Using a subset of the TCGA data saved some space and time but will lead to comparable results.

In [4]:
source, target, bias = load_data.load_data()

BIRA comes with a number of hyperparameters that can be chosen freedly, here we provide the parameters chosen in the paper for this experiment.

<ul>
    <li>source_layers: a list of integers representing the number of nodes to be used per layer for the source and bias mapper, [512] will create one layer with 512 nodes</li>
    <li>classifier_layers: a list of integers representing the number of nodes to be used per layer for the classifier layer, [] will only create a single output layer with n=classes</li>
    <li>lr: learning rate applied in the second training cycle
    <li>classes: number of classes in the data</li>
    <li>batch_size: batch size</li>
    <li>margin: size of margin applied in triplet loss
    <li>print: True / False, if test accuracy should be printed after every epoch during the second training cycle    
</ul>

In [8]:
config = {'mapper_layers': [512],
          'classifier_layers': [],
          'lr': 0.0005,
          'classes': 16,
          'batch_size': 64,
          'margin': 5,
          'print': True}

Finally we start with the first training cyle, here we train the source mapper and the classification layer as a vanilla MLP

In [9]:
model = da_model.DaModel(source, target, bias, config=config)
model.train_source_mapper(epochs=10)
model.eval_source_mapper()

0.6729802860456127

The accuracy above is what we achieved using GTEx to train a MLP to predict TCGA.
Note that we use a different network than the MLP G-T we present in the paper, so results my vary.
Let's see if we can do better by injecting some SRA data set biases:

In [11]:
model.train_bias_mapper(epochs=10)

0.6590645535369154
0.8051797448782373
0.845380749903363
0.8442211055276382
0.8554310011596443
0.8596830305373019
0.8763045999226904
0.8778507924236567
0.872825666795516
0.8755315036722072
