# BIRA: Bias Invariant RNA-Seq Annotation Using Domain Adaptation

Welcome to this notebook where we'll run an example using our novel RNA-Seq annotation method.

In this notebook you will be able to reproduce some of our results yourself!

We'll go through the following steps:

<ol>
<li>Install the source code for BIRA as published on our github repo</li>
<li>Load training, test and bias injection data sets</li>
<li>Run and evaluate a full training cycle for BIRA</li>
</ol>





Befor we can import BIRA we first have to install the package, run the following cell.<br>
After installing you need to <b> restart the kernel </b>. To do so, select Kernal -> Restart Kernal...

In [23]:
!pip install .

Processing /projectbig/jupyternotebook/rnaseq_augmentation/rna_augment
Building wheels for collected packages: rna-augment
  Building wheel for rna-augment (setup.py) ... [?25ldone
[?25h  Created wheel for rna-augment: filename=rna_augment-1.0-cp37-none-any.whl size=4605 sha256=ae28f26622ca23407dd3ac1d446c475512a1587841ee093435b4256fbb3c3f36
  Stored in directory: /tmp/pip-ephem-wheel-cache-dr817v6z/wheels/6b/93/40/f3e9867b3e873b21b81882c92ddac3850b395e576c949465e7
Successfully built rna-augment
Installing collected packages: rna-augment
  Found existing installation: rna-augment 1.0
    Uninstalling rna-augment-1.0:
      Successfully uninstalled rna-augment-1.0
Successfully installed rna-augment-1.0


In [1]:
from src import bira, load_data

Next we'll load some training data, target data and bias data. We'll reproduce the results for the BIRA G+S-T expereiment as described in our paper. <br>
For "source" we load all the GTEx data originally used in the paper as well as all the SRA data as "bias". For "target" we'll load a random subset (frac=0.5) of the origianl TCGA test data. Using a subset of the TCGA data saved some space but will lead to comparable results.

In [2]:
source, target, bias = load_data.load_data()

BIRA comes with a number of hyperparameters that can be chosen freedly, here we provide the parameters chosen in the paper for this experiment.

<ul>
    <li>source_layers: a list of integers representing the number of nodes to be used per layer for the source and bias mapper, [512] will create one layer with 512 nodes</li>
    <li>classifier_layers: a list of integers representing the number of nodes to be used per layer for the classifier layer, [] will only create a single output layer with n=classes</li>
    <li>lr: learning rate applied in the second training cycle
    <li>classes: number of classes in the data</li>
    <li>batch_size: batch size</li>
    <li>margin: size of margin applied in triplet loss
    <li>print: True / False, if test accuracy should be printed after every epoch during the second training cycle    
</ul>

In [5]:
config = {'source_layers': [512],
      'classifier_layers': [],
      'lr': 0.0005,
      'classes': 16,
      'batch_size': 64,
      'margin': 11,
         'print': True}

Finally we start with the first training cyle, here we train the source mapper and the classification layer as a vanilla MLP

In [16]:
model = bira.Bira(source, target, bias, config=config)
model.train_source_mapper(epochs=10)
model.eval_source_mapper()

0.622342481638964


The accuracy above is what we achieved using GTEx to train a MLP to predict TCGA, let's see if we can do better by injecting some SRA data set biases:

In [17]:
model.train_da(epochs=10)

0.6899884035562428
0.8001546192500967
0.8229609586393506
0.8438345574023965
0.8573637417858523
0.860456126787785
0.8689601855431001
0.8685736374178585
0.8689601855431001
0.8797835330498647
