<a href="https://colab.research.google.com/github/hwartmann/rna_augment/blob/master/rna_augment_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BIRA: Bias Invariant RNA-Seq Annotation Using Domain Adaptation

Welcome to this notebook where we'll run an example using our novel RNA-Seq annotation method.

In this notebook you will be able to reproduce some of our results yourself!

We'll go through the following steps:

<ol>
<li>Install the source code for BIRA as published on our github repo</li>
<li>Load training, test and bias injection data sets</li>
<li>Run and evaluate a full training cycle for BIRA</li>
</ol>





In [None]:
!git clone https://github.com/imsb-uke/rna_augment.git

Cloning into 'rna_augment'...
remote: Enumerating objects: 29, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 57 (delta 10), reused 19 (delta 4), pack-reused 28[K
Unpacking objects: 100% (57/57), done.


In [None]:
!pip install rna_augment/

Processing ./rna_augment
Building wheels for collected packages: rna-augment
  Building wheel for rna-augment (setup.py) ... [?25l[?25hdone
  Created wheel for rna-augment: filename=rna_augment-1.0-cp36-none-any.whl size=4233 sha256=fe0e1d5bbebe42b36c66b79e2b89e2b61ade9b60db4bea98dae38d46a24b3254
  Stored in directory: /tmp/pip-ephem-wheel-cache-2q4u1jwv/wheels/5d/a2/2a/09db3901a39f38b5a2cc80f741037a4934972c328799e0394e
Successfully built rna-augment
Installing collected packages: rna-augment
Successfully installed rna-augment-1.0


In [None]:
from rna_augment.src import bira, load_data

For "source" we load all the GTEx data originally used in the paper as well as all the SRA data as "bias". For "target" we'll load a random subset (frac=0.5) of the origianl TCGA test data. Using a subset of the TCGA data saved some space but will lead to comparable results.

In [None]:
source, target, bias = load_data.load_data()

BIRA comes with a number of hyperparameters that can be chosen freedly, here we provide the parameters chosen in the paper for this experiment.

<ul>
    <li>source_layers: a list of integers representing the number of nodes to be used per layer for the source and bias mapper, [512] will create one layer with 512 nodes</li>
    <li>classifier_layers: a list of integers representing the number of nodes to be used per layer for the classifier layer, [] will only create a single output layer with n=classes</li>
    <li>lr: learning rate applied in the second training cycle
    <li>classes: number of classes in the data</li>
    <li>batch_size: batch size</li>
    <li>margin: size of margin applied in triplet loss
    <li>print: True / False, if test accuracy should be printed after every epoch during the second training cycle    
</ul>

In [None]:
config = {'source_layers': [512],
      'classifier_layers': [],
      'lr': 0.0005,
      'classes': 16,
      'batch_size': 64,
      'margin': 11,
         'print': True}

Finally we start with the first training cyle, here we train the source mapper and the classification layer as a vanilla MLP

In [None]:
model = bira.Bira(source, target, bias, config=config)
model.train_source_mapper(epochs=10)
model.eval_source_mapper()

The accuracy above is what we achieved using GTEx to train a MLP to predict TCGA, let's see if we can do better by injecting some SRA data set biases:


In [None]:
model.train_da(epochs=10)