Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Training a generative model and a classifier for genomic OOD dataset
PiperOrigin-RevId: 256221996
- Loading branch information
Jie Ren
authored and
Copybara-Service
committed
Jul 2, 2019
1 parent
c80499a
commit 6c0c4cf
Showing
12 changed files
with
1,329 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Out-of-Distribution Detection for Genomics Sequences | ||
|
||
This directory contains implementation of the generative model and the classification model for the bacteria genomic dataset that are used in the paper of [Ren J, Liu PJ, Fertig E, Snoek J, Poplin R, DePristo MA, Dillon JV, Lakshminarayanan B. Likelihood Ratios for Out-of-Distribution Detection. arXiv preprint arXiv:1906.02845. 2019 Jun 7.](https://arxiv.org/abs/1906.02845) | ||
|
||
## Installation | ||
|
||
```bash | ||
virtualenv -p python3 . | ||
source ./bin/activate | ||
|
||
pip install -r genomics_ood/requirements.txt | ||
``` | ||
|
||
## Usage | ||
|
||
This directory contains two python scripts: | ||
generative.py: build an autoregressive generative model for DNA sequences using LSTM. | ||
classifier.py: build a classifier for DNA sequences using ConvNets. | ||
|
||
To test the models on a toy dataset, | ||
|
||
```bash | ||
DATA_DIR=./genomics_ood/test_data | ||
OUT_DIR=./genomics_ood/test_out | ||
python -m genomics_ood.generative \ | ||
--hidden_lstm_size=30 \ | ||
--val_freq=100 \ | ||
--num_steps=1000 \ | ||
--in_tr_data_dir=$DATA_DIR/before_2011_in_tr \ | ||
--in_val_data_dir=$DATA_DIR/between_2011-2016_in_val \ | ||
--ood_val_data_dir=$DATA_DIR/between_2011-2016_ood_val \ | ||
--out_dir=$OUT_DIR | ||
|
||
python -m genomics_ood.classifier \ | ||
--num_motifs=30 \ | ||
--val_freq=100 \ | ||
--num_steps=1000 \ | ||
--in_tr_data_dir=$DATA_DIR/before_2011_in_tr \ | ||
--in_val_data_dir=$DATA_DIR/between_2011-2016_in_val \ | ||
--ood_val_data_dir=$DATA_DIR/between_2011-2016_ood_val \ | ||
--label_dict_file=$DATA_DIR/label_dict.json \ | ||
--out_dir=$OUT_DIR | ||
``` | ||
|
||
## Real Bacteria Dataset | ||
|
||
The real bacteria dataset with 10 in-distribtution classes, 60 validation out-of-distribution (OOD) classes, and 60 test OOD classes can be downloaded at [Google Drive](https://drive.google.com/open?id=1Ht9xmzyYPbDouUTl_KQdLTJQYX2CuclR) | ||
|
||
To run models on the real dataset, one needs to set DATA_DIR=/<path to the real data directory>/ and specify the OUT_DIR. | ||
|
||
|
||
## Likelihood Ratios | ||
|
||
To compute likelihood ratios, we train two generative models using the generative.py. The full model is trained with L2 regularization weight and mutation rate both 0.0. The background model is trained with L2 regularization weight 0.0001 and mutation rate 0.1. | ||
|
||
|
Oops, something went wrong.