Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
images_ood Internal change Feb 6, 2020
test_data Training a generative model and a classifier for genomic OOD dataset Jul 2, 2019
README.md Pixel_cnn evaluation code for LLR OOD paper Jan 16, 2020
classifier.py internal Jan 22, 2020
generative.py Explicitly replace "import tensorflow" with "tensorflow.compat.v1" fo… Feb 18, 2020
requirements.txt Training a generative model and a classifier for genomic OOD dataset Jul 2, 2019
run.sh internal Jan 22, 2020
utils.py internal Jan 22, 2020

README.md

Out-of-Distribution Detection for Genomics Sequences

This directory contains implementation of the generative model and the classification model for the bacteria genomic dataset that are used in the paper of Ren J, Liu PJ, Fertig E, Snoek J, Poplin R, DePristo MA, Dillon JV, Lakshminarayanan B. Likelihood Ratios for Out-of-Distribution Detection. arXiv preprint arXiv:1906.02845.

Installation

virtualenv -p python3 .
source ./bin/activate

pip install -r genomics_ood/requirements.txt

Usage

This directory contains two python scripts: generative.py: build an autoregressive generative model for DNA sequences using LSTM. classifier.py: build a classifier for DNA sequences using ConvNets.

To test the models on a toy dataset,

DATA_DIR=./genomics_ood/test_data
OUT_DIR=./genomics_ood/test_out
python -m genomics_ood.generative \
--hidden_lstm_size=30 \
--val_freq=100 \
--num_steps=1000 \
--in_tr_data_dir=$DATA_DIR/before_2011_in_tr \
--in_val_data_dir=$DATA_DIR/between_2011-2016_in_val \
--ood_val_data_dir=$DATA_DIR/between_2011-2016_ood_val \
--out_dir=$OUT_DIR

python -m genomics_ood.classifier \
--num_motifs=30 \
--val_freq=100 \
--num_steps=1000 \
--in_tr_data_dir=$DATA_DIR/before_2011_in_tr \
--in_val_data_dir=$DATA_DIR/between_2011-2016_in_val \
--ood_val_data_dir=$DATA_DIR/between_2011-2016_ood_val \
--label_dict_file=$DATA_DIR/label_dict.json \
--out_dir=$OUT_DIR

Real Bacteria Dataset

The real bacteria dataset with 10 in-distribtution classes, 60 validation out-of-distribution (OOD) classes, and 60 test OOD classes can be downloaded at Google Drive

To run models on the real dataset, one needs to set DATA_DIR=// and specify the OUT_DIR.

Likelihood Ratios

To compute likelihood ratios, we train two generative models using the generative.py. The full model is trained with L2 regularization weight and mutation rate both 0.0. The background model is trained with L2 regularization weight 0.0001 and mutation rate 0.1.

You can’t perform that action at this time.