Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


popVAE fits a variational autoencoder (VAE) to a set of genotypes and outputs the latent space.

A manuscript describing popVAE's methods and testing it on several empirical datasets can be found at


Popvae requires python 3.7 and tensorflow 1.15. We recommend you first install miniconda3 (, then install in a new environment.

Clone this repo and install with:

conda create --name popvae python=3.7.7
conda activate popvae
git clone
cd popvae
python install


popVAE requires input genotypes in .vcf, .vcf.gz, or .zarr formats. This repo includes a small test dataset of ddRADseq genotypes from migratory Painted Buntings (from Fit a model with: --infile data/pabu/pabu_test_genotypes.vcf --out out/pabu_test --seed 42

It should fit in less than a minute on a regular laptop CPU. For running on larger datasets we recommend using a CUDA-enabled GPU.


At default settings popvae will output 4 files:
pabu_test_latent_coords.txt -- best-fit latent space coordinates by sample.
pabu_test_history.txt -- training and validation loss by epoch.
pabu_test_history.pdf -- a plot of training and validation loss by epoch.
pabu_test_training_preds.txt -- latent coordinates output during model training, stored every --prediction_freq epochs.


Many hyperparameters and filtering options can be adjusted at the command line. Run --h to see all parameters.

Default settings work well on most datasets, but validation loss can usually be improved by tuning hyperparameters. We've seen most effects from changing three settings: network size, early stopping patience, and the proportion of samples used for model training versus validation.

--search_network_sizes runs short optimizations for a range of network sizes and selects the network with lowest validation loss. Alternately, --depth and --width set the number of layers and the number of hidden units per layer in the network. If you're running low on GPU memory, reducing --width will help.

--patience sets the number of epochs the optimizer will run after the last improvement in validation loss -- we've found that increasing this value (to, say, 300) sometimes helps with small datasets.

--train_prop sets the proportion of samples used for model training, with the rest used for validation.

To run a grid search over a specific set of network sizes with increased patience and a larger validation set on the test data, use: --infile data/pabu/pabu_test_genotypes.vcf \
--out out/pabu_test --seed 42 --patience 300 \
--search_network_sizes --width_range 32,256,512 \
--depth_range 3,5,8 --train_prop 0.75


Using the optional --plot and --metadata options will generate an interactive scatterplot with metadata for each sample visible on mouseover. Running the test data with --infile data/pabu/pabu_test_genotypes.vcf --out out/pabu_test --seed 42 --plot --metadata data/pabu/pabu_test_sample_data.txt

Should cause a plot like this to open in your default web browser:

You can (slightly) customize this plot with the script. See options with python scripts/ --h.

Alternately, simple scatter plots can be produced in R with:


#load data
pd <- fread("out/pabu_test_latent_coords.txt",header=T)
names(pd)[1:2] <- c("LD1","LD2")
sd <- fread("data/pabu/pabu_test_sample_data.csv") #this has sample metadata for interpreting plots

#merge tables
pd <- merge(pd,sd,by="sampleID")

#plot VAE

Note there are two main groups of samples corresponding to eastern and western sampling localities, as well as cline within the western group. These are allopatric (the big gap) and parapatric (the cline in western samples) breeding populations with different migratory strategies (see to compare these results with PCA and STRUCTURE).

Generating Artificial Genotypes

We're still working on the best way to allow users to generate artificial genotypes from trained models, since this tends to be a more interactive task than just fitting the model and visualizing the latent space. For now we have included a working example of fitting a VAE, generating artificial genotypes, and analyzing them with PCA and Admixture clustering at scripts/ Stay tuned for updates.


genotype dimensionality reduction with a VAE







No releases published


No packages published

Contributors 4