Bayesian inference of hybrid genealogy from population genetic data.
MONGRAIL classifies individuals as purebred or hybrid based on multilocus genotype data from two reference populations. It computes log-likelihoods and posterior probabilities for six genealogical models representing different hybrid categories.
Download the latest release for your platform from the Releases page:
- Linux (x86_64):
mongrail2-linux-x86_64.tar.gz - macOS (Apple Silicon):
mongrail2-macos-arm64.tar.gz - Windows (x86_64):
mongrail2-windows-x86_64.zip
Each archive contains the executable and example data (puma dataset in GT and VCF formats).
Linux/macOS:
tar -xzf mongrail2-linux-x86_64.tar.gz
cd mongrail2-linux-x86_64
./mongrail2 --helpWindows:
Extract the zip file and run mongrail2.exe from the command prompt.
The pre-built binaries require htslib to be installed on your system:
Ubuntu/Debian:
sudo apt-get install libhts3Fedora/RHEL:
sudo dnf install htslibmacOS (Homebrew):
brew install htslibWindows (MSYS2):
pacman -S mingw-w64-x86_64-htslib- GCC compiler
- GNU Make
- Standard C math library
- htslib development files
Ubuntu/Debian:
sudo apt-get install build-essential libhts-devFedora/RHEL:
sudo dnf install gcc make htslib-develmacOS (Homebrew):
brew install htslibgit clone https://github.com/mongrail/mongrail2.git
cd mongrail2
makeThis produces the mongrail2 executable.
The default build uses -O3 -march=native -ffast-math for optimal performance. For debugging, edit the Makefile:
CFLAGS = -Wall -gmongrail2 [options] popA popB hybrids
| Argument | Description |
|---|---|
popA |
Genotype file for reference population A |
popB |
Genotype file for reference population B |
hybrids |
Genotype file for putative hybrid individuals |
| Option | Description |
|---|---|
-c |
Read VCF format (requires phased, biallelic SNPs; max 16 per chromosome) |
-p THRESH |
Prune haplotype pairs by posterior probability threshold (0-1, e.g., 0.99) |
-r RATE |
Recombination rate per base pair (default: 1e-8) |
-k K |
Max recombinations for sparse enumeration (required if >16 loci per chromosome) |
-v |
Verbose output (show input file summary) |
-h |
Show help message |
-V |
Show version |
# Using .GT format
./mongrail2 popA.GT popB.GT hybrids.GT
# Using VCF format
./mongrail2 -c popA.vcf popB.vcf hybrids.vcf
# With options
./mongrail2 -c -r 1e-7 -v popA.vcf popB.vcf hybrids.vcfInput files use the .GT format with space or tab-separated fields:
Chrom:Position Ind1 Ind2 Ind3 ...
The first field combines chromosome name and position with a colon. Each individual's genotype is encoded as allele1|allele2 or allele1/allele2:
|indicates phased data/indicates unphased data- Alleles are
0or1
Chr1:1000 0|0 0|1 1|1
Chr1:2000 0|1 1|1 0|0
Chr1:3500 0|0 0|1 0|1
Chr2:500 1|0 0|0 1|1
Chr2:1200 0|1 0|1 0|0
- Maximum 32 SNPs per chromosome/scaffold
- Maximum 1000 chromosomes/scaffolds
- Maximum 1000 individuals per population
- Input files may have different loci; missing loci are treated as missing data and integrated over
MONGRAIL handles phased and unphased genotype data differently for optimal performance and accuracy.
When all genotypes in the hybrid file use the | separator, MONGRAIL uses direct likelihood computation. This computes the probability of the specific observed haplotype pair for each individual, with O(H) complexity per chromosome (where H is the number of distinct haplotypes).
Advantages:
- Significantly faster computation
- No limit on the number of loci per chromosome
- The
-ppruning and-ksparse enumeration options are not needed
Note: The -p option is ignored for phased data (a warning is printed).
When genotypes use the / separator, MONGRAIL marginalizes over all possible phasings by enumerating haplotype pairs. This computes P(genotype | model) = Σ P(haplotype_pair | model) over all compatible phasings.
For chromosomes with many loci, use the -p or -k options to make computation tractable:
-p THRESH: Prune haplotype pairs with posterior probability below threshold-k K: Limit enumeration to haplotype pairs with at most K recombination events
If the hybrid file contains a mixture of phased (|) and unphased (/) genotypes, the entire dataset is treated as unphased. This is a conservative approach that ensures correct likelihood computation. To use the faster phased computation, ensure all genotypes are phased.
The likelihoods computed for phased and unphased data represent different quantities:
- Phased: P(specific haplotype pair | model)
- Unphased: P(genotype | model) = sum over all compatible phasings
For individuals homozygous at all sites, both methods produce identical results. For heterozygous individuals, unphased likelihoods are generally higher because they sum over multiple possible phasings.
When using the -c option, MONGRAIL reads standard VCF files.
- Phased genotypes: All genotypes must use
|separator (e.g.,0|1) - Biallelic SNPs only: No indels or multiallelic sites
- Maximum 16 SNPs per chromosome: Hard limit for computational tractability
- Different loci allowed: Files may have different loci; missing loci are treated as missing data
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IND1 IND2 IND3
chr1 1000 . A G . PASS . GT 0|0 0|1 1|1
chr1 2000 . C T . PASS . GT 0|1 1|1 0|0
chr1 3500 . G A . PASS . GT 0|0 0|1 0|1
| Error | Cause |
|---|---|
variant at X:Y is not a biallelic SNP |
Indel or multiallelic variant found |
unphased genotype at X:Y |
Genotype uses / instead of ` |
too many SNPs on chromosome X |
More than 16 SNPs on a single chromosome |
Output is written to stdout in a format suitable for downstream processing:
# mongrail 2.1
# recomb_rate=1.00e-08 chroms=2 n_A=4 n_B=4 n_hyb=3
#
# Log-likelihoods followed by posterior probabilities (assuming uniform priors)
#
# indiv Ma Mb Mc Md Me Mf P(Ma) P(Mb) P(Mc) P(Md) P(Me) P(Mf)
0 -0.8825 -2.2860 -9.0206 -12.9229 -10.3721 -3.6376 0.7635 0.1876 0.0002 0.0000 0.0001 0.0486
1 -12.9229 -10.2429 -8.8892 -0.6351 -2.0255 -3.3791 0.0000 0.0001 0.0002 0.7612 0.1895 0.0490
2 -7.4921 -2.1017 -0.7920 -7.4921 -2.1017 -2.1018 0.0007 0.1489 0.5518 0.0007 0.1489 0.1489
- Header lines begin with
# - Each row contains log-likelihoods and posterior probabilities for one individual across all six models
- Log-likelihoods (Ma-Mf): The model with the highest (least negative) value is the best fit
- Posterior probabilities (P(Ma)-P(Mf)): Probability of each model given the data, assuming uniform priors. These sum to 1.0 for each individual
# Data only (no header)
./mongrail2 popA.GT popB.GT hybrids.GT | grep -v '^#'
# Pipe to file
./mongrail2 popA.GT popB.GT hybrids.GT > results.txtMONGRAIL evaluates six genealogical models for each individual:
| Model | Code | Description | Parental Origin |
|---|---|---|---|
| Ma | d | Pure population A | Both haplotypes from pop A |
| Mb | b | Backcross to A | One haplotype from A, one recombinant (F1 parent) |
| Mc | c | F1 hybrid | One haplotype from A, one from B |
| Md | a | Pure population B | Both haplotypes from pop B |
| Me | e | Backcross to B | One haplotype from B, one recombinant (F1 parent) |
| Mf | f | F2 hybrid | Both haplotypes recombinant (both parents F1) |
Population A (Ma) Population B (Md)
| |
+----------+----------+
|
F1 (Mc)
|
+----------+----------+
| |
Backcross A (Mb) Backcross B (Me)
| |
+----------+----------+
|
F2 (Mf)
For backcross (Mb, Me) and F2 (Mf) models, MONGRAIL accounts for recombination in the F1 parent's gamete. The recombinant haplotype probability is:
U(x) = Σ_z U(x|z) · Q(z|d,r)
Where:
zis the ancestry vector indicating which loci derive from population A vs BQ(z|d,r)is the probability of ancestry configuration given marker distances and recombination rateU(x|z)is the probability of observing haplotype x given ancestry z
The default recombination rate is 1e-8 per base pair (approximately 1 cM per Mb), which can be adjusted with the -r option.
MONGRAIL uses a Bayesian approach with:
- Prior: Symmetric Dirichlet prior on haplotype frequencies; uniform prior over genealogical models
- Likelihood: Posterior predictive distribution (Dirichlet-Multinomial)
- Inference: Posterior probability computation across models
The posterior predictive probability for observing a haplotype is:
P(haplotype | data) = (count + α) / (total + H·α)
Where α = 1/H is the prior pseudocount and H is the number of possible haplotypes.
The posterior probability for each genealogical model is computed as:
P(Model_i | data) = L_i / Σ_j L_j
Where L_i = exp(log-likelihood_i).
MONGRAIL 2.1 includes several optimizations for fast likelihood computation:
- Pre-computed recombination probabilities Q(z)
- Cached haplotype probabilities U(x)
- Static memory allocation in hot loops
- Compiler optimizations (-O3)
- Direct computation for phased data: O(H) per chromosome instead of O(n²) pair enumeration
Typical performance: ~0.3 seconds for 31 individuals across 34 chromosomes (unphased data). Phased data is significantly faster, especially for chromosomes with many heterozygous sites.
./mongrail2 examples/simulated_data/popA.GT \
examples/simulated_data/popB.GT \
examples/simulated_data/hybrids.GT./mongrail2 -r 2e-8 examples/puma/popA.GT \
examples/puma/popB.GT \
examples/puma/hybrids.GT./mongrail2 -v popA.GT popB.GT hybrids.GT > results.txt 2> summary.txt-
Use posterior probabilities: The P(M*) columns give the probability of each genealogical class. Values close to 1.0 indicate high confidence; values spread across multiple models indicate uncertainty.
-
Identify the best model: For each individual, the model with the highest posterior probability (or equivalently, highest log-likelihood) is the most probable classification.
-
Assess confidence:
- P > 0.95: Strong support for classification
- P = 0.5-0.95: Moderate support, consider alternative models
- P < 0.5: Weak support, multiple models plausible
-
Consider biological context:
- Pure individuals (Ma, Md) should have much higher likelihoods than hybrid models
- F1 hybrids (Mc) should show intermediate ancestry
- Backcrosses (Mb, Me) and F2 (Mf) show evidence of recombination
# indiv Ma Mb Mc Md Me Mf P(Ma) P(Mb) P(Mc) P(Md) P(Me) P(Mf)
0 -0.88 -2.29 -9.02 -12.92 -10.37 -3.64 0.7635 0.1876 0.0002 0.0000 0.0001 0.0486
1 -12.92 -10.24 -8.89 -0.64 -2.03 -3.38 0.0000 0.0001 0.0002 0.7612 0.1895 0.0490
2 -7.49 -2.10 -0.79 -7.49 -2.10 -2.10 0.0007 0.1489 0.5518 0.0007 0.1489 0.1489
- Individual 0: P(Ma)=0.76 indicates ~76% probability of being pure population A, with some support for Mb (backcross to A)
- Individual 1: P(Md)=0.76 indicates ~76% probability of being pure population B
- Individual 2: P(Mc)=0.55 suggests an F1 hybrid, but with notable uncertainty (Mb, Me, Mf each ~15%)
If you use MONGRAIL in your research, please cite:
Sneha Chakraborty, Bruce Rannala. 2025. Improved Bayesian inference of hybrids using genome sequences. bioRxiv 2025.12.26.696621; doi: https://doi.org/10.64898/2025.12.26.69662
Sneha Chakraborty, Bruce Rannala. 2023. An efficient exact algorithm for identifying hybrids using population genomic sequences, Genetics 223:4, iyad011, https://doi.org/10.1093/genetics/iyad011
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
For questions and bug reports, please open an issue on GitHub.
