Inversion Benchmark Data Set

We composed a benchmark data set for evaluating SNP-based inversion detection methods. We collected variant data from three insect data sets. We provide links to the original data sources, a pipeline for processing the raw data into the final usable product for the benchmark, and labels of samples' inversion genotypes.

Overview of Data Sets

Genotype labels are provided under the inversion_genotypes directory, and inversion boundaries are provided under the inversion_boundaries directory.

Data Set	Citation	Populations (samples)
1000 Anopheles genomes	Miles, et al. (2017)	gambiae (81), coluzzii (69)
DGRPv2	Mackay, et al. (2012); Huang, et al. (2014)	all (198)
UBC Sunflower Genome Project	Todesco, et al. (2020)	petiolaris (166), fallax (234), niveus (86)
Peach	Guan, et al. (2021)	persica (149), kansuensis (37)
Blue tit	Perrier, et al. (2020)	mainland (111), corsica (343)

Positives

These positives examples were curated to select subpopulations with clear signals.

Chromosome	Population	Inversions	Genotypes Available
ag1000g_2L	An. gambiae (B. Faso)	2La	Yes
ag1000g_2R	An. gambiae (B. Faso)	2Rb	Yes
ag1000g_2R	An. coluzzii (B. Faso)	2Rbc, 2Rd, 2Ru	Yes
cyan_chrom03	C. caeruleus (mainland)	chrom03.01	Predicted
dgrp2_2L	D. melanogaster	In(2L)t	Yes
dgrp2_2R	D. melanogaster	In(2R)ns	Yes
dgrp2_3R	D. melanogaster	In(3R)p, In(3R)mo, In(3R)k	Yes
peach_pp06	P. persica	pp06.01	No
sunflowers_pet17	H. niveus canescens	pet17.03	Predicted
sunflowers_pet05	H. petiolaris petiolaris	pet05.01	Predicted
sunflowers_pet09	H. petiolaris petiolaris	pet09.01	Predicted
sunflowers_pet11	H. petiolaris petiolaris	pet11.01	Predicted
sunflowers_pet17	H. petiolaris petiolaris	pet17.01, pet17.02, pet17.04	Predicted
sunflowers_pet09	H. petiolaris fallax	pet09.01	Predicted
sunflowers_pet11	H. petiolaris fallax	pet11.01	Predicted
sunflowers_pet17	H. petiolaris fallax	pet17.01, pet17.03	Predicted

Negatives

Chromosome	Population
ag1000g_3L	An. coluzzii (B. Faso)
ag1000g_3R	An. gambiae (B. Faso)
ag1000g_3R	An. coluzzii (B. Faso)
cyan_chrom03	C. caeruleus (corsica)
dgrp2_3L	D. melanogaster
peach_pp06	P. kansuensis
sunflowers_pet05	H. niveus canescens
sunflowers_pet09	H. niveus canescens
sunflowers_pet11	H. niveus canescens
sunflowers_pet05	H. petiolaris fallax

Uncharacterized phenomena

Chromosome	Population
ag1000g_3L	An. gambiae (B. Faso)
ag1000g_X	An. gambiae (B. Faso)
ag1000g_X	An. coluzzii (B. Faso)

Machine Learning Problems

There are several potential machine learning problems associated with these data.

Inversion Segmentation

Given variant data from a population of samples for a chromosome, determine if there are any inversions and their locations. Existing tools require quite a bit of manual parameter tuning to perform localization (segmentation). The goal would be a develop a method that is completely and capable of working even when data include "noise" resulting from pooling samples from multiple locations or closely-related species.

The input to the problem would be a 2D matrix of allele counts. Each position along the chromosome in which nucleotide variation was detected has a column in the matrix. (Positions with fixed nucleotides are excluded -- the matrix is effectively sparse.) The allele counts for each variant are stored in each row. The output would be a 1D vector with an entry for each column with a value of 0 indicating "no inversion" and 1 indicating "inversion detected". See the numpy output format description.

Some thoughts on the approach and associated challenges:

Perform the evaluation with each chromosome held out once. There are 16 total chromosomes so there would 16 models trained and evaluated.
The number of samples will vary across chromosomes. You can add rows with zeros (zero padding) so that all matrices have the same number of rows.
The ordering of the samples is not relevant. The rows can be permuted in the matrix and you should get the same result. There are several ways to handle this. You can generate permutations of the matrices (increasing the training set size) or find a way to sort the entries in a canonical order, or explore deep learning methods like Deep Sets that are permutation invariant.
The inversion frequencies will vary quite substantially in potential target data sets. You can use bootstrap sampling (sampling with replacement) to create new data sets for each chromosome that simulate different inversion frequencies. This is effectively a form of data augmentation.
Given the small number of samples, methods such as RNNs and LSTMs might not work. Instead, you might want to consider dividing the chromosomes into overlapping windows and performing classification on each window. Combine the results by averaging the predictions from each window.
Use the Sørensen–Dice coefficient, Jaccard Similarity, or precision / recall on the predicted and ground truth masks.

Genotype Prediction

Given variant data in a known inversion region from a population of samples for a chromosome, determine the inversion genotypes of the samples (homozygeous inverted, homozygeous standard, and heterozygeous). Existing tools require quite a bit of manual parameter tuning to perform localization (segmentation). The goal would be a develop a method that is completely and capable of working even when data include "noise" resulting from pooling samples from multiple locations or closely-related species.

The input would be a 2D matrix of allele counts. Each position along the chromosome in which nucleotide variation was detected has a column in the matrix. (Positions with fixed nucleotides are excluded -- the matrix is effectively sparse.) The allele counts for each variant are stored in each row. The output would be a 1D vector with an entry for each row (sample) indicating the number of inversion copies (0, 1, or 2). A simpler problem would be to determine if the sample had at least 1 inversion; in which case, the output vector would only contain 0s or 1s. See the numpy output format description.

Some thoughts on the approach and associated challenges:

When chromosomes only have a small number of samples of a particular inversion type, exclude that class. For example, only two of the Anopheles gambiae samples arehomozygous for the standard orientation of 2Rb, only 1 Anopheles coluzzii sample is homozygous for one of the inversion orientations for 2Rc and 2Rd, and only two of the blue tit samples are homozygous for one of the inversion genotypes.

Setup Instructions

You will need to download:

Drosophila: The dgrp2.vcf VCF file from the Drosophila Genetics Reference Panel v2
Anopheles: The biallelic 2L, 2R, 3L, 3R, and X VCF files from the FTP site for the 1000 Anopheles Genomes project phase 1 AR3 data release (ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/AR3/variation/main/vcf/). These files are named: ag1000g.phase1.ar3.pass.biallelic.{chrom}.vcf.gz where chrom is 3L, 2R, or 2L.
H. petiolaris: Petiolaris.pet_gwas.tranche90_snps_bi_AN50_AF99.vcf.gz from the UBC Sunflower Genome project
Prunus: SNP.vcf.gz (for Prunus) from Figshare
Cyanistes: BLUE2020VCF.vcf.gz from Dryad

Place these files in the directory input_files. The directory contains an empty file named PLACE_INPUT_FILES_HERE.

To run the pipeline, you will need to install:

Running Pipeline to Generate Output Files

Once you've downloaded the data, you can process the individual data sets with the following commands:

$ snakemake prepare_all

Each task is assigned one thread. If you want to run multiple tasks concurrently, use Snakemake's --cores flag:

$ snakemake --cores 4 prepare_all

Output File Formats

The pipeline will convert the variants into three file formats:

VCF (potentially gzipped): This file format can be read by Asaph
Plink bed: This file format can be read by pcadapt
inveRsion: This text file format can be read by inveRsion
Numpy matrices (.npz): Each Numpy .npz file contains three objects. Allele counts are stored as a 2D matrix. Each position along the chromosome in which nucleotide variation was detected has a column in the matrix. (Positions with fixed nucleotides are excluded -- the matrix is effectively sparse.) The allele counts for each variant are stored in each row. Values are either 0 (homozygeous reference), 1 (heterozygous), 2 (homozygeous alternate), or -1 (unknown). A mapping of column indices to genomic coordinates is provided as a 1D array with one entry for each column of the allele counts matrix. A ground truth mask is provided as a 1D array (1 for inversion, 0 for no inverson) with one entry for each column of the allele counts matrix.

Citing

If you use this data set, please cite the original papers from which the data are derived:

Anopheles 1000 Genomes: Miles, A., Harding, N., Bottà, G. et al. Genetic diversity of the African malaria vector Anopheles gambiae. Nature 552, 96–100 (2017).
Anopheles 1000 Genomes: Corbett-Detig, R. B., Said, I., Calzetta, M., et al. Fine-Mapping Complex Inversion Breakpoints and Investigating Somatic Pairing in the Anopheles gambiae Species Complex Using Proximity-Ligation Sequencing, Genetics, Volume 213, Issue 4, 1 December 2019, Pages 1495–1511
Drosophila Genetics Reference Panel: Mackay, T., Richards, S., Stone, E., et al. The Drosophila melanogaster Genetic Reference Panel. Nature 482, 173–178 (2012).
Drosophila Genetics Reference Panel: Huang, W., Massouras, A., Inoue, Y., et al. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Research 24:1193-1208 (2014).
UBC Sunflower Genome project: Todesco, et al. Massive haplotypes underlie ecotypic differentiation in sunflowers Nature (2020).
Peach: Guan, et al. Genome structure variation analyses of peach reveal population dynamics and a 1.67 Mb causal inversion for fruit shape. Genome Biology. (2021)
Blue tit: Perrier, et al. Demographic history and genomics of local adaptation in blue tit populations. Evolutionary Applications. 2020.

and our paper describing the test set:

Nowling, R.J., Manke, K.R., and Emrich, S.J. Detection of Inversions with PCA in the Presence of Population Structure. PLOS One (2020).

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
input_files		input_files
inversion_boundaries		inversion_boundaries
inversion_genotypes		inversion_genotypes
sample_lists		sample_lists
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
inversions.yaml		inversions.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input_files

input_files

inversion_boundaries

inversion_boundaries

inversion_genotypes

inversion_genotypes

sample_lists

sample_lists

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Snakefile

Snakefile

inversions.yaml

inversions.yaml

Repository files navigation

Inversion Benchmark Data Set

Overview of Data Sets

Positives

Negatives

Uncharacterized phenomena

Machine Learning Problems

Inversion Segmentation

Genotype Prediction

Setup Instructions

Running Pipeline to Generate Output Files

Output File Formats

Citing

About

Releases

Packages

Languages

License

nowling-lab/inversion-test-set

Folders and files

Latest commit

History

Repository files navigation

Inversion Benchmark Data Set

Overview of Data Sets

Positives

Negatives

Uncharacterized phenomena

Machine Learning Problems

Inversion Segmentation

Genotype Prediction

Setup Instructions

Running Pipeline to Generate Output Files

Output File Formats

Citing

About

Resources

License

Stars

Watchers

Forks

Languages