A model-based tool to quantify the difficulty of calling a variant given genomic context.
Intuitively we understand that accurately calling variants in a genome can be more or less difficult depending on the context of that variant. For example, many sequencing technologies have higher error rates in homopolymers, and this error rate generally increases as homopolymers get longer. However, precisely quantifying the relationship between these errors, the length of the homopolymer, and the impact on the resulting variant call remain challenging. Analogous arguments can be drawn for other "repetitive" regions in the genome, such as tandem repeats, segmental duplications, transposable elements, and difficult-to-map regions.
The solution we present here is to use an interpretable modeling framework called explainable boosting machines to predict variant calling errors as a function of genomic features (eg, whether or not the variant in a tandem repeat, homopolymer, etc). The interpretability of the model is important for allowing end users to understand the relationship each feature has to the prediction, which facilitates understanding (for example) at what lengths of homopolymers the likelihood of incorrectly calling a variant drastically increases. This precision is an improvement over existing methods we have developed for stratifying the genome by difficulty into discrete bins. Furthermore, this modeling framework allows understanding of interactions between different genomic contexts, which is important as many repetitive characteristics do not exist in isolation.
We anticipate StratoMod
would be useful for both method developers and
clinicians who wish to better understand variant calling error modalities. In
the case of method development, StratoMod
can be used to accurately compare
the error modalities of different sequencing technologies. For clinicians, this
can be used for determining in which regions/genes (which may be clinically
interesting for a given study) variant errors are likely to occur, which may in
turn inform which technologies should be employed and/or other mitigation
strategies should be used.
Further information can be found in our preprint.
-
Compare user-supplied query vcf with GIAB benchmark vcf to produce labels (true positive, false positive, false negative). The labels comprise the dependent variable used in model training downstream.
-
Intersect comparison output labels with genomic features to produce the features (independent variables) used in model training.
-
Train the EBM model with random holdout for testing
-
If desired, test the model on other query vcfs (which may or may not also be labeled with a benchmark comparison).
-
Inspect the output features (plots showing the profile of each feature and its effect on the label).
NOTE: currently only two labels can be compared at once given that we used a binary classifier. This means either one of the three labels must be omitted or two need to be combined into one label.
The only mandatory user-supplied data required to run is a query vcf. Optionally one can supply other vcfs for testing the model.
Unless one is using esoteric references or benchmarks, the pipeline is preconfigured to retrieve commonly-used data defined by flags in the configuration. This includes:
- a GIAB benchmark, including the vcf, bed, and reference fasta
- reference-specific bed files which will provide "contextual" features for each
variant call, including:
- difficult-to-map regions (GIAB stratification bed file)
- segmental duplications (UCSC superdups database)
- tandem repeats (UCSC simple repeats database)
- transposable elements (UCSC Repeat Masker)
This assumes the user has a working conda
or mamba
installation (tested
below with mamba 1.5.1/conda 23.7.4).
Run the following to set up the runtime environment:
mamba env create -f env.yml
A sample configuration file may be found in config/testing.yml
which
may be copied as a starting point and modified to one's liking. This file is
heavily annotated to explain all the options/flags and their purpose.
For a list of features which may be used, see FEATURES.md
.
Execute the pipeline using snakemake:
snakemake --use-conda -c <num_cores> --rerun-incomplete --configfile=config/<confname.yml> all
Each model has a report at
results/model/<model_key>-<filter_key>-<run_key>/summary.html
which contains
model performance curves and feature plots (the latter which allows model
interpretation).
Here <model_key>
is the key under the models
section in the config,
<filter_key
is either SNV
or INDEL
depending on what was requested, and
<run_key>
is the key under the models -> <model_key> -> runs
section in the
config.
All raw data for the models will be saved alongside the model report (see above). This includes the input tsv of data used to train the EBM, a config yml file with all settings used to train the EBM for reference, and python pickles for the X/Y train/test datasets as well as a pickle for the final model itself.
Within the run directory will also be a test
directory which will contain all
test runs (eg the results of the model test and the input data used for the
test).
In addition to the model data itself, the raw input data (that is the master
dataframe with all features for each query vcf prior to
filtering/transformation) can be found in
results/annotated_variants/{unlabeled,labeled}/<query_key>
where query_key
is the key under either labeled_queries
or unlabeled_queries
in the config.
Each of these directories contains the raw dataframe itself (both both SNVs and INDELs) as well as an HTML report summarizing the dataframe (statistics for each feature, distributions, correlations, etc)
By convention, the conda environment specified by env.yml
only has runtime
dependencies for the pipeline itself.
To install development environments, run the following:
./setup_dev.sh
In addition to creating new environments, this script will update existing ones if they are changed during development.
Note that scripts in the pipeline are segregated by environment in order to
prevent dependency hell while maintaining reproducible builds. When editing, one
will need to switch between environments in the IDE in order to benefit from the
features they provide. Further details on which environments correspond to which
files can be found in workflow/scripts
.
Note that this will only install environments necessary for running scripts (eg
rules with a script
directive).
All python code should be error free when finalizing any new features. Linting will be performed automatically as part of the CI/CD pipeline, but to run it manually, invoke the following:
./lint.sh
This assumes all development environments are installed (see above).
There are two primary branches: master
and develop
.
Make a new branch off of develop for a new feature/bugfix, then merge into
develop when done (note --no-ff
).
git checkout develop
git branch -n <new_feature>
git checkout <new_feature>
# do a bunch of stuff...
git checkout develop
git merge --no-ff <new_feature>
After feature(s) have been added and all tests have succeeded, update changelog, add tag, and merge into master. Use semantic versioning for tags.
# update changelog
vim CHANGELOG.md
git commit
git tag vX.Y.Z
git checkout master
git merge --no-ff vX.Y.Z
NOTE: do not add an experiment-specific configuration to master
or develop
.
The yml files in config
for these branches are used for testing.
To make an experiment, one has several options a) import this pipeline as a module in a wrapper pipeline which contains the experimental config (recommended) b) track experimental configurations in a separate directory c) fork this repo and add experimental configurations in your fork.