Skip to content

Targeted subsampling of SARS-CoV-2 genome sequence ensembles used in genomic epidemiology

License

Notifications You must be signed in to change notification settings

nodrogluap/nybbler

Repository files navigation

Nybbler

Remove redundancy in SARS-CoV-2 genome sequence ensembles. The output can be easily "nibbled" to generate a useful subset of the ensemble, before ingestion by phylogenomic tools such as NextStrain.

Motivation

With the global viral sequencing effort generating 1,000,000+ genomes for SARS-CoV-2, we are starting to hit the limits of phylogenetic (a.k.a. evolutionary tree) methods to handle large inputs. This tool is intended to help researchers select a subset of those 1,000,000+ that fits their analysis needs, ideally with as little sampling bias as possible since to date some countries (e.g. U.K.) produce many sequences relative to their population, and others (e.g. U.S.) produce few genomes relative to the number of cases.

Installation

Clone this repository if you have not done so already:

$ git clone https://github.com/nodrogluap/nybbler
$ cd nybbler

This program is fairly simple but depends on minimap2 being in your path, as well as the Perl packages YAML::XS (for reading the config file), File::Fetch (for downloading reference data), and Time::Piece (for metadata date parsing and manipulations).

If you have these requirements already, you are good to go. If not, these requirements can be easily met with the included conda environment, with no administrator privileges required, like so:

$ conda env create -f nybbler.yml

Then any time you want to run the program start with:

$ conda activate nybbler

The input sequence.fasta and metadata.tsv data files needed to run Nybbler would typically be downloaded directly from GISAID under EpiCov -> Downloads, labelled "nextmeta" and "nextfasta".

The basics

As of 2020-08-28, there are 92144 "full" SARS-CoV-2 genomes in GISAID. Using a straight text comparison, 81730 of these are unique. On the other hand, only 73602 are non-redundant after filtering changes in the error-prone end regions of the genome, and selected internal high frequency error sites as defined by careful manual tree analysis (e.g. Nybbler uses Weilguny et al., 2020 by default). See the "Advanced usage" section below for details on how a core of ~7000 genomes can be derived using Nybbler.

Nybbler maps all of the reads to the reference genome (currently MN908947.3), then constructs a hash table key consisting of the CIGAR and MD tags of a sensitivity-tweaked minimap2 alignment output, supplemented with the actual reference insertion bases from the alignments. The CIGAR and MD are modified by Nybbler to incorporate the end mask and internal site mask configuration.

Sequences that are exactly the same after masking, but are from different countries or more specific locations can be of particular interest for epidemiology, therefore Nybbler allows non-redundancy to be rolled up globally (default, no arg), per region (0), country (1), state/province (2), or city/area (3) with the optional last argument. So, global analysis:

$ ./nybbler minimum_config.yaml my_output_dir

Whereas for country-level analysis:

$ ./nybbler minimum_config.yaml my_output_dir_by_country 1

Nybbler is designed to be compatible with the defaults/parameters.yaml file in NextStrain's ncov, which could be used for downstream processing of the sequence (FASTA) and metadata (TSV) files generated by this tool in my_output_dir. The minimum format of the config file is provided as minimum_config.yaml in this repository, in case you aren't using NextStrain.

Advanced usage

Troubleshooting

Occasionally you will get an error message like below saying that something at the NCBI is not available. This is a transient bug in the NCBI e-utils availability, simply rerun the script and it should go away.

(nybbler) -bash-4.2$ ./nybbler minimum_config.yaml testoutdir
Fetching masking site definitions
Loading sequences and calculating keys
Could not find QueryKey in results for NCBI E-Utils query https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&usehistory=y&term=MN908947.3[accn]
Search response was:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult>
	<ERROR>Search Backend failed: Database is not supported: nuccore</ERROR>
</eSearchResult>

About

Targeted subsampling of SARS-CoV-2 genome sequence ensembles used in genomic epidemiology

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages