Nybbler

Remove redundancy in SARS-CoV-2 genome sequence ensembles. The output can be easily "nibbled" to generate a useful subset of the ensemble, before ingestion by phylogenomic tools such as NextStrain.

Motivation

With the global viral sequencing effort generating 1,000,000+ genomes for SARS-CoV-2, we are starting to hit the limits of phylogenetic (a.k.a. evolutionary tree) methods to handle large inputs. This tool is intended to help researchers select a subset of those 1,000,000+ that fits their analysis needs, ideally with as little sampling bias as possible since to date some countries (e.g. U.K.) produce many sequences relative to their population, and others (e.g. U.S.) produce few genomes relative to the number of cases.

Installation

Clone this repository if you have not done so already:

$ git clone https://github.com/nodrogluap/nybbler
$ cd nybbler

This program is fairly simple but depends on minimap2 being in your path, as well as the Perl packages YAML::XS (for reading the config file), File::Fetch (for downloading reference data), and Time::Piece (for metadata date parsing and manipulations).

If you have these requirements already, you are good to go. If not, these requirements can be easily met with the included conda environment, with no administrator privileges required, like so:

$ conda env create -f nybbler.yml

Then any time you want to run the program start with:

$ conda activate nybbler

The input sequence.fasta and metadata.tsv data files needed to run Nybbler would typically be downloaded directly from GISAID under EpiCov -> Downloads, labelled "nextmeta" and "nextfasta".

The basics

As of 2020-08-28, there are 92144 "full" SARS-CoV-2 genomes in GISAID. Using a straight text comparison, 81730 of these are unique. On the other hand, only 73602 are non-redundant after filtering changes in the error-prone end regions of the genome, and selected internal high frequency error sites as defined by careful manual tree analysis (e.g. Nybbler uses Weilguny et al., 2020 by default). See the "Advanced usage" section below for details on how a core of ~7000 genomes can be derived using Nybbler.

Nybbler maps all of the reads to the reference genome (currently MN908947.3), then constructs a hash table key consisting of the CIGAR and MD tags of a sensitivity-tweaked minimap2 alignment output, supplemented with the actual reference insertion bases from the alignments. The CIGAR and MD are modified by Nybbler to incorporate the end mask and internal site mask configuration.

Sequences that are exactly the same after masking, but are from different countries or more specific locations can be of particular interest for epidemiology, therefore Nybbler allows non-redundancy to be rolled up globally (default, no arg), per region (0), country (1), state/province (2), or city/area (3) with the optional last argument. So, global analysis:

$ ./nybbler minimum_config.yaml my_output_dir

Whereas for country-level analysis:

$ ./nybbler minimum_config.yaml my_output_dir_by_country 1

Nybbler is designed to be compatible with the defaults/parameters.yaml file in NextStrain's ncov, which could be used for downstream processing of the sequence (FASTA) and metadata (TSV) files generated by this tool in my_output_dir. The minimum format of the config file is provided as minimum_config.yaml in this repository, in case you aren't using NextStrain.

Advanced usage

Troubleshooting

Occasionally you will get an error message like below saying that something at the NCBI is not available. This is a transient bug in the NCBI e-utils availability, simply rerun the script and it should go away.

(nybbler) -bash-4.2$ ./nybbler minimum_config.yaml testoutdir
Fetching masking site definitions
Loading sequences and calculating keys
Could not find QueryKey in results for NCBI E-Utils query https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&usehistory=y&term=MN908947.3[accn]
Search response was:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult>
	<ERROR>Search Backend failed: Database is not supported: nuccore</ERROR>
</eSearchResult>

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
defaults		defaults
LICENSE		LICENSE
README.md		README.md
country_data.csv		country_data.csv
country_data.md		country_data.md
country_ideal_sequences.csv		country_ideal_sequences.csv
minimum_config.yaml		minimum_config.yaml
nybbler		nybbler
nybbler.yml		nybbler.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

defaults

defaults

LICENSE

LICENSE

README.md

README.md

country_data.csv

country_data.csv

country_data.md

country_data.md

country_ideal_sequences.csv

country_ideal_sequences.csv

minimum_config.yaml

minimum_config.yaml

nybbler

nybbler

nybbler.yml

nybbler.yml

Repository files navigation

Nybbler

Motivation

Installation

The basics

Advanced usage

Troubleshooting

About

Releases

Packages

Contributors 2

Languages

License

nodrogluap/nybbler

Folders and files

Latest commit

History

Repository files navigation

Nybbler

Motivation

Installation

The basics

Advanced usage

Troubleshooting

About

Resources

License

Stars

Watchers

Forks

Languages