GitHub - khyox/draftGenomes: Script to collect sequence files from multiple NCBI WGS projects and process them (LEGACY SOFTWARE)

Collect all the NCBI WGS sequences for any taxonomic subtree

IMPORTANT NOTICE

LEGACY SOFTWARE: Due to recent changes in NCBI database framework, this software is not longer working as expected. While it will succeed in retrieving old projects from WGS database, you will need another complementary method to get new projects. See last issues for some examples of the problems. Sorry, I had to move to a different research topic and I have no time to develop the major update needed to keep this working after the NCBI changes.

Overview

NCBI WGS (Whole Genome Shotgun) is a huge database from NCBI including sequences from incomplete genomes that have been sequenced by a whole genome shotgun strategy. Those sequences belong to hundreds of thousands of different sequencing projects which should be located and downloaded individually.

draftGenomes greatly simplifies the otherwise arduous task of collecting all the NCBI WGS sequences related to a taxonomic identifier (taxid) at any taxonomic level. This script downloads the appropriate sequence files from NCBI WGS projects and processes them to generate a single coherent fasta file by parsing the sequence headers and updating them if needed.

Details

In the beginning, draftGenomes was conceived as a Python version of the taxid2wgs Perl script from NCBI, but finally, it now goes beyond such initial purpose.

As downloading and parsing NCBI WGS projects could take a long time (and require a lot of disk space) depending on the taxid selected, the script has progress indicators and recovers from several errors. It has a resume mode in case of any fatal interruption of the process.

In addition, there are some other modes of operation:

The reverse mode enables another instance of the script to manage the download of sequences without interfering with the first one, which is also parsing the sequences to generate the resulting fasta file.
The force mode ignores previous downloads and recreates the final FASTA file in spite of any previous run.
The download mode for just downloading without parsing the WGS project files.
The verbose mode substitutes the progress indicator with details about every project parsed.

It has been tested successfully in ~TB downloads with several forced and unforced interruptions.

Installing

Just clone the GitHub repository or, even easier, download the script or copy&paste its source code.

Running

draftGenomes only requires a Python 3 interpreter. No other packages beyond the Python Standard Library ones are needed.

The name of the output files has the format: WGS4taxid{include}-{exclude}.fa, where {include} is the taxid of the root of the taxonomical subtree of interest, while {exclude} (optional) is the taxid of the root of the excluded taxa in that subtree. Both taxids are options of the script (a run with no taxid related arguments will test the script).

Please run ./draftGenomes --help to see all the possibilities and details.

References

GenBank WGS Projects: ftp://ftp.ncbi.nih.gov/genbank/wgs/README.genbank.wgs
WGS projects browser: https://www.ncbi.nlm.nih.gov/Traces/wgs/
WGS projects data: ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
draftGenomes		draftGenomes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

draftGenomes

draftGenomes

Repository files navigation

IMPORTANT NOTICE

Overview

Details

Installing

Running

References

About

Releases

Packages

Languages

License

khyox/draftGenomes

Folders and files

Latest commit

History

Repository files navigation

IMPORTANT NOTICE

Overview

Details

Installing

Running

References

About

Resources

License

Stars

Watchers

Forks

Languages