-
Notifications
You must be signed in to change notification settings - Fork 11
Prepare_gg.py
The GreenGenes 16s rRNA taxonomic database has been updated (May 2013) and has a new home on the web (http://greengenes.secondgenome.com/). However, the format of the database has changed as it no longer comes with a .fasta file that is ammenable to BLAST database creation. This script combines the available Taxonomy, GenBank, and sequence references to construct such a file.
$ generate_ncbi_gss.py --fasta_dir fasta_directory [-p] gss_parameter_file -o outputwhere,
-
fasta_directoryis a directory containing .fasta (.fa|.fas|.fasta|.fna|.f) files to be transformed into .gss files -
gss_parameter_fileis an optional parameter file (see below) containing gss metadata related to the submission. If not provided, "dummy data" will be loaded into each of the fields that can be replaced programatically (or using a text editor search and replace function) later -
outputis the name of the output files (.pub, .lib, and .cont) that will be associated with the .gss series.
The library name for the .lib file is taken from the input fasta files. For each sequence the identifier is taken from the caret line > and placed in the GSS#: field. These should be renamed appropriately for the submission before dbGSS submission.
Using some example .fasta files, contained in the /generate_ncbi_gss/example/ directory of this repo, and the template_gss_param.txt, contained in /generate_ncbi_gss/, One can create a series of output files (. gss, .pub, .lib, and .cont) using the following command:
$ python generate_ncbi_gss.py --fasta_dir example/ -p template_gss_param.txt -o testThis will generate the three .gss files corresponding to the input .fasta files (12010.gss, 12500.gss, and 12200.gss) and the contact, library, and publication files (test.cont, test.lib, test.pub) for dbGSS submission.