Skip to content
nielshanson edited this page Sep 13, 2013 · 2 revisions

About

The GreenGenes 16s rRNA taxonomic database has been updated (May 2013) and has a new home on the web (http://greengenes.secondgenome.com/). However, the format of the database has changed as it no longer comes with a .fasta file that is ammenable to BLAST database creation. This script combines the available Taxonomy, GenBank, and sequence references to construct such a file.

Common Usage

$ generate_ncbi_gss.py --fasta_dir fasta_directory [-p] gss_parameter_file -o output

where,

  • fasta_directory is a directory containing .fasta (.fa|.fas|.fasta|.fna|.f) files to be transformed into .gss files
  • gss_parameter_file is an optional parameter file (see below) containing gss metadata related to the submission. If not provided, "dummy data" will be loaded into each of the fields that can be replaced programatically (or using a text editor search and replace function) later
  • output is the name of the output files (.pub, .lib, and .cont) that will be associated with the .gss series.

The library name for the .lib file is taken from the input fasta files. For each sequence the identifier is taken from the caret line > and placed in the GSS#: field. These should be renamed appropriately for the submission before dbGSS submission.

Example

Using some example .fasta files, contained in the /generate_ncbi_gss/example/ directory of this repo, and the template_gss_param.txt, contained in /generate_ncbi_gss/, One can create a series of output files (. gss, .pub, .lib, and .cont) using the following command:

$ python generate_ncbi_gss.py --fasta_dir example/ -p template_gss_param.txt -o test

This will generate the three .gss files corresponding to the input .fasta files (12010.gss, 12500.gss, and 12200.gss) and the contact, library, and publication files (test.cont, test.lib, test.pub) for dbGSS submission.

Clone this wiki locally