-
Notifications
You must be signed in to change notification settings - Fork 11
generate_ncbi_gss
A script to create .gss and the parallel .pub, .lib, and .cont files from input .fasta files containing sequences for submission to dbGSS. As of May 2013, .gss is the preferred format for submitting fosmid-end sequences for the NCBI. This script generally assumes that the sequences have undergone some form of previous quality control from the original sequence trace files, for example using Phred, Phrap, or Consed from the Phil Green Laboratory at the University of Washington. However, it also performs additional quality control steps for head-tail ambiguous bases, general ambiguous bases, length distribution, and heteropolymer repeats. These settings were appropriate for fosmid-end submission to the NCBI Summer 2013, but are subject to change. Please contact the NCBI (info@ncbi.nlm.nih.gov) about fosmid-end submission settings to be up-to-date on their submission requirements.
$ generate_ncbi_gss.py --fasta_dir fasta_directory [-p] gss_parameter_file -o outputwhere,
-
fasta_directoryis a directory containing .fasta (.fa|.fas|.fasta|.fna|.f) files to be transformed into .gss files -
gss_parameter_fileis an optional parameter file (see below) containing gss metadata related to the submission. If not provided, "dummy data" will be loaded into each of the fields that can be replaced programatically (or using a text editor search and replace function) later -
outputis the name of the output files (.pub, .lib, and .cont) that will be associated with the .gss series.
The library name for the .lib file is taken from the input fasta files. For each sequence the identifier is taken from the caret line > and placed in the GSS#: field. These should be renamed appropriately for the submission before dbGSS submission.
Using some example .fasta files, contained in the /generate_ncbi_gss/example/ directory of this repo, and the template_gss_param.txt, contained in /generate_ncbi_gss/, One can create a series of output files (. gss, .pub, .lib, and .cont) using the following command:
$ python generate_ncbi_gss.py --fasta_dir example/ -p template_gss_param.txt -o testThis will generate the three .gss files corresponding to the input .fasta files (12010.gss, 12500.gss, and 12200.gss) and the contact, library, and publication files (test.cont, test.lib, test.pub) for dbGSS submission.
Because the dbGSS database also contains a number of high-quality sequences (e.g., random "single pass read" genome survey sequences, single pass reads from cosmid/BAC/YAC ends, exon trapped genomic sequences, or Alu PCR sequences) the NCBI has additional quality control requirements for submission. generate_ncbi_gss.py performs the following QC by default, but are parameter controlled and can be changed should submission requirements change:
- Ambiguous bases (non-ATCG) in the first or last
Nnucleotides of the sequence. The sequence is trimmed from the left and right to exclude all ambiguous bases. By default this is set to 100 nucleotides. - Length distribution restrictions on fosmid-ends.
generate_ncbi_gss.pyhas two parameters,L_minandL_maxthat specify the minimum and maximum length of sequences being submitted. Sequences that do not fall into this range are excluded from the output .gss file. - Ambiguous bases in the sequence overall. A sequence is excluded from the output .gss file if more than
ppercent of the nucleotides in the sequence are ambiguous (non-ATCG). By default this is set top=5%. - Sequences with repetitive sequences (heteropolymers). A repetitive violation is
mconsecutive repetitive subunits less than lengthl. A sequence is excluded from the output .gss file if more thannsuch violations are found. By defaultl=2,m=10, andn=0, equivalent to saying any sequence with more than two repetitive sequence greater than 10 units long will be excluded from the output .gss file.
Note: This last stringent requirement of excluding repetitive sequences seems to be required of fosmids to the NCBI. If your sample contains an excessive number of repetitive sequences, it might be prudent to consult the original sequence trace files and rerun with more stringent sequencing cut-offs.
In addition to converting .fasta files into corresponding .gss files, the script also creates the three contact (.cont), library (.lib) and publication (.pub) files required for dbGSS submission. To learn more about these files see the NCBI dbGSS website. These files contain a bunch of information related to the library being submitted which has to be provided in the optional template_gss_param.txt file in order to be used. This file specifies the minimum set of fields that are required for a metagenomic fosmid-end submission (Summer 2013); however, this is subject to change at anytime by the NCBI, so early consultation with someone at the NCBI is critical.
Example: template_gss_param.txt
# NCBI GSS Parameter File
# A parameter file to create valid NCBI GSS submission files.
# As of May 2013, the preferred method of preparing fosmid end
# sequences for the NCBI
# See: http://www.ncbi.nlm.nih.gov/dbGSS/how_to_submit.html
# Publication (.pub)
# REQUIRED
# Title of the publication
PUB_TITLE:A potential role for Marine Group A bacteria in the marine sulfur cycle
PUB_AUTHORS:Name1,I.I.;Name2,I.
PUB_JOURNAL:ISMEJ
# Publication status 1=unpublished, 2=submitted, 3=in press, 4=published
PUB_STATUS:X
# OPTIONAL
# Additional publication information
PUB_VOLUME:XX
PUB_ISSUE:X
PUB_PAGES:XXX-XX
PUB_YEAR:20XX
# Library (.lib)
# OPTIONAL
# Library vector information
LIB_VECTOR:pcc1fos
LIB_V_TYPE:Fosmid
LIB_DESCR:Fosmid paired end sequenced library, in pCC1fos vector produced using Copy Control Fosmid Library Production Kit (Epicentre), bidirectionally end sequenced on Sanger ABI PRISM3730
# Contact (.cont)
# REQUIRED
CONT_NAME:Hallam, S.J.
CONT_TEL:604 827 4216
CONT_EMAIL:shallam@mail.ubc.ca
CONT_LAB:Hallam
CONT_INST:University of British Columbia
CONT_ADDR:Life Sciences Institute, 2552-2350 Health Sciences Mall, Vancouver, B.C., V6T 1Z3
Most of these fields are fairly intuitive if you follow along with the comments. For additional detail please consult the NCBI dbGSS website.