SeqSender

Public Database Submission Pipeline

Version: 0.1 (Beta)

Beta Version This pipeline is currently in Beta testing and issues could appear during submission, use at your own risk. Feedback and suggestions are welcome!

General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Overview

SeqSender is used to generate the files necessary to upload via FTP to NCBI's databases Genbank, BioSample, and SRA, as well as, GISAID. The pipeline then automatically performs the sequential upload to these databases to ensure proper linkage of data. The pipeline is dynamic in that the user creates a config file to select which databases they would like to upload and allows for any possible metadata fields by using a YAML to pair the database's metadata fields which your personal metadata field columns. This pipeline is currently tested with uploading SARS-COV2 data but the dynamic nature of this pipeline will allow for other organism's in future updates.

Setup:

1. Account Creation:

NCBI: A NCBI account is required along with a center account approved for submitting via FTP. Contact NCBI at gb-admin@ncbi.nlm.nih.gov to get a center account created.
GISAID: A GISAID account is required for submitting, register at https://www.gisaid.org/. A test submission is required before production submissions are allowed. After making a test submission contact GISAID at hcov-19@gisaid.org to receive your personal CID.

2. Environment Setup:

Clone files to working space and setup Python environment.

We recommend Miniconda

  conda env create -f config_files/conda_environment.yaml
  conda activate seqsender

Create a folder where you would like the program to generate the output to.
Submission files will be created and processed here.

3. Config File Creation:

The script will automatically default to the default_config.yaml. Multiple config files can be created and passed to the script via --config <filename>.yaml.
To create your config file fill out the empty spaces in the config file. Place the full path for the output directory you created and set what databases you want to submit to, to True or False. For the column_names sections of the config file place the corresponding datafield for the public repository to your metadata's column name for example{"Public repository field":"Your metadata column"}.
You must also create the naming schema for how you want the sequence to be named for the database and give the associated column name for the fields Genbank_sample_name_col, SRA_sample_name_col, BioSample_sample_name_col, and gisaid_sample_name_col. This is because the naming schema can vary between databases.
Refer to the database for what fields are required for submission and what options are available. For a full list of what is required in the config file, see the table below.

4. Submission File Creation:

Create all the submission files by running seqsender.py prep --unique_name <> --fasta <> --metadata <>. Provide the full path to the fasta and metadata file and give a unique name which will be used for the submission.
These cannot repeat as the submission name is what is used to upload to via FTP. Using a submission name again could result in your submission not processing.

5. GISAID Authentication (Required if submitting to GISAID):

GISAID requires the script to be authenticated with the CID. To authenticate your script run gisaid_uploader.py COV authenticate --cid TEST-EA76875B00C3.
It will then ask you to provide your GISAID username and password. If this test CID no longer works refer to the GISAID upload CLI for the latest test CID on https://www.gisaid.org/.
After performing a test submission you will need to run this command again with your official production CID provided by GISAID.

6. Test Submission:

To perform your test submission run seqsender.py submit --unique_name <> --fasta <> --metadata <> --test. The flag --test will allow you to perform a test submission anytime to NCBI. However, this will not work for GISAID as submissions are based off the CID. This will submit the test submission to the automated pipeline using the test command. The automated pipeline will not submit test submissions to GISAID. To perform the test submission to GISAID run the command seqsender.py gisaid --unique_name <> --test, after running the submit command.
After performing the test submission run the command seqsender.py update_submissions for the script to automatically check the progress of submissions and to continue submitting sequences to the next database after accessions are generated for linking BioSample, SRA, and Genbank submissions together.

7. Final Setup:

After successfully performing a test submission to every database you plan to submit to contact GISAID at hcov-19@gisaid.org to receive your production CID. Remember to update your config file to this new CID and authenticate the GISAID script with the new CID.
Contact gb-admin@ncbi.nlm.nih.gov to begin production submissions to NCBI after performing test submissions.
For production submissions run seqsender.py submit --unique_name <> --fasta <> --metadata <>, this will generate all the required file and place it in the automated pipeline.
To progress the automated pipeline run seqsender.py update_submissions every couple hours to process submissions.

Commands:

seqsender.py submit --unique_name <> --fasta <> --metadata <> Creates the files for submission and adds to automated submission pipeline and starts submission process.
seqsender.py prep --unique_name <> --fasta <> --metadata <> Creates the files for submission.
seqsender.py update_submissions Updates process of all sequences in submission pipeline, performs submission to subsequent databases based on submission status.
seqsender.py gisaid --unique_name <> Performs manual submission to GISAID.
seqsender.py genbank --unique_name <> Performs manual submission to Genbank.
seqsender.py biosample --unique_name <> Performs manual submission to BioSample.
seqsender.py sra --unique_name <> Performs manual submission to SRA.
seqsender.py biosample_sra --unique_name <> Performs manual joint submission to BioSample/SRA.

Optional flags:

--config <> If using a different config file than the default config. Provide the full name of the config file stored in config_files folder.
--test Performs test submission to NCBI. Does not perform test submission to GISAID. You must used authenticated CID for test submission to GISAID.
--overwrite Overwrites an existing submission on NCBI FTP. Used to update errored submissions.

Tips and Troubleshooting:

If you need to update a submissions metadata mid submission run seqsender.py prep --unique_name <> --fasta <> --metadata <>. Then run seqsender.py <database> --unique_name <> --fasta <> --metadata <> --overwrite to overwrite an existing submission with the new files on the FTP server.
If you receive an error for the config file it will notify you which line in the config file this is occurring. Common errors are missing quotes or having a comma after the last item.
Large GISAID submissions occassionally time-out. The automated pipeline will attempt to make the submission again the next time it is ran.

Config File Fields:

Section:	Name:	Description:	Required:
General	submission_directory	Output directory for script to process submissions at	Yes
General	submit_Genbank	Perform submission to Genbank	Yes, True/False
General	submit_GISAID	Perform submission to GISAID	Yes, True/False
General	submit_SRA	Perform submission to SRA	Yes, True/False
General	submit_BioSample	Perform submission to BioSample	Yes, True/False
General	joint_SRA_BioSample_submission	Submit SRA and BioSample together as one submission	Yes, True/False
General	contact_email1	Primary contact email	Yes
General	contact_email2	Secondary contact email	No
General	organization_name	Organization name	Yes
General	authorset	List of authors for submission	Yes
General	ncbi_org_id	Center account organization ID	Yes
General	submitter_info	Primary submitter info	Yes
General	organism_name	Organism	Yes
General	metadata_file_sep	File seperator used in your metadata file	Yes
General	fasta_sample_name_col	Metadata column of sequence name used in fasta file	Yes
General	collection_date_col	Collection date for sequence	Yes
General	baseline_surveillance	If performing baseline_sequencing setting this to True will add baseline_sequencing tag to all of your submissions	Yes, True/False
ncbi	hostname	FTP site. Use ftp-private.ncbi.nlm.nih.gov	Yes
ncbi	api_url	API url for pulling down files from NCBI https://submit.ncbi.nlm.nih.gov/api/2.0/files/FILE_ID/?format=attachment	Yes
ncbi	username	NCBI center account username	Yes
ncbi	password	NCBI center account password	Yes
ncbi	publication_title	Public facing title for submission	Yes
ncbi	ncbi_ftp_path_to_submission_folders	If your FTP site does not directly have the folders for Production/Test then provide the path for this. Typically left blank	No
ncbi	BioProject	BioProject to link submissions to	No
ncbi	BioSample_sample_name_col	Sequence names for BioSample	Yes, if submitting to BioSample
ncbi	SRA_sample_name_col	Sequence names for SRA	Yes, if submitting to SRA
ncbi	Genbank_sample_name_col	Sequence names for Genbank	Yes, if submitting to Genbank
ncbi	BioSample_package	Use SARS-CoV-2.cl.1.0	Yes
ncbi	Center_title	Title for center account	Yes
ncbi	Genbank_organization_type	Center organization type	Yes
ncbi	Genbank_organization_role	Use owner unless otherwise specified	Yes
ncbi	Genbank_spuid_namespace	Use ncbi-sarscov2-genbank unless specified	Yes
ncbi	Genbank_auto_remove_sequences_that_fail_qc	Genbank can automatically remove sequences that fail submission qc in order to not stall submissions	Yes True/False
ncbi	Genbank_wizard	Use BankIt_SARSCoV2_api unless specified	Yes
ncbi	citation_address	Address of submitter organization	Yes
ncbi	SRA_file_location	Whether you are submitting SRA files via manual upload or cloud link	Yes, if submitting to SRA File/Cloud
ncbi	SRA_file_column1	Either name of file if uploading or full path to cloud link	Yes, if submitting to SRA
ncbi	SRA_file_column2	Either name of file if uploading or full path to cloud link	Yes, if submitting to SRA and you have multiple files
ncbi	SRA_file_loader	Leave blank unless notified by SRA team you need to use a specific loader for your SRA files	No
genbank_src_metadata	column_names	Database field to column name of metadata	Yes
genbank_cmt_metadata	create_cmt	Create cmt file to go with submission	Yes True/False
genbank_cmt_metadata	column_names	Database field to column name of metadata	Yes
BioSample_attributes	column_names	Database field to column name of metadata	Yes
SRA_attributes	column_names	Database field to column name of metadata	Yes
gisaid	column_names	Database field to column name of metadata	Yes
gisaid	gisaid_sample_name_col	Sequence names for GISAID	Yes, if submitting to GISAID
gisaid	cid	CID used for submission	Yes
gisaid	username	gisaid username	Yes
gisaid	password	GISAID password	Yes
gisaid	type	Use betacoronavirus unless specified	Yes
gisaid	Update_sequences_on_Genbank_auto_removal	If using Genbank auto-remove qc then use this to update the GISAID submission based on what is accepted by Genbank	Yes True/False

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
.vscode		.vscode
config_files		config_files
docs		docs
test_input		test_input
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
biosample_sra_submission.py		biosample_sra_submission.py
genbank_submission.py		genbank_submission.py
gisaid_submission.py		gisaid_submission.py
gisaid_uploader.py		gisaid_uploader.py
seqsender.py		seqsender.py
submission_preparation.py		submission_preparation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeqSender

Public Database Submission Pipeline

Overview

Setup:

1. Account Creation:

2. Environment Setup:

3. Config File Creation:

4. Submission File Creation:

5. GISAID Authentication (Required if submitting to GISAID):

6. Test Submission:

7. Final Setup:

Commands:

Tips and Troubleshooting:

Config File Fields:

Public Domain Standard Notice

License Standard Notice

Privacy Standard Notice

Contributing Standard Notice

Records Management Standard Notice

Additional Standard Notices

About

Releases

Packages

Languages

License

nbx0/seqsender

Folders and files

Latest commit

History

Repository files navigation

SeqSender

Public Database Submission Pipeline

Overview

Setup:

1. Account Creation:

2. Environment Setup:

3. Config File Creation:

4. Submission File Creation:

5. GISAID Authentication (Required if submitting to GISAID):

6. Test Submission:

7. Final Setup:

Commands:

Tips and Troubleshooting:

Config File Fields:

Public Domain Standard Notice

License Standard Notice

Privacy Standard Notice

Contributing Standard Notice

Records Management Standard Notice

Additional Standard Notices

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages