Version: 0.1 (Beta)
Beta Version This pipeline is currently in Beta testing and issues could appear during submission, use at your own risk. Feedback and suggestions are welcome!
General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.
SeqSender is used to generate the files necessary to upload via FTP to NCBI's databases Genbank, BioSample, and SRA, as well as, GISAID. The pipeline then automatically performs the sequential upload to these databases to ensure proper linkage of data. The pipeline is dynamic in that the user creates a config file to select which databases they would like to upload and allows for any possible metadata fields by using a YAML to pair the database's metadata fields which your personal metadata field columns. This pipeline is currently tested with uploading SARS-COV2 data but the dynamic nature of this pipeline will allow for other organism's in future updates.
- NCBI: A NCBI account is required along with a center account approved for submitting via FTP. Contact NCBI at gb-admin@ncbi.nlm.nih.gov to get a center account created.
- GISAID: A GISAID account is required for submitting, register at https://www.gisaid.org/. A test submission is required before production submissions are allowed. After making a test submission contact GISAID at hcov-19@gisaid.org to receive your personal CID.
- Clone files to working space and setup Python environment.
- We recommend Miniconda
-
conda env create -f config_files/conda_environment.yaml conda activate seqsender
- Create a folder where you would like the program to generate the output to.
- Submission files will be created and processed here.
- The script will automatically default to the default_config.yaml. Multiple config files can be created and passed to the script via
--config <filename>.yaml
. - To create your config file fill out the empty spaces in the config file. Place the full path for the output directory you created and set what databases you want to submit to, to True or False. For the column_names sections of the config file place the corresponding datafield for the public repository to your metadata's column name for example
{"Public repository field":"Your metadata column"}
. - You must also create the naming schema for how you want the sequence to be named for the database and give the associated column name for the fields Genbank_sample_name_col, SRA_sample_name_col, BioSample_sample_name_col, and gisaid_sample_name_col. This is because the naming schema can vary between databases.
- Refer to the database for what fields are required for submission and what options are available. For a full list of what is required in the config file, see the table below.
- Create all the submission files by running
seqsender.py prep --unique_name <> --fasta <> --metadata <>
. Provide the full path to the fasta and metadata file and give a unique name which will be used for the submission. - These cannot repeat as the submission name is what is used to upload to via FTP. Using a submission name again could result in your submission not processing.
- GISAID requires the script to be authenticated with the CID. To authenticate your script run
gisaid_uploader.py COV authenticate --cid TEST-EA76875B00C3
. - It will then ask you to provide your GISAID username and password. If this test CID no longer works refer to the GISAID upload CLI for the latest test CID on https://www.gisaid.org/.
- After performing a test submission you will need to run this command again with your official production CID provided by GISAID.
- To perform your test submission run
seqsender.py submit --unique_name <> --fasta <> --metadata <> --test
. The flag--test
will allow you to perform a test submission anytime to NCBI. However, this will not work for GISAID as submissions are based off the CID. This will submit the test submission to the automated pipeline using the test command. The automated pipeline will not submit test submissions to GISAID. To perform the test submission to GISAID run the commandseqsender.py gisaid --unique_name <> --test
, after running the submit command. - After performing the test submission run the command
seqsender.py update_submissions
for the script to automatically check the progress of submissions and to continue submitting sequences to the next database after accessions are generated for linking BioSample, SRA, and Genbank submissions together.
- After successfully performing a test submission to every database you plan to submit to contact GISAID at hcov-19@gisaid.org to receive your production CID. Remember to update your config file to this new CID and authenticate the GISAID script with the new CID.
- Contact gb-admin@ncbi.nlm.nih.gov to begin production submissions to NCBI after performing test submissions.
- For production submissions run
seqsender.py submit --unique_name <> --fasta <> --metadata <>
, this will generate all the required file and place it in the automated pipeline. - To progress the automated pipeline run
seqsender.py update_submissions
every couple hours to process submissions.
seqsender.py submit --unique_name <> --fasta <> --metadata <>
Creates the files for submission and adds to automated submission pipeline and starts submission process.seqsender.py prep --unique_name <> --fasta <> --metadata <>
Creates the files for submission.seqsender.py update_submissions
Updates process of all sequences in submission pipeline, performs submission to subsequent databases based on submission status.seqsender.py gisaid --unique_name <>
Performs manual submission to GISAID.seqsender.py genbank --unique_name <>
Performs manual submission to Genbank.seqsender.py biosample --unique_name <>
Performs manual submission to BioSample.seqsender.py sra --unique_name <>
Performs manual submission to SRA.seqsender.py biosample_sra --unique_name <>
Performs manual joint submission to BioSample/SRA.
Optional flags:
--config <>
If using a different config file than the default config. Provide the full name of the config file stored inconfig_files
folder.--test
Performs test submission to NCBI. Does not perform test submission to GISAID. You must used authenticated CID for test submission to GISAID.--overwrite
Overwrites an existing submission on NCBI FTP. Used to update errored submissions.
- If you need to update a submissions metadata mid submission run
seqsender.py prep --unique_name <> --fasta <> --metadata <>
. Then runseqsender.py <database> --unique_name <> --fasta <> --metadata <> --overwrite
to overwrite an existing submission with the new files on the FTP server. - If you receive an error for the config file it will notify you which line in the config file this is occurring. Common errors are missing quotes or having a comma after the last item.
- Large GISAID submissions occassionally time-out. The automated pipeline will attempt to make the submission again the next time it is ran.
Section: | Name: | Description: | Required: |
---|---|---|---|
General | submission_directory | Output directory for script to process submissions at | Yes |
General | submit_Genbank | Perform submission to Genbank | Yes, True/False |
General | submit_GISAID | Perform submission to GISAID | Yes, True/False |
General | submit_SRA | Perform submission to SRA | Yes, True/False |
General | submit_BioSample | Perform submission to BioSample | Yes, True/False |
General | joint_SRA_BioSample_submission | Submit SRA and BioSample together as one submission | Yes, True/False |
General | contact_email1 | Primary contact email | Yes |
General | contact_email2 | Secondary contact email | No |
General | organization_name | Organization name | Yes |
General | authorset | List of authors for submission | Yes |
General | ncbi_org_id | Center account organization ID | Yes |
General | submitter_info | Primary submitter info | Yes |
General | organism_name | Organism | Yes |
General | metadata_file_sep | File seperator used in your metadata file | Yes |
General | fasta_sample_name_col | Metadata column of sequence name used in fasta file | Yes |
General | collection_date_col | Collection date for sequence | Yes |
General | baseline_surveillance | If performing baseline_sequencing setting this to True will add baseline_sequencing tag to all of your submissions | Yes, True/False |
ncbi | hostname | FTP site. Use ftp-private.ncbi.nlm.nih.gov | Yes |
ncbi | api_url | API url for pulling down files from NCBI https://submit.ncbi.nlm.nih.gov/api/2.0/files/FILE_ID/?format=attachment | Yes |
ncbi | username | NCBI center account username | Yes |
ncbi | password | NCBI center account password | Yes |
ncbi | publication_title | Public facing title for submission | Yes |
ncbi | ncbi_ftp_path_to_submission_folders | If your FTP site does not directly have the folders for Production/Test then provide the path for this. Typically left blank | No |
ncbi | BioProject | BioProject to link submissions to | No |
ncbi | BioSample_sample_name_col | Sequence names for BioSample | Yes, if submitting to BioSample |
ncbi | SRA_sample_name_col | Sequence names for SRA | Yes, if submitting to SRA |
ncbi | Genbank_sample_name_col | Sequence names for Genbank | Yes, if submitting to Genbank |
ncbi | BioSample_package | Use SARS-CoV-2.cl.1.0 | Yes |
ncbi | Center_title | Title for center account | Yes |
ncbi | Genbank_organization_type | Center organization type | Yes |
ncbi | Genbank_organization_role | Use owner unless otherwise specified | Yes |
ncbi | Genbank_spuid_namespace | Use ncbi-sarscov2-genbank unless specified | Yes |
ncbi | Genbank_auto_remove_sequences_that_fail_qc | Genbank can automatically remove sequences that fail submission qc in order to not stall submissions | Yes True/False |
ncbi | Genbank_wizard | Use BankIt_SARSCoV2_api unless specified | Yes |
ncbi | citation_address | Address of submitter organization | Yes |
ncbi | SRA_file_location | Whether you are submitting SRA files via manual upload or cloud link | Yes, if submitting to SRA File/Cloud |
ncbi | SRA_file_column1 | Either name of file if uploading or full path to cloud link | Yes, if submitting to SRA |
ncbi | SRA_file_column2 | Either name of file if uploading or full path to cloud link | Yes, if submitting to SRA and you have multiple files |
ncbi | SRA_file_loader | Leave blank unless notified by SRA team you need to use a specific loader for your SRA files | No |
genbank_src_metadata | column_names | Database field to column name of metadata | Yes |
genbank_cmt_metadata | create_cmt | Create cmt file to go with submission | Yes True/False |
genbank_cmt_metadata | column_names | Database field to column name of metadata | Yes |
BioSample_attributes | column_names | Database field to column name of metadata | Yes |
SRA_attributes | column_names | Database field to column name of metadata | Yes |
gisaid | column_names | Database field to column name of metadata | Yes |
gisaid | gisaid_sample_name_col | Sequence names for GISAID | Yes, if submitting to GISAID |
gisaid | cid | CID used for submission | Yes |
gisaid | username | gisaid username | Yes |
gisaid | password | GISAID password | Yes |
gisaid | type | Use betacoronavirus unless specified | Yes |
gisaid | Update_sequences_on_Genbank_auto_removal | If using Genbank auto-remove qc then use this to update the GISAID submission based on what is accepted by Genbank | Yes True/False |
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.
This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.
You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
The source code forked from other open source projects will inherit its license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.
Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.