Skip to content

klebgenomics/Klebsiella-genome-metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

115 Commits
 
 
 
 
 
 

Repository files navigation

Klebsiella-genome-metadata

Klebsiella genome metadata scheme, plus guidance, examples and submission template.

This is a community-driven data curation effort to facilitate the use and reuse of public genome collections for maximum knowledge gain. These efforts are focussed on Klebsiella pneumoniae and closely related organisms in the K. pneumoniae Species Complex (KpSC) and are coordinated by the KlebNET-GSP project team as a major activity of the KlebNET-GSP Epidemiology Consortium.

The submitted data will be compiled, collated and made publicly available in the KlebNET Metadata Repository and the PathogenWatch and BIGSdb platforms, which host public KpSC genome collections and reports associated genotypes. The KlebNET Metadata Repository is also available for secondary analysis according to the principles of the PHA4GE Microbial Data Sharing Accord, with the exception of Clause 3: Onward sharing of data as inclusion of this database in external platforms is allowed so long as this statement is included; exception of Clause 2: Overview of Outputs prior to Publication which is required for only a subset of the data; and exception of Clause 7: Opportunity for Collaboration which is required for only a subset of data. We encourage contributors to the KlebNET Metadata Repository to waive Clause 2 and 7 where feasible, as this will accelerate open data sharing and streamline KpSC research by enabling more effective reuse of data.

Our goal is to collect information that has broad utility for research focussed on KpSC, and that can be readily harmonised for easy and effective reuse. We aim to capture information that is not currently well represented in the public data repositories. Notably, the National Center for Biotechnology Information (NCBI) already allows submission of detailed Antimicrobial Susceptibility Testing (AST) information that is directly applicable to the KpSC, therefore AST data is excluded from our data curation effort. If you have generated and are able to share AST data for KpSC isolates please consider submitting to NCBI.

Our scheme includes data fields divided into four sections:

1. Isolate metadata fields capture information about the individual KpSC genomes and their associated isolates, as well as the sample sources and/or hosts from which the isolates were collected.

2. Sampling fields capture information about how and why isolates were collected and/or chosen for sequencing. These data are essential to understand the underlying biases in genome collections, and to make decisions about the inclusion or exclusion of isolates for comparative and aggregate analyses.

3. Citation information fields capture information regarding genomes that are part of a study that has not yet been published or is in preprint. This allows direct attribution of multiple authors for each project.

4. PHA4GE Microbial Data Sharing Accord - Clause Waiver fields indicate whether the data contributor agrees to waive Clause 2 and/or Clause 7 of the PHA4GE Microbial Data Sharing Accord on a per-isolate basis.

The submission template is available here. Detailed instructions and guidance for data submissions can be found below.

PLEASE NOTE: Data submitters must only send anonymised and de-identified data; and should not send any human data (age, sex, clinical information etc) for which they do not have ethical approval to release into the public domain. If the requisite approvals to share certain information is lacking or sharing is prohibited, the relevant data fields should be marked as ‘Restricted access’ or ‘Not available’ as indicated in the template instructions, and the fields that are able to be shared publicly should be completed.

Join the KlebNET-GSP Epidemiology Consortium

Data submitters are eligible to join the KlebNET-GSP Epidemiology Consortium, via this registration form. Members are required to read and agree to the Consortium Terms of Reference.

Contents

1. Data submission
2. Isolate metadata fields
3. Sampling fields
    i. Term definitions for 'purpose of sampling'
    ii. Examples of how to describe study designs using the sampling fields
4. Citation information fields
5. PHA4GE Microbial Data Sharing Accord - Clause Waiver
6. Queries and suggestions
7. License

Data submission

The data submission template is available here. Please MAKE A COPY before inputting your own data. You cannot enter data directly into the master copy of the template. Once completed, email or share your copy to klebsiella.genome.metadata@gmail.com.

The full list of metadata fields, value formats and options are shown in the tables below.

Fields with restricted vocabularies

Some fields have restricted vocabularies and/or require selection from a list of predefined data values. In most cases the list of possible values can be accessed and searched via a drop-down list within the submission template (also shown in the tables below, marked 'Choose from list') and only values matching those in the list will be accepted. However, in a minority of cases the possible set of values is derived from an established ontology that is too large for inclusion within the submission template. These fields are marked as, 'Controlled vocabulary,' with a link to the appropriate ontology e.g., NCBI taxonomy database or MeSH disease ontology.

Fields with a list of suggested values

In some cases it is desirable to have a restricted vocabulary to support data harmonisation, but there are no appropriate predefined ontologies and too many foreseeable options to create a definitive list. In these cases, we provide a list of suggested values that we expect to capture the vast majority of scenarios, but also provide the option to enter alternative values as free text. These fields are marked in the tables below as 'Choose common values from the list, or if none are appropriate, enter free text'. The submission template includes a drop-down list of the suggested values, but will allow other values to be entered (these free text entries will be marked with warnings).

Isolate metadata

These data describe individual genome sequences and the bacterial isolates from which they were derived. Please complete one row per sequence (i.e., one set of sequence read data and/or a de novo assembly).

Variable fields, and guidance for completing them, are shown in the table below.

For text fields, please DO NOT enter 'unknown' or 'missing' unless otherwise specified. Instead, leave the field blank if you do not have any data to input for that field.

Status Variable Definition; Guidance Value format
RECOMMENDED; REQUIRED if no Assembly accession provided Run accession Sequence archive run accession (sequence read accession); SRRxxx, ERRxxx. If multiple sequences for the same ISOLATE, please enter the repeat run accession as a separate row, and indicate that this is a repeat run sequence (in the field ‘Repeat sequence status’) and which run accession is the primary sequence (in the field ‘Primary sequence name’). {text}
REQUIRED Project accession BioProject accession; PRJxxx. If multiple projects for the same ISOLATE, a list of accessions can be given (comma-separated). {text}
REQUIRED Sample accession BioSample accession; SAMxxx {text}
RECOMMENDED; REQUIRED if no Assembly accession provided Experiment accession Sequence archive experiment accession; SRXxxx, ERXxxx. If multiple experiments for the same ISOLATE, please enter each unique experiment accession in a separate row. {text}
optional Secondary sample accession NCBI Biosample; ERSxxx {text}
optional; REQUIRED if no Run accession provided Assembly accession GenBank assembly accession; GCA_xxx or GCF_xxx. The accession for the entire assembly, including chromosome and plasmids. {text}
optional Secondary assembly accession GenBank WGS master record accession {text}
REQUIRED Genome source Type of sequence from which this genome was derived; Indicate if the sequence represents a single cultured isolate whole genome sequence (WGS) or is derived from a mixed sequence / metagenome-assembled genome (MAG). Choose from the list. Isolate WGS | MAG | Unknown
REQUIRED Isolate name A name that you choose for the isolate. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. Every isolate name from a single submitter must be unique, however duplicate isolate names may exist across the entire metadata repository. {text}
optional Isolate alias Other IDs associated with this isolate. Multiple IDs can be given (comma-separated). {text}
REQUIRED Collection year The year that the isolate was collected; YYYY. {int}
RECOMMENDED Collection month The month that the isolate was collected; MM. If collection month is not known, leave blank. {int}
RECOMMENDED Collection day The day that the isolate was collected within the month specified in 'Collection month'; DD. If collection day is not known, leave blank. {int}
REQUIRED Country Country of isolate collection. Controlled vocabulary, choose from the list of values as defined in https://www.insdc.org/submitting-standards/geo_loc_name-qualifier-vocabulary/ {term}
REQUIRED City or region City or region of isolate collection. For human-associated isolates, please only provide the broad city or region, not specific addresses or area codes as this may lead to host re-identification. {text}
REQUIRED Isolate source Short free text description of the sample source from which the Klebsiella was isolated. E.g., ‘human blood’ , ‘animal feed’ , ‘river water grab sample’. {text}
REQUIRED Source type Controlled vocabulary describing the source of the isolate. Choose from the list. Enables high level grouping of isolates. Human | Animal - food producing | Animal - other | Food | Aquaculture | Environmental | Experimentally derived | Other | Missing | Restricted access | Not applicable | Not collected | Not provided
REQUIRED Host Scientific name of the host from which the isolate was collected. Controlled vocabulary as defined in https://www.ncbi.nlm.nih.gov/taxonomy. If not host-associated, specify 'not host-associated'. Ensure the source is appropriately described under ‘Isolation source' and consider submitting detailed source information to NCBI via the One Health Enteric metadata template. {term}
REQUIRED if host-associated Host tissue sampled Name of body site or specimen type from which the sample was obtained, such as a specific organ, tissue, or clinical specimen. Choose common values from the list, or if none are appropriate, enter free text. Blood | Cerebrospinal fluid (CSF) | Urine | Sputum | Bronchoalveolar lavage (BAL) | Other respiratory | Wound | Skin | Faeces | Rectal swab | Throat swab | Cecal swab | Unknown | {text}
REQUIRED if host-associated Infection For host-associated isolates, indicate if infecting or colonising isolate, or if the infection status is unknown. Choose from the list. Infection | Colonisation | Unknown
REQUIRED if Infection = 'Infection' Host disease For host-associated infecting isolates, provide the name of the relevant disease, e.g., Pneumonia, Bacteremia. Controlled vocabulary as defined in https://meshb.nlm.nih.gov/treeView. If unknown, leave blank. {term}
optional Infection outcome For host-associated and infecting isolates, indicate the broad infection outcome at 28 days post-infection. Choose from the list. Death within 28 days | Alive at 28 days | Restricted access | Unknown
optional Infection severity For host-associated infecting isolates, if severity information could be made available (upon request), indicate the type of information here. If none available or none can be shared with the community, leave blank. {text}
optional Host age group For human-associated isolates, indicate the age range of the host. Choose from the list. 0-30 days | 1-12 months | 1-5 years | 5-18 years | 18-60 years | >60 years | Restricted access | Not collected | Not applicable | Missing
optional Host sex For human-associated isolates, indicate the biological sex of the host. Choose from the list. Male | Female | Restricted access | Not collected | Not applicable | Missing
optional Travel associated For human-associated isolates, indicate if associated with recent travel. Leave blank if travel status is unknown. Travel associated | NOT travel associated
optional Travel country If travel associated, indicate the travel country. This should be one of the countries listed here: https://www.insdc.org/submitting-standards/geo_loc_name-qualifier-vocabulary/. Leave blank if unknown. {term}
REQUIRED Repeat isolate status If this is the only ISOLATE sequenced for this host infection or colonisation episode, select 'Primary isolate.' If more than one ISOLATE is sequenced from the same host infection or colonisation episode, indicate whether this is the primary isolate or a repeat isolate. Primary isolate | Repeat isolate
REQUIRED if Repeat isolate status = 'Repeat isolate' Primary isolate name If other ISOLATES are sequenced from the same host infection or colonisation episode, and this entry is NOT the primary isolate in the series, provide the isolate name of the primary isolate. {text}
REQUIRED Repeat sequence status If this is the only sequence record for this isolate (READS and/or ASSEMBLY), select 'Primary sequence.' If multiple sequences are provided for the same isolate, indicate whether this is the primary sequence or a repeat sequence. Primary sequence | Repeat sequence
REQUIRED if Repeat sequence status = 'Repeat sequence' Primary sequence name If multiple sequences of the same isolate, and this entry is NOT the primary sequence in the series, provide the READ or ASSEMBLY accession for the primary sequence, otherwise leave blank. {text}
REQUIRED Sample collected by The name of the agency or institution (in full) that collected the original bacterial isolate. {text}
REQUIRED Sequenced by The name of the agency or institution (in full) that generated the sequence data. {text}
REQUIRED Sequence submitted by The name of the agency or institution (in full) that is submitting the metadata. {text}
REQUIRED Data custodian organisation name The name of the agency or institution (in full) with authority over how the bacterial isolate and its associated data can be used. {text}
REQUIRED Lab contact Contact email address for the person providing the metadata. Note this information will only be made available to the KlebNET-GSP team. {text}

Sampling fields

These contextual data describe the purpose of sampling, and the sampling strategy for the collection from which each isolate is derived. Please complete one row per isolate.

Variable fields, and guidance for completing them, are summarised in the table below. Definitions and detailed examples are also shown below the table.

Status Variable Definition Guidance Value format
REQUIRED Purpose of sampling Primary purpose for sampling bacterial isolates Indicate the primary purpose for the collection and sequencing of these isolates (e.g., routine diagnostics, outbreak investigation, research). Choose from the list, or if none of the values are appropriate, provide the reason as free text. Definitions are shown below this table. Routine diagnostics and / or infection control | Routine surveillance | Outbreak investigation / outbreak-initiated surveillance | Research | {text}
REQUIRED Study population Population from which bacterial isolates were sampled Give details about the population of hosts or environments represented in the sample (e.g., Hospital patients, Neonates, Hospital wastewater). This information is essential to inform the inclusion and exclusion of studies for aggregate or comparative epidemiological analyses. Choose common values from the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated). Hospital patients | Intensive Care Unit (ICU) patients | Primary care patients | Community participants | Neonates | Clinical environment: sinks and drains | Clinical environment: surfaces | Medical devices | Hospital wastewater | Wastewater (not hospital) | Fresh water | Seawater | Soil | Rhizosphere | Plants | Livestock | Companion animals | Captive animals | Wild animals | Food | {text}
REQUIRED Target epi Broad epidemiological category of the study Indicate the broad epidemiological category of the study (e.g., Host colonisation, Host infection, Environmental). This information is useful to inform aggregate or comparative analyses of disease-associated vs non-disease associated isolates. Choose from the list, or if none of the values are appropriate, enter the information as free text. Host infection | Host colonisation | Environmental | Host infection & colonisation | Host infection, colonisation & environmental | {text}
REQUIRED if target epi includes 'Host infection' Selected by clinical phenotype Flag to indicate whether isolates were selected for inclusion on the basis of host clinical phenotype Indicate whether isolates were selected for inclusion on the basis of host clinical phenotype (e.g., bloodstream infection, liver abscess, severe infection) or choose 'NOT selected' if no selection was applied. Choose from the list. This information is essential to inform studies focussed on specific infection types or disease severity. E.g., to determine serotype distributions among invasive infection isolates or compare rates of drug resistance among blood stream infections. The specific phenotype used for selection can be indicated in the 'selected clinical phenotype' field. Selected by clinical phenotype | NOT selected by clinical phenotype
REQUIRED if selected by clinical phenotype = 'Selected by clinical phenotype' Selected clinical phenotype Clinical phenotype used to select isolates for inclusion Indicate the specific clinical phenotype that was used to select samples for collection and/or sequencing. Choose common values from the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated). Liver abscess | Invasive infection | Bloodstream infection | Respiratory infection | Urinary tract infection | Healthcare-associated infection | Community acquired infection | Severe disease | {text}
REQUIRED Selected by organism trait Flag to indicate whether isolates were selected for inclusion on the basis of microbial trait Indicate if samples were selected for inclusion on the basis of a microbial phenotype or genotype (e.g., specific drug resistance or serotype, presence of a specific gene) or choose 'NOT selected' if no selection was applied. Choose from the list. This information is essential to inform studies aiming to estimate the prevalence of microbial phenotypes / genotypes by study populations, geographies, etc. For example, to estimate national prevalence of ceftriaxone or carbapenem resistant isolates. The specific phenotype or genotype used for selection can be indicated in the 'selected organism trait' field. Selected by organism trait | NOT selected by organism trait
REQUIRED if selected by organism trait = 'Selected by organism trait' Selected organism trait Microbial trait used to select isolates for inclusion Indicate the specific microbial phenotype or genotype that was used to select isolates for collection and/or sequencing. Choose common values from the list, or if none of the values are appropriate, enter the information as free text. Multiple values can be specified (comma-separated). Ceftriaxone resistance | Carbapenem resistance | Drug resistance (not ceftriaxone or carbapenem) | ESBL producers | Carbapenemase producers | KPC positive | NDM positive | iuc (aerobactin) positive | iro (salmochelin) positive | rmpA positive | peg-344 positive | OXA positive | String-test positive | Hypermucoviscous by low-speed centrifugation | Hypermucoviscous by Percoll-gradient sedimentation | 7-gene multi-locus sequence type | Serotype | {text}
RECOMMENDED Sampling period start Start date for the sampling period Indicate when the sample collection began (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis. {ISO format}
RECOMMENDED Sampling period end End date for the sampling period Indicate when the sample collection ended (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis. If collection and sequencing are on-going, leave this field blank. {ISO format}

Term definitions for purpose-of-sampling

Routine diagnostics and / or infection control

Samples collected through the routine and ongoing activities of clinical or veterinary microbiology laboratories for the purposes of clinical diagnosis and/or infection control. This may include isolates confirmed as infecting agents and/or those considered asymptomatic or environmental colonisers. E.g., isolates identified from hospital sinks or patient screening swabs as part of routine infection prevention and control procedures.

Routine surveillance

Samples collected through the routine and ongoing activities of other laboratories (not clinical or veterinary microbiology laboratories) and/or collected for purposes other than clinical diagnostics and infection control, e.g., laboratories processing samples from non-healthcare environmental sources or food products.

Outbreak investigation / outbreak-initiated surveillance

Samples collected as part of a response to a specific outbreak, e.g., within a hospital or other healthcare setting (human or veterinary). This may include isolates confirmed as infecting agents and/or those considered asymptomatic colonisers (e.g., from screening swabs) and/or those from environmental sources (e.g., hospital sinks, drains etc.)

Research

Samples collected for specific research purposes (excluding outbreak investigation / outbreak-initiated surveillance) that would not have otherwise been collected via routine diagnostics, infection control or surveillance activities as described above.

Examples of how to describe study designs using the sampling fields

Below we describe various hypothetical study designs and show how the sampling fields would be populated for each.

Neonatal sepsis study

K. pneumoniae were isolated from the blood of neonates via routine diagnostic procedures. All isolates collected between 01 Jan 2019 and 31 Dec 2020 were stocked and subjected to whole genome sequencing.

Neonatal sepsis study flow diagram

Ceftriaxone-resistant infection study

K. pneumoniae identified via routine diagnostic procedures from hospitalised patients in a tertiary care centre between February 2016 and February 2018 were collected. Isolates resistant to ceftriaxone were selected for sequencing.

Ceftriaxone resistant infection study flow diagram

CPE outbreak study

In May 2019, there was a sudden increase in CPE infections in the ICU of a large tertiary care centre. Enhanced infection prevention and control procedures were activated from 18 May 2019 until 31 August 2019 when the outbreak was declared contained. Rectal screening swabs were collected on patient admission and every three days thereafter, in addition to sink and drain screening swabs. All swabs were cultured on selective media and presumptive carbapenem-resistant K. pneumoniae were sequenced alongside all carbapenem-resistant K. pneumoniae identified from ICU patients via routine diagnostics procedures.

CPE outbreak study flow diagram

CR-hvKp study

Carbapenem-resistant K. pneumoniae were isolated from liver abscess patients as part of a research study focussed on diabetic patients, between 01 June 2018 and 30 June 2020. Strains carrying K. pneumoniae carbapanemase genes were detected by PCR and string test was used to determine hypermucoidy. String test positive isolates harbouring blaKPC were subjected to whole genome sequencing.

CR-hvKp study flow diagram

Pig gut carriage study

Veterinary researchers collected 100 faecal samples from each of six pig farms in June 2017. K. pneumoniae were isolated by culture on SCAI media and subjected to whole genome sequencing as part of a One Health research project.

Pig carriage study flow diagram

(Note that the specific hosts, i.e., pigs, should be indicated in the isolate metadata field 'host', rather than in the sampling field)

Water surveillance study

K. pneumoniae were isolated from fresh and wastewaters in a metropolitan centre as part of routine water surveillance conducted by the Environmental Protection Authority. Since 2021, all isolates have been stocked and 100 isolates have been randomly selected for sequencing each year. Sampling and sequencing is ongoing.

Water surveillance study flow diagram

Citation Information

These data fields ask for the PubMed ID, study title and authors for published data. The title of the study and author list can also be provided for genomes that are not yet published with a PubMed ID or genomes that are currently only in preprint with a DOI. Please complete one row per isolate.

Variable fields, and guidance for completing them, are summarised in the table below.

Status Variable Definition Guidance Value format
REQUIRED if published References PubMed ID for associated publication reporting genome data DOI is acceptable for preprints only. Multiple references can be provided as a list (comma-separated). If no associated publications, leave blank. {text}
RECOMMENDED Study title Title of the study for this isolate For isolates that are part of a study that has not yet been published (no PubMed ID) or isolates that are part of a study that is in preprint (with DOI), please provide the title of the study. {text}
REQUIRED if Study title provided Author list A list of all contributing authors for this study For isolates that are part of a study that has not yet been published (no PubMed ID) or isolates that are part of a study that is currently in preprint (with DOI), please list all contributing authors. {text}

PHA4GE Microbial Data Sharing Accord - Clause Waiver

The data submitted to this metadata repository will be shared in accordance with the principles of the PHA4GE Microbial Data Sharing Accord. The following data fields indicate whether data contributors wish to waive Clause 2 and Clause 7 for each data record they submit.

Variable fields, and guidance for completing them, are summarised in the table below.

Status Variable Definition Guidance Value format
REQUIRED Clause 2 waiver Explicit waiver of Clause 2 of the PHA4GE Accord For this isolate, do you agree to waive Clause 2 of the PHA4GE Accord? Waiving Clause 2 means that authors of publications using your submitted data will not be required to notify you or share a confidential copy for review before publication. Waive | Do Not Waive
REQUIRED Clause 7 waiver Explicit waiver of Clause 7 of the PHA4GE Accord For this isolate, do you agree to waive Clause 7 of the PHA4GE Accord? Waiving Clause 7 means that data users are not required to make a reasonable attempt to collaborate with you before using the data. Waive | Do Not Waive

Queries and suggestions

We welcome queries and suggestions from the community on any aspect of this scheme. In particular, please notify us if you think we have missed key data fields or options, or if the guidance is unclear. You can contact us via the issue tracker.

License

The metadata scheme, template and associated resources contained within this repository are freely available for reuse and adaptation under GNU general public license v3. We encourage the development of similar schemes for other organisms.

About

Klebsiella genome metadata collection schema plus guidance, examples and collection template

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors