Pathogen Background

Background

NCBI Pathogen Detection integrates bacterial and fungal pathogen genomic sequences from numerous ongoing surveillance and research efforts whose sources include food, environmental sources such as water or production facilities, and patient samples. Foodborne, hospital-acquired, and other clinically infectious pathogens are included.

The system provides two major automated real-time analyses: 1) it quickly clusters related pathogen genome sequences to identify potential transmission chains, helping public health scientists investigate disease outbreaks, and 2) as part of the National Database of Antibiotic Resistant Organisms (NDARO), NCBI screens genomic sequences using AMRFinderPlus to identify the antimicrobial resistance, stress response, and virulence genes found in bacterial genomic sequences, which enables scientists to track the spread of resistance genes and to understand the relationships among antimicrobial resistance, stress response, and virulence.

In this workshop we will be looking at NCBI Pathogen Detection data in Google Cloud with an emphasis on the antimicrobial resistance data.

Learning Objectives

Demonstrate use of BigQuery in the Google Cloud Console and commandline bq
Show how BigQuery can be used to do analysis of microbigge, isolates, and isolate_exceptions tables and how they relate to the web interface
Demonstrate downloading sequences and phylogenetic analysis from the Reference Gene Catalog and visualization using iTOL
Demonstrate using gsutil to download MicroBIGG-E contig sequences from cloud storage buckets
Demonstrate the use of seqkit to perform some common operations on FASTA files
Show how to slice out coding sequences from contig sequences and perform simple selection analysis on genes in MicroBIGG-E

Help documentation

Detailed information on the Pathogen Detection resources
Pathogen Detection HowTo page with demonstrations of how to perform some common analysis tasks using our web interfaces
Isolates Browser help
MicroBIGG-E help
Isolates Browser at GCP
MicroBIGG-E at GCP
AMRFinderPlus

Background for workshop

Background PowerPoint

Continue to Project 1

This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious disease (NIAID), National Institutes of Health

2022 ASM NGS workshop

Setup
Part 1: SRA and Viral Resources
- Introduction
- Background
- Introductory Exercises
  - Exercise 1: Review metadata table contents and help docs
  - Exercise 2: Review taxonomy table and help docs
  - Exercise 3: Review STAT results table and help docs
  - Exercise 4: Review VCF results table and help docs
- Project A: Find SARS2 data with paired Illumina and ONT samples, generated using ARTICv3
  - Step 1: Figuring out how to implement each filter individually
    - Query 1: Filter by Organism
    - Query 2: Filter by Platform
    - Query 3: Find additional fields that may have primer strategy information
  - Step 2: Putting it all together
    - Query 4
  - Step 3: Grouping by collection date
    - Query 5
- Project B: Find SARS-CoV-2 data with low reference coverage
  - Part 1: Retrieving SRA data
    - Command 1: Letting the SRAToolkit do all the work
    - Command 2: Using SRAToolkit's SDL function to locate the data and then retrieve it yourself
    - Command 3: Then we download a file and extract the fastq from one of those locations.
  - Part 2: Retrieving a reference genome sequence
    - Command 4: To retrieve the reference genome as a fasta, we can use edirect.
  - Part 3: Writing bash scripts to calculate and plot reference genome coverage
    - Part 3a: A bash script to calculate reference coverage, using minimap2 and samtools.
      - Command 5: First create a file for the coverage calculation script
      - Command 6: Then copy and paste the following code into the file
    - Part 3b: Create a Coverage Plot script.
      - Command 7: Create a file for the coverage plot script.
      - Command 8: Then copy and paste the following code into the file
  - Part 4: Analyzing your SRA records
    - Command 9: Now analyze the two SRA records
- Project C: Find variant calls that are common between paired Illumina and ONT SARS-CoV-2 records
  - Step 1: Different approaches to filtering by mutation
    - Query 1: Filtering by genomic position in nucleotide space
    - Query 2: Filtering by mutation types
    - Query 3: Filtering by associated protein
    - Query 4: Filtering by read support
  - Step 2: Combine what's been learned in Projects A, B and C thus far
    - Query 5: Putting it all together
  - Step 3: Describe our resulting dataset
    - Query 6: Finding average read support per group
Part 2: NCBI Pathogen Detection
- Background
- Project 1
  - Introduction
  - Background
  - Exercises
    - Exercise 1
    - Exercise 2
- Project 2 Generate tree of KPC alleles to examine evolution of size variants
  - Background
  - Step 1 Download FASTA file of all blaKPC alleles from the Reference Gene Catalog
    - Step 1a Make a working directory
    - Step 1b Download the Reference Coding Sequence from the AMRFinderPlus database
  - Step 2 Filter for sequences less than 297 amino-acids in length
    - Step 2a Download the Reference Gene Catalog table
    - Step 2b Get a list of blaKPC genes < 297 amino-acids in length
    - Step 2c Filter and reformat the FASTA file
  - Step 3 Align the sequences with Muscle muscle
  - Step 4 Infer the tree using RAxML raxml
  - Step 5 Visualize the tree in iTOL
    - Step 5a Copy and paste into iTOL
    - Step 5b Download and add iTOL annotation files
    - Step 5c Upload the annotation file to color the taxon names
  - Discussion
- Project 3 Selection analysis on 293-aa blaKPC genes
  - Background
  - Step 1 Get a list of contigs with sequences of interest
    - Step 1a Create a working directory for this project
    - Step 1b Use BigQuery to get a list of contigs with KPC genes
    - Step 1c How many contigs are we looking at?
  - Step 2 Download contig sequences
    - Step 2a Download a subset of the sequences
    - Step 2b Copy in tarball of all contigs
  - Step 3 Cut out coding sequences
    - Step 3a Reformat FASTA contig identifiers and combine to one file
    - Step 3b Create BED file of coordinates we want to cut out
    - Step 3c Cut out coding sequence
  - Step 4 Prepare sequences for selection analysis
  - Step 5 Use RAxML to infer a tree
  - Step 6 Run FUBAR test with HyPhy
  - Discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pathogen Background

Background

Learning Objectives

Help documentation

Background for workshop

Clone this wiki locally