Pathogen Background
NCBI Pathogen Detection integrates bacterial and fungal pathogen genomic sequences from numerous ongoing surveillance and research efforts whose sources include food, environmental sources such as water or production facilities, and patient samples. Foodborne, hospital-acquired, and other clinically infectious pathogens are included.
The system provides two major automated real-time analyses: 1) it quickly clusters related pathogen genome sequences to identify potential transmission chains, helping public health scientists investigate disease outbreaks, and 2) as part of the National Database of Antibiotic Resistant Organisms (NDARO), NCBI screens genomic sequences using AMRFinderPlus to identify the antimicrobial resistance, stress response, and virulence genes found in bacterial genomic sequences, which enables scientists to track the spread of resistance genes and to understand the relationships among antimicrobial resistance, stress response, and virulence.
In this workshop we will be looking at NCBI Pathogen Detection data in Google Cloud with an emphasis on the antimicrobial resistance data.
- Demonstrate use of BigQuery in the Google Cloud Console and commandline
bq
- Show how BigQuery can be used to do analysis of microbigge, isolates, and isolate_exceptions tables and how they relate to the web interface
- Demonstrate downloading sequences and phylogenetic analysis from the Reference Gene Catalog and visualization using iTOL
- Demonstrate using
gsutil
to download MicroBIGG-E contig sequences from cloud storage buckets - Demonstrate the use of seqkit to perform some common operations on FASTA files
- Show how to slice out coding sequences from contig sequences and perform simple selection analysis on genes in MicroBIGG-E
- Detailed information on the Pathogen Detection resources
- Pathogen Detection HowTo page with demonstrations of how to perform some common analysis tasks using our web interfaces
- Isolates Browser help
- MicroBIGG-E help
- Isolates Browser at GCP
- MicroBIGG-E at GCP
- AMRFinderPlus
Continue to Project 1
This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious disease (NIAID), National Institutes of Health
- Setup
- Part 1: SRA and Viral Resources
- Introduction
- Background
-
Introductory Exercises
- Exercise 1: Review metadata table contents and help docs
- Exercise 2: Review taxonomy table and help docs
- Exercise 3: Review STAT results table and help docs
- Exercise 4: Review VCF results table and help docs
- Project A: Find SARS2 data with paired Illumina and ONT samples, generated using ARTICv3
-
Project B: Find SARS-CoV-2 data with low reference coverage
- Part 1: Retrieving SRA data
-
Part 2: Retrieving a reference genome sequence
- Command 4: To retrieve the reference genome as a fasta, we can use edirect.
- Part 3: Writing bash scripts to calculate and plot reference genome coverage
-
Part 4: Analyzing your SRA records
- Command 9: Now analyze the two SRA records
- Project C: Find variant calls that are common between paired Illumina and ONT SARS-CoV-2 records
- Part 2: NCBI Pathogen Detection
- Background
- Project 1
-
Project 2 Generate tree of KPC alleles to examine evolution of size variants
- Background
- Step 1 Download FASTA file of all blaKPC alleles from the Reference Gene Catalog
- Step 2 Filter for sequences less than 297 amino-acids in length
- Step 3 Align the sequences with Muscle muscle
- Step 4 Infer the tree using RAxML raxml
- Step 5 Visualize the tree in iTOL
- Discussion
-
Project 3 Selection analysis on 293-aa blaKPC genes
- Background
- Step 1 Get a list of contigs with sequences of interest
- Step 2 Download contig sequences
- Step 3 Cut out coding sequences
- Step 4 Prepare sequences for selection analysis
- Step 5 Use RAxML to infer a tree
- Step 6 Run FUBAR test with HyPhy
- Discussion