Home

2022 ASM NGS Pre-conference workshop

Microbial Pathogen and SARS-CoV-2 Resources in the Cloud

This is the pre-conference workshop wiki where we'll take you through the projects used for the conference.

Workshop organization

Setup
Part 1: SRA and NCBI Virus
- Introduction
- Background
- Introductory Exercises
- Projects
  - Project A: Find SARS-CoV-2 data with paired Illumina and ONT samples, generated using ARTICv3
  - Project B: Find SARS-CoV-2 data with low reference coverage
  - Project C: Find variant calls that are common between paired Illumina and ONT SARS-CoV-2 records
Part 2: NCBI Pathogen Detection
- Background
- Exercises
  - Project 1: Use BigQuery to search MicroBIGG-E and Isolates data
  - Project 2: Build a phylogeny of reference blaKPC alleles
  - Project 3: Look for evidence of positively selected sites in blaKPC genes

Workshop goals

Demonstrate use of BigQuery in the Google Cloud Console and commandline bq
Show how BigQuery can be used to do analysis of microbigge, isolates, and isolate_exceptions tables and how they relate to the web interface
Demonstrate downloading sequences and phylogenetic analysis from the Reference Gene Catalog and visualization using iTOL
Demonstrate using gsutil to download MicroBIGG-E contig sequences from cloud storage buckets
Demonstrate the use of seqkit to perform some common operations on FASTA files
Show how to slice out coding sequences from contig sequences and perform simple selection analysis on genes in MicroBIGG-E

Don't have a cloud account?

NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative has launched a new NIH Cloud Lab program that lets you experiment with using cloud for your research. You can request a GCP or AWS account, and will receive $500 and three months, in addition to access to biomedical tutorials that walk you through common cloud-based research use cases. This is available to intramural researchers currently but expect it to be ready for extramural researchers in the coming months. Learn more via this link- https://cloud.nih.gov/resources/cloudlab/

Has this been useful?

Please let us know at suggest@ncbi.nlm.nih.gov if this has been useful and if you're using any of these resources in your work and especially if you use these resources in your publications.

Post-workshop updates

Commands used to set up the VM image These are the commands used to install the software used in the workshop on the stock Google Cloud Ubuntu 20.04 image.

This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious disease (NIAID), National Institutes of Health

2022 ASM NGS workshop

Setup
Part 1: SRA and Viral Resources
- Introduction
- Background
- Introductory Exercises
  - Exercise 1: Review metadata table contents and help docs
  - Exercise 2: Review taxonomy table and help docs
  - Exercise 3: Review STAT results table and help docs
  - Exercise 4: Review VCF results table and help docs
- Project A: Find SARS2 data with paired Illumina and ONT samples, generated using ARTICv3
  - Step 1: Figuring out how to implement each filter individually
    - Query 1: Filter by Organism
    - Query 2: Filter by Platform
    - Query 3: Find additional fields that may have primer strategy information
  - Step 2: Putting it all together
    - Query 4
  - Step 3: Grouping by collection date
    - Query 5
- Project B: Find SARS-CoV-2 data with low reference coverage
  - Part 1: Retrieving SRA data
    - Command 1: Letting the SRAToolkit do all the work
    - Command 2: Using SRAToolkit's SDL function to locate the data and then retrieve it yourself
    - Command 3: Then we download a file and extract the fastq from one of those locations.
  - Part 2: Retrieving a reference genome sequence
    - Command 4: To retrieve the reference genome as a fasta, we can use edirect.
  - Part 3: Writing bash scripts to calculate and plot reference genome coverage
    - Part 3a: A bash script to calculate reference coverage, using minimap2 and samtools.
      - Command 5: First create a file for the coverage calculation script
      - Command 6: Then copy and paste the following code into the file
    - Part 3b: Create a Coverage Plot script.
      - Command 7: Create a file for the coverage plot script.
      - Command 8: Then copy and paste the following code into the file
  - Part 4: Analyzing your SRA records
    - Command 9: Now analyze the two SRA records
- Project C: Find variant calls that are common between paired Illumina and ONT SARS-CoV-2 records
  - Step 1: Different approaches to filtering by mutation
    - Query 1: Filtering by genomic position in nucleotide space
    - Query 2: Filtering by mutation types
    - Query 3: Filtering by associated protein
    - Query 4: Filtering by read support
  - Step 2: Combine what's been learned in Projects A, B and C thus far
    - Query 5: Putting it all together
  - Step 3: Describe our resulting dataset
    - Query 6: Finding average read support per group
Part 2: NCBI Pathogen Detection
- Background
- Project 1
  - Introduction
  - Background
  - Exercises
    - Exercise 1
    - Exercise 2
- Project 2 Generate tree of KPC alleles to examine evolution of size variants
  - Background
  - Step 1 Download FASTA file of all blaKPC alleles from the Reference Gene Catalog
    - Step 1a Make a working directory
    - Step 1b Download the Reference Coding Sequence from the AMRFinderPlus database
  - Step 2 Filter for sequences less than 297 amino-acids in length
    - Step 2a Download the Reference Gene Catalog table
    - Step 2b Get a list of blaKPC genes < 297 amino-acids in length
    - Step 2c Filter and reformat the FASTA file
  - Step 3 Align the sequences with Muscle muscle
  - Step 4 Infer the tree using RAxML raxml
  - Step 5 Visualize the tree in iTOL
    - Step 5a Copy and paste into iTOL
    - Step 5b Download and add iTOL annotation files
    - Step 5c Upload the annotation file to color the taxon names
  - Discussion
- Project 3 Selection analysis on 293-aa blaKPC genes
  - Background
  - Step 1 Get a list of contigs with sequences of interest
    - Step 1a Create a working directory for this project
    - Step 1b Use BigQuery to get a list of contigs with KPC genes
    - Step 1c How many contigs are we looking at?
  - Step 2 Download contig sequences
    - Step 2a Download a subset of the sequences
    - Step 2b Copy in tarball of all contigs
  - Step 3 Cut out coding sequences
    - Step 3a Reformat FASTA contig identifiers and combine to one file
    - Step 3b Create BED file of coordinates we want to cut out
    - Step 3c Cut out coding sequence
  - Step 4 Prepare sequences for selection analysis
  - Step 5 Use RAxML to infer a tree
  - Step 6 Run FUBAR test with HyPhy
  - Discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly