This repository contains a (re-)analysis of SARS-CoV-2 FASTQs from PRJCA008874 (Lu et al., The Lancet 2020) using the SARS-CoV-2 analysis pipeline developed for UC San Diego's Return to Learn program (C-VIEW).
The FASTQ files from PRJCA008874 can be found here:
- HTTP: https://download.cncb.ac.cn/gsa/CRA006587
- FTP: ftp://download.big.ac.cn/gsa/CRA006587
- Metadata:
CRA006587.xlsx
(Markdown)
Here's a mapping of BioSample to various relevant IDs for convenience:
BioSample | Description | Experiment Accession | Run Accession(s) |
---|---|---|---|
SAMC703641 | WH19001 and WH19005 | CRX398523 | CRR456568 |
SAMC703642 | WH19002 | CRX398524 | CRR456569 |
SAMC703643 | WH19004 | CRX398525 | CRR456570 |
SAMC703644 | WH19008 | CRX398526 | CRR456571 |
SAMC703645 | YS8011 | CRX398527 | CRR456572, CRR456573 |
SAMC703646 | WH01 | CRX398528 | CRR456574, CRR456575, CRR456576, CRR456577, CRR456578, CRR456579, CRR456580, CRR456581, CRR456582, CRR456583, CRR456584, CRR456585, CRR456586, CRR456587 |
SAMC703647 | WH02 | CRX398529 | CRR456588, CRR456589, CRR456590, CRR456591, CRR456592, CRR456593, CRR456594, CRR456595, CRR456596, CRR456597 |
SAMC703648 | WH03 | CRX398530 | CRR456598, CRR456599, CRR456600, CRR456601 |
SAMC703649 | WH04 | CRX398531 | CRR456602, CRR456603, CRR456604, CRR456605 |
I'm using Minimap2 v2.24 to map reads against the NC_045512.2 reference genome, piped to samtools v1.14 to sort and output as BAM.
minimap2 -t 2 -d NC_045512.2.fas.mmi NC_045512.2.fas > NC_045512.2.fas.mmi.log 2>&1
- Input:
NC_045512.2.fas
- Output:
NC_045512.2.fas.mmi
- Log:
NC_045512.2.fas.mmi.log
Some of the FASTQ files are quite huge, so rather than downloading them locally, I will map reads on-the-fly while downloading. For convenience, I have created a CSV file containing the FASTQ FTP paths (fastq_list.csv
). The individual Minimap2 command is as follows:
minimap2 -t THREADS -a -x sr REF.MMI READ1.FASTQ.GZ READ2.FASTQ.GZ | samtools sort --threads THREADS -o SORTED.BAM
-a
means "output in the SAM format"-x sr
means "use Minimap2's genomic short-read mapping preset"
However, I'll use named pipes for the FASTQ files to download + map on-the-fly, and I'll throw out unmapped reads (via Minimap2's --sam-hit-only
flag):
minimap2 --sam-hit-only -t THREADS -a -x sr REF.MMI <(wget -qO- READ1.FASTQ.GZ.URL) <(wget -qO- READ2.FASTQ.GZ.URL) | samtools sort --threads THREADS -o SORTED.BAM
Putting everything together, here's the looped command for doing everything:
for f in $(cat fastq/fastq_list.csv) ; do crr=$(echo $f | cut -d'/' -f6) && url1=$(echo $f | cut -d',' -f1) && url2=$(echo $f | cut -d',' -f2) && minimap2 --sam-hit-only -t 1 -a -x sr ref/NC_045512.2.fas.mmi <(wget -qO- "$url1") <(wget -qO- "$url2") 2> "bam/$crr.01.sorted.untrimmed.bam.log" | samtools sort --threads 1 -o "bam/$crr.01.sorted.untrimmed.bam" ; done
The resulting BAM files can be found in the data/bam
folder.
As can be seen in Table: BioSamples, some BioSamples have multiple FASTQ pairs (and thus BAMs). However, for any downstream analyses, I wanted to merge all of the BAMs corresponding to a single BioSample into a single BAM file. The samtools merge
command merges multiple sorted BAMs into a single sorted output BAM.
samtools merge -o OUTPUT.BAM INPUT1.BAM INPUT2.BAM ...
The resulting merged BAM files can be found in the data/merged_bam
folder.
Note: A preprint was released in June 2022 in which the authors called consensus sequences (and performed downstream analyses on the consensus sequences, e.g. phylogenetic inference, tree dating) using the untrimmed BAMs from this repo. This is a technical Bioinformatics mistake: reads need to be trimmed in order to remove low-quality bases (which we will do in Step 2) prior to any downstream analyses. Trimming is an important Quality Control (QC) step in any sequence data analysis, something I explained to the authors. If you want to learn more about read trimming and why QC is so important in sequence data analysis, this Galaxy tutorial is an excellent resource. This Data Carpentry tutorial about assessing read quality as well as this subsequent tutorial about trimming and filtering are also excellent resources.
I'm using iVar v1.3.1 to quality trim the mapped reads, and I'm using samtools v1.14 to sort the trimmed BAMs.
ivar trim -e -i SORTED_UNTRIMMED.BAM -p UNSORTED_TRIMMED_PREFIX
samtools sort --threads 1 -o SORTED_TRIMMED.BAM UNSORTED_TRIMMED.BAM
-e
means "include reads without primers" (we're not trimming primers, as there were no primers to trim)
The resulting trimmed BAM files can be found in the data/trimmed_bam
folder.
I'm using SamBamViz v0.0.4 to compute various statistics and produce various visualizations from the trimmed BAMs. I'm using a minimum base quality score of 20 (-q 20
) to match standard consensus calling approaches such as iVar Consensus.
SamBamViz.py -i TRIMMED_SORTED.BAM -o OUT_DIR -q 20 --ylog
The resulting SamBamViz output files can be found in the data/sambamviz
folder.
I'm using samtools v1.14 to create pile-up files from the trimmed BAMs. Because iVar Variants and Consensus read the pile-up from standard input anyways, I gzip the pile-up files to save space.
samtools mpileup -B -A -aa -d 0 -Q 0 --reference REF_GENOME.FASTA TRIMMED_SORTED.BAM | gzip -9 > PILEUP.TXT.GZ
-B
means "disable BAQ (per-base alignment quality)"-A
means "count orphans (do not discount anomalous read pairs)"-aa
means "output absolutely all positions, including unused reference sequences"-d 0
means "don't cap the per-file depth"-Q 0
means "skip bases with less than 0 base quality (i.e., don't skip any bases based on base quality)"
The resulting pile-up files can be found in the data/pileup
folder.
I'm using iVar v1.3.1 to call variants.
zcat PILEUP.TXT.GZ | ivar variants -m 10 -r REF_GENOME.FASTA -g REF_GENOME.GFF -p VARIANTS.TSV
-m 10
means "minimum read depth to call a variant = 10 reads"
The resulting variant TSV files can be found in the data/variants
folder.
I'm using iVar v1.3.1 to call consensus sequences.
zcat PILEUP.TXT.GZ | ivar consensus -m 10 -n N -t 0.5 -p OUT_PREFIX
-m 10
means "minimum read depth to call a base in the consensus sequence = 10 reads"-t 0.5
means "a single base must show up in at least 50% of the reads covering a given position to show up in the consensus"-n N
means "use the letterN
to denote ambiguous positions (i.e., positions in which we were unable to call a base)"- To clarify, a position in the consensus sequence can be
N
if (1) it had a coverage of less than 10, or (2) it had a coverage of greater than 10 but none of the 4 possible bases had a frequency greater than or equal to 0.5
- To clarify, a position in the consensus sequence can be
The resulting consensus sequences can be found in the data/consensus
folder.
As mentioned at the beginning of this document, this analysis was conducted using the C-VIEW pipeline developed for UC San Diego's Return to Learn program, which uses iVar Trim to perform trimming. If you suspect that iVar Trim might be trimming the reads too aggressively, rather than skipping quality trimming entirely (which is objectively a bad idea), I would recommend trying a different read trimmer.
One popular option is fastp. Note that fastp (and many other read trimmers) trim FASTQ files before read mapping (whereas iVar Trim trims BAM files after read mapping). As such, if you want to try out a read trimmer that trims FASTQ files (rather than BAM files), you will need to convert the merged untrimmed BAM files back into FASTQ files, e.g. using bamtofastq
. The FASTQ files you will get from using bamtofastq
on the merged untrimmed BAM files will be much smaller than the original FASTQ files because the converted FASTQ files will only contain mapped reads (because the BAM files only contain mapped reads).
After you have converted the untrimmed BAM files into FASTQ files and then trimmed the FASTQ files using fastp (or whichever read trimmer you choose), you will need to remap the trimmed FASTQ files to the reference genome (as described in Step 1; this should be quite fast), which will result in trimmed BAM files. You will then be able to continue the rest of the analysis starting at Step 3.
I have added some extra supplementary exploratory files for samples of interest in the supplement
folder.