Skip to content

Latest commit

 

History

History
90 lines (53 loc) · 4.33 KB

Report.md

File metadata and controls

90 lines (53 loc) · 4.33 KB

ViReport v0.0.1 — 2020-09-20

Input Dataset

The analysis was conducted on a dataset containing 104803 sequences. The average sequence length was 29832.809, with a standard deviation of 142.127. The earliest sample date was 2019-12-24, the median sample date was 2020-04-11, and the most recent sample date was 2020-09-12.

Distribution of input sequence lengths

Distribution of input sample dates

Distribution of input sample categories

Preprocessed Dataset

The input dataset was preprocessed such that sequences were given safe names: non-letters/digits in sequence IDs were converted to underscores. After preprocessing, the dataset contained 104803 sequences. The average sequence length was 29832.809, with a standard deviation of 142.127. The earliest sample date was 2019-12-24, the median sample date was 2020-04-11, and the most recent sample date was 2020-09-12.

Distribution of preprocessed sequence lengths

Distribution of preprocessed sample dates

Distribution of preprocessed sample categories

Multiple Sequence Alignment

Multiple sequence alignment was performed using Minimap2 (Li, 2018). Each input sequence was aligned to the reference sequence (MT072688), and the multiple sequence alignment was constructed based on positions in the reference. There were 29808 positions (2 invariant) and 85463 unique sequences in the multiple sequence alignment. Pairwise distances were computed from the multiple sequence alignment using the tn93 tool of HIV-TRACE (Pond et al., 2018).

Distribution of pairwise sequence distances

Across the positions of the multiple sequence alignment, the minimum coverage was 0.192, the maximum coverage was 0.998, and the average coverage was 0.974, with a standard deviation of 0.0352.

Coverage (proportion of non-gap characters) across the positions of the multiple sequence alignment

Across the positions of the multiple sequence alignment that had non-zero Shannon entropy, the minimum Shannon entropy was 0.000173, the maximum Shannon entropy was 0.968, and the average Shannon entropy was 0.0025, with a standard deviation of 0.0181.

Shannon entropy across the positions of the multiple sequence alignment. A significance threshold was computed using Tukey's Rule: 1.5x the interquartile range added to the third quartile, which was 0.00279. The significance threshold is shown as a red dashed line, and significant points are shown in red.

Phylogenetic Inference

Phylogenetic inference was not performed. Phylogenetic rooting was not performed.

Phylogenetic Dating

Phylogenetic dating was not performed.

Citations

  • Li H. (2018). "Minimap2: pairwise alignment for nucleotide sequences". Bioinformatics. 34(18), 3094-3100.
  • Moshiri N. (2020). "ViReport" (https://github.com/niemasd/ViReport).
  • Pond S.L.K., Weaver S., Leigh Brown A.J., Wertheim J.O. (2018). "HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens". Molecular Biology and Evolution. 35(7), 1812-1819.