Skip to content

Latest commit

 

History

History
212 lines (173 loc) · 10.9 KB

abundance.rst

File metadata and controls

212 lines (173 loc) · 10.9 KB

Minimum Abundance Threshold

THAPBI PICT has a default minimum absolute abundance threshold of 100 reads per marker per sample, and 0.1% of the reads per marker per sample, before accepting any unique sequence. Background contamination and PCR noise levels will vary, so having multiple negative_controls will help set this objectively.

In this dataset there is a single negative control for the MOL16S marker, library BIM8M aka SRR5534986. However, we can also treat all the SPH16S libraries as negative controls for the MOL16S marker, and vice versa. You could do this automatically within THAPBI PICT via the -n or --negctrls command line option, but as we shall see in this example it will discard most of the data.

In order to examine an appropriate minimum abundance threshold, the run.sh script provided uses -a 10 -f 0 to accept any unique sequence seen in sample at least ten times (regardless the fraction of the sample read total). This does allow unwanted noise though to the reports.

SPH16S

This was the more specific primer pair, expected to only amplify sphaeriid mussel species, so in general we expect less unique sequences than with the more general MOL16S primers.

Looking at some key columns in the sample report,

$ cut -f 1,2,4,7,9,14 summary/SPH16S.samples.onebp.tsv
<SEE TABLE BELOW>

Or, open SPH16S.samples.onebp.xlsx in Excel. Focusing on the left hand columns, you should see:

#Marker Group Library-name Raw FASTQ Cutadapt Accepted
MOL16S Aquarium BIR2M 306311 2 0
MOL16S Aquarium BIR6M 291954 14 0
MOL16S Control BIM8M 2433 0 0
MOL16S Mock Community SC3PRO1 689712 17 0
MOL16S Mock Community SC3PRO2 405048 0 0
MOL16S Mock Community SC3PRO3 402219 16 0
MOL16S Mock Community NFSC3PRO3 349590 33 10
MOL16S Mock Community SC3PRO4 671241 6 0
MOL16S Mock Community NFSC3PRO4 420015 7 0
MOL16S Mock Community SC3PRO5 480606 13 0
MOL16S River BIM6M 821849 0 0
MOL16S River BIM2M 1119271 0 0
MOL16S River BIM4M 709472 40 19
SPH16S Aquarium BIR2S 498926 251148 209402
SPH16S Aquarium BIR6S 240360 226012 191456
SPH16S Mock Community SPSC3PRO1 425271 317960 224689
SPH16S Mock Community SPSC3PRO2 341476 282516 204249
SPH16S Mock Community SPSC3PRO4 410780 303957 197507

Things to note:

  • In the "Raw FASTQ" column, the control has far fewer raw reads (good).
  • The "Cutadapt" column shows reads after SPH16S primer trimming. There are hundreds of thousands for the final five samples amplified with these primers (good). The first 13 samples were amplified with the MOL16S primers, but still have low levels of sequences matching the SPH16S primers (bad).
  • The "Read count" column is after applying the minimum abundance threshold (here 10). Two negative controls still have reads, lifting the threshold to 20 or more would fix this. These are Sphaerium simile in mock community NFSC3PRO3, and an unknown in river sample BIM4M.

So, using the MOL16S samples as negative controls suggests that for the SPH16S the default minimum abundance threshold is perhaps overly harsh - but using at least 20 would be wise.

MOL16S

We'll initially looking at the same key columns in the sample report,

$ cut -f 1,2,4,7,9,14 summary/MOL16S.samples.onebp.tsv
<SEE TABLE BELOW>

Or, open MOL16S.samples.onebp.xlsx in Excel. Focusing on the left hand columns, you should see:

#Marker Group Library-name Raw FASTQ Cutadapt Accepted
MOL16S Aquarium BIR2M 306311 297657 256386
MOL16S Aquarium BIR6M 291954 286427 256470
MOL16S Control BIM8M 2433 1014 551
MOL16S Mock Community SC3PRO1 689712 656661 550293
MOL16S Mock Community SC3PRO2 405048 377026 297877
MOL16S Mock Community SC3PRO3 402219 380347 304626
MOL16S Mock Community NFSC3PRO3 349590 328956 262963
MOL16S Mock Community SC3PRO4 671241 628644 494257
MOL16S Mock Community NFSC3PRO4 420015 364233 262726
MOL16S Mock Community SC3PRO5 480606 458896 383865
MOL16S River BIM6M 821849 799349 703578
MOL16S River BIM2M 1119271 954787 823782
MOL16S River BIM4M 709472 367539 317363
SPH16S Aquarium BIR2S 498926 25 0
SPH16S Aquarium BIR6S 240360 27 0
SPH16S Mock Community SPSC3PRO1 425271 35 0
SPH16S Mock Community SPSC3PRO2 341476 168 27
SPH16S Mock Community SPSC3PRO4 410780 420 108

Looking at the same points, I see two problems:

  • The control sample BIM8M (SRR5534986) had almost a thousand unwanted MOL16S matches, reduced to 551 with a minimum abundance threshold of 10.
  • All the SPH16S mock community samples have unwanted MOS16S matches, the worst case being SPSC3PRO4 (SRR5534980) with over four hundred reads reduced to 108 with the minimum abundance threshold of 10.

To see exactly what is in these two problematic samples, we can turn to the read report summary/MOL16S.reads.onebp.xlsx in Excel, or the TSV version at the command line (using grep to drop the rows ending with a zero count):

$ cut -f 1,2,3,7,10 summary/MOL16S.reads.onebp.tsv | grep -v "[[:space:]]0$"
#                                                                  Marker           MOL16S
#                                                                  Group            Control
#                                                                  Sample           Blank MOL16S
#                                                                  Library-name     BIM8M
#                                                                  Raw FASTQ        2433
#                                                                  Flash            1963
#                                                                  Cutadapt         1014
#                                                                  Threshold pool   raw_data
#                                                                  Threshold        10
#                                                                  Control          Sample
#                                                                  Singletons       258
#                                                                  Accepted         551
#                                                                  Unique           4
#Marker       MD5                               onebp-predictions  Total-abundance  SRR5534986
MAX or TOTAL  -                                 -                  4914872          478
MOL16S        20c0669e4c6f8436c9d42736df727c83  Sphaerium simile   152924           478
MOL16S        e1d838b4f39bffe88d8c0e79b52700f1  Sphaerium simile   3215             13
MOL16S        778e3dace4b993135e11d450e6c559ff  Sphaerium simile   249              11
MOL16S        a36d3f7291c173c4243f22c2afbd111e  Sphaerium simile   147              49

The unwanted sequences in the control sample are dominated by a single sequence (and three variants of it), which were matched to Sphaerium simile.

This is consistent with the original author's analysis - although our pipeline has produced higher read counts:

Finally, our water blank sample had 71 reads, eight of those being singletons with the remaining belonging to Sphaerium striatinum (Table 9), likely due to amplicon contamination in the lab.

What about the other problematic sample? Again, you can find this in the Excel read report, or at the command line:

$ cut -f 1,2,7,25 summary/MOL16S.reads.onebp.tsv | grep -v "[[:space:]]0$"
#                                               Marker           SPH16S
#                                               Group            Mock Community
#                                               Sample           Mock Community 4 SPH16S
#                                               Library-name     SPSC3PRO4
#                                               Raw FASTQ        410780
#                                               Flash            375539
#                                               Cutadapt         420
#                                               Threshold pool   raw_data
#                                               Threshold        10
#                                               Control          Sample
#                                               Singletons       272
#                                               Accepted         108
#                                               Unique           3
#Marker       MD5                               Total-abundance  SRR5534980
MAX or TOTAL  -                                 4914872          46
MOL16S        ecdaa082b70f5e268f76128693531760  269109           45
MOL16S        98dc259e48de3e258cb93a34c38a9484  120026           17
MOL16S        20c0669e4c6f8436c9d42736df727c83  152924           46

The species names are too long to include in the above, listing them directly:

$ grep -E "(MD5|20c0669e4c6f8436c9d42736df727c83|ecdaa082b70f5e268f76128693531760|98dc259e48de3e258cb93a34c38a9484)" \
  summary/MOL16S.reads.onebp.tsv | cut -f 2,3
<SEE TABLE BELOW>

Giving:

MD5 onebp-predictions
ecdaa082b70f5e268f76128693531760 Dreissena bugensis;Dreissena rostriformis
98dc259e48de3e258cb93a34c38a9484 Dreissena polymorpha
20c0669e4c6f8436c9d42736df727c83 Sphaerium simile

The unwanted mock community sample content is split between Sphaerium and Dreissena, and suggest using a minimum threshold of perhaps 50 reads?

Minimum threshold

Clearly using a minimum abundance threshold of 10 is too low, and it should be increased to at perhaps 50 based on this. However, we have one exceptional sequence present at almost 500 copies. Setting the minimum that high seems excessive - but perhaps the THAPBI PICT default of 100 is more reasonable?