This short BASH script allows Illumina MiSeq FASTQ files to be demultiplexed by their primer pairs before sample demultiplexing. This is useful when there are duplicate barcodes across primers, which will result in multiple samples being merged when sample demultiplexing if the primers aren't separated first.
If you are on a Windows machine, you must be running a Linux virtual machine either with WSL2, or as a Virtual Disk/Emulator. I recommend using WSL2 if you are on Windows 10+.
If you are already on Linux or MacOS, no further action is required, however I have not tested this code on a Mac.
To download the necessary files, first, log in to your BaseSpace account, then follow along with the screenshots below:
- Click the Paper icon > Download > Run.
There are 3 required options to run this code:
- -f forward original fastq (required)
- -r reverse original fastq (required)
- -p forward primer sequence [ITS2; LSUA; LSUBG, rbcLa, psbA3] (required)
- -h show the help message
Each primer sequence is defined within the script, if you need to modfy the script to add in your specific forward primer sequence, please do so at the primer definition section ~line 54, under a new primer heading1. Any wobble bases must be defined and enclosed within square brackets [ ].
- Unzip the BaseSpace files and place the forward and reverse .fastq files into your working directory. This can be done using commandline
gunzip
(WSL2, MacOS, Linux) or via gui (Windows Explorer) using 7zip (Windows). - Download and place the
primer_pull.sh
script into your working directory. - Open a Terminal window and navigate to your working directory using commands such as
cd
,ls
. - Set executable permissions for the script in Terminal using
chmod u+x ./primer_pull.sh
. This will allow your program to run. - Run the script using
./primer_pull.sh -f <forward reads.fastq> -r <reverse_reads.fastq> -p <primer name>
The output files will be created in order:
<primer>_R1.fastq
- this will be your forward reads with the specified primer (keep)<primer>_samples.txt
- this will contain the unique sequence ID that will be used to search your reverse fastq file (can be discarded)<primer>_R2.fastq
- this will be your reverse reads with the specified primer (keep)
A console output will provide a sanity-check, to make sure the number of reads in each file matches. The difference should be 0.
A final elapsed time output will be provided.
Repeat the above for each set of primers within your fastq files.
- After each of your primers have been demultiplexed, proceed with the
demultiplex_dada2.pl
from Dr. Greg Gloor's GitHub repository.- You will need to create a
samples.txt
file for each primer set, examples can be found in the readme section. For ease of creation, I recommend using MS Excel to create a template with all your samples across all primers, and then convert to a text file. Again, you need a separate samples_xx.txt file for each primer set. - Open the samples.txt file in a text editor and double check (!!) that the format is tab-delimited, plain text, Unicode UTF-8, and UNIX line feeds.
- You will need to create a
- To run the
demultiplex_dada2.pl
script from ggloor, you will have to:- Modify the shebang line (if you are in Windows) to:
#!/usr/bin/perl -w
; otherwise for Unix machines, leave the shebang as:#!/usr/bin/env perl -w
- Modify the primer variables to include the length of your primer (counting wobble bases only as 1 nt). These lines start with
my @lprimerlen
andmy @rprimerlen
. - Specify your barcode length just below the primer definition section to the length of most of your barcodes (8). Individual primers with different barcode lengths can be specified as ggloor has done in their original script, e.g.:
- Modify the shebang line (if you are in Windows) to:
$bclen = 8 if $ARGV[3] eq "MCHII_SOSP"; # check that the primer names match, capitalizations included
$bclen = 8 if $ARGV[3] eq "SOSP";
- Alternatively, copy and paste the following chunk of code into your unedited copy from ggloor if you are in the Thorn Lab or are using their primers. Check the primer list below to make sure the same primers are being used. If not, you will have to manually add your primers in and specify the primer length, or your barcodes will not work and your final output table will be unusable.
#!/usr/bin/perl -w # Windows shebang
use strict;
my @lprimerlen = (16, 22, 20, 18, 27, 23, 19); # length of forward primer
my @rprimerlen = (20, 26, 21, 17, 21, 24, 22); # length of reverse primer
my $primer = 1;
if ( defined $ARGV[3]){ # list of all possible primers
$primer = 0 if $ARGV[3] eq "PROKV4"; # fwd: 16 nt, rev: 20 nt
$primer = 2 if $ARGV[3] eq "LSUA"; # fwd: 20, rev: 21
$primer = 1 if $ARGV[3] eq "ITS2"; # fwd: 22, rev: 26
$primer = 2 if $ARGV[3] eq "LSUBG"; # fwd: 20, red: 21
$primer = 4 if $ARGV[3] eq "rbcLa"; # fed: 27, rev: 21
$primer = 5 if $ARGV[3] eq "psbA3"; # fwd: 23, rev: 24
$primer = 6 if $ARGV[3] eq "AMFV4"; # fwd: 19, rev: 22
my %samples;
my $bclen = 8; # Change this to the length of your barcodes (ALL barcodes must be this length, unless specified below)
Run the script using the instructions provided in ggloor's GitHub page for each primer.
- After demultiplexing within samples, you are free to filter, overlap, chimera-check, and classify your sequences2. The easiest is to combine code from the following three scripts:
- the dada2 tutorial. Read and follow along with a smaller subset to make sure you fully understand the process before attempting the Big Data tutorial. Useful segments of code is the Sanity Check section where you get intermediary files that show you read loss information.
- the dada2 Big Data: Paired-End tutorial since it will be likely that you have many samples per primer and you find your R program crashing due to insufficient computer RAM or processing power. Note: this will likely happen if you are on a personal computer. Please read through the details on the single-read version beforehand.
- the dada2 ITS tutorial if any of your primers are ITS primers, please go through this and modify your script to accomodate the varying read lengths expected in ITS sequences.
To successfully publish your data in a manuscript you will have to upload and accession your demultiplexed FASTQ sequence files to an online repository. The easiest option is using the European Nucleotide Archive. The files that will be required for upload will be the zipped, demultiplexed sample files in the demultiplex_$group
folder, with the $group
names matching those in your samples.txt files.
Additionally, you will have to provide information on the parameters used for filtering, merging, chimera-checking, taxonomy assignment, etc. from within dada2. Save your .R scripts for this purpose.
1 It is recommended to download a reliable text editor for your platform: notepad++ for Windows; Sublime Text for Mac or Linux.
2 ASVs are created through dada2, which are becoming increasingly popular over OTUs created by Mothur, QIIME and other demultiplexing pipelines.
AMFV4 (AMV4.5N-F/AMDG-R): rRNA 18S V4; Fungi; Glomeromycota (Sato et al., 2005)
ITS2 (5.8S_Fun [F]/ITS4_Fun [R]): rRNA ITS-2; Fungi (Taylor et al., 2016)
LSUA (28S200A-F/28S476A-R): rRNA 28S D1-D2 region; Fungi; Ascomycota (Asemaninejad et al., 2016)
LSUBG (28S200-F/28S481-R): rRNA 28S D1-D2 region; Fungi; non-Ascomycota (Asemaninejad et al., 2016)
ProkV4 (U518F/806R): rRNA 18S V4 region; Bacteria/Archaea (Caporaso et al., 2011)
rbcLa (rbcLa-F/rbcLa-R): ribulose-1,5-bisphosphate 156 carboxylase/oxygenase; Viridiplantae (Kress et al., 2009)
psbA3 (psbA3-F/trnH-R): trnH-psbA spacer; Viridiplantae (Kress et al., 2009)
**
Asemaninejad, Asma, et al. 2016. “New Primers for Discovering Fungal Diversity Using Nuclear Large Ribosomal DNA.” PLoS ONE 11 (7) doi:10.1371/journal.pone.0159043.
Caporaso, J Gregory, et al. 2011. “Global Patterns of 16S RRNA Diversity at a Depth of Millions of Sequences per Sample.” Proceedings of the National Academy of Sciences 108 (Supplement 1): 4516–4522. doi:10.1073/pnas.1000080107.
Kress, W.J., et al. 2009. “Plant DNA Barcodes and a Community Phylogeny of a Tropical Forest Dynamics Plot in Panama.” Proceedings of the National Academy of Sciences USA 106 (44): 18621–18626 doi:10.1073/pnas.0909820106.
Taylor, D Lee, et al. 2016. “Accurate Estimation of Fungal Diversity and Abundance through Improved Lineage-Specific Primers Optimized for Illumina Amplicon Sequencing.” Applied and Environmental Microbiology 82 (24): 7217–7226 doi.org/10.1128/AEM.02576-16.
Sato, Kouichi, et al. 2005. “A New Primer for Discrimination of Arbuscular Mycorrhizal Fungi with Polymerase Chain Reaction-Denature Gradient Gel Electrophoresis.” Grassland Science 51 (2): 179–181. doi:10.1111/j.1744-697X.2005.00023.x
For citing this script, please use something similar to:
Weerasuriya, Nimalka M. 2021. primer_pull.sh for Demultiplexing Primers in a MiSEQ Run. GitHub repository, https://github.com/nweerasu/primer_pull