-
Notifications
You must be signed in to change notification settings - Fork 0
/
run_DADA2_pipeline.sh
executable file
·361 lines (321 loc) · 16.9 KB
/
run_DADA2_pipeline.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
#!/bin/bash
## Copyright (C) 2023 Jonathan D. Magasin
OUTDIR="Data.nobackup"
PARAMS=$1
## If missing params file or if -h,--help, -help, etc., then print usage.
if [ -z "$PARAMS" ] || [ ! -z `echo $PARAMS | grep '\-h'` ] ; then
cat <<-EOUSAGE
This script uses DADA2 to identify nifH amplicon sequence
variants (ASVs) in paired MiSeq data sets.
Usage:
run_DADA2_pipeline.sh params.csv > log.date.txt
Will run the pipeline using the specified parameters, most of which specify
how DADA2 should filter reads for quality. Saving output in the log file is
recommended but not required.
Before you run this script you must: (1) Install the required tools [see
note at end]; (2) Use organizeFastqs.R to create a directory structure (in
LinksToFastqs) that organizes the FASTQs into "processing groups":
Processing groups: The FASTQ files are paritioned into non-overlapping
sets that are run through DADA2 independently (during stage 4). They
also have their own error models (stage 3). Although you can define
processing groups however you like, probably you want them to partition
the FASTQs by sequencing type (DNA vs. mRNA) and size fraction, e.g. so
that nifH genes vs. transcripts are processed separately for each size
fraction. I prefer this approach because amplicon relative abundances
differ greatly with respect to sequence type and size fraction, and I
do not want that to influence DADA2's ASV inference. Second, we gain
confidence in the ASVs if we see them in different processing groups.
Third, it is easier to manage and write scripts for the outputs when
they are organized by directories that follow the processing groups,
e.g. path/to/output/DNA/Filter0.2_3.
The main stages of this script are:
1. Primer trimming using cutadapt. The default primers are:
forward: 5’ – TGY GAY CCN AAR GCN GA – 3’
reverse: 5’ – ADN GCC ATC ATY TCN CC – 3’
but you can specify other primers by setting 'forward' and 'reverse'
in the parameters file. The script runCutadapt.sh uses the specified
primers and puts the trimmed FASTQs in $OUTDIR/Data.trimmed
2. Identification of reads that likely encode NifH. These are used in the
next stage. Output is stored in $OUTDIR/NifH_prefilter
3. Build error models for each processing group. Only use the NifH-like
reads found in stage 2 because I have found that excluding the non-NifH-
like reads (presumably off-target PCR products) results in better error
models. The intuition is that non-nifH amplicons are highly variable
(compared to nifH) garbage that will make it harder for DADA2 to learn
the sequencing error models. Error models are stored in
$OUTDIR/ErrorModels
4. Run DADA2 on each processing group. The script that runs DADA2,
dada2_nifH_amplicons.R, has stages: Read quality filtering; ASV
inference ("denoising", but use the stage 3 error models); paired read
merging to create full length ASVs; removal of chimeric ASVs. Output
is stored in $OUTDIR/Dada2PipeOutput/Out.<datestamp>
If you rerun this script, previous outputs will be reused. Usually this is
desired for stages 1-3 (e.g. there is no point in trimming primers again)
and allows you to experiment with DADA2 (stage 4) parameters. However, if
you want to rerun DADA2 but the datestamp already exists, just rename or
delete the previous Out.<datestamp> directory.
Required tools: This script depends on many external tools (R packages
cutadapt, HMMER3, ...) nearly all of which can be installed using
miniconda3. The document INSTALL.txt describes how to get set up to run
the pipeline.
EOUSAGE
exit -1
elif [ ! -f "$PARAMS" ] ; then
echo "Must specify a parameters file. This can be the example params.csv file in"
echo "which every parameter has been commented out (in which case defaults will be"
echo "used)."
exit -1
fi
## Absolute path of the scripts dir. 'scripts' must be next to run_DADA2_pipeline.sh.
SDIR="$(dirname $(realpath "${BASH_SOURCE[0]}"))/scripts"
if [ ! -d "$SDIR" ] ; then echo "Cannot find pipeline scripts directory."; exit -1 ; fi
CWD=`pwd`
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo "Started DADA2 analysis scripts on" `date`
echo "Working directory is $CWD"
echo "Will use parameters file $PARAMS"
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo
echo "All outputs will be in the directory $OUTDIR. Previous intermediate"
echo "results will generally be reused, e.g. primer-trimmed reads and error models."
echo "The DADA2 analyses for each processing group will be redone and saved in"
echo "date-stamped directories."
mkdir -p "$OUTDIR"
## Use LinksToFastqs which is hard coded in organizeFastqs.R. Could instead
## pass a name to this script but seems ~error prone. One should not run the
## pipeline on different data paritions in the same working directory.
FQDIR=LinksToFastqs
if [ ! -d "$FQDIR" ] ; then
echo "There should be a directory called $FQDIR that you created with the"
echo "organizeFastqs.R script."
exit -1
fi
echo
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo "Building FASTQ lists that define which samples will be processed together."
## Note that this code is simplified compared to the Baltic data where there are
## separate lists for {DNA,RNA}x{sizeFracA,sizeFracB}. See the Baltic script
## for future projects that require separate processing groups.
rflist="RawFastq.list"
if [ ! -f "$rflist" ] ; then
find -L "$FQDIR" -name "*.fastq.gz" | sort > "$rflist"
else
echo "$rflist already exists."
fi
## Check whether that FASTQs names are structured as expected by the pipeline,
## so that we don't have to wait until that stage only to get a crash (in
## Fastq2Samp() which defines the convention encoded right here).
echo
echo "Checking if FASTQ names follow format: {Samp}{_stuff1}_R{1,2}{stuff2}.fastq.gz"
basename -a `cat "$rflist"` | grep '_R1' \
| sed -e 's/\.fastq\.gz$//' -e 's/_R1.*$//' -e 's/_.*$//' \
| sort | uniq -c \
> checkSampleNamesImpliedByFastqNames.tmp
numDupSamps=`cat checkSampleNamesImpliedByFastqNames.tmp | grep -c -v " 1"`
if [ "$numDupSamps" -ge 1 ] ; then
echo "It looks like you have used invalid FASTQ file names because they start with"
echo "sample names that are duplicated. FASTQ names must be structured like this:"
echo " {Samp}{_stuff1}_R{1,2}{stuff2}.fastq.gz"
echo "where:"
echo " - Samp can have any character other than \"_\"."
echo " - stuff1, if present, can have any character but must be flanked by \"_\""
echo " Usually stuff1 will be/include the sequencing lane (L001). stuff1 is"
echo " not interesting to the study and is excluded from column names in the"
echo " final abundance table."
echo " - stuff2 can be anything, or absent"
echo " - The .fastq.gz could be absent, but that would be bad style."
echo "Your FASTQ names imply the following $numDupSamps duplicated sample names,"
echo "each preceded by the number of FASTQ pairs that share the sample name."
cat checkSampleNamesImpliedByFastqNames.tmp
exit -1
fi
rm checkSampleNamesImpliedByFastqNames.tmp
echo "FASTQ names seem okay -- at least they do not imply duplicated sample names."
echo
echo "There are "`cat "$rflist" | wc -l`" FASTQ files in $rflist."
## Processing groups are defined as the distinct pathways up to but excluding
## the folder that holds the fastq's, and exclude the LinksToFastqs. E.g.,
## LinksToFastqs/foo/bar/s1/s1_R1.fastq.gz
## LinksToFastqs/foo/bar/s1/s1_R2.fastq.gz
## LinksToFastqs/foo/bif/s2/s2_R1.fastq.gz
## LinksToFastqs/foo/bif/s2/s2_R2.fastq.gz
## has two processing groups, foo/bar and foo/bif. For this script the key is
## that the processing groups encode partial paths so we can store and find
## outputs as we go.
while read line; do
x="$(dirname $line)"
echo "$(dirname $x)"
done < "$rflist" | sed 's/^'"$FQDIR"'\///' | sort | uniq > processingGroups.txt
echo
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo "Running cutadapt. Presuming paired reads and the standard nifH primers."
echo "Results will be in Data.trimmed and each sample will have its own detailed"
echo "log from cutadapt."
if [ -d "$OUTDIR/Data.trimmed" ] ; then
echo "Directory Data.trimmmed already exists so skipping cutadapt."
else
## Not necessary to use processing groups for cutadapt.
## Note that runCutadapt output fastq paths have the same directory
## structure as their inputs execpt the first folder is renamed as directed.
fwd=`cat $PARAMS | grep "^forward" | cut -d, -f2 | tr -d [:space:]`
rev=`cat $PARAMS | grep "^reverse" | cut -d, -f2 | tr -d [:space:]`
anp=`cat $PARAMS | grep ^"allowMissingPrimers" | cut -d, -f2 | tr -d [:space:]`
$SDIR/runCutadapt.sh "$rflist" "$OUTDIR/Data.trimmed" "$fwd" "$rev" "$anp" 2>&1
if [ "$?" -ne 0 ] ; then
echo "runCutadapt.sh failed. Aborting pipeline."
exit -1
fi
## One summary across all processing groups.
casum="$OUTDIR/Data.trimmed/summary.cutadapt.txt"
echo "Summarizing all of the cutadapt logs (all processing groups) in $casum"
echo "# Summary of reads that passed cutadapt filters." > "$casum"
for x in `find "$OUTDIR/Data.trimmed" -name cutadapt.log`; do
echo -n -e "$x:\t\t" >> "$casum"
echo `grep '^Pairs written (passing filters):' $x | cut -d: -f2 | tr -d "[:space:]"` \
>> "$casum"
done
fi
echo
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo "Identifying reads that look like NifH because those will lead to better DADA2"
echo "error models. Will predict ORFs and then search them for Fer4_NifH domains"
echo "using HMMER and trusted cutoffs for the domain (defined in the hmm). Will"
echo "then filter for predicted domains that have sufficient length and bit score"
echo "as specified in the params file (or default values will be used). Only the R1"
echo "reads are used in this stage."
## This is to support 16S runs. Forces run_DADA2_pipeline.sh to create models rather
## than precomputed from nifH-like R1 reads.
SKIP_NIFH_ERROR_MODELS=`cat $PARAMS | grep -i ^skipNifHErrorModels | grep -i 'true'`
if [ ! -z "$SKIP_NIFH_ERROR_MODELS" ] ; then
echo "Skipping this step because you asked not to use precomputed error models." ;
elif [ -d "$OUTDIR/Data.NifH_prefilter" ] ; then
echo "Looks like you already did this step. Skipping."
else
NIFH_MINLEN=`cat $PARAMS | grep "^NifH_minLen" | cut -d, -f2 | tr -d [:space:]`
NIFH_MINBITS=`cat $PARAMS | grep "^NifH_minBits" | cut -d, -f2 | tr -d [:space:]`
if [ -z "$NIFH_MINLEN" ] ; then echo " - Will use default length cut off." ; fi
if [ -z "$NIFH_MINBITS" ] ; then echo " - Will use default bit score cut off."; fi
cd "$OUTDIR"
## Not necessary to use processing groups.
$SDIR/NifH_prefilter/run_NifH_prefilter.sh Data.trimmed "$NIFH_MINLEN" "$NIFH_MINBITS"
if [ "$?" -ne 0 ] ; then
echo "run_NifH_prefilter.sh failed. Aborting pipeline."
exit -1
fi
echo
echo "Done. The NifH-like search results are in $OUTDIR/Data.NifH_prefilter."
echo "Now summarizing how many R1 reads were processed, how many ORFs were predicted,"
echo "and how many of the reads had NifH identified at specified cutoffs."
pfsum="Data.NifH_prefilter/summary.NifH_prefilter.txt"
printf "%-30s%15s%15s%15s%15s\n" Sample R1.reads Orfs OrfsTC R1.reads.NifH > "$pfsum"
for lf in `find Data.NifH_prefilter -name log.nifScan.txt`; do
## Print just the matching text (only on any lines matched).
samp=`echo $(basename $(dirname $lf))`
nseq=`cat $lf | sed -n 's/^no\. of seqs: \([0-9][0-9]*\)/\1/p'`
norf=`cat $lf | sed -n 's/^Predicted \([0-9][0-9]*\) orfs\./\1/p'`
## Approximately how many ORFs beat the trusted cut offs for the NifH model.
## Slow so just count the domtab lines -- assume an ORF only gets 1 table entry.
hmmtc=`gunzip -c "$(dirname $lf)/hmmsearch.Fer_NifH.domtab.gz" | grep -vc '^#'`
nread=`cat $lf | sed -n 's/^Identified \([0-9][0-9]*\) reads with NifH domains\./\1/p'`
printf "%-30s%15s%15s%15s%15s\n" $samp $nseq $norf $hmmtc $nread >> "$pfsum"
done
cd "$CWD"
fi
echo
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo "Building error models from the forward reads that appeared to have NifH domains."
echo "Separate models will be built for each processing group."
if [ ! -z "$SKIP_NIFH_ERROR_MODELS" ] ; then
echo "Skipping precomputed error models based on NifH-like reads." ;
else
while read pgrp; do
pdesc=`echo "$pgrp" | tr '/' '.'`
logfile="log.learnErrors.${pdesc}.txt"
if [ -f "$OUTDIR/ErrorModels/$pgrp/${logfile}" ] ; then
echo "Error models already exist for $pgrp ($logfile found). Skipping."
else
cd "$OUTDIR"
echo " -- Working on error models for processing group $pgrp"
## pgrp encodes the path for the processing group.
Rscript "${SDIR}/learnErrorsFromFastq.R" "Data.NifH_prefilter/$pgrp" > "$logfile"
if [ "$?" -ne 0 ] ; then
echo "learnErrorsFromFastq.R exited abnormally. Aborting pipeline."
exit -1
fi
mkdir -p "ErrorModels/$pgrp"
mv errorModel.rds errorModel.pdf "$logfile" "ErrorModels/$pgrp"
cd "$CWD"
fi
done < processingGroups.txt
fi
echo
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo "The main event, DADA2! Will run DADA2 separately for each processing group,"
echo "with independently inferred ASVs across DADA2 runs. A later stage of the"
echo "pipeline will relabel ASV id's to AUID's (ASV unique id's) for identifying"
echo "sequences found across DADA2 runs."
## One date stamp is used for all pgroups (even if DADA2 finishes the last group
## after midnight). The stamp is so that we can have different DADA2 runs (with
## parameter tweaks) stored alongside each other. (Previous stages of the
## pipeline change less frequently and their params should be finalized before
## playing with DADA2 params.)
STAMP=$(echo $(date) | awk -F' ' '{print $6 $2 $3}')
while read pgrp; do
pdesc=`echo "$pgrp" | tr '/' '.'`
logfile="log.dada2.${pdesc}.${STAMP}.txt"
if [ -f "$OUTDIR/Dada2PipeOutput/$pgrp/${logfile}" ] ; then
echo "DADA2 output already exists for $pgrp ($logfile found). Skipping."
else
cd "$OUTDIR"
echo ""
echo " #######################################################"
echo " ## Running DADA2 on processing group $pgrp"
echo " ##"
echo " ##"
## Shared results directory for this processing group. Add a time stamp for
## the DADA2 outdir so different runs (e.g. parameter tweaks) can be stored
## alongside each other.
mkdir -p "Dada2PipeOutput/$pgrp"
trimdir="Data.trimmed/$pgrp"
dada2dir="Dada2PipeOutput/$pgrp/Out.${STAMP}"
echo "For this DADA2 run the output will be in $dada2dir"
if [ -z "$SKIP_NIFH_ERROR_MODELS" ] ; then
EMPARAM="ErrorModels/$pgrp/errorModel.rds"
fi
Rscript "$SDIR/dada2_nifH_amplicons.R" "$trimdir" "$dada2dir" \
"../$PARAMS" $EMPARAM \
2> "$logfile"
if [ "$?" -ne 0 ] ; then
echo "dada2_nifH_amplicons.R exited abnormally. Aborting pipeline."
exit -1
fi
mv "$logfile" "${dada2dir}/"
## Tally reads retained at stage. The first arg is the raw reads dir
## for this pgroup and takes awhile to load. Although the script caches
## the counts for subsequent runs, it uses a hard-coded cache name which
## would have collisions across pgroups. So delete the cache.
$SDIR/countReadsAtStages.R "$CWD/$FQDIR/$pgrp" "$trimdir" "$dada2dir"
if [ "$?" -ne 0 ] ; then
echo "countReadsAtStages.R exited abnormally. Aborting pipeline."
exit -1
fi
rm -f readCountsCacheForScript_countReadsAtStages.RData
rcdp="readCountsDuringProcessing.${pdesc}.${STAMP}.csv"
rcdpPng="readCountsDuringProcessing.${pdesc}.${STAMP}.png"
mv readCountsDuringProcessing.csv "$rcdp"
mv readCountsDuringProcessing.png "$rcdpPng"
mv "$rcdp" "$rcdpPng" "${dada2dir}/"
echo " ##"
echo " ## Finished DADA2 processing group $pgrp"
echo " #######################################################"
echo ""
cd "$CWD"
fi
done < processingGroups.txt
echo
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
echo "Finished DADA2 analysis scripts on" `date`
echo "Working directory is $CWD"
echo "##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### ##### #####"
exit 0
### END END END ####