Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion error in few processesed files #369

Closed
LauriPa opened this issue Mar 19, 2019 · 4 comments

Comments

@LauriPa
Copy link

commented Mar 19, 2019

I am running cutadapt as part of DADA2 pipeline to analyze paired-end MiSeq 16S sequencing data. Since my sequences contains 0-7 b long heterogeneity spacer sequence, as described in Fadrosh et al. 2014, in addition to primer sequences, I had to remove them by using primer sequences as a template instead of just cutting fixed number of bases from each end.

So, I am relatively new to running things on command prompt, but after persistent trial and error, I managed to install cutadapt to my WSL Ubuntu 18.04.

As DADA2 is R program, I am naturally running it on latest version of R that is 3.5.3. Python is also updated to 3.6.7 and cutadapt is version 2.1

This is the code that I was running in R:
`

FWD <- "CCTACGGGNGGCWGCAG"
REV <- "GACTACHVGGGTATCTAATCC"

#Removal of ambiguous bases
fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filterd files in filtN/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = FALSE)

cutadapt <- "/home/lauri/.local/bin/cutadapt"

path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

R1.flags <- paste("-g", FWD, "-a", REV.RC)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs.filtN[i], fnRs.filtN[i])) # input files
}

Running this begun smoothly and I got these prints as expected:

=== Summary ===

Total read pairs processed: 71,065
Read 1 with adapter: 70,393 (99.1%)
Read 2 with adapter: 70,102 (98.6%)
Pairs written (passing filters): 71,065 (100.0%)

Total basepairs processed: 42,922,924 bp
Read 1: 23,166,984 bp
Read 2: 19,755,940 bp
Total written (filtered): 38,799,372 bp (90.4%)
Read 1: 21,163,432 bp
Read 2: 17,635,940 bp

=== First read: Adapter 1 ===

Sequence: GGATTAGATACCCBDGTAGTC; Type: regular 3'; Length: 21; Trimmed: 3938 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2

Bases preceding removed adapters:
A: 16.5%
C: 3.2%
G: 44.7%
T: 35.6%
none/other: 0.0%

.... (I skipped lengthy print of removed sequences, some of which were suspiciously long)

=== First read: Adapter 2 ===

Sequence: CCTACGGGNGGCWGCAG; Type: regular 5'; Length: 17; Trimmed: 70265 times.

No. of allowed errors:
0-9 bp: 0; 10-16 bp: 1

Overview of removed sequences
length count expect max.err error counts
3 25 1110.4 0 25
11 3 0.0 1 1 2
12 1 0.0 1 0 1
14 5 0.0 1 4 1
15 14 0.0 1 8 6
16 282 0.0 1 87 195
17 18379 0.0 1 18261 118
18 324 0.0 1 62 262
19 18035 0.0 1 17885 150
20 305 0.0 1 61 244
21 15572 0.0 1 15458 114
22 575 0.0 1 292 283
23 16654 0.0 1 16539 115
24 86 0.0 1 29 57
25 2 0.0 1 2
27 1 0.0 1 1
28 1 0.0 1 1
67 1 0.0 1 1

....(skipped prints for the second read, they were pretty similar to this one)

Things went pretty good until I got this error message:

This is cutadapt 2.1 with Python 3.6.7
Command line parameters: -g CCTACGGGNGGCWGCAG -a GGATTAGATACCCBDGTAGTC -G GACTACHVGGGTATCTAATCC -A CTGCWGCCNCCCGTAGG -n 2 -o /mnt/d/DADA2analyysi/cutadapt/A059-7M-3-CAGTTCCA-TGTCTAAG-Paulamaki-run20190304R_S59_L001_R1_001.fastq.gz -p /mnt/d/DADA2analyysi/cutadapt/A059-7M-3-CAGTTCCA-TGTCTAAG-Paulamaki-run20190304R_S59_L001_R2_001.fastq.gz /mnt/d/DADA2analyysi/filtN/A059-7M-3-CAGTTCCA-TGTCTAAG-Paulamaki-run20190304R_S59_L001_R1_001.fastq.gz /mnt/d/DADA2analyysi/filtN/A059-7M-3-CAGTTCCA-TGTCTAAG-Paulamaki-run20190304R_S59_L001_R2_001.fastq.gz
Processing reads on 1 core in paired-end mode ...
[ 8<---------] 00:00:06 30,000 reads @ 209.6 µs/read; 0.29 M reads/minuteTraceback (most recent call last):
File "/home/lauri/.local/bin/cutadapt", line 11, in
sys.exit(main())
File "/home/lauri/.local/lib/python3.6/site-packages/cutadapt/main.py", line 803, in main
stats = runner.run()
File "/home/lauri/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 727, in run
(n, total1_bp, total2_bp) = self._pipeline.process_reads(progress=self._progress)
File "/home/lauri/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 324, in process_reads
read1, read2 = modifier(read1, read2, matches1, matches2)
File "/home/lauri/.local/lib/python3.6/site-packages/cutadapt/modifiers.py", line 33, in call
return self._modifier1(read1, matches1), self._modifier2(read2, matches2)
File "/home/lauri/.local/lib/python3.6/site-packages/cutadapt/modifiers.py", line 179, in call
match = AdapterCutter.best_match(self.adapters, trimmed_read)
File "/home/lauri/.local/lib/python3.6/site-packages/cutadapt/modifiers.py", line 107, in best_match
match = adapter.match_to(read)
File "/home/lauri/.local/lib/python3.6/site-packages/cutadapt/adapters.py", line 841, in match_to
alignment = self.aligner.locate(read_seq)
File "src/cutadapt/_align.pyx", line 515, in cutadapt._align.Aligner.locate
AssertionError

Code kept running after this and proceed to process some more files succesfully. Only 5 samples out of 30 were affected and looks like they are missing most of the reads. As an example one sample lost over 70% of its reads. This ended up being pretty lengthy post, but I wasn't sure what to include.

@marcelm

This comment has been minimized.

Copy link
Owner

commented Mar 19, 2019

It appears you have found a bug. Can you make one of the problematic file pairs available to me?

@LauriPa

This comment has been minimized.

Copy link
Author

commented Mar 19, 2019

I just emailed them to you.

@marcelm

This comment has been minimized.

Copy link
Owner

commented Mar 19, 2019

Thanks, I got the sequences.

Note for myself: I’ve been able to pinpoint the problem to an empty read in the input file, combined with using a wildcard in the adapter sequence. This will reproduce the problem:

printf "@r\n\n+\n\n" | cutadapt -o /dev/null -g CWC -

marcelm added a commit that referenced this issue Mar 19, 2019

marcelm added a commit that referenced this issue Mar 19, 2019

marcelm added a commit that referenced this issue Mar 19, 2019

@marcelm

This comment has been minimized.

Copy link
Owner

commented Mar 19, 2019

This is fixed now in the development version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.