Support UMI pre-processing with a 3rd file containing UMIs #23

chapmanb · 2018-01-03T20:47:27Z

Shifu;
Thanks for this great tool and adding pre-processing for UMIs. I've been looking for faster options to replace our use of umis (https://github.com/vals/umis) for pre-processing UMI outputs and adding into read headers. We typically end up with UMIs in a 3rd file as outputs from bcl2fastq when the UMIs are present in the input reads, and I wondered if this is possible to support?

I had a quick dig into the code to start implementing but realized you have specialized iterators for pairs so didn't want to break too much by trying to have a 3 input iterator, thinking there might be a better way to integrate.

Here is an example case with R1/R3 as the first/second read pair and R2 as the UMI:

https://s3.amazonaws.com/chapmanb/testcases/fastp_umi_example.tar.gz

Thanks for any thoughts and suggestions for processing these with fastp.

sfchen · 2018-01-04T00:09:37Z

Brad,

It's possible to support, but it may take much time to implement since it breaks the design of paired-end reading. I need to check whether this case is common.

However, I can't understand why you put the UMIs in a 3rd file. If UMIs are presented in the read (either R1 or R2), bcl2fastq can shift them to the corresponding read headers directly.

chapmanb · 2018-01-04T01:38:12Z

Shifu;
Thanks much for considering this. The problem with putting them in the header using bcl2fastq is that the UMIs aren't present in read 1 or read 2, they're in a separate index read so are only supported coming out as the third file. This is fairly common method but unfortunately bcl2fastq doesn't support doing anything clever like attaching them into the other paired reads. Thanks again for looking into support for this.

sfchen · 2018-01-04T04:43:40Z

If my guess is right, you're using one of the two Illumina sample indexes (I7 or I5) as UMI. This is exactly the same as my lab does, and the UMI processing for such kind of data is already supported by fastp.

Note that the sample indexes are also present in the reads (both R1 and R2), so actually you don't have to input a 3rd index-only file. For example, a typical read is like:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

In this case, index 1 (I7) is TATAGCCT, and index 2 (I5) is GACCCCCA

By specifying --umi_loc=index2, fastp will append index 2 (I5) to the end of the first part of the read header (both R1 and R2), and the read will become:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101:GACCCCCA 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

And --umi_loc=index1 will do the same thing for index 1 (I7).

If my guess was right, you can look into the README (https://github.com/OpenGene/fastp#unique-molecular-identifer-umi-processing) for more details.

But if my guess was wrong, please kindly let me know.

chapmanb · 2018-01-04T12:13:59Z

Shifu;
Thanks for this discussion. It would be great if we had a way to export directly into the indexes but am not sure how to do this with bcl2fastq. The setup we have is dual indexes with the UMI in I7 so it looks like IIIIIIIINNNNNN where I is the index and N is the UMI. We want to use the dual index for demultiplexing and then export the UMIs. So assuming 150 bp reads, 8 bp dual indices and single 6bp UMI in the first index, the bcl2fastq looks like 150Y, 8I6Y, 8I, 150Y where the 150Y, 6Y and 150Y generate the three files: read1, UMI and read2, respectively. The 8Is generate the two indexes used in demultiplexing.

Is there a way to handle both the demultiplexing and export into the header from bcl2fastq? I definitely see how to use fastp for this case if we could export either to an index or one of read1 or read2, but I might easily be ignorant about how best to setup bcl2fastq to generate something compatible. Thanks again for helping with this.

sfchen · 2018-01-04T13:39:50Z

Thanks Brad, I now understood your difficulties. You're right, AFAIK, bcl2fastq cannot use a part of index as UMI.

If you use fastp only for processing UMI (without any filtering/adapter/correction), there is a way to do: processing R1+R3 and R2+R3 individually. For example, your command must look like:

fastp -i R1.fq -I R3.fq -o R1.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1
fastp -i R2.fq -I R3.fq -o R2.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1

The -Q -A -L options disable quality filtering, adapter trimming and length filtering so all input reads will be output, to keep R1.out.fq and R2.out.fq consistent.

You may also specify -G to disable polyG end trimming if you're processing data from NextSeq / NovaSeq, but polyG end trimming will not discard reads so it will not break the read consistency.

The performance will be mainly unaffected since only R3.fq is loaded twice, which is very small since it's only 6 bp long. And actually you can run these two commands in parallel so that the performance may be even better :)

Updated on 1/11/2018: add -w 1 to the commands to limit the worker threads to keep read order of R1 and R2.

chapmanb · 2018-01-04T14:13:32Z

Shifu;
That is perfect, thanks so much for the clever idea. I was trying to envision all these more complicated approaches to concatenating R1 + R2 and streaming into fastp but this is much simpler. We don't need trimming/filtering so will turn off all those options to get identical outputs. Thanks again for helping think through this, much appreciated.

sfchen · 2018-01-11T10:30:08Z

Hi Brad,

Although you've closed this issue, I still have an important note about the method I proposed to run R1+R3 and R2+R3 individually.

You should always run these two commands with only 1 worker thread by specifying -w 1. If you run them with more than 1 thread, the read order of output.R1 and output.R2 may be inconsistent since read order is not kept by multi-threading output.

So the commands should be:

fastp -i R1.fq -I R3.fq -o R1.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1
fastp -i R2.fq -I R3.fq -o R2.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1

I hope you have noted this issue already.

Thanks
Shifu

sfchen · 2018-01-11T10:39:56Z

I also commented on bcbio: bcbio/bcbio-nextgen@c5b60bc

Fixes OpenGene/fastp#23

chapmanb · 2018-01-11T10:51:48Z

Thanks so much for the heads up, apologies for missing this on my side. I've pushed the fix to use a single core, thank you again.

dh10 · 2019-02-26T18:42:13Z

It seems to help to add: -u 100 -n $UMI_LENGTH -Y 100 -G

Low quality sequences otherwise seem to break fastq sync using just the options originally proposed

chapmanb closed this as completed Jan 4, 2018

sfchen reopened this Jan 11, 2018

chapmanb added a commit to bcbio/bcbio-nextgen that referenced this issue Jan 11, 2018

UMI prep: single core to maintain fastq order

dfbdd89

Fixes OpenGene/fastp#23

chapmanb closed this as completed Jan 11, 2018

sfchen mentioned this issue Aug 3, 2018

Support UMI located in separate index read #80

Closed

dh10 mentioned this issue Nov 12, 2019

Adding UMI from third file fails with zero length reads #207

Open

migrau mentioned this issue Nov 23, 2022

UMI in fastq read names nf-core/sarek#746

Closed

lauren-tjoeka mentioned this issue Jul 25, 2024

Support for UMI FASTQ nf-core/sarek#871

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support UMI pre-processing with a 3rd file containing UMIs #23

Support UMI pre-processing with a 3rd file containing UMIs #23

chapmanb commented Jan 3, 2018

sfchen commented Jan 4, 2018 •

edited

Loading

chapmanb commented Jan 4, 2018

sfchen commented Jan 4, 2018 •

edited

Loading

chapmanb commented Jan 4, 2018

sfchen commented Jan 4, 2018 •

edited

Loading

chapmanb commented Jan 4, 2018

sfchen commented Jan 11, 2018

sfchen commented Jan 11, 2018

chapmanb commented Jan 11, 2018

dh10 commented Feb 26, 2019

Support UMI pre-processing with a 3rd file containing UMIs #23

Support UMI pre-processing with a 3rd file containing UMIs #23

Comments

chapmanb commented Jan 3, 2018

sfchen commented Jan 4, 2018 • edited Loading

chapmanb commented Jan 4, 2018

sfchen commented Jan 4, 2018 • edited Loading

chapmanb commented Jan 4, 2018

sfchen commented Jan 4, 2018 • edited Loading

chapmanb commented Jan 4, 2018

sfchen commented Jan 11, 2018

sfchen commented Jan 11, 2018

chapmanb commented Jan 11, 2018

dh10 commented Feb 26, 2019

sfchen commented Jan 4, 2018 •

edited

Loading

sfchen commented Jan 4, 2018 •

edited

Loading

sfchen commented Jan 4, 2018 •

edited

Loading