Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UMI pre-processing with a 3rd file containing UMIs #23

Closed
chapmanb opened this issue Jan 3, 2018 · 10 comments
Closed

Support UMI pre-processing with a 3rd file containing UMIs #23

chapmanb opened this issue Jan 3, 2018 · 10 comments

Comments

@chapmanb
Copy link

chapmanb commented Jan 3, 2018

Shifu;
Thanks for this great tool and adding pre-processing for UMIs. I've been looking for faster options to replace our use of umis (https://github.com/vals/umis) for pre-processing UMI outputs and adding into read headers. We typically end up with UMIs in a 3rd file as outputs from bcl2fastq when the UMIs are present in the input reads, and I wondered if this is possible to support?

I had a quick dig into the code to start implementing but realized you have specialized iterators for pairs so didn't want to break too much by trying to have a 3 input iterator, thinking there might be a better way to integrate.

Here is an example case with R1/R3 as the first/second read pair and R2 as the UMI:

https://s3.amazonaws.com/chapmanb/testcases/fastp_umi_example.tar.gz

Thanks for any thoughts and suggestions for processing these with fastp.

@sfchen
Copy link
Member

sfchen commented Jan 4, 2018

Brad,

It's possible to support, but it may take much time to implement since it breaks the design of paired-end reading. I need to check whether this case is common.

However, I can't understand why you put the UMIs in a 3rd file. If UMIs are presented in the read (either R1 or R2), bcl2fastq can shift them to the corresponding read headers directly.

@chapmanb
Copy link
Author

chapmanb commented Jan 4, 2018

Shifu;
Thanks much for considering this. The problem with putting them in the header using bcl2fastq is that the UMIs aren't present in read 1 or read 2, they're in a separate index read so are only supported coming out as the third file. This is fairly common method but unfortunately bcl2fastq doesn't support doing anything clever like attaching them into the other paired reads. Thanks again for looking into support for this.

@sfchen
Copy link
Member

sfchen commented Jan 4, 2018

If my guess is right, you're using one of the two Illumina sample indexes (I7 or I5) as UMI. This is exactly the same as my lab does, and the UMI processing for such kind of data is already supported by fastp.

Note that the sample indexes are also present in the reads (both R1 and R2), so actually you don't have to input a 3rd index-only file. For example, a typical read is like:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

In this case, index 1 (I7) is TATAGCCT, and index 2 (I5) is GACCCCCA

By specifying --umi_loc=index2, fastp will append index 2 (I5) to the end of the first part of the read header (both R1 and R2), and the read will become:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101:GACCCCCA 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

And --umi_loc=index1 will do the same thing for index 1 (I7).

If my guess was right, you can look into the README (https://github.com/OpenGene/fastp#unique-molecular-identifer-umi-processing) for more details.

But if my guess was wrong, please kindly let me know.

@chapmanb
Copy link
Author

chapmanb commented Jan 4, 2018

Shifu;
Thanks for this discussion. It would be great if we had a way to export directly into the indexes but am not sure how to do this with bcl2fastq. The setup we have is dual indexes with the UMI in I7 so it looks like IIIIIIIINNNNNN where I is the index and N is the UMI. We want to use the dual index for demultiplexing and then export the UMIs. So assuming 150 bp reads, 8 bp dual indices and single 6bp UMI in the first index, the bcl2fastq looks like 150Y, 8I6Y, 8I, 150Y where the 150Y, 6Y and 150Y generate the three files: read1, UMI and read2, respectively. The 8Is generate the two indexes used in demultiplexing.

Is there a way to handle both the demultiplexing and export into the header from bcl2fastq? I definitely see how to use fastp for this case if we could export either to an index or one of read1 or read2, but I might easily be ignorant about how best to setup bcl2fastq to generate something compatible. Thanks again for helping with this.

@sfchen
Copy link
Member

sfchen commented Jan 4, 2018

Thanks Brad, I now understood your difficulties. You're right, AFAIK, bcl2fastq cannot use a part of index as UMI.

If you use fastp only for processing UMI (without any filtering/adapter/correction), there is a way to do: processing R1+R3 and R2+R3 individually. For example, your command must look like:

fastp -i R1.fq -I R3.fq -o R1.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1
fastp -i R2.fq -I R3.fq -o R2.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1

The -Q -A -L options disable quality filtering, adapter trimming and length filtering so all input reads will be output, to keep R1.out.fq and R2.out.fq consistent.

You may also specify -G to disable polyG end trimming if you're processing data from NextSeq / NovaSeq, but polyG end trimming will not discard reads so it will not break the read consistency.

The performance will be mainly unaffected since only R3.fq is loaded twice, which is very small since it's only 6 bp long. And actually you can run these two commands in parallel so that the performance may be even better :)

Updated on 1/11/2018: add -w 1 to the commands to limit the worker threads to keep read order of R1 and R2.

@chapmanb
Copy link
Author

chapmanb commented Jan 4, 2018

Shifu;
That is perfect, thanks so much for the clever idea. I was trying to envision all these more complicated approaches to concatenating R1 + R2 and streaming into fastp but this is much simpler. We don't need trimming/filtering so will turn off all those options to get identical outputs. Thanks again for helping think through this, much appreciated.

@chapmanb chapmanb closed this as completed Jan 4, 2018
@sfchen
Copy link
Member

sfchen commented Jan 11, 2018

Hi Brad,

Although you've closed this issue, I still have an important note about the method I proposed to run R1+R3 and R2+R3 individually.

You should always run these two commands with only 1 worker thread by specifying -w 1. If you run them with more than 1 thread, the read order of output.R1 and output.R2 may be inconsistent since read order is not kept by multi-threading output.

So the commands should be:

fastp -i R1.fq -I R3.fq -o R1.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1
fastp -i R2.fq -I R3.fq -o R2.out.fq -O R3.out.fq --umi --umi_loc=read2 --umi_len=6 -Q -A -L -w 1

I hope you have noted this issue already.

Thanks
Shifu

@sfchen
Copy link
Member

sfchen commented Jan 11, 2018

I also commented on bcbio: bcbio/bcbio-nextgen@c5b60bc

@sfchen sfchen reopened this Jan 11, 2018
chapmanb added a commit to bcbio/bcbio-nextgen that referenced this issue Jan 11, 2018
@chapmanb
Copy link
Author

Thanks so much for the heads up, apologies for missing this on my side. I've pushed the fix to use a single core, thank you again.

@dh10
Copy link

dh10 commented Feb 26, 2019

It seems to help to add: -u 100 -n $UMI_LENGTH -Y 100 -G

Low quality sequences otherwise seem to break fastq sync using just the options originally proposed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants