-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UMI pre-processing with a 3rd file containing UMIs #23
Comments
Brad, It's possible to support, but it may take much time to implement since it breaks the design of paired-end reading. I need to check whether this case is common. However, I can't understand why you put the UMIs in a 3rd file. If UMIs are presented in the read (either R1 or R2), bcl2fastq can shift them to the corresponding read headers directly. |
Shifu; |
If my guess is right, you're using one of the two Illumina sample indexes (I7 or I5) as UMI. This is exactly the same as my lab does, and the UMI processing for such kind of data is already supported by fastp. Note that the sample indexes are also present in the reads (both R1 and R2), so actually you don't have to input a 3rd index-only file. For example, a typical read is like:
In this case, index 1 (I7) is By specifying
And If my guess was right, you can look into the README (https://github.com/OpenGene/fastp#unique-molecular-identifer-umi-processing) for more details. But if my guess was wrong, please kindly let me know. |
Shifu; Is there a way to handle both the demultiplexing and export into the header from bcl2fastq? I definitely see how to use fastp for this case if we could export either to an index or one of read1 or read2, but I might easily be ignorant about how best to setup bcl2fastq to generate something compatible. Thanks again for helping with this. |
Thanks Brad, I now understood your difficulties. You're right, AFAIK, bcl2fastq cannot use a part of index as UMI. If you use fastp only for processing UMI (without any filtering/adapter/correction), there is a way to do: processing R1+R3 and R2+R3 individually. For example, your command must look like:
The You may also specify The performance will be mainly unaffected since only R3.fq is loaded twice, which is very small since it's only 6 bp long. And actually you can run these two commands in parallel so that the performance may be even better :) Updated on 1/11/2018: add |
Shifu; |
Hi Brad, Although you've closed this issue, I still have an important note about the method I proposed to run R1+R3 and R2+R3 individually. You should always run these two commands with only 1 worker thread by specifying So the commands should be:
I hope you have noted this issue already. Thanks |
I also commented on |
Thanks so much for the heads up, apologies for missing this on my side. I've pushed the fix to use a single core, thank you again. |
It seems to help to add: -u 100 -n $UMI_LENGTH -Y 100 -G Low quality sequences otherwise seem to break fastq sync using just the options originally proposed |
Shifu;
Thanks for this great tool and adding pre-processing for UMIs. I've been looking for faster options to replace our use of umis (https://github.com/vals/umis) for pre-processing UMI outputs and adding into read headers. We typically end up with UMIs in a 3rd file as outputs from bcl2fastq when the UMIs are present in the input reads, and I wondered if this is possible to support?
I had a quick dig into the code to start implementing but realized you have specialized iterators for pairs so didn't want to break too much by trying to have a 3 input iterator, thinking there might be a better way to integrate.
Here is an example case with R1/R3 as the first/second read pair and R2 as the UMI:
https://s3.amazonaws.com/chapmanb/testcases/fastp_umi_example.tar.gz
Thanks for any thoughts and suggestions for processing these with fastp.
The text was updated successfully, but these errors were encountered: