Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read names do not match with dual UMIs #46

Closed
carlandt opened this issue Mar 28, 2018 · 10 comments
Closed

Read names do not match with dual UMIs #46

carlandt opened this issue Mar 28, 2018 · 10 comments

Comments

@carlandt
Copy link

Thank you for making your wonderful tool!

For dual-UMI experiments, there may/should be different UMI tags on the forward and reverse read of a pair. Is there an option (now or in development) to remove the UMI tags from each read and place them on both of the resultant reads? Downstream tools require that the read names be the same so if there are different UMI tags on the forward and reverse of a pair, it will fail. Instead it should have the read name, followed by a delimiter between the forward and reverse UMI tags.

For instance, in fastq_1.fq.gz
read_1_name:etc:etc:etc:etc:etc:etc:read_1_tagread_2_tag

And in the pair, fastq_2.fq.gz
read_2_name:etc:etc:etc:etc:etc:etc:read_1_tagread_2_tag

sfchen added a commit that referenced this issue Mar 29, 2018
@sfchen
Copy link
Member

sfchen commented Mar 29, 2018

I just update the behaviour of UMI preprocessing for per_index and per_read mode.

  • per_index index1_index2 is used as UMI for both read1/read2.
  • per_read define umi1 as the head of read1, and umi2 as the head of read2. umi1_umi2 is used as UMI for both read1/read2.

Could you please try to build fastp with latest code on master. Or download http://opengene.org/fastp/fastp to test.

@carlandt
Copy link
Author

carlandt commented Apr 1, 2018

Thanks for the fast response! I'll hit it on Monday and let you know. Sorry for the poor spelling in the name of the bug, that's a bit embarrassing =P

Thanks again for the wonderful tool, loving it so far =)

@sfchen
Copy link
Member

sfchen commented Apr 9, 2018

Any update?

@carlandt carlandt changed the title Read names do not matched with dual UMIs Read names do not match with dual UMIs Apr 11, 2018
@carlandt
Copy link
Author

Tried out
fastp -i read_1.fastq.gz -I read_2.fastq.gz -o read_umi_1.fastq -O read_umi_2.fastq -U --umi_loc=per_read --umi_len=8
Was that what you were thinking?

If so, I'm still just getting the UMI from each read put on that read, not shared across, eg umi1_umi2

Used version 0.12.6 - the one at http://opengene.org/fastp/fastp

@sfchen
Copy link
Member

sfchen commented Apr 11, 2018

can you paste some reads here?

@carlandt
Copy link
Author

carlandt commented Apr 16, 2018

Sure! Here are the input reads, the output reads, and what I was hoping to see:


$ zcat 782404_V1_L1.N701_505_1.fastq.gz | head -n 8
@M01378:492:000000000-BLK46:1:1101:9647:3917 1:N:0: TAAGGCGA-TTAAGGAG
CACGCGAGTGGAGCTGAGCAGCCTGAGATCTGACCGTCTCCTCAGGTACGCACCCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGGTATGTTGGTTTATGTTTTTTTTGGTTGGATGTGGTATGGTTTTTTTTTGTTTGTTTTTTGTTAT
+
ABBBBBBBBBBFGGGGFGFGFFHHHHHHGFHFHFFGGGHGGHHFHFFHD1AEEEEHHFHGH5FFGFH5GHHHHHGFFHHGCEEEFHGC?FDFFGFBBG//<E/F?CF/GFBDDG@GF2G2@DHF0F2FFC#######################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 1:N:0: TAAGGCGA-TTAAGGAG
TGGGGGAATGGAGCTGAGCAGCCTGAGATCTGGGCGTTCACCCAGGCTTCCACGTTCCCCTCGCTTGGGTCACCGTCTCCTCAGGTAAGAGGTCAGCCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACTCAGGTGCGATGTGCAGCTGTCTTGTG
+
BBBBBBBBFFFFGGGGDACE4FGEFFHGHGDHCBGGGGFHGFH1G1BAFGHHHDCGFFGFHG?EFHGGEEFFFGGHGHHHHHHBGGHBB4FEFGGFBGHFHHHHHHFGEFHFGFHGHG/@/<BFFFC?/?DHHH1GFCFGCEHHFCEGFFDFDFAEFHBGHFF.00CC#################################


$ zcat 782404_V1_L1.N701_505_2.fastq.gz | head -n 8
@M01378:492:000000000-BLK46:1:1101:9647:3917 2:N:0: TAAGGCGA-TTAAGGAG
TGAGGGTGCTTACCTGCGGCGACGGTCAGATCTCTTTCTTCTCATCTCCACTCGCTTGCTGTCTCTTATACACATCTGACGCTGCCGACTACTCCTTTCTTGTTTTTTTTTTTTTTCTTCTTTTTTTTTTTTTTCATCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTT
+
11111111AFFFGGGE1A0000AAAE?/01D1G22212A12DA22ADA11ADE/////>D1B1@@FG1BB22@1BBF12@/>?######################################################################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 2:N:0: TAAGGCGA-TTAAGGAG
TCTGACCTCTTACCTGCGGAGACGGTGACCCAAGCTAGTTGAACTTGGAAGCCTGTTTGAACTCCCAGATCTCAGGCTGCTCAGCTCCATTCCCCCACTGTCTCTTATACACATCTGTCGCTGCCGACGTCTCCTTACGTGTTGTTCTCTGTTTTCTTCTTTTCTTTTTTATCTCTTTTTTTGTTGTTCTTTTGTTTTTTC
+
11>1>11BFFF@FGGG1A10AABABECE1A000//11112122A1A11110/0B0111011A1AAA01/BAFE220>>0>FE101BB@1@F2@1>//>01BBGFHG1D22211B>E122//?//0//////0011121/@F@0/@/11111?1<01111111?######################################


# here is the fastp command, correct me if I'm wrong
$ fastp -i 782404_V1_L1.N701_505_1.fastq.gz -I 782404_V1_L1.N701_505_2.fastq.gz -o 782404_V1_L1.N701_505_umi_1.fastq -O 782404_V1_L1.N701_505_umi_2.fastq -U --umi_loc=per_read --umi_len=8


# the output reads only have the UMI per read, not both as I had hoped and perhaps explained incorrectly

$ head -n 8 782404_V1_L1.N701_505_umi_1.fastq
@M01378:492:000000000-BLK46:1:1101:11212:3937:TGGGGGAA 1:N:0: TAAGGCGA-TTAAGGAG
TGGAGCTGAGCAGCCTGAGATCTGGGCGTTCACCCAGGCTTCCACGTTCCCCTCGCTTGGGTCACCGTCTCCTCAGGTAAGAGGTCAGCCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACTCAGGTGCGATGTGCAGCTGTCTTGTG
+
FFFFGGGGDACE4FGEFFHGHGDHCBGGGGFHGFH1G1BAFGHHHDCGFFGFHG?EFHGGEEFFFGGHGHHHHHHBGGHBB4FEFGGFBGHFHHHHHHFGEFHFGFHGHG/@/<BFFFC?/?DHHH1GFCFGCEHHFCEGFFDFDFAEFHBGHFF.00CC#################################
@M01378:492:000000000-BLK46:1:1101:11675:3965:CAGGGGGG 1:N:0: TAAGGCGA-CTAAGGAG
TGGAGCTGAGCAGCCTGAAGTGCAACCGGGAAGGGAAGGAGTGGGAGACGGTACTCACCAGCCGGACCCTCACTGCTGCGGGCAGCTGTGACGTGGTGTGTGTCGCCTGTGAAAAAAGGATGCTGTCAGTGTTCTCCACCTGTGGTCACCGTCTCCTCAGGTAAGGCGCTTCTCTGTCTCTTATACACATCTC
+
@DDAGGF1@GFGGGF0BFHB>FGF1C0E/<B//C//0/<0/?FFAGFCFGAAGCFHE1FD0FF--<-<CGHHGCGHEHEG-?@@.9FFC/0;CEBAAAABAFFF--;B?9B/;/9-9B?BFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFB/;FFFF?@@FFFFFFFFFFFBFFFFFFFFFF9


$ head -n 8 782404_V1_L1.N701_505_umi_2.fastq
@M01378:492:000000000-BLK46:1:1101:11212:3937:TCTGACCT 2:N:0: TAAGGCGA-TTAAGGAG
CTTACCTGCGGAGACGGTGACCCAAGCTAGTTGAACTTGGAAGCCTGTTTGAACTCCCAGATCTCAGGCTGCTCAGCTCCATTCCCCCACTGTCTCTTATACACATCTGTCGCTGCCGACGTCTCCTTACGTGTTGTTCTCTGTTTTCTTCTTTTCTTTTTTATCTCTTTTTTTGTTGTTCTTTTGTTTTTTC
+
FFF@FGGG1A10AABABECE1A000//11112122A1A11110/0B0111011A1AAA01/BAFE220>>0>FE101BB@1@F2@1>//>01BBGFHG1D22211B>E122//?//0//////0011121/@F@0/@/11111?1<01111111?######################################
@M01378:492:000000000-BLK46:1:1101:11675:3965:TGAAGCGC 2:N:0: TAAGGCGA-CTAAGGAG
CTTACCTGAGGAGACGGTGACCACAGGTTGTTCACACTGACAGCCTCCTTTTTTCACAGTCTACACACACCACGTCACAGCTGCCCGCAGCAGTGAGGTTCCGGCTGTTGTGTACCGTCTCCCACTCCTTCCCTTCCCGTTTGCACTTCAGTCTGCTCAGCTCCACCCCCCTGCTGTCTCTTATACACATCTG
+
>AADGGGF1B01A0BA0AA01B1A00AB10//1A12A011111//AB/A1FDFGB2D211@121101?/?>///B//>B1/B110//////00>11100B1<E//>/0B10/B2@2<</@0100<?0<GDF0FGHG00..0<A<11>B1<11=F0=D<0//=</C/..-:;@ACB0;BFFFFGB0B00CFFF0```

@carlandt
Copy link
Author

carlandt commented Apr 16, 2018

As an example, I was hoping that first forward read would have come out with the UMI of both itself and the reverse read delimited in some way:

@M01378:492:000000000-BLK46:1:1101:9647:3917:CACGCGAG-TGAGGGTG 1:N:0: TAAGGCGA-TTAAGGAG
CACGCGAGTGGAGCTGAGCAGCCTGAGATCTGACCGTCTCCTCAGGTACGCACCCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGGTATGTTGGTTTATGTTTTTTTTGGTTGGATGTGGTATGGTTTTTTTTTGTTTGTTTTTTGTTAT
+
ABBBBBBBBBBFGGGGFGFGFFHHHHHHGFHFHFFGGGHGGHHFHFFHD1AEEEEHHFHGH5FFGFH5GHHHHHGFFHHGCEEEFHGC?FDFFGFBBG//<E/F?CF/GFBDDG@GF2G2@DHF0F2FFC#######################################################################

That way the read name of the forward and the reverse read would be the same (except for the 1:N:0 part) and BWA would still stake it.

@sfchen
Copy link
Member

sfchen commented Apr 17, 2018

Could you please confirm you used the latest fastp?

Seems like your result was obtain with old version of fastp. The fastp on bioconda is still old, you have to download from http://opengene.org/fastp/fastp, or use git to clone the latest code to build it.

With --umi_loc=per_read option, the latest version of fastp will output the reads like:

@NS500713:64:HFKJJBGXY:1:11101:1675:1101:TAGGAGGC_TAGGGCAA 1:N:0:TATAGCCT+GACCCCCA
TTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAATTTTTAAACCCAGGCAGCTTCCTGGCAGTGACATTTGGAGCATCAAAGTGGTAAATAAAATTTCATTTACATTAATAT
+
EEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAAA/E/A/6E/6//6<EAAEEE/EEEA/EA/EEEEEE/<<EEEE//A/EE<AEEEEE/</AA</E<AAAE/E<E/

@carlandt
Copy link
Author

carlandt commented Apr 17, 2018

The work up above should have been 0.12.6

Sure, downloaded again:

$ ./fastp --version
fastp: an ultra-fast all-in-one FASTQ preprocessor
version 0.13.1

Yup, output looks mostly as you described. Thank you!

Checked a handful of reads, here is the output. Of the first four read pairs, two pairs gave output. Were the others just low quality? This sample did lose a great many of the reads due to that...

# the first four forward reads:
$ zcat 782404_V1_L1.N701_505_1.fastq.gz | head -n 16
@M01378:492:000000000-BLK46:1:1101:9647:3917 1:N:0: TAAGGCGA-TTAAGGAG
CACGCGAGTGGAGCTGAGCAGCCTGAGATCTGACCGTCTCCTCAGGTACGCACCCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAGGTATGTTGGTTTATGTTTTTTTTGGTTGGATGTGGTATGGTTTTTTTTTGTTTGTTTTTTGTTAT
+
ABBBBBBBBBBFGGGGFGFGFFHHHHHHGFHFHFFGGGHGGHHFHFFHD1AEEEEHHFHGH5FFGFH5GHHHHHGFFHHGCEEEFHGC?FDFFGFBBG//<E/F?CF/GFBDDG@GF2G2@DHF0F2FFC#######################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 1:N:0: TAAGGCGA-TTAAGGAG
TGGGGGAATGGAGCTGAGCAGCCTGAGATCTGGGCGTTCACCCAGGCTTCCACGTTCCCCTCGCTTGGGTCACCGTCTCCTCAGGTAAGAGGTCAGCCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACTCAGGTGCGATGTGCAGCTGTCTTGTG
+
BBBBBBBBFFFFGGGGDACE4FGEFFHGHGDHCBGGGGFHGFH1G1BAFGHHHDCGFFGFHG?EFHGGEEFFFGGHGHHHHHHBGGHBB4FEFGGFBGHFHHHHHHFGEFHFGFHGHG/@/<BFFFC?/?DHHH1GFCFGCEHHFCEGFFDFDFAEFHBGHFF.00CC#################################
@M01378:492:000000000-BLK46:1:1101:20250:3942 1:N:0: TAAGGCGA-TTAAGGAG
TCCCCAAATGGAGCTGAGCAGCCTGAGATCTGTCACCGTCTCCTCAGGTAAGCCTACCGACTGTCTCTTATACACATCTCCTATCCCACGATACTAAGGCGAATCTCGTATTCCTTCTTCTTCTTTAAAAAAAAAATTTTTGGTTTATTTTATTTTGTTTTGTTGTTGTTTTTTATTTTTTTGTTTTTGTTTTTTTGTTGT
+
@AAAAFAFFFFFGGGGGF1F1C0A0BB1FHBHH321B0FCGBF1BF10G121A11B0B//AEA1AG2FEFHFB2F1A@F1D1121B1BEE/B>>G2210>///>/2B0?/01222B1FGBF122B12B1B#######################################################################
@M01378:492:000000000-BLK46:1:1101:11675:3965 1:N:0: TAAGGCGA-CTAAGGAG
CAGGGGGGTGGAGCTGAGCAGCCTGAAGTGCAACCGGGAAGGGAAGGAGTGGGAGACGGTACTCACCAGCCGGACCCTCACTGCTGCGGGCAGCTGTGACGTGGTGTGTGTCGCCTGTGAAAAAAGGATGCTGTCAGTGTTCTCCACCTGTGGTCACCGTCTCCTCAGGTAAGGCGCTTCTCTGTCTCTTATACACATCTC
+
@AAAADDD@DDAGGF1@GFGGGF0BFHB>FGF1C0E/<B//C//0/<0/?FFAGFCFGAAGCFHE1FD0FF--<-<CGHHGCGHEHEG-?@@.9FFC/0;CEBAAAABAFFF--;B?9B/;/9-9B?BFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFB/;FFFF?@@FFFFFFFFFFFBFFFFFFFFFF9


# the first four reverse reads
$ zcat 782404_V1_L1.N701_505_2.fastq.gz | head -n 16
@M01378:492:000000000-BLK46:1:1101:9647:3917 2:N:0: TAAGGCGA-TTAAGGAG
TGAGGGTGCTTACCTGCGGCGACGGTCAGATCTCTTTCTTCTCATCTCCACTCGCTTGCTGTCTCTTATACACATCTGACGCTGCCGACTACTCCTTTCTTGTTTTTTTTTTTTTTCTTCTTTTTTTTTTTTTTCATCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTT
+
11111111AFFFGGGE1A0000AAAE?/01D1G22212A12DA22ADA11ADE/////>D1B1@@FG1BB22@1BBF12@/>?######################################################################################################################
@M01378:492:000000000-BLK46:1:1101:11212:3937 2:N:0: TAAGGCGA-TTAAGGAG
TCTGACCTCTTACCTGCGGAGACGGTGACCCAAGCTAGTTGAACTTGGAAGCCTGTTTGAACTCCCAGATCTCAGGCTGCTCAGCTCCATTCCCCCACTGTCTCTTATACACATCTGTCGCTGCCGACGTCTCCTTACGTGTTGTTCTCTGTTTTCTTCTTTTCTTTTTTATCTCTTTTTTTGTTGTTCTTTTGTTTTTTC
+
11>1>11BFFF@FGGG1A10AABABECE1A000//11112122A1A11110/0B0111011A1AAA01/BAFE220>>0>FE101BB@1@F2@1>//>01BBGFHG1D22211B>E122//?//0//////0011121/@F@0/@/11111?1<01111111?######################################
@M01378:492:000000000-BLK46:1:1101:20250:3942 2:N:0: TAAGGCGA-TTAAGGAG
TCTTTCTTCTTACCTGAGGAGACGGTGACATTTCTCCTTCTTCTCTGCTCCTTTTTTTTTCTTTCTCTTTTACTCTTCTGTCGCTGCCGTCTTCTCCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTT
+
111>13BBBDFFGGGG1AA11AB00A0A133332A21111A12212111111121B////0A1B2A2@D1B211@##############################################################################################################################
@M01378:492:000000000-BLK46:1:1101:11675:3965 2:N:0: TAAGGCGA-CTAAGGAG
TGAAGCGCCTTACCTGAGGAGACGGTGACCACAGGTTGTTCACACTGACAGCCTCCTTTTTTCACAGTCTACACACACCACGTCACAGCTGCCCGCAGCAGTGAGGTTCCGGCTGTTGTGTACCGTCTCCCACTCCTTCCCTTCCCGTTTGCACTTCAGTCTGCTCAGCTCCACCCCCCTGCTGTCTCTTATACACATCTG
+
11111111>AADGGGF1B01A0BA0AA01B1A00AB10//1A12A011111//AB/A1FDFGB2D211@121101?/?>///B//>B1/B110//////00>11100B1<E//>/0B10/B2@2<</@0100<?0<GDF0FGHG00..0<A<11>B1<11=F0=D<0//=</C/..-:;@ACB0;BFFFFGB0B00CFFF0


# looking for the read name of the first read pair in the output fastq files, no luck
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:9647:3917"
<nothing>

# the second pair works
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:11212:3937"
@M01378:492:000000000-BLK46:1:1101:11212:3937:TGGGGGAA_TCTGACCT 1:N:0: TAAGGCGA-TTAAGGAG
@M01378:492:000000000-BLK46:1:1101:11212:3937:TGGGGGAA_TCTGACCT 2:N:0: TAAGGCGA-TTAAGGAG

# the third pair doesn't show
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:20250:3942"
<nothing>

# the fourth pair does
$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:11675:3965"
@M01378:492:000000000-BLK46:1:1101:11675:3965:CAGGGGGG_TGAAGCGC 1:N:0: TAAGGCGA-CTAAGGAG
@M01378:492:000000000-BLK46:1:1101:11675:3965:CAGGGGGG_TGAAGCGC 2:N:0: TAAGGCGA-CTAAGGAG

@carlandt
Copy link
Author

Ah, yes, quality, if I turned off quality filtering they all come back. Looks like this issue is closed, thank you again!

$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:20250:3942"
@M01378:492:000000000-BLK46:1:1101:20250:3942:TCCCCAAA_TCTTTCTT 1:N:0: TAAGGCGA-TTAAGGAG
@M01378:492:000000000-BLK46:1:1101:20250:3942:TCCCCAAA_TCTTTCTT 2:N:0: TAAGGCGA-TTAAGGAG

$ cat *.fastq | grep "@M01378:492:000000000-BLK46:1:1101:9647:3917"
@M01378:492:000000000-BLK46:1:1101:9647:3917:CACGCGAG_TGAGGGTG 1:N:0: TAAGGCGA-TTAAGGAG
@M01378:492:000000000-BLK46:1:1101:9647:3917:CACGCGAG_TGAGGGTG 2:N:0: TAAGGCGA-TTAAGGAG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants