Stitch together overlapping reads? #31

tseemann · 2018-03-03T08:00:23Z

When the DNA library is overly short, Is it possible that most reads overlap.

Can fastp stitch these reads together (instead of just correcting errors) ?

So input R1, R2 would produce output R1, R2 and SR (stitched, longer single end reads)

The text was updated successfully, but these errors were encountered:

sfchen · 2018-03-03T13:09:17Z

Yes, I can implement this feature.

tseemann · 2018-03-05T03:01:32Z

I think it will be used a lot, because

FLASH is on sourceforge and isn't able to be downloaded lately due to problems at sourceforge
PEAR is no longer fully open source, you need a click through licence now

So there is a gap in the open source market for a overlapper tool

It would be amazing to have a tool that does adapters, quality AND stitching!

ndaniel · 2018-03-05T12:06:42Z

@tseemann

As far as I understand out there are open source tools which already stitch overlapping reads from paired-reads, like for example BBMerge from BBMAP.

Here one would be interested how fastp would compare against BBMerge.

tseemann · 2018-03-19T05:11:14Z

fastp is "a tool designed to provide fast all-in-one preprocessing for FastQ files".

You'll need stitching support for that to be really true? :-)

sfchen · 2018-03-19T08:39:47Z

Haha, I've put this feature in fastp's roadmap.

sjackman · 2018-08-13T18:25:08Z

I'm also interested in this feature!

oschwengers · 2018-10-23T10:36:27Z

me too!

sfchen · 2018-10-23T14:37:00Z

Okay, I will implement it soon, probably in 1 week.

sjackman · 2018-10-23T16:01:32Z

There's a lot of literature and existing tools for stitching together reads. It'd be nice to implement whichever is considered "the best", as in, the most accurate. Is there a review paper? Does the peanut gallery have any comments on which is perceived to be the best tool by the community?

tseemann · 2018-10-24T00:10:42Z

My old blog post is a start, but probably newer tools now:
http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

Please note that PEAR is no longer open source and should not be considered.
Heng Li also has one buried in fermi-kit somewhere too I think!

sjackman · 2018-10-24T01:20:20Z

ABySS has abyss-mergepairs too. I have no idea how it compares to other tools.

tseemann · 2018-10-27T01:29:42Z

      --chastity          discard unchaste reads [default]
      --no-chastity       do not discard unchaste reads

Our old nesoni toolkit had chastity and fidelity options too :)

Regards, the 🥜 gallery.

sjackman · 2018-10-27T16:39:24Z

Random bit trivia. ABySS discards unchaste reads when building the de Bruin graph, but uses unchaste reads when mapping back to the assembly. (if they map, may as well use them)

brucemoran · 2019-03-06T11:07:35Z

Was this implemented? Aligners can penalize unpaired reads, so is it possible that the overlap can be 'clipped' from the read with lower base quality (or randomly if tied)?

ndaniel · 2019-03-06T12:05:08Z

@tseemann
BBMerge (which is part of BBMAP) is stiching paired reads really well!
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmerge-guide/

Now even the STAR aligner is stitching together the overlapping reads before mapping them in order to get better alignments.

sfchen · 2019-03-09T00:04:53Z

I promise to implement this in 3 days

#139 (comment) #31 (comment)

sfchen · 2019-03-16T10:39:16Z

Hi guys, this function is implemented, please have a try and help to update this thread with the results.

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode, the output will be a single file.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

For the pairs of reads that cannot be merged successfully, they will be both included in the output by default. But you can specify the --discard_unmerged option to discard the unmerged reads.

Same as the base correction feature, this function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

sjackman · 2019-03-16T15:50:45Z

Thank you, Shifu! A couple of questions.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

But you can specify the --chastity option to discard the unmerged reads.

Chastity refers to the Illumina chastity filter, which is a different thing, the :N: or :Y: in the FASTA header comment. I'd suggest naming this option something like --only-merged.

sfchen · 2019-03-17T01:28:46Z

@sjackman thanks for your reply.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

Yes, it handles that case.

As you suggested, I renamed --chastity to --discard_unmerged.

Please try with the latest code.

sjackman · 2019-03-17T04:34:47Z

Thanks, Shifu! I'll give it a spin.

sfchen · 2019-04-08T03:17:51Z

Hi guys, this feature is revised and improved a lot in fastp v0.19.9 (will be released soon), see the update here:

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode:

--merged_out shouuld be given to specify the file to store merged reads, otherwise you should enable --stdout to stream the merged reads to STDOUT. The merged reads are also filtered.
--out1 and --out2 will be the reads that cannot be merged successfully, but both pass all the filters.
--unpaired1 will be the reads that cannot be merged, read1 passes filters but read2 doesn't.
--unpaired2 will be the reads that cannot be merged, read2 passes filters but read1 doesn't.
--include_unmerged can be enabled to make reads of --out1, --out2, --unpaired1 and --unpaired2 redirected to --merged_out. So you will get a single output file. This option is disabled by default.

--failed_out can still be given to store the reads (either merged or unmerged) failed to passing filters.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

This function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

sfchen added the enhancement label Mar 9, 2019

sfchen added a commit that referenced this issue Mar 13, 2019

support merging PE reads by add option -m/--merge

e3fa1da

#139 (comment) #31 (comment)

sfchen added a commit that referenced this issue Mar 14, 2019

report merging result

0c7e526

#139 (comment) #31 (comment)

sfchen added a commit that referenced this issue Mar 16, 2019

Introduce the merge function in README

db0fe91

#139 (comment) #31 (comment)

sfchen mentioned this issue Mar 16, 2019

Option to merge overlapping PE reads? #139

Closed

jrostudent mentioned this issue Oct 9, 2023

Hi guys, this feature is revised and improved a lot in fastp v0.19.9 (will be released soon), see the update here: #526

Open

mmokrejs mentioned this issue Oct 30, 2024

merging read pair mates using a reference sequence #582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stitch together overlapping reads? #31

Stitch together overlapping reads? #31

tseemann commented Mar 3, 2018

sfchen commented Mar 3, 2018

tseemann commented Mar 5, 2018 •

edited

Loading

ndaniel commented Mar 5, 2018 •

edited

Loading

tseemann commented Mar 19, 2018

sfchen commented Mar 19, 2018

sjackman commented Aug 13, 2018

oschwengers commented Oct 23, 2018

sfchen commented Oct 23, 2018

sjackman commented Oct 23, 2018

tseemann commented Oct 24, 2018 •

edited

Loading

sjackman commented Oct 24, 2018

tseemann commented Oct 27, 2018 •

edited

Loading

sjackman commented Oct 27, 2018

brucemoran commented Mar 6, 2019

ndaniel commented Mar 6, 2019 •

edited

Loading

sfchen commented Mar 9, 2019

sfchen commented Mar 16, 2019 •

edited

Loading

sjackman commented Mar 16, 2019

sfchen commented Mar 17, 2019

sjackman commented Mar 17, 2019

sfchen commented Apr 8, 2019

Stitch together overlapping reads? #31

Stitch together overlapping reads? #31

Comments

tseemann commented Mar 3, 2018

sfchen commented Mar 3, 2018

tseemann commented Mar 5, 2018 • edited Loading

ndaniel commented Mar 5, 2018 • edited Loading

tseemann commented Mar 19, 2018

sfchen commented Mar 19, 2018

sjackman commented Aug 13, 2018

oschwengers commented Oct 23, 2018

sfchen commented Oct 23, 2018

sjackman commented Oct 23, 2018

tseemann commented Oct 24, 2018 • edited Loading

sjackman commented Oct 24, 2018

tseemann commented Oct 27, 2018 • edited Loading

sjackman commented Oct 27, 2018

brucemoran commented Mar 6, 2019

ndaniel commented Mar 6, 2019 • edited Loading

sfchen commented Mar 9, 2019

sfchen commented Mar 16, 2019 • edited Loading

merge paired-end reads

sjackman commented Mar 16, 2019

sfchen commented Mar 17, 2019

sjackman commented Mar 17, 2019

sfchen commented Apr 8, 2019

merge paired-end reads

tseemann commented Mar 5, 2018 •

edited

Loading

ndaniel commented Mar 5, 2018 •

edited

Loading

tseemann commented Oct 24, 2018 •

edited

Loading

tseemann commented Oct 27, 2018 •

edited

Loading

ndaniel commented Mar 6, 2019 •

edited

Loading

sfchen commented Mar 16, 2019 •

edited

Loading