Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stitch together overlapping reads? #31

Open
tseemann opened this issue Mar 3, 2018 · 21 comments
Open

Stitch together overlapping reads? #31

tseemann opened this issue Mar 3, 2018 · 21 comments

Comments

@tseemann
Copy link

tseemann commented Mar 3, 2018

When the DNA library is overly short, Is it possible that most reads overlap.

Can fastp stitch these reads together (instead of just correcting errors) ?

So input R1, R2 would produce output R1, R2 and SR (stitched, longer single end reads)

@sfchen
Copy link
Member

sfchen commented Mar 3, 2018

Yes, I can implement this feature.

@tseemann
Copy link
Author

tseemann commented Mar 5, 2018

I think it will be used a lot, because

  1. FLASH is on sourceforge and isn't able to be downloaded lately due to problems at sourceforge
  2. PEAR is no longer fully open source, you need a click through licence now

So there is a gap in the open source market for a overlapper tool

It would be amazing to have a tool that does adapters, quality AND stitching!

@ndaniel
Copy link

ndaniel commented Mar 5, 2018

@tseemann

As far as I understand out there are open source tools which already stitch overlapping reads from paired-reads, like for example BBMerge from BBMAP.

Here one would be interested how fastp would compare against BBMerge.

@tseemann
Copy link
Author

fastp is "a tool designed to provide fast all-in-one preprocessing for FastQ files".

You'll need stitching support for that to be really true? :-)

@sfchen
Copy link
Member

sfchen commented Mar 19, 2018

Haha, I've put this feature in fastp's roadmap.

@sjackman
Copy link

I'm also interested in this feature!

@oschwengers
Copy link
Contributor

me too!

@sfchen
Copy link
Member

sfchen commented Oct 23, 2018

Okay, I will implement it soon, probably in 1 week.

@sjackman
Copy link

There's a lot of literature and existing tools for stitching together reads. It'd be nice to implement whichever is considered "the best", as in, the most accurate. Is there a review paper? Does the peanut gallery have any comments on which is perceived to be the best tool by the community?

@tseemann
Copy link
Author

tseemann commented Oct 24, 2018

My old blog post is a start, but probably newer tools now:
http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

Please note that PEAR is no longer open source and should not be considered.
Heng Li also has one buried in fermi-kit somewhere too I think!

@sjackman
Copy link

ABySS has abyss-mergepairs too. I have no idea how it compares to other tools.

@tseemann
Copy link
Author

tseemann commented Oct 27, 2018

      --chastity          discard unchaste reads [default]
      --no-chastity       do not discard unchaste reads

Our old nesoni toolkit had chastity and fidelity options too :)

Regards, the 🥜 gallery.

@sjackman
Copy link

Random bit trivia. ABySS discards unchaste reads when building the de Bruin graph, but uses unchaste reads when mapping back to the assembly. (if they map, may as well use them)

@brucemoran
Copy link

Was this implemented? Aligners can penalize unpaired reads, so is it possible that the overlap can be 'clipped' from the read with lower base quality (or randomly if tied)?

@ndaniel
Copy link

ndaniel commented Mar 6, 2019

@tseemann
BBMerge (which is part of BBMAP) is stiching paired reads really well!
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmerge-guide/

Now even the STAR aligner is stitching together the overlapping reads before mapping them in order to get better alignments.

@sfchen
Copy link
Member

sfchen commented Mar 9, 2019

I promise to implement this in 3 days

@sfchen
Copy link
Member

sfchen commented Mar 16, 2019

Hi guys, this function is implemented, please have a try and help to update this thread with the results.

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode, the output will be a single file.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

For the pairs of reads that cannot be merged successfully, they will be both included in the output by default. But you can specify the --discard_unmerged option to discard the unmerged reads.

Same as the base correction feature, this function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

@sjackman
Copy link

Thank you, Shifu! A couple of questions.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

But you can specify the --chastity option to discard the unmerged reads.

Chastity refers to the Illumina chastity filter, which is a different thing, the :N: or :Y: in the FASTA header comment. I'd suggest naming this option something like --only-merged.

@sfchen
Copy link
Member

sfchen commented Mar 17, 2019

@sjackman thanks for your reply.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

Yes, it handles that case.

As you suggested, I renamed --chastity to --discard_unmerged.

Please try with the latest code.

@sjackman
Copy link

Thanks, Shifu! I'll give it a spin.

@sfchen
Copy link
Member

sfchen commented Apr 8, 2019

Hi guys, this feature is revised and improved a lot in fastp v0.19.9 (will be released soon), see the update here:

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode:

  • --merged_out shouuld be given to specify the file to store merged reads, otherwise you should enable --stdout to stream the merged reads to STDOUT. The merged reads are also filtered.
  • --out1 and --out2 will be the reads that cannot be merged successfully, but both pass all the filters.
  • --unpaired1 will be the reads that cannot be merged, read1 passes filters but read2 doesn't.
  • --unpaired2 will be the reads that cannot be merged, read2 passes filters but read1 doesn't.
  • --include_unmerged can be enabled to make reads of --out1, --out2, --unpaired1 and --unpaired2 redirected to --merged_out. So you will get a single output file. This option is disabled by default.

--failed_out can still be given to store the reads (either merged or unmerged) failed to passing filters.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

This function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants