Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken pipe with samtools sort and multiple selects #92

Closed
unode opened this issue Oct 29, 2018 · 3 comments

Comments

@unode
Copy link
Member

commented Oct 29, 2018

Current master dc769d7 fails with:

[Mon 29-10-2018 02:23:33]: Script OK. Starting interpretation...
[Mon 29-10-2018 02:23:33] Line 6: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 6: Interpreting [interpretIO]: input = samfile("input.bam")
[Mon 29-10-2018 02:23:33] Line 6: Interpreting [assignment]: samfile("input.bam")
[Mon 29-10-2018 02:23:33] Line 6: Interpreting [samfile]: NGOString "input.bam"
[Mon 29-10-2018 02:23:33] Line 8: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 8: Interpreting [interpretIO]: input = select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 8: Interpreting [assignment]: select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 9: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 9: Interpreting [interpretIO]: input = select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 9: Interpreting [assignment]: select(Lookup 'input' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 02:23:33] Line 11: Running garbage collection.
[Mon 29-10-2018 02:23:33] Line 11: Interpreting [interpretIO]: input = samtools_sort(Lookup 'input' as NGLMappedReadSet; __output_bam=True; by={name})
[Mon 29-10-2018 02:23:33] Line 11: Interpreting [assignment]: samtools_sort(Lookup 'input' as NGLMappedReadSet; __output_bam=True; by={name})
[Mon 29-10-2018 02:23:33] Line 11: Interpreting [executing module function: 'samtools_sort']: NGOMappedReadSet {nglgroupName = "input.bam", nglSamFile = <STREAM>, nglReference = Nothing}
[Mon 29-10-2018 02:23:33] Line 11: Created & opened temporary file /tmp/sorted_selected_selected_input.23787-0.bam
[Mon 29-10-2018 02:23:33] Line 11: Calling binary /home/u/system/apps/ngless/bin/../share/ngless/bin/ngless-0.9.1-samtools with args: sort -n -@ 1 -O bam -T /tmp/samtools_sort_temp.tmp23787/samruntmp
[Mon 29-10-2018 02:23:33] Line 11: Starting samtools view of input.bam
Exiting after internal error. If you can reproduce this issue, please run your script with the --trace flag and report a bug at http://github.com/ngless-toolkit/ngless/issues
fd:13: hPutBuf: resource vanished (Broken pipe)

when using:

ngless "0.8"
import "parallel" version "0.6"
import "mocat" version "0.0"
import "samtools" version "0.0"

input = samfile("input.bam")

input = select(input, keep_if=[{mapped}])
input = select(input, keep_if=[{mapped}])

input = samtools_sort(input, by={name})
write(input, ofile='namesorted.bam')

and a sufficiently large input.bam file.

I tried reproducing this with the .bam files we have in the repository but none was big enough to trigger the broken pipe.

I managed to reproduce it locally with a 13MB bam file generated with:

ngless "0.8"
import "parallel" version "0.6"
import "mocat" version "0.0"
import "samtools" version "0.0"

samples = readlines(ARGV[3])
sample = lock1([ARGV[1]])
input = load_mocat_sample(ARGV[2] + '/' + sample)

mapped = map(input, fafile=ARGV[4], mode_all=True)
mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=97, action={unmatch})

write(mapped, ofile='outputs/' + sample + '.test.bam')
@unode

This comment has been minimized.

Copy link
Member Author

commented Oct 29, 2018

While trying to workaround the problem above I ran into a different issue but the error above may be related to the same problem.

Using:

ngless "0.8"
import "parallel" version "0.6"
import "mocat" version "0.0"
import "samtools" version "0.0"

samples = readlines(ARGV[3])
sample = lock1([ARGV[1]])
input = load_mocat_sample(ARGV[2] + '/' + sample)

mapped = map(input, fafile=ARGV[4], mode_all=True)
mapped = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=97, action={unmatch})

mapped_non_unique = select(mapped, keep_if=[{mapped}])

# workaround
write(mapped_non_unique, ofile='outputs/' + sample + '.test.bam')
mapped_non_unique = samfile('outputs/' + sample + '.test.bam')
# workaround

mapped_non_unique = samtools_sort(mapped_non_unique, by={name})
write(mapped_non_unique, ofile='outputs/' + sample + '.namesorted.bam')

collect(qcstats({fastq}), ofile='outputs/fqstats.txt', current=sample, allneeded=samples)
collect(qcstats({mapping}), ofile='outputs/mapstats.txt', current=sample, allneeded=samples)

I got:

...
[Mon 29-10-2018 03:03:45] Line 10: Finished mapping to db.fna
[Mon 29-10-2018 03:03:45] Line 10: Finished mapping to db.fna
[Mon 29-10-2018 03:03:45] Line 10: Total reads: 95047
[Mon 29-10-2018 03:03:45] Line 10: Total reads aligned: 54160 [56.98%]
[Mon 29-10-2018 03:03:45] Line 10: Total reads Unique map: 38849 [40.87%]
[Mon 29-10-2018 03:03:45] Line 10: Total reads Non-Unique map: 15311 [16.11%]
[Mon 29-10-2018 03:03:45] Line 11: Running garbage collection.
[Mon 29-10-2018 03:03:45] Line 11: Interpreting [interpretIO]: mapped = select(Lookup 'mapped' as NGLMappedReadSet)using {Block {blockVariable = [Variable "mr"], blockBody = Sequence [mr = (Lookup 'mr' as NGLMappedRead).MethodName {unwrapMethodName = "filter"}( Nothing; min_match_size=45; min_identity_pc=97; action={unmatch} )]}}
[Mon 29-10-2018 03:03:45] Line 11: Interpreting [assignment]: select(Lookup 'mapped' as NGLMappedReadSet)using {Block {blockVariable = [Variable "mr"], blockBody = Sequence [mr = (Lookup 'mr' as NGLMappedRead).MethodName {unwrapMethodName = "filter"}( Nothing; min_match_size=45; min_identity_pc=97; action={unmatch} )]}}
[Mon 29-10-2018 03:03:45] Line 11: Executing blocked select on file /tmp/23095980/mapped_db.27976-0.sam
[Mon 29-10-2018 03:03:45] Line 11: Created & opened temporary file /tmp/23095980/block_selected_mapped_db27976-1.sam
[Mon 29-10-2018 03:03:46] Line 14: Running garbage collection.
[Mon 29-10-2018 03:03:46] Line 14: Removing temporary file: /tmp/23095980/mapped_db.27976-0.sam
[Mon 29-10-2018 03:03:46] Line 14: Interpreting [interpretIO]: mapped_non_unique = select(Lookup 'mapped' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 03:03:46] Line 14: Interpreting [assignment]: select(Lookup 'mapped' as NGLMappedReadSet; keep_if=[{mapped}])
[Mon 29-10-2018 03:03:46] Line 17: Running garbage collection.
[Mon 29-10-2018 03:03:46] Line 17: Removing temporary file: /tmp/23095980/mapped_db.27976-0.sam
[Mon 29-10-2018 03:03:46] Line 17: Interpreting [interpretIO]: write(Lookup 'mapped_non_unique' as NGLMappedReadSet; __hash="96241298d14e84a92cabbadc8bc7d00a"; ofile="outputs/"BOpAddLookup 'sample' as NGLStringBOpAdd".test.bam")
[Mon 29-10-2018 03:03:46] Line 17: Interpreting [write]: NGOMappedReadSet {nglgroupName = "data/sample", nglSamFile = <STREAM>, nglReference = Nothing}
[Mon 29-10-2018 03:03:46] Line 17: Created & opened temporary file /tmp/23095980/selected_block_selected_mapped_db27976-1streamed_.27976-2.sam
[Mon 29-10-2018 03:03:46] Line 17: Finished mapping to select.lno14
[Mon 29-10-2018 03:03:46] Line 17: Total reads: 35647
[Mon 29-10-2018 03:03:46] Line 17: Total reads aligned: 35647 [100.00%]
[Mon 29-10-2018 03:03:46] Line 17: Total reads Unique map: 27792 [77.96%]
[Mon 29-10-2018 03:03:46] Line 17: Total reads Non-Unique map: 7855 [22.04%]
Exiting after fatal error while loading and running script
System Error
Failure on converting sam to bam1

[Mon 29-10-2018 03:03:46] Line 17: Created & opened temporary file /tmp/23095980/converted_selected_block_selected_mapped_db27976-1streamed_27976-3.bam
[Mon 29-10-2018 03:03:46] Line 17: SAM->BAM Conversion start ('/tmp/23095980/selected_block_selected_mapped_db27976-1streamed_.27976-2.sam' -> '/tmp/23095980/converted_selected_block_selected_mapped_db27976-1streamed_27976-3.bam')
[Mon 29-10-2018 03:03:46] Line 17: Message from samtools: [E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] Parse error at line 43804
[main_samview] truncated file.

luispedro added a commit that referenced this issue Oct 30, 2018

TST Crash NGLess with a corner case
When re-injecting the sequence into a Samline, NGLess does not ensure
that the CIGAR string is correctly transformed.

This is not an issue for NGLess' own downstream processing, but samtools
will refuse to transform it to a BAM file

Reported by @unode as issue #92
@luispedro

This comment has been minimized.

Copy link
Collaborator

commented Oct 30, 2018

It's actually not related to large BAM files at all, it's just a corner case that doesn't happen that frequently (something like 1 out of 50,000 sequences)

@luispedro luispedro closed this in 3eb9336 Oct 30, 2018

@unode

This comment has been minimized.

Copy link
Member Author

commented Oct 30, 2018

In the case of the broken pipe above, is there any way that ngless can still capture and display the error?
It seems like the actual message is lost and the only diagnostic is the broken pipe.

Thanks for the fix.

luispedro added a commit that referenced this issue Nov 12, 2018

BLD Release 0.10
Many small fixes rather than any large new features.

Full ChangeLog:

    * Fix to lock1's return value when used with paths (#68 - reopen)
    * Support _F/_R suffixes for forward/reverse in load_mocat_sample
    * samtools_sort() now accepts by={name} to sort by read name
    * Fixed bug where header was printed even when STDOUT was used
    * Fixed bug where writing interleaved FastQ to STDOUT did not work as
    expected
    * Indices created by bwa and minimap2 are now versioned
    * arg1 in external modules is no longer always treated as a path
    * Added expand_searchpath to external modules API (closes #56)
    * Fixed bug where detection of Fastq encoding was not performed on the second pair
    * Fix saving fastq sets with --subsample (issue #85)
    * Add __extra_megahit_args to assemble() (issue #86)
    * Better error message when user mis-specifies the ngless version string
    (issue #84)
    * Support NO_COLOR environment variable (issue #83)
    * Garbage collection for temporary files (issue #79)
    * Rename --search-dir to --search-path for consistency with other API
    * Fix corner case with select() producing incorrect CIGAR strings (#92)
    * Always check output file writability (#91)
    * Make paired() accept encoding argument

unode added a commit that referenced this issue Feb 8, 2019

MNT Implement fixCigar for select() blocks
The previous solution implemented on 3eb9336 only covered the
select() function and not blocked select.

refs #92

luispedro added a commit that referenced this issue Feb 22, 2019

RLS Release 0.11.0
A collection of several bugfixes and performance improvements over the
last few months.

Full ChangeLog:

    * Switch to diagrams package for plotting
    * Update minimap2 version to 2.14
    * Module samtools (version 0.1) now includes samtools_view
    * Update to LTS-13 (GHC 8.6)
    * Fix bug with orf_find & prots_out argument
    * Call bwa/minimap2 with interleaved fastq files
    * Add --verbose flag to check-install mode
    * Avoid leaving open file descriptors after FastQ encoding detection
    * Fix bug in garbage collection
    * Compress intermediate SAM files (#22)
    * Tar extraction uses much less memory (#77)
    * Add early checks for input files in more situations (#33)
    * Support compression in collect() output (#42)
    * Fix CIGAR (#92) for select() blocks

luispedro added a commit that referenced this issue Mar 15, 2019

RLS Release 0.11.0
A collection of several bugfixes and performance improvements over the
last few months.

Full ChangeLog:

    * Switch to diagrams package for plotting
    * Update minimap2 version to 2.14
    * Module samtools (version 0.1) now includes samtools_view
    * Fix bug with orf_find & prots_out argument
    * Call bwa/minimap2 with interleaved fastq files
    * Add --verbose flag to check-install mode
    * Avoid leaving open file descriptors after FastQ encoding detection
    * Fix bug in garbage collection
    * Compress intermediate SAM files (#22)
    * Tar extraction uses much less memory (#77)
    * Add early checks for input files in more situations (#33)
    * Support compression in collect() output (#42)
    * Fix CIGAR (#92) for select() blocks
    * Update to LTS-13 (GHC 8.6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.