Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam-dump produces invalid SAM #23

Closed
DaGaMs opened this issue Jan 12, 2016 · 31 comments
Closed

sam-dump produces invalid SAM #23

DaGaMs opened this issue Jan 12, 2016 · 31 comments

Comments

@DaGaMs
Copy link

DaGaMs commented Jan 12, 2016

I'm not able to ascertain whether this is a problem of the raw data or of sam-dump, but when I do e.g. this:

sam-dump -r SRR2141560 | samtools view -b - > SRR2141560.bam

I get errors of the type:

[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] parse error at line 187
[main_samview] truncated file.

The offending line looks like this:

35      163     1       17649   0       41S110M =       17898   400     GAGCACAAGCACTACTTACTGGCCTAGGTTGTGAGAGAAGTTGATGCTGCTGGGAAGACCCCCAAGTCCCTCTTCTGCATCGTCCTCGGGCTCCGGCTTGGTGCTCACGCACACAGGAAAGTCCTTCAGCTTCTCCTGAGAGGGCCAGGAT   ::;=@@B@CBCACBACABBDC@CAB7DAAADBD@DCDCACCBDCDDCD        RG:Z:4638-MDA-03_CGGCTATG_H0BLEALXX_L004        NH:i:2  NM:i:1

I've tried with sra tools versions 2.3.5 and 2.5.7, with the same result. Any idea what's going on there?

@osris
Copy link

osris commented Jan 13, 2016

its an issue with the raw data. an inconsistency in the bam flags where 1 read of a pair has no primary alignment – either because it is in the bam but never marked as primary or because it is not found in the bam.
It produces an internal inconsistency in the SRA archive that is found when you try to read the data.
Its subtle enough that we have not been catching it when we load the files. it is not even clear what we should do when we catch it. I’ve looked at the offending sam records from the source files and many are questionable, bordering on nonsense. We are working on a fix and will either reprocess the source files or provide updated sam-dump utility.

@DaGaMs
Copy link
Author

DaGaMs commented Jan 13, 2016

Interesting - I wonder what happened there. I'm now trying to dump to fastq and re-map. But it would be good to make this part of the QC in the future...

@osris
Copy link

osris commented Jan 13, 2016

i conflated 2 issues. i think this one is caused by an aligner that hard clips reads, but not qualities in some sam rows. or maybe its qualities but not reads. either way, the data is corrupt as the bases cannot be matched to corresponding qualities. when SRA processes bam files, we have been using the sam flags to determine the primary read. multiple placements of a read with conflicting flag values can lead to processing errors. regardless, we need to add further checks to our loading procedures.
sam/bam is the best hope for a defined file format for NGS data, but the spec is too loose and samtools doesn't help the situation as it doesn't do any validation beyond simple formatting checks. any application that has implemented the sam spec (SRA, Picard and GATK) would complain about these bam files. those of following the spec are forced to play catch-up.

@DaGaMs
Copy link
Author

DaGaMs commented Jan 13, 2016

So, the data is inherently corrupt and fastq-dump will produce invalid fastq files, or missing reads?

@kwrodarmer
Copy link
Contributor

fastq-dump will produce what was loaded, with reads and qualities. The lengths of each will be what was submitted, missing those that were filtered out during load.

@DaGaMs
Copy link
Author

DaGaMs commented Jan 14, 2016

further to this, I've now tried to dump these files into fastq, but fastq-dump is sucking up huge amounts of memory. For example,

fastq-dump --gzip --split-3 -G -T SRR2141573

fails even with 70GB of memory allocated to it. I'm now re-running it with 120TB of memory but I'm not too hopeful even that will be enough. The same happens for all runs in that project. Any idea what the problem is there?

@kwrodarmer
Copy link
Contributor

What is the failure?

@DaGaMs
Copy link
Author

DaGaMs commented Jan 14, 2016

Either our cluster scheduler kills it because it uses > 70GB memory, or it dies by itself with

2016-01-14T08:41:17 fastq-dump.2.5.7 err: memory exhausted while resizing buffer within runtime module - failed /users/bschuster/ncbi/public/sra/SRR2141561.sra

=============================================================
An error occurred during processing.
A report was generated into the file '/users/bschuster/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to 'sra@ncbi.nlm.nih.gov' for assistance.
=============================================================

You can see an example error report here: https://gist.github.com/DaGaMs/74338cc5e2af7e3c52c3

@kwrodarmer
Copy link
Contributor

Okay, this is good. Yes, your cluster imposed a resource limit that causes us to receive an almost unheard-of out-of-memory error during processing. I would drop the gzipping of output and the splitting and try it without.

The run you mention has 206,754,609,912 bases, which is almost twice the size of SRR1264615 that you mention in another issue. The output of a simple, single-file fastq-dump of this run will produce about a half terabyte file.

@DaGaMs
Copy link
Author

DaGaMs commented Jan 14, 2016

I'll try without the splitting/zipping, but I'll need to split by read group and mate pair eventually to map the data correctly.

For clarification: SRR1264615 belongs to project SRP041470. All runs in SRP041470 extract ok and have no SAM issues. On the other hand, all runs in SRP061939 have SAM issues as described above, and all of them lead to humongous memory consumption upon fastq-dump.

@kwrodarmer
Copy link
Contributor

Another thing to point out is that this run, in addition to being quite large, (52G as SRA, but ~500G as fastq which doesn't even account for its alignments), contains 83 references. In the release you have, our code is keeping all 83 of those references in memory. Release 2.6.0 will reduce this down a lot. We'll test to see if this could account for some of the memory consumption.

Regarding the original topic of this issue, we have discovered a number of cases where the submitter has been hard-clipping secondary alignments. Hard-clipping is invalid for submission to archives, for the obvious reason that it discards sequence. The specific case is that secondary alignments that occur before related primaries are being seen as clipped, and their sequence is extracted during load. When the primary is later seen, the full sequence is given but by then is too late. We are addressing this in our ETL.

@DaGaMs
Copy link
Author

DaGaMs commented Jan 14, 2016

It seems unlikely to me that the reference issue is responsible for the memory inflation. First of all, I think runs from SRP041470 and SRP061939 where aligned to the same reference, but only the latter causes memory issues. Secondly, the combined uncompressed size of all the reference sequences is 1.4G. Even if everything was loaded into memory, at a 2x overhead per base, it would still only use up < 3G of memory.

@kwrodarmer
Copy link
Contributor

Yes, and this is why we have traditionally cached them.

@DaGaMs
Copy link
Author

DaGaMs commented Jan 14, 2016

So, the output of fastq-dump $HOME/ncbi/public/sra/SRR2141580 puts mate pairs concatenated on the same line? So, I should be able to split them by length, hopefully?

However, read-group annotation is completely lost. I won't be able to re-capture the sequencing run for each read, right? That's a bit sub-optimal...

@kwrodarmer
Copy link
Contributor

Well, it's a different use case. For example, 454 mate pairs will contain linkers, and for anyone who wants to perform their own read splitting, that's the way to go. Illumina read-pairs are fixed size, so they can be split externally as well.

vdb-dump will show the splitting information that we use within fastq-dump to create split files. I think we may even have a perl script that does this for you... I'll investigate and post later.

So if you want them to be split within fastq-dump, go ahead - but you may do better by not compressing the output. You can continue to experiment or wait for us to examine this ourselves, the latter of which will take some time.

@kwrodarmer
Copy link
Contributor

You'll notice that the spot length for SRR2141580 is 302, and in fact
$ vdb-dump -R1 -CREAD_LEN SRR2141580
shows
READ_LEN: 151, 151
It is not difficult to split this using text processing tools.

Of course, we also provide the ability to split the reads, and that's what we normally recommend. We're just exploring other avenues here, and showing that in fact the output is perfectly useful.

@DaGaMs
Copy link
Author

DaGaMs commented Jan 14, 2016

Sorry to keep going about this, but the experiment unfortunately failed. I tried running fastq-dump $HOME/ncbi/public/sra/SRR2141580 without splitting or compressing, but fastq-dump still started using huge amounts of memory. I gave it 15G, which it used up within 20 minutes and then died. So, whatever the problem is, it's not related to splitting or compressing.

@kwrodarmer
Copy link
Contributor

Okay, then let us work on it for a bit and we'll be back with you.

However, I'm confused - you gave it 15G of running space? I thought before you were talking in terms of 70G and beyond?

@DaGaMs
Copy link
Author

DaGaMs commented Jan 15, 2016

So, a few of the runs finished with a max memory usage of 105G, but most of them still failed after they exceeded 120G of memory... I'll try again with 256G each...

@osris
Copy link

osris commented Jan 15, 2016

We are producing cache files that will speed the process and reduce mem use. I'll check on the progress in a bit.

-Chris

On Jan 15, 2016, at 5:45 AM, Benjamin Schuster-Boeckler <notifications@github.commailto:notifications@github.com> wrote:

So, a few of the runs finished with a max memory usage of 105G, but most of them still failed after they exceeded 120G of memory... I'll try again with 256G each...


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-171930933.

@kwrodarmer
Copy link
Contributor

Sorry that this issue didn't get updated before - it went into our internal NCBI issue tracking DB and didn't get updated here. But in general, the files that were giving these problems have been reloaded. Please let us know if this is still a problem for you.

@kwrodarmer
Copy link
Contributor

When running against the reloaded data, there does not appear to be any further error

sam-dump -r SRR2141560 | samtools view -b -S - > SRR2141560.bam

This works.

@hammer
Copy link

hammer commented Aug 30, 2016

@osris @kwrodarmer i'm seeing a nearly identical error on SRR2170057:

[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] parse error at line 76842889
[main_samview] truncated file

@osris
Copy link

osris commented Aug 30, 2016

this one had not been reloaded. it was only released a couple weeks ago. must have been missed in our first scan. reloading and checking for others in this study.

@hammer
Copy link

hammer commented Aug 31, 2016

@osris thanks for looking into it! we're having problems converting several of the files in this collection to BAMs, including SRR2177298, SRR2177289, SRR3198441, SRR2177293. Some of these die with a segmentation fault in sam-dump, others die in samtools view.

@hammer
Copy link

hammer commented Aug 31, 2016

Few more that look busted: SRR3198427, SRR2177269

@osris
Copy link

osris commented Aug 31, 2016

@hammer no need to send more examples. we are checking to find out why we missed these when we scanned for inconsistent seq/qual lengths (BWA-mem plays fast and loose with sam-spec). i'm running a test on reloaded SRR2170057 right now and i'll queue the others after verifying the fix.

@hammer
Copy link

hammer commented Sep 10, 2016

@osris how's the cleanup of these files coming? I have another set of IDs of broken files to send you if you're interested!

@hammer
Copy link

hammer commented Sep 30, 2016

@osris Any updates? We'd really like to be able to use this data.

@osris
Copy link

osris commented Sep 30, 2016

@hammer apologies. this dropped off my radar. i'm running the reloads now. i'll check progress over the weekend.

@osris
Copy link

osris commented Oct 3, 2016

@hammer reprocessing is complete. please let me know if you encounter any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants