-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sam-dump produces invalid SAM #23
Comments
its an issue with the raw data. an inconsistency in the bam flags where 1 read of a pair has no primary alignment – either because it is in the bam but never marked as primary or because it is not found in the bam. |
Interesting - I wonder what happened there. I'm now trying to dump to fastq and re-map. But it would be good to make this part of the QC in the future... |
i conflated 2 issues. i think this one is caused by an aligner that hard clips reads, but not qualities in some sam rows. or maybe its qualities but not reads. either way, the data is corrupt as the bases cannot be matched to corresponding qualities. when SRA processes bam files, we have been using the sam flags to determine the primary read. multiple placements of a read with conflicting flag values can lead to processing errors. regardless, we need to add further checks to our loading procedures. |
So, the data is inherently corrupt and fastq-dump will produce invalid fastq files, or missing reads? |
fastq-dump will produce what was loaded, with reads and qualities. The lengths of each will be what was submitted, missing those that were filtered out during load. |
further to this, I've now tried to dump these files into fastq, but
fails even with 70GB of memory allocated to it. I'm now re-running it with 120TB of memory but I'm not too hopeful even that will be enough. The same happens for all runs in that project. Any idea what the problem is there? |
What is the failure? |
Either our cluster scheduler kills it because it uses > 70GB memory, or it dies by itself with
You can see an example error report here: https://gist.github.com/DaGaMs/74338cc5e2af7e3c52c3 |
Okay, this is good. Yes, your cluster imposed a resource limit that causes us to receive an almost unheard-of out-of-memory error during processing. I would drop the gzipping of output and the splitting and try it without. The run you mention has 206,754,609,912 bases, which is almost twice the size of SRR1264615 that you mention in another issue. The output of a simple, single-file fastq-dump of this run will produce about a half terabyte file. |
I'll try without the splitting/zipping, but I'll need to split by read group and mate pair eventually to map the data correctly. For clarification: SRR1264615 belongs to project SRP041470. All runs in SRP041470 extract ok and have no SAM issues. On the other hand, all runs in SRP061939 have SAM issues as described above, and all of them lead to humongous memory consumption upon |
Another thing to point out is that this run, in addition to being quite large, (52G as SRA, but ~500G as fastq which doesn't even account for its alignments), contains 83 references. In the release you have, our code is keeping all 83 of those references in memory. Release 2.6.0 will reduce this down a lot. We'll test to see if this could account for some of the memory consumption. Regarding the original topic of this issue, we have discovered a number of cases where the submitter has been hard-clipping secondary alignments. Hard-clipping is invalid for submission to archives, for the obvious reason that it discards sequence. The specific case is that secondary alignments that occur before related primaries are being seen as clipped, and their sequence is extracted during load. When the primary is later seen, the full sequence is given but by then is too late. We are addressing this in our ETL. |
It seems unlikely to me that the reference issue is responsible for the memory inflation. First of all, I think runs from SRP041470 and SRP061939 where aligned to the same reference, but only the latter causes memory issues. Secondly, the combined uncompressed size of all the reference sequences is 1.4G. Even if everything was loaded into memory, at a 2x overhead per base, it would still only use up < 3G of memory. |
Yes, and this is why we have traditionally cached them. |
So, the output of However, read-group annotation is completely lost. I won't be able to re-capture the sequencing run for each read, right? That's a bit sub-optimal... |
Well, it's a different use case. For example, 454 mate pairs will contain linkers, and for anyone who wants to perform their own read splitting, that's the way to go. Illumina read-pairs are fixed size, so they can be split externally as well.
So if you want them to be split within |
You'll notice that the spot length for SRR2141580 is 302, and in fact Of course, we also provide the ability to split the reads, and that's what we normally recommend. We're just exploring other avenues here, and showing that in fact the output is perfectly useful. |
Sorry to keep going about this, but the experiment unfortunately failed. I tried running |
Okay, then let us work on it for a bit and we'll be back with you. However, I'm confused - you gave it 15G of running space? I thought before you were talking in terms of 70G and beyond? |
So, a few of the runs finished with a max memory usage of 105G, but most of them still failed after they exceeded 120G of memory... I'll try again with 256G each... |
We are producing cache files that will speed the process and reduce mem use. I'll check on the progress in a bit. -Chris On Jan 15, 2016, at 5:45 AM, Benjamin Schuster-Boeckler <notifications@github.commailto:notifications@github.com> wrote: So, a few of the runs finished with a max memory usage of 105G, but most of them still failed after they exceeded 120G of memory... I'll try again with 256G each... — |
Sorry that this issue didn't get updated before - it went into our internal NCBI issue tracking DB and didn't get updated here. But in general, the files that were giving these problems have been reloaded. Please let us know if this is still a problem for you. |
When running against the reloaded data, there does not appear to be any further error sam-dump -r SRR2141560 | samtools view -b -S - > SRR2141560.bam This works. |
@osris @kwrodarmer i'm seeing a nearly identical error on SRR2170057:
|
this one had not been reloaded. it was only released a couple weeks ago. must have been missed in our first scan. reloading and checking for others in this study. |
@osris thanks for looking into it! we're having problems converting several of the files in this collection to BAMs, including SRR2177298, SRR2177289, SRR3198441, SRR2177293. Some of these die with a segmentation fault in |
Few more that look busted: SRR3198427, SRR2177269 |
@hammer no need to send more examples. we are checking to find out why we missed these when we scanned for inconsistent seq/qual lengths (BWA-mem plays fast and loose with sam-spec). i'm running a test on reloaded SRR2170057 right now and i'll queue the others after verifying the fix. |
@osris how's the cleanup of these files coming? I have another set of IDs of broken files to send you if you're interested! |
@osris Any updates? We'd really like to be able to use this data. |
@hammer apologies. this dropped off my radar. i'm running the reloads now. i'll check progress over the weekend. |
@hammer reprocessing is complete. please let me know if you encounter any issues. |
I'm not able to ascertain whether this is a problem of the raw data or of
sam-dump
, but when I do e.g. this:I get errors of the type:
The offending line looks like this:
I've tried with sra tools versions 2.3.5 and 2.5.7, with the same result. Any idea what's going on there?
The text was updated successfully, but these errors were encountered: