Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about eventalign parallelization at file level #770

Open
mmiladi opened this issue Apr 25, 2020 · 5 comments
Open

Question about eventalign parallelization at file level #770

mmiladi opened this issue Apr 25, 2020 · 5 comments

Comments

@mmiladi
Copy link

mmiladi commented Apr 25, 2020

Hi,

Is it possible to speedup eventalign computations by splitting the files and/or region windowing?

For example to speedup nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa > all.tsv, split the fastq file and then run:

nanopolish eventalign --reads half1.fastq --bam all.bam --genome genome.fa > half1.tsv
nanopolish eventalign --reads half2.fastq --bam all.bam --genome genome.fa > half2.tsv
cat half1.tsv half2.tsv > all.tsv

Best,

@jts
Copy link
Owner

jts commented Apr 25, 2020 via email

@mmiladi
Copy link
Author

mmiladi commented Apr 25, 2020

Great, Thanks.
Would this also work with the window option '-w'? For the data I am using, the -w seems to be ineffective as I can see positions outside the requested range withing the .tsv table.

@jts
Copy link
Owner

jts commented Apr 25, 2020

Sorry, I misread your issue initially (I shouldn't try to answer emails first thing in the morning...).

Splitting the fastq would work, but isn't the recommended way since it will still iterate over every read in the bam, but ignore them because it won't find the signal data. You should provide a coordinate range as the last argument (without -w though):

nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa chrA:0-1,000,000
nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa chrA:1,000,000-2,000,000
[...]

@mmiladi
Copy link
Author

mmiladi commented Apr 25, 2020

Thanks a lot for your prompt supports. The coordinate option hint would be very life (time) saving :-)

@mmiladi mmiladi closed this as completed Apr 25, 2020
@mmiladi mmiladi reopened this May 7, 2020
@mmiladi
Copy link
Author

mmiladi commented May 7, 2020

Hi @jts ,

I have got stumbled on the expected input of the eventalign range option. There are cases where the output tsv is empty with no errors:

nanopolish eventalign --reads seq.fastq.gz --bam align.bam --genome ref.fa --samples --print-read-names --scale-events chr:21000-22000

[bam process] iterating over region:chr:21000-22000                                                                                                                

[post-run summary] total reads: 17556, unparseable: 0, qc fail: 2, could not calibrate: 0, no alignment: 1, bad fast5: 0

Here, I have spliced reads with 5'end at the upstream of position 21000, but all the reads fully cover the range 21000-22000. It seems, though not so sure, I only get the aligned events if I use a start range that covers the 5'end of the read. Is it the expected behavior?
Is there a way to parallelize over a region for all the reads that have (partial or complete) bases aligned to the region?
Best,
-M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants