HAC models #14

MichelMoser · 2019-08-05T08:53:56Z

Dear HELEN developers,

When running MarginPolish with the allParams.np.human.guppy-ff-235.json model, i get a Calloc error.

udocker run -v /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K:/data mGPolish reads_2_assembly.bam assembly.fasta allParams.np.human.guppy-ff-235.json -t 32 -o /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305 -f

 ******************************************************************************
 *                                                                            *
 *               STARTING 137351bb-4e04-3309-9bf5-ae016625cef7                *
 *                                                                            *
 ******************************************************************************
 executing: sh
Set log level to INFO
Running OpenMP with 32 threads.
> Parsing model parameters from file: allParams.np.human.guppy-ff-235.json
Calloc failed with request for -2 lots of 16 bytes
Command exited with non-zero status 1

DEBUG_MAX_MEM:4608
DEBUG_RUNTIME:0:00.06

The program runs if using another model:

udocker run -v /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K:/data mGPolish reads_2_assembly.bam assembly.fasta allParams.np.human.guppy-ff-233.json -t 32 -o /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305 -f

 ******************************************************************************
 *                                                                            *
 *               STARTING 137351bb-4e04-3309-9bf5-ae016625cef7                *
 *                                                                            *
 ******************************************************************************
 executing: sh
Set log level to INFO
Running OpenMP with 32 threads.
> Parsing model parameters from file: allParams.np.human.guppy-ff-233.json
> Parsing reference sequences from file: assembly.fasta
> Going to write polished reference in : /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305.fa
...

Is the 235 model file corrupted?

Also, i saw your latest models for polishing is named guppy 2.3.5.
Is this trained on the HAC configuration files?

We are currently using promethION data basecalled with HAC models on guppy 3.0.5 provided by ONT and i wonder which model would fit the data best.

model files used for basecalling:

md5sum dna_r9.4.1_450bps_hac_prom.cfg   c9dc5f42f63c005085ed89e4094e0bb4
md5sum template_r9.4.1_450bps_hac_prom.jsn     6ee479f9ae82a7d26cb47bd24a7882fd

Maybe it would be more accurate to name models after their used basecall models instead of guppy versions?

Thanks,
michel

The text was updated successfully, but these errors were encountered:

kishwarshafin · 2019-08-06T13:31:12Z

Hi Michel,

Regarding the MarginPolish error: @tpesout can help you with it.

HAC model: All models are trained on HAC base-called data. We are trying to assess if the fast prediction model, and it's applicability.

And the model: we are trying to assess and provide a model for the latest guppy version, but the summer hiatus and the expense of base-calling are holding us back. As we re-group after summer and get the new base called data, we will publish updated models. For now, guppy 2.3.3 would be the one to use for 3.0.2 as the RLE confusion matrix matches closely.

kishwarshafin · 2019-08-09T03:10:08Z

@MichelMoser ,

I was able to replicate the model error you were getting, can you delete the local copy of your guppy 235 file and do this:
wget https://raw.githubusercontent.com/UCSC-nanopore-cgl/MarginPolish/master/params/allParams.np.human.guppy-ff-235.json

This downloads the raw json file and makes sure you don't download html content.

MichelMoser · 2019-08-09T07:35:00Z

Great, thanks for fixing this. Works fine now!
Will just run a few model comparisons on my guppy305-basecalled data to see what performs best.

I also have some datasets from different basecalling models (from older and newer firmware). Not sure yet how to apply HELEN to such cases. Do you have any suggestions?
I assume marginPolish and HELEN's main source of confidence in consensus is coverage, so if i split the dataset into sets of same-guppy-basecalled data, i will probably loose a lot of "consensus-power" as coverage drops.

kishwarshafin · 2019-08-09T09:42:39Z

@MichelMoser, we describe why we need different models for different versions of basecallers:
https://www.biorxiv.org/content/biorxiv/early/2019/07/26/715722.full.pdf

If you look at figure 18 on page 41 you'll see that the confusion matrix for two basecallers is different. I think the 3.0.5 is the closest to 2.3.3 so you would get the best results for 3.0.5 with 2.3.3. Do you have training data (HG002) for these basecaller versions? if yes then you can train models for each version and use that.

MichelMoser · 2019-08-09T10:45:00Z

@kishwarshafin, yes, i totally understand why different models are needed to match the different guppy basecall-models. I wonder how i could get optimal polishing results with mP+H for genomes assembled from a combination of different nanopore reads (guppy233-called and guppy235-called). We work with non-model organisms (fishes) without reference genomes, so training is difficult to do.

kishwarshafin · 2019-08-09T10:53:18Z

I see. Sorry, I misinterpreted your question. Ok, in such case you also need to consider that different basecallers report different base qualities and the model rely highly on the reported qualities from the basecaller. It is extremely important that you pick the right model for the right basecaller. I would suggest that you gather all your data and basecall with a single guppy basecaller i.e. 3.0.5 then when we release a 3.0.5 or higher basecaller model you use that. Mixing 233 and 235 will make the model perform badly.

It's a bit frustrating to keep up with all the frequent upgrades of basecaller but that's the best we can do now.

kishwarshafin · 2019-09-26T06:28:36Z

@MichelMoser ,

I can confidently say that I was wrong about my statement last month. I ran a simple test by mixing reads from 235 and 233 and performed polishing and it looked fine. So, you can mix your reads and do the polishing to get better results. Sorry for being so late on this.

MichelMoser · 2019-09-26T06:31:50Z

@kishwarshafin

Thats great news. Thanks for the follow-up.
You did the polishing of the mixed reads using the 235 model?

kishwarshafin · 2019-09-26T14:31:09Z

Yes, the data was 233 + 235 mixture and model was 235. Also, if you want to wait, we are working on a 305 model, should be able to deliver in a week or so.

Thanks for your patience.

MichelMoser · 2019-09-27T11:26:03Z

Great, cant wait for this!

kishwarshafin · 2019-10-03T16:46:47Z

@MichelMoser

The marginPolish model is updated to the master and here's the HELEN model for 305:
HELEN_guppy305_model

There are still a few tests that we need to do before making it public but the initial tests look very good. :-) Good luck!

MichelMoser · 2019-10-04T06:31:54Z

@kishwarshafin

Great! Already started a run to see how it compares to previous models. BUSCO will tell =)
Thank you so much!

kishwarshafin · 2019-10-04T19:12:16Z

Great! Please share the results if you can.

MichelMoser · 2019-10-09T06:58:44Z

Hi,

I did a comparison of newest models for HELEN guppy305) and marginPolish with the previous one (guppy235) and medaka (r941_prom_high)

Unfortunately, the guppy305 model did not perform well on our data.
Did you use MinION data for training? Because we exclusively have PromethION data for our genomes.

Here the overview:

allParams.np.human.guppy-ff-305.json+guppy305_hg002_splitRleWeight.pkl    C:80.3%[S:77.2%,D:3.1%],F:11.1%,M:8.6%,n:4584

allParams.np.human.guppy-ff-235.json+r941_flip235_v001.pkl                C:86.4%[S:83.4%,D:3.0%],F:5.5%,M:8.1%,n:4584

racon (1 round) + medaka                                                  C:90.2%[S:87.1%,D:3.1%],F:3.9%,M:5.9%,n:4584

marginPolish tpesout/margin_polish:latest
HELEN commit 84f3575
medaka version 0.9.1
racon version v1.4.7

kishwarshafin · 2019-10-09T14:21:04Z

Hi @MichelMoser ,

I have a few questions for you:

I believe it's a non-model organism but do you have a good assembly of that organism that can be used as a reference?
Will you be willing to do a test with another version of MP+HELEN that is in development or be willing to provide us the reads and the assembly so we can do it for you?

From our findings, we saw that BUSCO is not a great metric for assembly accuracy. But this big of a difference is concerning.

MichelMoser · 2019-10-09T19:03:51Z

Hi,

I was surprised by the result as well.

Yes I tested in on a non-model organism because it has a relatively small genome (700 Mbp) and the nanopore assembly is very good (N50 of 18 Mbp). Unfortunately we currently only have nanopore data for this species available and Illumina data is yet to come.

What kind metric do you prefer to assess assembly accuracy? Comparison to Illumina contigs?

Yes, i would be happy to test other versions or share datasets. We have multiple de novo genome assemblies and nanopore datasets from species (called with either guppy 2.2.3 or 3.0.5) with different genome sizes. I could switch to a larger genome (2.5 Gbp) where nanopore + Illumina data is available and references exist.

Let me know what you think. email: michel.moser at nmbu.no

kishwarshafin · 2019-10-09T19:13:09Z

I'll follow up with you over the email and close this issue here as it's non-related to the pipeline.

kishwarshafin assigned tpesout Aug 6, 2019

kishwarshafin added the discussion label Aug 14, 2019

MichelMoser closed this as completed Sep 27, 2019

kishwarshafin reopened this Oct 9, 2019

kishwarshafin closed this as completed Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAC models #14

HAC models #14

MichelMoser commented Aug 5, 2019 •

edited

Loading

kishwarshafin commented Aug 6, 2019

kishwarshafin commented Aug 9, 2019

MichelMoser commented Aug 9, 2019

kishwarshafin commented Aug 9, 2019

MichelMoser commented Aug 9, 2019

kishwarshafin commented Aug 9, 2019

kishwarshafin commented Sep 26, 2019

MichelMoser commented Sep 26, 2019

kishwarshafin commented Sep 26, 2019

MichelMoser commented Sep 27, 2019

kishwarshafin commented Oct 3, 2019

MichelMoser commented Oct 4, 2019

kishwarshafin commented Oct 4, 2019

MichelMoser commented Oct 9, 2019

kishwarshafin commented Oct 9, 2019 •

edited

Loading

MichelMoser commented Oct 9, 2019

kishwarshafin commented Oct 9, 2019

HAC models #14

HAC models #14

Comments

MichelMoser commented Aug 5, 2019 • edited Loading

kishwarshafin commented Aug 6, 2019

kishwarshafin commented Aug 9, 2019

MichelMoser commented Aug 9, 2019

kishwarshafin commented Aug 9, 2019

MichelMoser commented Aug 9, 2019

kishwarshafin commented Aug 9, 2019

kishwarshafin commented Sep 26, 2019

MichelMoser commented Sep 26, 2019

kishwarshafin commented Sep 26, 2019

MichelMoser commented Sep 27, 2019

kishwarshafin commented Oct 3, 2019

MichelMoser commented Oct 4, 2019

kishwarshafin commented Oct 4, 2019

MichelMoser commented Oct 9, 2019

kishwarshafin commented Oct 9, 2019 • edited Loading

MichelMoser commented Oct 9, 2019

kishwarshafin commented Oct 9, 2019

MichelMoser commented Aug 5, 2019 •

edited

Loading

kishwarshafin commented Oct 9, 2019 •

edited

Loading