-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow Training new Bonito Model #22
Comments
Hey @menickname That sounds like a great idea and I'm happy to help. Did you see the notebook directory which you can run in Google Colab? Bonito uses the same HDF5 chunkify format that Taiyaki uses so you'll want to start with that, there is good documentation in Taiyaki repo. The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome. Once you have created a chunkify file using >>> # references.npy
>>> # Integer encoded target sequence {'A': 1, 'C': 2, 'G': 3, 'T': 4}
>>> # Variable length and zero padded (default range between 128 and 256).
>>> np.load('references.npy').shape
(1000000, 256)
>>> np.load('references.npy').dtype
dtype('uint8')
>>>
>>> # reference_lengths.npy
>>> # Lengths of target sequences in references.npy
>>> np.load('reference_lengths.npy').shape
(1000000,)
>>> np.load('reference_lengths.npy').dtype
dtype('uint8')
>>>
>>> # chunks.npy
>>> # Sections of squiggle that correspond with the target reference sequence
>>> # Variable length and zero padded (upto 4096 samples).
>>> np.load('chunks.npy').shape
(1000000, 4096)
>>> np.load('chunks.npy').dtype
dtype('float32')
>>>
>>> # chunk_lengths.npy
>>> # Lengths of squiggle sections in chunks.npy
>>> np.load('chunk_lengths.npy').shape
(1000000,)
>>> np.load('chunk_lengths.npy').dtype
dtype('uint16') Bonito uses CTC to train, this a good article explaining how it works https://distill.pub/2017/ctc/. For training, you just need to give a directory to store the model output, the model config and use
Give it shot and let me know if anything isn't clear or if you run into any problems. Chris. |
Dear @iiSeymour Thanks a lot for your prompt response. I will have a look into that. I think I needed this short summary to get from step A to B and so fort to step bonito train. I ended up reading in that notebook directory indeed, however I got confused reading issues on Github and the notebook. One small question for now: "The training file shipped with Bonito has 66,000 reads split equal across conditions so you will want to filter your own reads to take an equal sample of reads per genome." This means I should also stick to this 66,000 reads (I don't think the exact size matters here?). I would like to train the model for 1 specific species as I think the basecalling performance is due to the high AT content and numerous repeat stretches of the bacteria we are currently working on. However, I have 10 independent read files of one same isolate, which I might split to get an equal sample read input to train Bonito? If so, I might start training first of all with 1 single dataset and then retrain Bonito with a mixture of all 10 independent runs? Thank you in advance. |
Dear @iiSeymour Ones again, thank you for your answer earlier today. I have been "playing" around with the taiyaki today and succeeded in getting to the final step - prepare_mapped_reads.py. For this step we have to supply a model file. Currently, I have downloaded and used the r941_dna_minion.checkpoint file available in the taiyaki_walkthrough available for guppy. I was still wondering the following:
Thank you in advance. |
Right, the exact size (66,000) isn't important it was more just to let you know that don't need millions of reads for model training. Stick with Guppy for both questions (1, 2), you take the reference sequence from mapping anyway so the caller isn't important. HTH, Chris. |
Dear @iiSeymour I have successfully generated the chunkify file, however when running the Bonito convert-data command I get the following message: $ ./convert-data --chunks 1000000 ../../Bonito_Train/Bonito_Nick.hdf5 ../../Bonito_Train/ I went through all the previous steps again, however could not figure out where something went wrong. |
Sorry @menickname, I had left a bad debug print in the conversation script, can you do a |
@menickname , indeed! Performs perfectly now. The 4 .npy files have been generated and I will try to start up the training now on HPC GPU resources. Thank you for your support. |
Dear @iiSeymour The first Bonito train is currently running on our HPC-GPU cluster. However, with the current release we are not able to run on multi-GPU. In the #13 (comment) issue I saw you suggested to add some patches to the train.py file. Is there a specific reason this has not been implemented yet in the code itself? Thank you in advance. |
On the multi-gpu support:
I will look to get this merged in today. |
@iiSeymour Do you have more information on that hang issue? How does it manifest, etc., so we can keep an eye on it? |
There is a PyTorch thread on the issue here. The hanging happens on the import of This is all single node multi-gpu b.t.w and I can confirm Nvidia DGX systems are fine as they use NVLink/NVSwitch. Is multi-node something you are wanting to look at? |
We have 4 NVIDIA Tesla V100 GPUs in each box, so we're only looking at single node multi-GPU for now, which would be a nice step up over single GPU. |
Seems like we do have NVLink:
|
Yes, it looks like you are good! The samples are optional; they are here if you are interested - https://github.com/NVIDIA/cuda-samples |
Dear @iiSeymour Bonito train is running perfectly on multi-gpu after installing the patched Bonito version. Please find here the functional version https://gist.github.com/boegel/e0c303497ba6275423b39ea1c10c7d73 from @boegel. This one works on our HPC-GPU resource and fixes 2 of the afore-mentioned issues. Still 2 questions:
Thank you in advance. |
|
|
For distributed training, I think @iiSeymour is planning to look into adding support for leveraging RaySGD from Bonito. |
Version 0.1.2 is on PyPI now and includes the mutli-gpu patch along with the reworked solution for including the old scripts directory. What was |
Hi, thanks for your tool :-) I try to train a RNA model... but Bonito has only DNA model so I have used tayaki as start point with
With tayaki v5.0.0 only because with the last version (with mLstm_flipflop.py advised for guppy compatibility), I had : So, I'm happy :-) and I hope to be on the right track. However when running the Bonito convert command I get the following message: bonito -v bonito convert --chunksize 1000000 outputTayaki.fast5 bonito_convert
I have tried with --chunksize 100000 but same error. |
Without argument chunksize (by default is 3600...) it works ! But I have some doubt. Which impact of this parameter chunksize ? |
Dear @iiSeymour
I am currently experimenting with the new Bonito basecaller (in comparison with the Guppy basecaller).
Recently, we have generated a big dataset of bacterial ONT sequences of a highly AT-rich organism. In a de novo assembly, we have noticed we are not able to reach good final assembly Q-scores in comparison to generated Illumina data when running Guppy basecaller. As such, we started investing some time in running Bonito basecaller, followed by de novo assembly. However, no improvement was observed here. We are aware the current Bonito basecaller has not been trained with this type of data, thus we checked first if our Bonito basecaller was performing well on an in-house generated E. coli species dataset. Which rendered highly improved final genome assembly Q-scores (in line with Illumina Q-scores) when using Bonito in comparison with Guppy.
Now we would like to try to train the Bonito basecaller with our own organism dataset(s) to reach similar improved genome assembly Q-scores as with the E. coli dataset. I have been reading through multiple deposits on Github on Bonito, however could not figure out a clear workflow on "how to train the Bonito basecaller". Is it possible to get a (little) walk-through on how to train the bonito model with own input data?
Thank you in advance.
Best regards,
Nick Vereecke
The text was updated successfully, but these errors were encountered: