Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insect 5hmC values anomalous #29

Closed
dithiii opened this issue Aug 25, 2022 · 7 comments
Closed

Insect 5hmC values anomalous #29

dithiii opened this issue Aug 25, 2022 · 7 comments

Comments

@dithiii
Copy link

dithiii commented Aug 25, 2022

I am working with an insect genome and trying to call 5mC and optionally 5hmC. When using megalodon with --remora-modified-bases dna_r9.4.1_e8 model calling 5mC only, I get around 6% 5mC (too high). while calling 5hmC_5mC, I get 0.55% 5mC (about right), but I am getting near 60% 5hmC which seems FAR too high to be realistic. I've never heard of an insect with such incredibly high 5hmC.

  1. Shouldn't the 5mC values match from both calls?
  2. What's going on with the 5hmC? If I can't trust that one, why should I trust the 5hmC levels?

My calls are

  1. megalodon /path/to/wasp-runs/ --sort-mappings --outputs mod_mappings mods per_read_mods --reference waspassembly.fasta --devices 0 --processes 23 --output-directory megalodon-out-5mc-sup --guppy-params " --use_tcp" --overwrite --guppy-server-path /opt/ont/guppy/bin/guppy_basecall_server --guppy-config dna_r9.4.1_450bps_sup.cfg --remora-modified-bases dna_r9.4.1_e8 sup 0.0.0 5mc CG 0
  2. megalodon /path/to/wasp-runs/ --sort-mappings --outputs mod_mappings mods per_read_mods --reference waspassembly.fasta --devices 0 --processes 23 --output-directory megalodon-out-5hmC-5mc-sup --guppy-params " --use_tcp" --overwrite --guppy-server-path /opt/ont/guppy/bin/guppy_basecall_server --guppy-config dna_r9.4.1_450bps_sup.cfg --remora-modified-bases dna_r9.4.1_e8 sup 0.0.0 5hmc_5mc CG 0
@marcus1487
Copy link
Collaborator

I would first recommend using Guppy for modified base calling as Megalodon is being deprecated and all the necessary features from Megalodon to achieve the highest accuracy results are ported into Guppy.

As for the correspondence of the 5mC and the 5mC+5hmC results, these should not necessarily be equal. The 5mC model was trained to detect 5mC, not both 5mC and 5hmC as 5mC (as would occur in bisulfite data to some extent). Thus when the Remora model sees data from a new modified base not modeled the results are not known. It may be that some other modified base also exists here (4mC, 5caC, 5fC, etc) that may be causing these results. Again with the principle that modified bases not seen by the model may fall into any modeled category with unknown frequencies. This is one of the advantages of nanopore modified base detection over chemical methods (such as bisulfite). When we are able to extend our modified base catalogue the results will be much richer than chemical conversion methods. The 5mC+5hmC model is the first step in this direction.

As for the higher frequency of 5hmC in your data I would recommend a bit more investigation. Are these confident calls? Try to adjust the aggregation thresholds in modbam2bed once you have guppy modified basescalls. Or explore the distribution of probabilities in the modbam files. The other option here would be to produce some real ground truth for this dataset. Do you have alternative evidence that there are no other modified bases in this sample (e.g. via HPLC or mass spec)? We have performed several validation experiments on data orthogonal to the training data with good results, so results so far out of the expected range are certainly concerning, but warrant further exploration.

@dithiii
Copy link
Author

dithiii commented Aug 25, 2022

Thanks for the detailed response. I used megalodon rather than guppy because I already had an assembled genome that I wanted to get methylation calls for, rather than calling with guppy, then having to map the unmapped bam file back to the genome. I have done that once, but it's a hassle. In the meantime it's easier to run megalodon with remora against the genome to get modification data.

I appreciate your comments regarding miscalling of non-5mC mods as 5mC, thats' why I went with the 5hmc-5mc model, to reduce those miscalls. I believe this dual model worked, by giving me expected levels of 5mC, but the 5hmC results are hard to explain.

The input here is whole genomic DNA derived from adult wasps, with no conversions or chemical modifications, sequenced on an 9.4.1 flow cell to 60X coverage. I filtered them with even more strict confidence than default by modbam2bed --aggregate -e -m 5mC --cpg -t 4 -d 300 -a 0.2 -b 0.8 wasp.fasta mod_mappings.sorted.bam | awk '($5>0){print}' > modified_bases.5mC.frommodbam2-strict-no5-aggregated.bed. I calculated the average value for sites over 10 reads, same was as I did for 5mC, awk '$5>0 && $12+$13>10 {total+=$11; count++} END{print total/count}' modified_bases.5hmC.frommodbam2-strict.bed and I get around 53% 5hmC. That's wildly high.

For comparison, I recently called 5mC and 5hmC in an ant genome, where they both came out as expected values (less than 1% for 5mC and nearly 0% for 5hmC.) There I had used a previous model from rerio, res_dna_r941_min_modbases_5mC_5hmC_v001.cfg So why did that model work and in this case, the dna_r9.4.1_e8 sup model gave such off 5hmC?

As for ground truth, I haven't checked via HPLC or mass spec. I am willing to look with a cheap ELISA assay for 5hmC, but honestly, nobody has ever seen such incredibly high 5hmC levels in any type of cell from any organism, from bacteria, plants, fungi, or animals that I am aware of, and certainly not any insects. I can't imagine that 4mC, 5-carboxy, or 5-formylC are explaining >50% of modified C called from this wasp DNA. More likely I'm doing something wrong or this model isn't handling wasp DNA very well.

I am willing to try the old res_dna_r941_min_modbases_5mC_5hmC_v001.cfg model again but it will take me awhile to set it up because megalodon is crashing when I try it at the moment. I will probably have to do the guppy -> unaligned bam -> aligned bam -> modbam2bed -> awk summary to get there. It'll take me a couple days.

@marcus1487
Copy link
Collaborator

rather than calling with guppy, then having to map the unmapped bam file back to the genome
Guppy accepts a reference for mapping -a [ --align_ref ] arg

I agree that this certainly seem very high, but as noted this may suggest another modified base that is not modeled. I don't know enough about wasp genomics to know if this has been tested, but if the only bases present are C, 5mC and 5hmC I would highly doubt that the released Remora model would produce so many false positives, though it is certainly in the realm of possibilities and we will look to address this if this is the case.

As for the counting method, this seems reasonable, though I would suggest {can+=$12; mod+=$13} END{print mod/(can+mod)} to count calls not an average of percent modified over different regions. I would also suggest visualizing the per-read calls in a genome browser (the modbam tag is supported in IGV and jbrowse) to see if this might be coming from a few select regions or is a global issue over the whole reference.

@dithiii
Copy link
Author

dithiii commented Sep 7, 2022

Just an update. I re-ran the basecaller with guppy, using the old research model with >80% stringency, res_dna_r941_min_modbases_5mC_5hmC_v001 and got:
0.2% 5hmC
0.13% 5mC

Then I ran the updated model, dna_r9.4.1_450bps_modbases_5mc_cg_sup at >80% stringency yielded:
1.7% 5mC.
It does not report 5hmC. It also gave 1.0% when stringency was increased to 90% on modbam2bed. I decided to use this model because it is the recommended "production" model for 5mC at CpG sites, and the 5hmC models are only recommended when 5hmC is explicitly targeted, which I am not expecting to see in these wasps anyway.

I believe with such sparse and low incidence of 5mC, the data is replicating within 2% no matter what model is used, so long as stringency is applied. That's reassuring. And with low values it's going to be near limit of detection probably.

Also, marcus1487, I respect the method of averaging all calls that you proposed, however the method I am using averages the percentage methylation itself, therefore it normalizes any uneven coverage. I don't want errant contigs with 1000x coverage biasing my overall methylation. In practice though, after I applied the 80% filter, both count methods came back with indistinguishable values anyway!

@dithiii dithiii closed this as completed Sep 7, 2022
@kaanokay
Copy link

kaanokay commented Oct 21, 2022

I used megalodon rather than guppy because I already had an assembled genome that I wanted to get methylation calls for, rather than calling with guppy, then having to map the unmapped bam file back to the genome. I have done that once, but it's a hassle. In the meantime it's easier to run megalodon with remora against the genome to get modification data.

Hi @dithiii, I'm now using Guppy and Remora pipeline. This pipeline gave me fast5, fastq, and bam files for each sample. The problem is that I have unaligned bam files with methylation tags and I want to align those bam files to reference genome. I think, minimap2 doest not accept bam files as input. I want to keep alignment info and match this info to methylation info. Do you have any experience with that to recommend me something?

Thank you.
Kind regards.

@dithiii
Copy link
Author

dithiii commented Oct 21, 2022

Hi @kaanokay ,
Yes I have dealt with that issue in the following way.
Concatenate all the unmapped bam files containing modified base information.
samtools cat -o mod_basecalls.bam *input.bam

Map unmapped Bams to a reference. The -TXX,YY flag carries over the modifications from bam to fastq. The -y flag copies input fastq commands to output.
samtools fastq -TMm,Ml mod_basecalls.bam | minimap2 -y -ax map-ont assembly.fa - -t 23 > assembly.modmapped.sam

Convert sam to sorted bam.
samtools view -u modmapped.sam | samtools sort -@ 24 -o modmapped.bam

Then I am using Clair3 to call variants, which uses Whatshap internally to call haplotypes. You can visualize regions with Methylartist and convert to bed file with modbam2bed. Good luck!

@kaanokay
Copy link

kaanokay commented Oct 23, 2022

Dear @dithiii,

I'll try your pipeline soon. My first aim is differential methylation analysis and second one is phasing for differential methylation. If you have any suggestion for DMR and phasing analyses, I'm very appreciated with that.

Thank you so much for sharing.
Bw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants