-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare mapping to sample graph vs. mapping to full graph #76
Comments
All the plots from the yeast data basically visualize the same thing: how well the same sets of Illumina short reads map to graphs generated with the VCF or cactus approach.
Until now, I had never overlaid both kinds of data but the results look quite interesting: For the cactus graph, the reads are mapped better to the full graph as we would expect. Surprisingly, this is not the case for the VCF graph (x-axis): More reads are mapped with mapq>0 to the sample graph than to the full graph. And far more reads are mapped with 100% identity to the sample graph than to the full graph. I have two ideas on this and would be curious to hear yours as well:
I'm not sure which conclusions to draw from this. One problem is that the cactus graph gets a head start because it contains all types of variation and not only SVs. I do not see a way around this in our experiments but it's one more reason to move the mapping evaluation plot to the supplement. It's not saying much beyond that the set of variants included in the VCF is incomplete. But it's a good point in favor of the cactus approach because it does not require running different variant callers to obtain all these different types of variants. |
Good points. The mapping experiments don't allow us to just cut off
variation below 50bp like we do in the rest of the paper, so we have to be
careful.
For "whole-graph" comparisons, I guess we would need to include SNP calls
in the VCF graph in order to properly compare it to the Cactus graph. This
isn't hard from a vg point of view, I don't think, but I don't know if you
have a way of getting SNPs from your data. Moving to the supplement and
making clear that this effect could be the source of the signal we're
seeing seems reasonable too. Like you say, it's a benefit of Cactus that
you get all variation from it as opposed to just SVs from asmvar.
Allowing the VCF sample graph to get augmented by vg call mitigates this
somewhat, but I think there could still be a bias here. Toggling --recall
on and off changes the output substantially. It may be fairer to run them
both without --recall.
Graph normalization can have a substantial effect on mapqs for the reasons
you state in point 2). Are you normalizing your graphs? Putting it
through "vg mod -U 10 graph.vg | vg mod -X 32 -" could help considerably.
It did for the HGSVC graphs which suffered from the same problem (similar
insertions from different samples lowering MAPQ).
…On Wed, May 8, 2019 at 9:36 AM David Heller ***@***.***> wrote:
All the plots from the yeast data basically visualize the same thing: how
well the same sets of Illumina short reads map to graphs generated with the
VCF or cactus approach.
- For the mapping evaluation, the plot shows how well the reads map to
the original (full) graphs.
- For the genotyping evaluation, the plot shows how well the reads map
to the sample graph (derived from the genotype calls).
Until now, I had never overlaid both kinds of data but the results look
quite interesting:
Mapping quality:
[image: mapping mapq full vs sample]
<https://user-images.githubusercontent.com/6477692/57377777-8f579380-71a3-11e9-9804-02c3ae0a960a.png>
Alignment identity:
[image: mapping id full vs sample]
<https://user-images.githubusercontent.com/6477692/57377782-91b9ed80-71a3-11e9-8da3-44605fa52532.png>
For the cactus graph, the reads are mapped better to the full graph as we
would expect. Surprisingly, this is not the case for the VCF graph
(x-axis): More reads are mapped with mapq>0 to the sample graph than to the
full graph. And far more reads are mapped with 100% identity to the sample
graph than to the full graph.
I have two ideas on this and would be curious to hear yours as well:
1. The genotyping (run with --recall for the cactus graph but without
--recall for the VCF graph) not only picks up SVs but also SNVs and
Indels which all get included into the sample graph. As a consequence, the
sample graph is much better than the original graph and more reads get
mapped.
2. The original VCF graph might be more repetitive because it contains
the variation from several strains. And unlike the cactus graph, each
variation exists as a separate branch off the reference path. Due to the
repeat content, many reads might be mapped with mapq=0 to the full graph
but not to the sample graph where much of it is removed.
I'm not sure which conclusions to draw from this. One problem is that the
cactus graph gets a head start because it contains all types of variation
and not only SVs. I do not see a way around this in our experiments but
it's one more reason to move the mapping evaluation plot to the supplement.
It's not saying much beyond that the set of variants included in the VCF is
incomplete. But it's a good point in favor of the cactus approach because
it does not require running different variant callers to obtain all these
different types of variants.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG373SQ6BBN7G5IDCZHXM3PULJO3ANCNFSM4HLJ4CWQ>
.
|
Thanks Glenn, I'm normalizing both graphs with |
It's the mod -U 10 that does the normalizing (it also unchops the nodes so
always needs to be followed by mod -X 32 in practice). Not using it could
lower the MAPQ, though there is no way to know without trying. I think
pushing MAPQ comparison on the whole graphs into the supplement and
mentioning normalization could be a factor is fine for now. Though it
would be interesting to rerun at some point.
I guess using --recall with cactus and not using for construct is okay, if
individually these are the best parameter sets for each graph,
respectively. I would expect these results to be fairly robust to minor
parameter differences, but --recall does lower the support threshold at
which a variant can be called.
…On Wed, May 8, 2019 at 11:13 AM David Heller ***@***.***> wrote:
Thanks Glenn,
I included --recall only for the cactus graph because it was the best
parameter setting for both graphs. In that way, it is a fair comparison
because it reflects the structural difference between the graphs. Running
both without --recall would make the results on the cactus graph slightly
worse and I'm also a bit hesitant to re-run everything now at this late
stage.
I'm normalizing both graphs with vg mod -X 32 but not with vg -U 10. Is
the missing part the important bit? Changing it would require re-running
everything, though, and would probably take some time..
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG373QYMLXLFOE3CPSMBPDPULUZZANCNFSM4HLJ4CWQ>
.
|
This is a good comment that Jonas made in the Google doc:
It might be interesting to compare mapping/identity statistics to the sample graph with the full graph. This would give an indication on the quality of the genotyping since wrong genotypes would likely result in fewer mapped reads and lower identity.
I have the data now and will generate a plot tomorrow.
The text was updated successfully, but these errors were encountered: