Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare mapping to sample graph vs. mapping to full graph #76

Closed
eldariont opened this issue May 7, 2019 · 4 comments
Closed

Compare mapping to sample graph vs. mapping to full graph #76

eldariont opened this issue May 7, 2019 · 4 comments
Assignees

Comments

@eldariont
Copy link
Collaborator

This is a good comment that Jonas made in the Google doc:
It might be interesting to compare mapping/identity statistics to the sample graph with the full graph. This would give an indication on the quality of the genotyping since wrong genotypes would likely result in fewer mapped reads and lower identity.
I have the data now and will generate a plot tomorrow.

@eldariont eldariont self-assigned this May 7, 2019
@eldariont
Copy link
Collaborator Author

eldariont commented May 8, 2019

All the plots from the yeast data basically visualize the same thing: how well the same sets of Illumina short reads map to graphs generated with the VCF or cactus approach.

  • For the mapping evaluation, the plot shows how well the reads map to the original (full) graphs.
  • For the genotyping evaluation, the plot shows how well the reads map to the sample graph (derived from the genotype calls).

Until now, I had never overlaid both kinds of data but the results look quite interesting:

Mapping quality:
mapping mapq full vs sample

Alignment identity:
mapping id full vs sample

For the cactus graph, the reads are mapped better to the full graph as we would expect. Surprisingly, this is not the case for the VCF graph (x-axis): More reads are mapped with mapq>0 to the sample graph than to the full graph. And far more reads are mapped with 100% identity to the sample graph than to the full graph.

I have two ideas on this and would be curious to hear yours as well:

  1. The genotyping (run with --recall for the cactus graph but without --recall for the VCF graph) not only picks up SVs but also SNVs and Indels which all get included into the sample graph. As a consequence, the sample graph is much better than the original graph and more reads get mapped.
  2. The original VCF graph might be more repetitive because it contains the variation from several strains. And unlike the cactus graph, each variation exists as a separate branch off the reference path. Due to the repeat content, many reads might be mapped with mapq=0 to the full graph but not to the sample graph where much of it is removed.

I'm not sure which conclusions to draw from this. One problem is that the cactus graph gets a head start because it contains all types of variation and not only SVs. I do not see a way around this in our experiments but it's one more reason to move the mapping evaluation plot to the supplement. It's not saying much beyond that the set of variants included in the VCF is incomplete. But it's a good point in favor of the cactus approach because it does not require running different variant callers to obtain all these different types of variants.

@glennhickey
Copy link
Collaborator

glennhickey commented May 8, 2019 via email

@eldariont
Copy link
Collaborator Author

Thanks Glenn,
I included --recall only for the cactus graph because it was the best parameter setting for both graphs. In that way, it is a fair comparison because it reflects the structural difference between the graphs. Running both without --recall would make the results on the cactus graph slightly worse and I'm also a bit hesitant to re-run everything now at this late stage.

I'm normalizing both graphs with vg mod -X 32 but not with vg -U 10. Is the missing part the important bit? Changing it would require re-running everything, though, and would probably take some time..

@glennhickey
Copy link
Collaborator

glennhickey commented May 8, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants