Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between graph length, reference length, and novel base pairs #25

Closed
CCLittlefield opened this issue Jun 6, 2024 · 3 comments

Comments

@CCLittlefield
Copy link

Hi,

I have created a pangenome using the minigraph-cactus pipeline in a manner very similar to how the HPRC created their pangenome (using T2T-CHM13 as the primary reference and including GRCh38 as an assembly), I then ran panacus and identified the amount of base pairs added per assembly, and in total. The total amount of basepairs added (coverage 1, quorum 0, last row) was 0.489 Gb. The length of the T2T-CHM13 reference is 3.11 Gb and the total length of the graph (determined using odgi stats) is 3.38 Gb. Am I wrong in thinking that the length of the reference, 3.11 Gb, plus the length of total bp added, 0.489 Gb, should equal the total length of the graph (3.38 Gb)? Or do I misunderstand? I've tried a variety of parameters and still get the same results.
The code:

grep -e '^W' $gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' > corrected.paths.txt

grep -e 'GRCh38' corrected.paths.txt > corrected.paths.grch38.txt

grep -ive 'grch38|chm13' corrected.paths.txt > corrected.paths.haplotypes.txt

panacus ordered-histgrowth --count bp --order $order --exclude corrected.paths.grch38.txt --subset corrected.paths.haplotypes.txt --output-format table --coverage 1,2,4,67 --groupby-sample --threads $SLURM_CPUS_ON_NODE $gfa > $gfa.ordered-histgrowth.bp.tsv

@danydoerr
Copy link
Member

danydoerr commented Jun 6, 2024

From your code, it looks like you exclude GRCh38, and compute growth for the subset, omitting GRCh38 and CHM13. Then you reason about the size of the total graph, but taking CHM13 as baseline. Even if your subset omits CHM13, it still counts nodes/bps covered by CHM13 if they are covered also by haplotypes of the subset. So this is the difference between exclude and subset: The exclude option makes sure that no bp of the specified paths is ever counted, while the subset option does count bps covered by the subset.

@danydoerr
Copy link
Member

In other words, if you exclude CHM13 rather than GRCh38, this should check out.

@CCLittlefield
Copy link
Author

Thank you for your quick response, It seems I misunderstood the function of the subset and exclude options. I reran the analysis and all the numbers check out. Thanks again for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants