Discrepancy between graph length, reference length, and novel base pairs #25

CCLittlefield · 2024-06-06T14:28:46Z

Hi,

I have created a pangenome using the minigraph-cactus pipeline in a manner very similar to how the HPRC created their pangenome (using T2T-CHM13 as the primary reference and including GRCh38 as an assembly), I then ran panacus and identified the amount of base pairs added per assembly, and in total. The total amount of basepairs added (coverage 1, quorum 0, last row) was 0.489 Gb. The length of the T2T-CHM13 reference is 3.11 Gb and the total length of the graph (determined using odgi stats) is 3.38 Gb. Am I wrong in thinking that the length of the reference, 3.11 Gb, plus the length of total bp added, 0.489 Gb, should equal the total length of the graph (3.38 Gb)? Or do I misunderstand? I've tried a variety of parameters and still get the same results.
The code:

grep -e '^W' $gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' > corrected.paths.txt

grep -e 'GRCh38' corrected.paths.txt > corrected.paths.grch38.txt

grep -ive 'grch38|chm13' corrected.paths.txt > corrected.paths.haplotypes.txt

panacus ordered-histgrowth --count bp --order $order --exclude corrected.paths.grch38.txt --subset corrected.paths.haplotypes.txt --output-format table --coverage 1,2,4,67 --groupby-sample --threads $SLURM_CPUS_ON_NODE $gfa > $gfa.ordered-histgrowth.bp.tsv

danydoerr · 2024-06-06T14:49:42Z

From your code, it looks like you exclude GRCh38, and compute growth for the subset, omitting GRCh38 and CHM13. Then you reason about the size of the total graph, but taking CHM13 as baseline. Even if your subset omits CHM13, it still counts nodes/bps covered by CHM13 if they are covered also by haplotypes of the subset. So this is the difference between exclude and subset: The exclude option makes sure that no bp of the specified paths is ever counted, while the subset option does count bps covered by the subset.

danydoerr · 2024-06-06T14:54:32Z

In other words, if you exclude CHM13 rather than GRCh38, this should check out.

CCLittlefield · 2024-06-07T02:05:21Z

Thank you for your quick response, It seems I misunderstood the function of the subset and exclude options. I reran the analysis and all the numbers check out. Thanks again for your help.

CCLittlefield closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between graph length, reference length, and novel base pairs #25

Discrepancy between graph length, reference length, and novel base pairs #25

CCLittlefield commented Jun 6, 2024

danydoerr commented Jun 6, 2024 •

edited

Loading

danydoerr commented Jun 6, 2024

CCLittlefield commented Jun 7, 2024

Discrepancy between graph length, reference length, and novel base pairs #25

Discrepancy between graph length, reference length, and novel base pairs #25

Comments

CCLittlefield commented Jun 6, 2024

danydoerr commented Jun 6, 2024 • edited Loading

danydoerr commented Jun 6, 2024

CCLittlefield commented Jun 7, 2024

danydoerr commented Jun 6, 2024 •

edited

Loading