how to construct pangenome graph from two assembly genome of two species #147

Aannaw · 2021-12-27T08:26:41Z

Hello,Professor
I have two assembly genome of two species. I want to construct a graph from the two genome using PGGB. I firstly combine the two genome into a input.fa ,then index the input.fa , finally run the PGGB. I refer to the human example in readme command is :
cat A.fasta B.fasta > input.fasta
samtools faidx input.fasta
pggb -i input.fata -s 100000 -p 70 -t 40 -v -L -U -o out -T 20 -n 2 -H 2 -G 20000
I have used "mash dist" to calculate divergence between A.fasta and B.fasta. The result is:
A.fasta B.fasta 0.0181313 0 519/1000
I do not know how to convert the result to the approximate percent identity and then provide it as -p and how to adapt to these parameters "-k -s -G". I set the n to 2 according to the number of my assembly genomes. And I can not run by chromosome because I can not find the related chromosome from the two genome.
Also, can you show me the use of memory and CPU. The size of My two genome is about 3G.
I would appreciate it if you could give me any suggestions.
Looking forward with your reply.

ekg · 2021-12-27T14:07:51Z

Thank you for reaching out with your questions. Here are some suggestions.

The mash distance estimate would be matched by setting -p 90 or -p 95.

You might want to set -n 1, I'm not certain that for two genomes (or small numbers) that it should be == -H. We tend to "oversaturate" the mapping slightly when working with larger numbers of genomes (e.g. setting number of mappings == number of genomes), but here it might be best to set -n 1 -H 2, and frankly I'm thinking that might make sense as best practice going forward, but I haven't yet tested it.

I'll try to work your feedback into documentation about how to go from mash dist to settings. Really, we want to automated this, and it's probably possible to do for the mash distance. The segment length determination is somewhat arbitrary, because it depends on the lengths of homologies that you want to support as approximately linear in the graph.

-G 20000 might cause you to run out of memory. abPOA (at least as we're running it) has some quadratic memory costs in the length of the segment. For HPRC and work on mouse (20-90 haplotypes) we use -G 13117,13219 on a system with 386GB of RAM and tend to use around half of memory (~150G) at peak. The successive numbers indicate two smoothxg passes with different target abPOA lengths. In practice, these passes really help normalize the graph well.

I would expect this process to take a day or so on one system, possibly less given that you just have two genomes. For human and mouse I've been partitioning the contigs by chromosome, but it should be fine to directly build the graph from everything.

Let me know how it goes and if you need any more hints. I'll try to roll your perspective into an update to the pggb documentation.

Aannaw · 2021-12-28T04:17:36Z

Hello,Professor
Thanks for your reply!
I have just runned the command you recommend :
pggb -i input.fasta-s 100000 -p 95 -t 40 -v -L -U -o out -T 20 -n 1 -H 2 -G 13117,13219
I still have any confusion about partitioning the contigs by chromosome. My initial two genomes is scaffolding to several pseudo-chromosomes and some unlocalized-contigs. After combing the two genomes, there are a pair of Chr1, Chr2,Chr3 ..... in the combined input.fasta. If I choose to run pggb by chromosome, should I put the pair of Chr1 sequences into a Chr1.fasta. So what about the unlocalized-contigs?
Also, I have another question. When I run the command above, the first step of wfmash seems to need a long time. The tmp generate file wfmash-KnfItp is empty. I choose to When I used htop to check , the state is sleeping. Is it out of memory?
Best wishes!

ekg · 2021-12-28T06:22:01Z

I'm not sure what would cause a stall at that stage. That's very strange. What does the output log say it's doing. It is strongly recommended that you run pggb in a directory with fast disk. Ideally an SSD. That can cause apparent stalls. Did you run out of memory during smoothxg? I'm curious why you set -T 20 to reduce the parallelism of that step.

…

On Tue, Dec 28, 2021, 05:17 Aannaw ***@***.***> wrote: Hello,Professor Thanks for your reply! I have just runned the command you recommend : pggb -i input.fasta-s 100000 -p 95 -t 40 -v -L -U -o out -T 20 -n 1 -H 2 -G 13117,13219 I still have any confusion about partitioning the contigs by chromosome. My initial two genomes is scaffolding to several pseudo-chromosomes and some unlocalized-contigs. After combing the two genomes, there are a pair of Chr1, Chr2,Chr3 ..... in the combined input.fasta. If I choose to run pggb by chromosome, should I put the pair of Chr1 sequences into a Chr1.fasta. So what about the unlocalized-contigs? Also, I have another question. When I run the command above, the first step of wfmash seems to need a long time. The tmp generate file wfmash-KnfItp is empty. I choose to When I used htop to check , the state is sleeping. Is it out of memory? Best wishes! — Reply to this email directly, view it on GitHub <#147 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEKZLZTBV3P4EKULJWDUTE26XANCNFSM5KZ6O5DA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

AndreaGuarracino · 2021-12-28T09:10:40Z

Hi @Aannaw, can you share your input (input.fasta) and specify which version (which commit) of pggb you are using?

Aannaw · 2021-12-29T07:24:39Z

Hello,Professors
I am so sorry for delayed reply. It seems to work and it is running the command "smoothxg -t 40 -T 20 -g out/input.fasta.15ccfd3.2ff309f.seqwish.gfa -w 26234 -K -X 100 -I 0.95 -R 0 -j 0 -e 0 -l 13117 -p 1,19,39,3,81,1,1 -o 0.03 -Y 200 -d 0 -D 0 -V -o out/input.fasta.15ccfd3.2ff309f.6754527.smooth.1.gfa" by checking with htop. The input.fasta.15ccfd3.2ff309f.6754527.smooth.1.gfa has not generated. I initially set -T 20 for that I seen the smoothxg consume a huge amount of memory in the POA step in the readme.md and so I set -T to minimize the threads.
@AndreaGuarracino
Hello, Professor
I installed noarch/pggb-0.2.0-hdfd78af_0.tar.bz2 by conda. Because of big size of input.fasta, if any need, I will sent to you by mail.
Thanks again for your help.
Best wishes!

Aannaw · 2021-12-30T09:48:01Z

Hello,Professor
The final .smooth.gfa graph were generated. And the following visualization of 1D and 2D has also been generated.
It seems that mapping between pseudo-chromosomes or scaffolds of the same assembly genomes. Is it true? Are there explanation and assessment about the output graph (.smooth.gfa) ? Maybe should I run pggb by same pseudo-chromosomes or scaffolds from two genome independently?
Looking forward with your reply.
Best wishes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to construct pangenome graph from two assembly genome of two species #147

how to construct pangenome graph from two assembly genome of two species #147

Aannaw commented Dec 27, 2021 •

edited

Loading

ekg commented Dec 27, 2021

Aannaw commented Dec 28, 2021

ekg commented Dec 28, 2021 via email

AndreaGuarracino commented Dec 28, 2021

Aannaw commented Dec 29, 2021 •

edited

Loading

Aannaw commented Dec 30, 2021 •

edited

Loading

how to construct pangenome graph from two assembly genome of two species #147

how to construct pangenome graph from two assembly genome of two species #147

Comments

Aannaw commented Dec 27, 2021 • edited Loading

ekg commented Dec 27, 2021

Aannaw commented Dec 28, 2021

ekg commented Dec 28, 2021 via email

AndreaGuarracino commented Dec 28, 2021

Aannaw commented Dec 29, 2021 • edited Loading

Aannaw commented Dec 30, 2021 • edited Loading

Aannaw commented Dec 27, 2021 •

edited

Loading

Aannaw commented Dec 29, 2021 •

edited

Loading

Aannaw commented Dec 30, 2021 •

edited

Loading