Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to construct pangenome graph from two assembly genome of two species #147

Open
Aannaw opened this issue Dec 27, 2021 · 6 comments
Open

Comments

@Aannaw
Copy link

Aannaw commented Dec 27, 2021

Hello,Professor
I have two assembly genome of two species. I want to construct a graph from the two genome using PGGB. I firstly combine the two genome into a input.fa ,then index the input.fa , finally run the PGGB. I refer to the human example in readme command is :
cat A.fasta B.fasta > input.fasta
samtools faidx input.fasta
pggb -i input.fata -s 100000 -p 70 -t 40 -v -L -U -o out -T 20 -n 2 -H 2 -G 20000
I have used "mash dist" to calculate divergence between A.fasta and B.fasta. The result is:
A.fasta B.fasta 0.0181313 0 519/1000
I do not know how to convert the result to the approximate percent identity and then provide it as -p and how to adapt to these parameters "-k -s -G". I set the n to 2 according to the number of my assembly genomes. And I can not run by chromosome because I can not find the related chromosome from the two genome.
Also, can you show me the use of memory and CPU. The size of My two genome is about 3G.
I would appreciate it if you could give me any suggestions.
Looking forward with your reply.

@ekg
Copy link
Collaborator

ekg commented Dec 27, 2021

Thank you for reaching out with your questions. Here are some suggestions.

The mash distance estimate would be matched by setting -p 90 or -p 95.

You might want to set -n 1, I'm not certain that for two genomes (or small numbers) that it should be == -H. We tend to "oversaturate" the mapping slightly when working with larger numbers of genomes (e.g. setting number of mappings == number of genomes), but here it might be best to set -n 1 -H 2, and frankly I'm thinking that might make sense as best practice going forward, but I haven't yet tested it.

I'll try to work your feedback into documentation about how to go from mash dist to settings. Really, we want to automated this, and it's probably possible to do for the mash distance. The segment length determination is somewhat arbitrary, because it depends on the lengths of homologies that you want to support as approximately linear in the graph.

-G 20000 might cause you to run out of memory. abPOA (at least as we're running it) has some quadratic memory costs in the length of the segment. For HPRC and work on mouse (20-90 haplotypes) we use -G 13117,13219 on a system with 386GB of RAM and tend to use around half of memory (~150G) at peak. The successive numbers indicate two smoothxg passes with different target abPOA lengths. In practice, these passes really help normalize the graph well.

I would expect this process to take a day or so on one system, possibly less given that you just have two genomes. For human and mouse I've been partitioning the contigs by chromosome, but it should be fine to directly build the graph from everything.

Let me know how it goes and if you need any more hints. I'll try to roll your perspective into an update to the pggb documentation.

@Aannaw
Copy link
Author

Aannaw commented Dec 28, 2021

Hello,Professor
Thanks for your reply!
I have just runned the command you recommend :
pggb -i input.fasta-s 100000 -p 95 -t 40 -v -L -U -o out -T 20 -n 1 -H 2 -G 13117,13219
I still have any confusion about partitioning the contigs by chromosome. My initial two genomes is scaffolding to several pseudo-chromosomes and some unlocalized-contigs. After combing the two genomes, there are a pair of Chr1, Chr2,Chr3 ..... in the combined input.fasta. If I choose to run pggb by chromosome, should I put the pair of Chr1 sequences into a Chr1.fasta. So what about the unlocalized-contigs?
Also, I have another question. When I run the command above, the first step of wfmash seems to need a long time. The tmp generate file wfmash-KnfItp is empty. I choose to When I used htop to check , the state is sleeping. Is it out of memory?
Best wishes!

@ekg
Copy link
Collaborator

ekg commented Dec 28, 2021 via email

@AndreaGuarracino
Copy link
Member

Hi @Aannaw, can you share your input (input.fasta) and specify which version (which commit) of pggb you are using?

@Aannaw
Copy link
Author

Aannaw commented Dec 29, 2021

Hello,Professors
I am so sorry for delayed reply. It seems to work and it is running the command "smoothxg -t 40 -T 20 -g out/input.fasta.15ccfd3.2ff309f.seqwish.gfa -w 26234 -K -X 100 -I 0.95 -R 0 -j 0 -e 0 -l 13117 -p 1,19,39,3,81,1,1 -o 0.03 -Y 200 -d 0 -D 0 -V -o out/input.fasta.15ccfd3.2ff309f.6754527.smooth.1.gfa" by checking with htop. The input.fasta.15ccfd3.2ff309f.6754527.smooth.1.gfa has not generated. I initially set -T 20 for that I seen the smoothxg consume a huge amount of memory in the POA step in the readme.md and so I set -T to minimize the threads.
@AndreaGuarracino
Hello, Professor
I installed noarch/pggb-0.2.0-hdfd78af_0.tar.bz2 by conda. Because of big size of input.fasta, if any need, I will sent to you by mail.
Thanks again for your help.
Best wishes!

@Aannaw
Copy link
Author

Aannaw commented Dec 30, 2021

Hello,Professor
The final .smooth.gfa graph were generated. And the following visualization of 1D and 2D has also been generated.
It seems that mapping between pseudo-chromosomes or scaffolds of the same assembly genomes. Is it true? Are there explanation and assessment about the output graph (
.smooth.gfa) ? Maybe should I run pggb by same pseudo-chromosomes or scaffolds from two genome independently?
Looking forward with your reply.
Best wishes!
未命名1640857306
Ma6-Mp5 all fata 15ccfd3 2ff309f 6754527 smooth og viz_inv
Ma6-Mp5 all fata 15ccfd3 2ff309f 6754527 smooth og lay draw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants