-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to construct pangenome graph from two assembly genome of two species #147
Comments
Thank you for reaching out with your questions. Here are some suggestions. The mash distance estimate would be matched by setting -p 90 or -p 95. You might want to set -n 1, I'm not certain that for two genomes (or small numbers) that it should be == -H. We tend to "oversaturate" the mapping slightly when working with larger numbers of genomes (e.g. setting number of mappings == number of genomes), but here it might be best to set -n 1 -H 2, and frankly I'm thinking that might make sense as best practice going forward, but I haven't yet tested it. I'll try to work your feedback into documentation about how to go from mash dist to settings. Really, we want to automated this, and it's probably possible to do for the mash distance. The segment length determination is somewhat arbitrary, because it depends on the lengths of homologies that you want to support as approximately linear in the graph. -G 20000 might cause you to run out of memory. abPOA (at least as we're running it) has some quadratic memory costs in the length of the segment. For HPRC and work on mouse (20-90 haplotypes) we use -G 13117,13219 on a system with 386GB of RAM and tend to use around half of memory (~150G) at peak. The successive numbers indicate two smoothxg passes with different target abPOA lengths. In practice, these passes really help normalize the graph well. I would expect this process to take a day or so on one system, possibly less given that you just have two genomes. For human and mouse I've been partitioning the contigs by chromosome, but it should be fine to directly build the graph from everything. Let me know how it goes and if you need any more hints. I'll try to roll your perspective into an update to the pggb documentation. |
Hello,Professor |
I'm not sure what would cause a stall at that stage. That's very strange.
What does the output log say it's doing.
It is strongly recommended that you run pggb in a directory with fast disk.
Ideally an SSD. That can cause apparent stalls.
Did you run out of memory during smoothxg? I'm curious why you set -T 20 to
reduce the parallelism of that step.
…On Tue, Dec 28, 2021, 05:17 Aannaw ***@***.***> wrote:
Hello,Professor
Thanks for your reply!
I have just runned the command you recommend :
pggb -i input.fasta-s 100000 -p 95 -t 40 -v -L -U -o out -T 20 -n 1 -H 2
-G 13117,13219
I still have any confusion about partitioning the contigs by chromosome.
My initial two genomes is scaffolding to several pseudo-chromosomes and
some unlocalized-contigs. After combing the two genomes, there are a pair
of Chr1, Chr2,Chr3 ..... in the combined input.fasta. If I choose to run
pggb by chromosome, should I put the pair of Chr1 sequences into a
Chr1.fasta. So what about the unlocalized-contigs?
Also, I have another question. When I run the command above, the first
step of wfmash seems to need a long time. The tmp generate file
wfmash-KnfItp is empty. I choose to When I used htop to check , the state
is sleeping. Is it out of memory?
Best wishes!
—
Reply to this email directly, view it on GitHub
<#147 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEKZLZTBV3P4EKULJWDUTE26XANCNFSM5KZ6O5DA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi @Aannaw, can you share your input (input.fasta) and specify which version (which commit) of |
Hello,Professors |
Hello,Professor
I have two assembly genome of two species. I want to construct a graph from the two genome using PGGB. I firstly combine the two genome into a input.fa ,then index the input.fa , finally run the PGGB. I refer to the human example in readme command is :
cat A.fasta B.fasta > input.fasta
samtools faidx input.fasta
pggb -i input.fata -s 100000 -p 70 -t 40 -v -L -U -o out -T 20 -n 2 -H 2 -G 20000
I have used "mash dist" to calculate divergence between A.fasta and B.fasta. The result is:
A.fasta B.fasta 0.0181313 0 519/1000
I do not know how to convert the result to the approximate percent identity and then provide it as -p and how to adapt to these parameters "-k -s -G". I set the n to 2 according to the number of my assembly genomes. And I can not run by chromosome because I can not find the related chromosome from the two genome.
Also, can you show me the use of memory and CPU. The size of My two genome is about 3G.
I would appreciate it if you could give me any suggestions.
Looking forward with your reply.
The text was updated successfully, but these errors were encountered: