-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File format and list #48
Comments
Hi, You can have both filetypes in the same list ! Also, I have never tried using JGI files, so I hope everthing will go fine ! If not, please do not hesitate to write a new issue. Sincerely, |
Hi, Its been a while. Thank you for your help before. I could run the fasta files and got the results. I was trying with the annotated files to run the workflow. content/PPanGGOLiN/testingDataset/GenomeData The above exception was the direct cause of the following exception: Traceback (most recent call last): |
Also, if I could run my fasta list, is it required to run annotated files separately? Kindly let me know. Thank you. |
Hi, It will try to replicate the issue. Would it be possible for you to share the first gff that you have in your list and which raised the error ? In the meantime, I will try to download some genomes from JGI, to see if this has to do with their way of formating files in gff. As for your question, it depends on what you are trying to achieve through the pangenomic analysis. If you want to have the same gene identifiers as those that you have in the annotated genomes, to see in which partitions they are, or if they are in genomic islands for example then using your annotated files is probably better. On the opposite, if you are building a pangenome to have a general idea of the dynamic of your species and its composition in terms of persistent / shell / cloud genes, using only the fasta files is perfectly fine. Adelme |
I did manage to replicate the issue with a genome downloaded from JGI, it is indeed linked to the missing "##sequence-region" pragma. |
The latest commit on the master branch should fix the issue. I noticed that gff files from JGI do not contain the fasta sequences, meaning that IF you want to use the annotations in your genome and not just the fasta files, you'll have to provide the fasta anyways, as such:
Because ppanggolin needs to get the sequences somehow Thank you for the issue about this, I hope everything will be fine from now ! Adelme |
Thank you so much. You already worked on this issue. I am running it on google colab where I call the ppanggolin. Since you have fixed the issue, i think i can just re-copy and run it. It should work. Thank you again. I will run it and let you know soon. |
One more quick doubt, should I use this command line mentioned by you? Previously, I had only used either gbff or fasta list to run the workflow. I have the fasta and annotation file (gbff & gff) listed in the same order. |
To be honest I have no idea how google colab works so you probably know much better than me how to make it work in this case. Yes, if you wish to use the annotations. It is for the genomes in gff files since the gff files from JGI do not have fasta sequences. |
Thank you, Adelme. I tried the command line but it is showing the below error.
I then removed the fasta line and entered only this: now it showed the following error I think I must enter the previous command but I do not know why it is not taking it. Thank you. |
No problem don't worry ! My guess is that you may have written |
Thank you so much. It was my blunder. It did take it but then showing this now.
I have two lists Fasta (114 files) and GBFF (114 Files in which some are gff files from jgi). |
The first lines are very strange, as their is no call to 'mv' in PPanGGOLiN, and they are not raised by ppanggolin. For the rest, if you indeed have 114 lines in each of your file lists, it may be related to the first problem. if not, I will look at it on monday when I have access to a linux machine |
Hi, Looking at the code, it seems possible to me that the names between 'organisms.gbff.list' and 'organisms.fasta.list' are different. Is it the case ? If so, the file that corresponds to the same genome should carry the same name between the two files, otherwise the program can't tell which is supposed to be associated to which. Adelme |
Hi, `/content/PPanGGOLiN/testingDataset/GenomeData During handling of the above exception, another exception occurred: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Hi, Adelme |
organisms.gbff.list.txt Thank you for your reply. I dragged the files here. I hope you would get it. Thank you so much. |
Thank you, I see the problem. Basically, you have what looks like file extentions and some descriptions in the genome names (first column) that make the genome names different. For example in the first line of organisms.fasta.list it looks like this:
And the first in organisms.gbff.list it looks like this: I see that it's for the same genome (GCF_000010665.1) but since the entire string is different, the names are considered being different. You should use 'GCF_000010665.1' or 'GCF_000010665.1_ASM1066v1' for example as genome names in both files, so that the program can actually know that those two lines are for the same genome. This should be done for all genomes for ppanggolin to work. The names in the first column of both files need to be strictly identical. Adelme |
Oh, I got that. Thank you so much. I am correcting those names and I will try again. I actually matched the numbers and assumed rest of the part to be correct. I am going to check the entire list and try again. Thank you. I highly appreciate your help. |
Hi, I could run it, however, the output file is large and it could not produce the pangenome file. I think my effort to run it on google colab was not worth. We recently got an Ubuntu system. I will try to run it locally so that there is restriction on disc usage. Do you have an instruction to install it on a Ubuntu system? I would be very thankful. Thank you. |
Hi, |
Thank you for your reply. I installed all the requirements. After I run this command |
The same problem was encountered in #60 , this is due to the python version of the installed conda. The following commands solved their problems, so it should be good for you too:
This will create the environment 'ppanggolin_env' and install ppanggolin in it. Adelme |
Thank you. It worked. I think I have installed it successfully. I see the ppanggolin_env folder. It might be a silly question but I do not know where is the folder location where I can replace the fasta and gbff files. There is no testitngdataset folder inside ppanggolin_env folder. Can you please help me where I can locate it? Thank you. |
The commands basically just installed the software, it did not download the testingDataset things which are not needed. to use it, you can use You can place your gbff or fasta files anywhere you want, as long as the paths in your organisms.list match the path of your working directory. |
Hi, |
Hi,
I am new to using Github codes to run the pangenome analysis. I am using more than 50 genomes to do this and some of the genomes are from JGI and hence I have .gff and .fna files. Most of my genomes are from NCBI and I could easily get .gbff and .fna files. I updated the gbff and fasta files and list of organisms. Now, the genome files from JGI, should I list them in a separate file or I can put them in the same gbff list? If I make another list, I think I may have to include this information somewhere in the code. I am sorry for very basic level questions. I am trying to learn this thing. Thank you so much for your help.
Sincerely,
Shailabh
The text was updated successfully, but these errors were encountered: