Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File format and list #48

Closed
shailabhr opened this issue Jan 8, 2021 · 26 comments
Closed

File format and list #48

shailabhr opened this issue Jan 8, 2021 · 26 comments

Comments

@shailabhr
Copy link

Hi,
I am new to using Github codes to run the pangenome analysis. I am using more than 50 genomes to do this and some of the genomes are from JGI and hence I have .gff and .fna files. Most of my genomes are from NCBI and I could easily get .gbff and .fna files. I updated the gbff and fasta files and list of organisms. Now, the genome files from JGI, should I list them in a separate file or I can put them in the same gbff list? If I make another list, I think I may have to include this information somewhere in the code. I am sorry for very basic level questions. I am trying to learn this thing. Thank you so much for your help.
Sincerely,
Shailabh

@axbazin
Copy link
Member

axbazin commented Jan 8, 2021

Hi,

You can have both filetypes in the same list !
Normally you should be fine with the gbff and the gff files, if your gff files have the fasta sequences in them (gbff files have them by standard). I do not know the policy of JGI about those.

Also, I have never tried using JGI files, so I hope everthing will go fine ! If not, please do not hesitate to write a new issue.

Sincerely,
Adelme

@shailabhr
Copy link
Author

Hi, Its been a while. Thank you for your help before. I could run the fasta files and got the results. I was trying with the annotated files to run the workflow.
I clubbed all the GBFF and GFF files in the same list and tried to run it but did not work. It shows the following error. It stopped at a place where the file type changes from gbff to gff. Hope you could help me. Thank you so much.

content/PPanGGOLiN/testingDataset/GenomeData
2021-02-06 18:02:26 main.py:l180 INFO Command: /usr/local/bin/ppanggolin workflow --anno organisms.gbff.list
2021-02-06 18:02:26 main.py:l181 INFO PPanGGOLiN version: 1.1.121
2021-02-06 18:02:26 annotate.py:l329 INFO Reading organisms.gbff.list the list of organism files ...
45% 51/114 [00:20<00:29, 2.12file/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.121-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 317, in launchReadAnno
return readAnnoFile(*args)
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.121-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 322, in readAnnoFile
return read_org_gff(organism_name, filename, circular_contigs, getSeq, pseudo)
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.121-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 275, in read_org_gff
if contig.name != gff_fields[GFF_seqname]:
UnboundLocalError: local variable 'contig' referenced before assignment
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/ppanggolin", line 33, in
sys.exit(load_entry_point('ppanggolin==1.1.121', 'console_scripts', 'ppanggolin')())
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.121-py3.6-linux-x86_64.egg/ppanggolin/main.py", line 191, in main
ppanggolin.workflow.workflow.launch(args)
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.121-py3.6-linux-x86_64.egg/ppanggolin/workflow/workflow.py", line 29, in launch
readAnnotations(pangenome, args.anno, cpu = args.cpu, getSeq = getSeq, show_bar=args.show_prog_bars)
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.121-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 341, in readAnnotations
for org, flag in p.imap_unordered(launchReadAnno, args):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
UnboundLocalError: local variable 'contig' referenced before assignment
46% 52/114 [00:21<00:25, 2.46file/s]

@shailabhr
Copy link
Author

Also, if I could run my fasta list, is it required to run annotated files separately? Kindly let me know. Thank you.

@axbazin
Copy link
Member

axbazin commented Feb 19, 2021

Hi,
Looking at the code, this error can be raised if the gff file is missing a '##sequence-region' field at the beginning before the genes are listed, which is something usually expected in gff files, even if it is not obligatory, since you often have to differenciate the different contigs the genes can be on.

It will try to replicate the issue. Would it be possible for you to share the first gff that you have in your list and which raised the error ? In the meantime, I will try to download some genomes from JGI, to see if this has to do with their way of formating files in gff.

As for your question, it depends on what you are trying to achieve through the pangenomic analysis. If you want to have the same gene identifiers as those that you have in the annotated genomes, to see in which partitions they are, or if they are in genomic islands for example then using your annotated files is probably better. On the opposite, if you are building a pangenome to have a general idea of the dynamic of your species and its composition in terms of persistent / shell / cloud genes, using only the fasta files is perfectly fine.

Adelme

@axbazin
Copy link
Member

axbazin commented Feb 19, 2021

I did manage to replicate the issue with a genome downloaded from JGI, it is indeed linked to the missing "##sequence-region" pragma.
Since this is not mandatory in the gff3 specifications, I'll update the parser to cope with this case. I'll tell you when this is done, sorry for the inconvenience.

@axbazin
Copy link
Member

axbazin commented Feb 19, 2021

The latest commit on the master branch should fix the issue.

I noticed that gff files from JGI do not contain the fasta sequences, meaning that IF you want to use the annotations in your genome and not just the fasta files, you'll have to provide the fasta anyways, as such:

ppanggolin workflow --anno organisms.gbff.list --fasta organisms.fasta.list

Because ppanggolin needs to get the sequences somehow

Thank you for the issue about this, I hope everything will be fine from now !

Adelme

@shailabhr
Copy link
Author

Thank you so much. You already worked on this issue. I am running it on google colab where I call the ppanggolin. Since you have fixed the issue, i think i can just re-copy and run it. It should work. Thank you again. I will run it and let you know soon.
Thanks,
Shailabh

@shailabhr
Copy link
Author

One more quick doubt, should I use this command line mentioned by you?
ppanggolin workflow --anno organisms.gbff.list --fasta organisms.fasta.list

Previously, I had only used either gbff or fasta list to run the workflow. I have the fasta and annotation file (gbff & gff) listed in the same order.
Thank you.

@axbazin
Copy link
Member

axbazin commented Feb 19, 2021

To be honest I have no idea how google colab works so you probably know much better than me how to make it work in this case.

Yes, if you wish to use the annotations. It is for the genomes in gff files since the gff files from JGI do not have fasta sequences.
As long as the contig names in between the files are the same, it should be fine.

@shailabhr
Copy link
Author

Thank you, Adelme. I tried the command line but it is showing the below error.
I am sorry to trouble u.

/content/PPanGGOLiN/testingDataset/GenomeData usage: ppanggolin [-h] [-v] ... ppanggolin: error: unrecognized arguments: fasta organisms.fasta.list

I then removed the fasta line and entered only this:
ppanggolin workflow --anno organisms.gbff.list

now it showed the following error
/content/PPanGGOLiN/testingDataset/GenomeData
2021-02-19 17:22:42 main.py:l180 INFO Command: /usr/local/bin/ppanggolin workflow --anno organisms.gbff.list
2021-02-19 17:22:42 main.py:l181 INFO PPanGGOLiN version: 1.1.132
2021-02-19 17:22:42 annotate.py:l338 INFO Reading organisms.gbff.list the list of organism files ...
100% 114/114 [00:30<00:00, 3.72file/s]
2021-02-19 17:23:13 writeBinaries.py:l481 INFO Writing genome annotations...
100% 114/114 [00:01<00:00, 72.93genome/s]
2021-02-19 17:23:15 writeBinaries.py:l530 INFO Done writing the pangenome. It is in file : ppanggolin_output_DATE2021-02-19_HOUR17.22.42_PID3932/pangenome.h5
Traceback (most recent call last):
File "/usr/local/bin/ppanggolin", line 33, in
sys.exit(load_entry_point('ppanggolin==1.1.132', 'console_scripts', 'ppanggolin')())
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.132-py3.6-linux-x86_64.egg/ppanggolin/main.py", line 191, in main
ppanggolin.workflow.workflow.launch(args)
File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.132-py3.6-linux-x86_64.egg/ppanggolin/workflow/workflow.py", line 32, in launch
raise Exception("The gff/gbff provided did not have any sequence informations, you did not provide clusters and you did not provide fasta file. Thus, we do not have the information we need to continue the analysis.")
Exception: The gff/gbff provided did not have any sequence informations, you did not provide clusters and you did not provide fasta file. Thus, we do not have the information we need to continue the analysis.

I think I must enter the previous command but I do not know why it is not taking it.

Thank you.

@axbazin
Copy link
Member

axbazin commented Feb 19, 2021

No problem don't worry !
Indeed the first solution should be used here.

My guess is that you may have written ppanggolin workflow --anno organisms.gbff.list fasta organisms.fasta.list ? I get exactly the same error message when doing so myself.
What should be written is ppanggolin workflow --anno organisms.gbff.list --fasta organisms.fasta.list

@shailabhr
Copy link
Author

Thank you so much. It was my blunder. It did take it but then showing this now.

/content/PPanGGOLiN/testingDataset/GenomeData mv: cannot move 'NCBI_fna_Files' to 'FASTA/NCBI_fna_Files': Directory not empty mv: cannot move 'NCBI_gbff_Files' to 'GBFF/NCBI_gbff_Files': Directory not empty 2021-02-19 18:06:47 main.py:l180 INFO Command: /usr/local/bin/ppanggolin workflow --anno organisms.gbff.list --fasta organisms.fasta.list 2021-02-19 18:06:47 main.py:l181 INFO PPanGGOLiN version: 1.1.132 2021-02-19 18:06:47 annotate.py:l338 INFO Reading organisms.gbff.list the list of organism files ... 100% 114/114 [00:29<00:00, 3.82file/s] 2021-02-19 18:07:16 writeBinaries.py:l481 INFO Writing genome annotations... 100% 114/114 [00:01<00:00, 74.71genome/s] 2021-02-19 18:07:19 writeBinaries.py:l530 INFO Done writing the pangenome. It is in file : ppanggolin_output_DATE2021-02-19_HOUR18.06.47_PID4106/pangenome.h5 Traceback (most recent call last): File "/usr/local/bin/ppanggolin", line 33, in <module> sys.exit(load_entry_point('ppanggolin==1.1.132', 'console_scripts', 'ppanggolin')()) File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.132-py3.6-linux-x86_64.egg/ppanggolin/main.py", line 191, in main ppanggolin.workflow.workflow.launch(args) File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.132-py3.6-linux-x86_64.egg/ppanggolin/workflow/workflow.py", line 35, in launch getGeneSequencesFromFastas(pangenome, args.fasta) File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.132-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 373, in getGeneSequencesFromFastas raise Exception(f"Not all of your pangenome's organisms are present within the provided fasta file. {missing} are missing (out of {len(pangenome.organisms)}).") Exception: Not all of your pangenome's organisms are present within the provided fasta file. 114 are missing (out of 228).

I have two lists Fasta (114 files) and GBFF (114 Files in which some are gff files from jgi).

@axbazin
Copy link
Member

axbazin commented Feb 19, 2021

The first lines are very strange, as their is no call to 'mv' in PPanGGOLiN, and they are not raised by ppanggolin.

For the rest, if you indeed have 114 lines in each of your file lists, it may be related to the first problem. if not, I will look at it on monday when I have access to a linux machine

@axbazin
Copy link
Member

axbazin commented Feb 22, 2021

Hi,

Looking at the code, it seems possible to me that the names between 'organisms.gbff.list' and 'organisms.fasta.list' are different.

Is it the case ? If so, the file that corresponds to the same genome should carry the same name between the two files, otherwise the program can't tell which is supposed to be associated to which.

Adelme

@shailabhr
Copy link
Author

Hi,
I am extremely sorry for not replying. I got involved with other work and could not continue on this. I checked both list and both have the same sequence of names in exact order. I also have all the fasta and annotation files in the respective folder. Note able to find why it is still showing error :( I think %mv we used to just move the file as I always save my data in a separate folder. I am sorry again for the gap. Thank you.

`/content/PPanGGOLiN/testingDataset/GenomeData
2021-03-04 12:32:48 main.py:l180 INFO Command: /usr/local/bin/ppanggolin workflow --anno organisms.gbff.list --fasta organisms.fasta.list
2021-03-04 12:32:48 main.py:l181 INFO PPanGGOLiN version: 1.1.141
2021-03-04 12:32:48 annotate.py:l338 INFO Reading organisms.gbff.list the list of organism files ...
100% 114/114 [00:33<00:00, 3.42file/s]
2021-03-04 12:33:21 writeBinaries.py:l481 INFO Writing genome annotations...
100% 114/114 [00:01<00:00, 69.04genome/s]
2021-03-04 12:33:24 writeBinaries.py:l530 INFO Done writing the pangenome. It is in file : ppanggolin_output_DATE2021-03-04_HOUR12.32.48_PID4467/pangenome.h5
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/ppanggolin-1.1.141-py3.7-linux-x86_64.egg/ppanggolin/pangenome.py", line 185, in getOrganism
return self._orgGetter[orgName]
KeyError: 'GCF_000010665.1_ASM1066v1_cds_from_genomic.fna'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/ppanggolin-1.1.141-py3.7-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 369, in getGeneSequencesFromFastas
org = pangenome.getOrganism(elements[0])
File "/usr/local/lib/python3.7/dist-packages/ppanggolin-1.1.141-py3.7-linux-x86_64.egg/ppanggolin/pangenome.py", line 187, in getOrganism
raise KeyError(f"{orgName} does not seem to be in your pangenome")
KeyError: 'GCF_000010665.1_ASM1066v1_cds_from_genomic.fna does not seem to be in your pangenome'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/ppanggolin", line 33, in
sys.exit(load_entry_point('ppanggolin==1.1.141', 'console_scripts', 'ppanggolin')())
File "/usr/local/lib/python3.7/dist-packages/ppanggolin-1.1.141-py3.7-linux-x86_64.egg/ppanggolin/main.py", line 191, in main
ppanggolin.workflow.workflow.launch(args)
File "/usr/local/lib/python3.7/dist-packages/ppanggolin-1.1.141-py3.7-linux-x86_64.egg/ppanggolin/workflow/workflow.py", line 35, in launch
getGeneSequencesFromFastas(pangenome, args.fasta)
File "/usr/local/lib/python3.7/dist-packages/ppanggolin-1.1.141-py3.7-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 371, in getGeneSequencesFromFastas
raise KeyError(f"One of the genome in your '{fasta_file}' was not found in the pangenome. This might mean that the genome names between your annotation file and your fasta file are different.")
KeyError: "One of the genome in your 'organisms.fasta.list' was not found in the pangenome. This might mean that the genome names between your annotation file and your fasta file are different."`

@axbazin
Copy link
Member

axbazin commented Mar 8, 2021

Hi,
No worries. This is quite puzzling. Can you send me the two list files 'organisms.gbff.list' and 'oganisms.fasta.list' ? You can add them to the issue as files.

Adelme

@shailabhr
Copy link
Author

organisms.gbff.list.txt
organisms.fasta.list.txt

Thank you for your reply. I dragged the files here. I hope you would get it. Thank you so much.

@axbazin
Copy link
Member

axbazin commented Mar 8, 2021

Thank you, I see the problem.

Basically, you have what looks like file extentions and some descriptions in the genome names (first column) that make the genome names different.

For example in the first line of organisms.fasta.list it looks like this:

GCF_000010665.1_ASM1066v1_cds_from_genomic.fna FASTA/GCF_000010665.1_ASM1066v1_cds_from_genomic.fna NC_012796.1 NC_012797.1 NC_012795.1

And the first in organisms.gbff.list it looks like this:
GCF_000010665.1_ASM1066v1_genomic.gbff GBFF/GCF_000010665.1_ASM1066v1_genomic.gbff

I see that it's for the same genome (GCF_000010665.1) but since the entire string is different, the names are considered being different. You should use 'GCF_000010665.1' or 'GCF_000010665.1_ASM1066v1' for example as genome names in both files, so that the program can actually know that those two lines are for the same genome.

This should be done for all genomes for ppanggolin to work. The names in the first column of both files need to be strictly identical.

Adelme

@shailabhr
Copy link
Author

Oh, I got that. Thank you so much. I am correcting those names and I will try again. I actually matched the numbers and assumed rest of the part to be correct. I am going to check the entire list and try again. Thank you. I highly appreciate your help.
Thank you.

@shailabhr
Copy link
Author

Hi, I could run it, however, the output file is large and it could not produce the pangenome file. I think my effort to run it on google colab was not worth. We recently got an Ubuntu system. I will try to run it locally so that there is restriction on disc usage. Do you have an instruction to install it on a Ubuntu system? I would be very thankful. Thank you.

@axbazin
Copy link
Member

axbazin commented Apr 5, 2021

Hi,
Conda is the easiest solution, you can find some instructions here: https://github.com/labgem/PPanGGOLiN/wiki/Installation
It requires to have conda installed though.

@shailabhr
Copy link
Author

Thank you for your reply. I installed all the requirements. After I run this command mamba install -c bioconda ppanggolin
it shows the following problem Encountered problems while solving. Problem: package ppanggolin-v0.3.88-py37h516909a_0 requires pytables 3.5.*, but none of the providers can be installed
I did try... pip install tables it says requirement already satisfied. I installed pytables 3.5.1. and tried again.
I also tried with mamba install -c bioconda ppanggolin but same problem.
Any suggestions please?
Thank you.

@axbazin
Copy link
Member

axbazin commented Apr 5, 2021

The same problem was encountered in #60 , this is due to the python version of the installed conda.

The following commands solved their problems, so it should be good for you too:

mamba create -n ppanggolin_env python=3.7
mamba install -n ppanggolin_env -c bioconda -c conda-forge -c r ppanggolin=1.1.136

This will create the environment 'ppanggolin_env' and install ppanggolin in it.

Adelme

@shailabhr
Copy link
Author

Thank you. It worked. I think I have installed it successfully. I see the ppanggolin_env folder. It might be a silly question but I do not know where is the folder location where I can replace the fasta and gbff files. There is no testitngdataset folder inside ppanggolin_env folder. Can you please help me where I can locate it? Thank you.

@axbazin
Copy link
Member

axbazin commented Apr 5, 2021

The commands basically just installed the software, it did not download the testingDataset things which are not needed.

to use it, you can use conda activate ppanggolin_env. This will allow you to use ppanggolin.

You can place your gbff or fasta files anywhere you want, as long as the paths in your organisms.list match the path of your working directory.

@axbazin
Copy link
Member

axbazin commented Nov 23, 2021

Hi,
I'll close this issue since things are more or less resolved.
Adelme

@axbazin axbazin closed this as completed Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants