-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing Anvio Pangenomic gene clusters into PPAnGGOLiN #56
Comments
I am attaching the files I am using: The .tsv file created from the The list files for the .fasta and .gff files: The .fasta and .gff files exported from Anvio: |
Hello, Adelme |
Hi, Yeah, I saw the issue with not having the '##sequence-region'. I will install the new version. Thanks @axbazin |
Apologies for the late response. I finally had a chance to take look at the export-gff function Thank you very much for your patience and investigating how to make these two tools talk. Please let me know if you run into other problems and I will be happy to work on them on our end. Using your experience and expertise, we can even implement an anvi'o script to export everything necessary from an anvi'o pangenome to be able to continue working with it in PPAnGGOLiN. Best wishes, |
Hello, Watching more carefully the files, there might be one last trouble. To be able to differentiate the genes within the cluster file, genes must have a unique identifier throughout all of the gff files and in the cluster file. As it stands it seems that Anvi'o gives incremental gene identifiers from 1 to n in each genome, which will be problematic in differentiating genes in between the different genomes. A possible solution would be to give unique IDs to each gene and generate the cluster file with those, but I am not very familiar with Anvi'o so I do not know how doable it is. Of course an ideal solution for this time and the next ones would most likely to have a script such as @meren suggested, that generates the gff and the tsv files with unique identifiers. Adelme |
Thanks a lot for helping solve the compatibility between your tools. I updated Anvio to the main GitHub branch version and the issue with the gffs is solved. As @axbazin indicated I got an error about the gene IDs. From the
The output looks like this Clusters_Dpig_Anvio7.tsv.txt:
This is, the first column is the gene_cluster_id and the second is new_id (concatenated genome_name + gene_callers_id, separated by @meren, is there a way to export the .gffs using this new_id format instead of the gene_callers_id? Like:
Instead of:
|
A follow up question for @meren once we figure out how to generate the rigth gff. files. PPanGGOLiN uses a graphical model and a statistical method to partition gene families in persistent, shell and cloud genomes. My next question would be how to import these partitions back as bins in Anvio. |
Hey @IsabelFE, can you please git pull and see if the GFF outputs look more useful? |
I did
The problem is that the Dpig_Prokka___hash2785db7a___1 format doesnt match on the |
The |
@axbazin, OK that's true, the matrix file has one row per gene cluster, not per individual gene. So the |
Yes, sorry. The naming is quite impactical as it is, but we chose to have it as such to match roary's file formating to be compatible with other pangenomic tools that take roary's file format as input. I should probably improve our own wiki about this file to make it clearer. If you are looking for a partition information for each gene separately, the projection files ( |
I can see that this requires a bit more work, @IsabelFE. The gene cluster summary output anvi'o generates is quite comprehensive, but it is missing gene start/stop positions. It is because when a genomes storage database is generated from a set of contigs databases, that information is discarded. If I change this:
to this:
You would have a temporary solution to link everything together since gene clusters file does contain the genome name and gene caller id for each gene cluster. Would you like me to do that? The real solution is to re-implement the anvi'o's genome storage database, which is something we have been meaning to do but we haven't allocated any time for it. |
@meren, I see the problem. If you change to As I understand |
I actually was generating the GFF output files using this command:
So it is supposed to be run on every genome (contigs db) separately. In that case |
I am running it on every genome (contigs db) separately using a loop:
I think that the problem is on the loop I used to create the contigs.db:
I am using -n Dpig_Prokka, instead of n $FILENAME, so all my genomes got named the same. 🤦🏻♀️ I think that I am going to need to run the whole analysis again... |
I ran the whole pipeline again and have the new contig databases with the correct names. Now my gffs look like this:
@meren could you please make Thanks for all the help!! |
Done :) |
Thanks @meren!! Now my gffs look like this:
And my tsv file for
However, I still got an error from PPAnGGOLiN:
@axbazin, it looks like it is not finding the clusters on the tsv file for any of the genes (49412). Is this because Anvi'o names every cluster with a unique ID (like |
This is my code and files for the PPanGGOLiN analysis
|
The fact that PPanGGOLiN uses gene IDs by default should not create problems. We do this to "have a unique ID" but anything could be used in theory. This is a little puzzling, I will look at it more closely as soon as I have access to a linux machine. |
Hello, It turned out that there was a slight problem in the log here. The number reported is the number of genes that were associated to a family, and not the opposite. I've fixed that in the master branch. Sorry for the inconvenience. There are only 175 genes without families, I've listed them in genes_without_families.txt Adelme |
Hello, I update to the new version and run it again:
|
That was a bad fix. It should be better now. |
Thanks @axbazin It worked!! 🥳 I have the whole PPanGGOLiN pipeline working. I was able to write all the outputs and also run the RGPs without any problem. There is just one last thing that could make the outputs even more useful. In the exported gffs from Anvio, the functional annotations are not present, therefore all the outputs in PPanGGOLiN have only the gene IDs. It would be much better to be able to explore the partition graph and the spots plots with the functional annotation on them. For example, this is one of the hotspots using the default PPanGGOLiN with Prokka annotated gffs: And this is a hotspot now, only with the gene IDs, but no annotations. For exploration purposes having the annotations will be great. @meren is there any way to export the gffs with functional annotations on them? This is not 100% needed, but it would be cool. A big thanks to both of you, @axbazin and @meren, both PPanGGOLiN and Anvi'o have been really useful for our research and now going from one output to the other is much easier since the gene IDs and cluster IDs are the same in both of them. 😊 |
I'm so glad to see you're making progress, @IsabelFE. Thank you, also, @axbazin. It is doable to export gene functional annotations in GFF files, @IsabelFE and I can take a look at that. Would you mind submitting an issue for this to anvi'o GitHub and in which mention the URL for this issue so there is context? Thank you, |
Thank you for your very clear and detailed bug reports and answers ! |
@axbazin, I think we closed this issue too soon... I found some issued with the annotations. When I look at the spot plots or at the matrix file output genes have an ID like this: KPL3070_CDS_0757, but this ID does not correspond with the ID on the .gff files:
For example, KPL3070_CDS_0757, which based on the spot plot corresponds to COG2264, is not equal to KPL3070___757, it is in fact KPL3070___759 (that as you can see it is COG2264). It looks like instead of using the information from the Anvio .gff file PPanGGOLiN is doing CDS search again. This goes to my initial post in this issue:
It looks like PPanGGOLiN is generating again new IDs for each gene (the ones in the KPL3070_CDS_0757 format), instead of using the ones from the gff. Therefore, on the PPanGGOLiN output, the gene clusters IDs match the ones from Anvio, but if you look at the individual gene IDs, the numbers are off. Thanks again! |
The funny thing is that the info on the gephi table has the Anvio IDs, but in the matrix file you get the other type of IDs that don't have the same gene_callers_id. I realized this because I am trying to draw a subgraph for some hotspots using Could be possible to get the KPL3070___757 annotations on the spots plots? That way, when I find a spot of interest I can find the corresponding protein sequences. Best, |
In the case of provided annotation, PPanGGOLiN uses the ids in the genome files but also creates internal gene IDs (just in case there are genes with identical ids in different genomes AND PPanGGOLiN is the one performing the clustering afterwards). What I am guessing is happening here is that those internal IDs are reported instead of the user's. I will look at this ASAP. It is definitely possible to get the right labels on the plot. I notice also that there are graphical issues with margins when IDs are quite long, I'll see if I can do something about this too. |
Thanks for looking into this and reopening the issue. I see the point with the genome files IDs vs the PPanGGOLiN internal gene IDs. I also found another issue with the current annotations on the spot plots. For some genes, only the COG ID is plotted, but a COG ID can be assigned to more than one GC, so ideally having both the KPL3070___757 and the COG ID will be perfect. But I understand that from a graphical point of view this is too long. Therefore, having only the gene ID as KPL3070___757 (not the internal IDs) is the most useful. I could easily search for the ID either on the gephi_table.xlsx or on the anvio summary table and found the COG ID or the other annotation info. Thanks! |
@axbazin while you fix this I moved ahead and manually found the flanking proteins for my spot of interest and ran
Do you have plans to add this functionality for externally clustered gene families? 🙏🏻 |
For the last thing that you reported, it is probably possible, though that might require some work. I'll see if I can do something about this in the near futur. It would indeed be quite practical. |
Hello, In the latest commits I added a new options so that you can choose what is used as label in the figures in priority. This will put gene IDs as labels:
If there is a label indicated in the case of provided annotations like yours, it should always be reported in priority in the figures As another example, this command will display the behavior of showing gene names if they exist, or gene ID if there are no gene names:
Also, to maybe make things a bit easier, I've updated the '--interest' option to add elements of interest to the figure file names, if you want to see all spots with a given annotation, for example:
should indicate which spots have genes with the annotation 'COG2264', either in the spot or as flanking persistent genes. This should work with family id, gene id, gene names and strings that can be found in the 'product' field of the annotation files. We're looking to update this feature of drawing spots in the futur, as we're not entirely happy with how it renders currently, so it might evolve. Adelme |
Hi Adelme, This is really cool! It is great to be able to explore the plots with the name, family, or ID options. Makes going back and forward with Anvio much easier. Thanks, Isabel |
Hi, |
Hello @labgem and @merenlab (@meren),
I've been using both PPAnGGOLiN and Anvio pangenomic pipelines and really loving both tools. I've used both tools independently with Prokka annotated .gff files without a problem. But now I want to import Anvio clusters into PPAnGGOLiN instead of using the default MMseqs2 clustering. The reason for this is that being able to visualize the same gene clusters with both methods would be really useful for our research. It is difficult to make sense of the data with 2 independent clustering methods since it is like comparing apples and oranges. We are able to make really cool observations with each method but we can't compare with the other one.
In order to import the Anvio clustering into PPanGGOLiN I need:
A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files. Using
anvi-summarize
I got the info needed to generate the .tsv file.The annotated genomes with gene IDs that match the ones listed in the previous .tsv file. The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script
gff_parser.py
. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However,gff_parser.py
will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio. I found out thatanvi-get-sequences-for-gene-calls
can be used to export new .fasta and .gff files with only the ORFs that match the gene IDs on the .tsv file. But I think that there is an issue with the formating of these .gff files not being compatible with the expected .gff files on PPanGGOLiNI tried running:
ppanggolin annotate --anno Anvio7GenomesAnno.txt --fasta Anvio7GenomesFasta.txt
And I got an error that I am not sure if it is related to PPanGGOLiN or to the format of the .gff/.fasta files obtained from Anvio.
I hope someone from one of your teams can help me with this. Really it would be really cool to have both tools on the same set of gene clusters.
Thanks a lot,
Isabel
The text was updated successfully, but these errors were encountered: