Importing Anvio Pangenomic gene clusters into PPAnGGOLiN #56

IsabelFE · 2021-02-19T22:10:21Z

I've been using both PPAnGGOLiN and Anvio pangenomic pipelines and really loving both tools. I've used both tools independently with Prokka annotated .gff files without a problem. But now I want to import Anvio clusters into PPAnGGOLiN instead of using the default MMseqs2 clustering. The reason for this is that being able to visualize the same gene clusters with both methods would be really useful for our research. It is difficult to make sense of the data with 2 independent clustering methods since it is like comparing apples and oranges. We are able to make really cool observations with each method but we can't compare with the other one.

In order to import the Anvio clustering into PPanGGOLiN I need:

A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files. Using anvi-summarize I got the info needed to generate the .tsv file.
The annotated genomes with gene IDs that match the ones listed in the previous .tsv file. The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio. I found out that anvi-get-sequences-for-gene-calls can be used to export new .fasta and .gff files with only the ORFs that match the gene IDs on the .tsv file. But I think that there is an issue with the formating of these .gff files not being compatible with the expected .gff files on PPanGGOLiN

I tried running:
ppanggolin annotate --anno Anvio7GenomesAnno.txt --fasta Anvio7GenomesFasta.txt

And I got an error that I am not sure if it is related to PPanGGOLiN or to the format of the .gff/.fasta files obtained from Anvio.

2021-02-19 13:37:26 main.py:l180 INFO	Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin annotate --anno analysis_PPanGGOLiN/Anvio7GenomesAnno.txt --fasta analysis_PPanGGOLiN/Anvio7GenomesFasta.txt -o analysis_PPanGGOLiN/OutputFromAnvio7 --basename FromAnvio7
2021-02-19 13:37:26 main.py:l181 INFO	PPanGGOLiN version: 1.1.131
2021-02-19 13:37:26 annotate.py:l337 INFO	Reading analysis_PPanGGOLiN/Anvio7GenomesAnno.txt the list of organism files ...
  0%|                                                                                                                                 | 0/28 [00:00<?, ?file/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 325, in readAnnoFile
    return read_org_gff(organism_name, filename, circular_contigs, getSeq, pseudo)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 277, in read_org_gff
    if contig.name != gff_fields[GFF_seqname]:
UnboundLocalError: local variable 'contig' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 319, in launchReadAnno
    return readAnnoFile(*args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 327, in readAnnoFile
    raise Exception(f"Reading the gff3 file '{filename}' raised an error.")
Exception: Reading the gff3 file 'analysis_PPanGGOLiN/Anvio7_Exported_Genomes/Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff' raised an error.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 183, in main
    ppanggolin.annotate.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 428, in launch
    readAnnotations(pangenome, args.anno, cpu = args.cpu, pseudo = args.use_pseudo, show_bar=args.show_prog_bars)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 349, in readAnnotations
    for org, flag in p.imap_unordered(launchReadAnno, args):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/multiprocessing/pool.py", line 761, in next
    raise value
Exception: Reading the gff3 file 'analysis_PPanGGOLiN/Anvio7_Exported_Genomes/Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff' raised an error.
  0%|

I hope someone from one of your teams can help me with this. Really it would be really cool to have both tools on the same set of gene clusters.

Thanks a lot,

Isabel

The text was updated successfully, but these errors were encountered:

IsabelFE · 2021-02-19T22:16:14Z

I am attaching the files I am using:

The .tsv file created from the anvi-summarize info:
Clusters_Dpig_Anvio7.tsv.txt

The list files for the .fasta and .gff files:
Anvio7GenomesAnno.txt
Anvio7GenomesFasta.txt

The .fasta and .gff files exported from Anvio:
Anvio7_Exported_Genomes.zip

axbazin · 2021-02-19T22:39:12Z

Hello,
Your error "UnboundLocalError: local variable 'contig' referenced before assignment" is quite funnily related to an issue that was fixed today (the fact of not having '##sequence-region' at the beginning of the gff), so you'll need to install the github version of ppanggolin to have it work.
However, the genomes in the first gff that I opened "Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff" is extremely strange. All genes start on position 1 even though they are on the same dna molecule 'c_000000000001'. This does not sound right to me. Unless I am missreading something, there is probably a problem there.

Adelme

IsabelFE · 2021-02-19T22:59:57Z

Hi,

Yeah, I saw the issue with not having the '##sequence-region'. I will install the new version.
I also thought that the .gff exported from Anvio look weird, it is like they have the right end positions, but they all start on position 1.
@meren is this an issue with Anvio not exporting the gff properly?

Thanks @axbazin

meren · 2021-02-21T16:28:11Z

Dear @axbazin and @IsabelFE,

Apologies for the late response. I finally had a chance to take look at the export-gff function anvi-get-sequences-for-gene-calls is using and found a bug we recently introduced. I now have fixed the issue via merenlab/anvio@a9f7328 in our main branch, which means, if you are tracking the main branch, your new GFF files should look much better, @IsabelFE.

Thank you very much for your patience and investigating how to make these two tools talk. Please let me know if you run into other problems and I will be happy to work on them on our end. Using your experience and expertise, we can even implement an anvi'o script to export everything necessary from an anvi'o pangenome to be able to continue working with it in PPAnGGOLiN.

Best wishes,

axbazin · 2021-02-22T07:54:54Z

Hello,

Watching more carefully the files, there might be one last trouble. To be able to differentiate the genes within the cluster file, genes must have a unique identifier throughout all of the gff files and in the cluster file. As it stands it seems that Anvi'o gives incremental gene identifiers from 1 to n in each genome, which will be problematic in differentiating genes in between the different genomes.

A possible solution would be to give unique IDs to each gene and generate the cluster file with those, but I am not very familiar with Anvi'o so I do not know how doable it is.

Of course an ideal solution for this time and the next ones would most likely to have a script such as @meren suggested, that generates the gff and the tsv files with unique identifiers.

Adelme

meren · 2021-02-22T20:15:31Z

We certainly can fix that :) If I have access to some 'ideal input files' from @IsabelFE, @merenlab can use that as a model to develop a tool.

IsabelFE · 2021-02-22T22:01:09Z

Dear @axbazin and @meren,

Thanks a lot for helping solve the compatibility between your tools. I updated Anvio to the main GitHub branch version and the issue with the gffs is solved. As @axbazin indicated I got an error about the gene IDs.

From the anvi-summarize output I generated the .tsv file on R like this:

Dpig_Anvio7 <- read_delim("analysis_Anvio7/Pangenomic_Results_Dpig/Dpig-PAN-SUMMARY/PAN_DPIG_prokka_gene_clusters_summary.txt.gz", "\t")
Dpig_Anvio7 <- Dpig_Anvio7 %>% 
  unite(new_id, genome_name:gene_callers_id, sep = "_", remove = FALSE)
Clusters_Dpig_Anvio7 <- select(Dpig_Anvio7, c(gene_cluster_id, new_id))

The output looks like this Clusters_Dpig_Anvio7.tsv.txt:

GC_00000001 ATCC_51524_900
GC_00000001 KPL1914_1178
GC_00000001 KPL1914_475
GC_00000001 KPL1922_CDC39_95_1080
GC_00000001 KPL1922_CDC39_95_135

This is, the first column is the gene_cluster_id and the second is new_id (concatenated genome_name + gene_callers_id, separated by _) (Should I have separated them with something different from _??)

@meren, is there a way to export the .gffs using this new_id format instead of the gene_callers_id?

Like:

c_000000000001	.	CDS	1	1338	.	+	.	ID=ATCC_51524_1
c_000000000001	.	CDS	1569	2705	.	+	.	ID=ATCC_51524_2
c_000000000001	.	CDS	2921	3157	.	+	.	ID=ATCC_51524_3
c_000000000001	.	CDS	3160	4305	.	+	.	ID=ATCC_51524_4
c_000000000001	.	CDS	4308	6254	.	+	.	ID=ATCC_51524_5
c_000000000001	.	CDS	6349	9012	.	+	.	ID=ATCC_51524_6
c_000000000001	.	CDS	9159	9896	.	+	.	ID=ATCC_51524_7

Instead of:

c_000000000001	.	CDS	1	1338	.	+	.	ID=1
c_000000000001	.	CDS	1569	2705	.	+	.	ID=2
c_000000000001	.	CDS	2921	3157	.	+	.	ID=3
c_000000000001	.	CDS	3160	4305	.	+	.	ID=4
c_000000000001	.	CDS	4308	6254	.	+	.	ID=5
c_000000000001	.	CDS	6349	9012	.	+	.	ID=6
c_000000000001	.	CDS	9159	9896	.	+	.	ID=7

IsabelFE · 2021-02-22T22:16:47Z

A follow up question for @meren once we figure out how to generate the rigth gff. files. PPanGGOLiN uses a graphical model and a statistical method to partition gene families in persistent, shell and cloud genomes. My next question would be how to import these partitions back as bins in Anvio.
PPanGGOLiN output looks like this: matrix.csv.txt
Gene column has the new_id from the .tsv file before and Non-unique Gene name has the partition info (is this correct @axbazin?)

related to labgem/PPanGGOLiN#56

meren · 2021-02-23T00:19:29Z

Hey @IsabelFE, can you please git pull and see if the GFF outputs look more useful?

IsabelFE · 2021-02-23T01:22:48Z

I did pip install --upgrade git+git://github.com/merenlab/anvio.git as before but it didn't update.
Anyway, I saw you edited the dbops.py file, so got the new version from GitHub, and the new gffs look like this"

c_000000000001	.	CDS	1	1338	.	+	.	ID=Dpig_Prokka___hash2785db7a___1
c_000000000001	.	CDS	1569	2705	.	+	.	ID=Dpig_Prokka___hash2785db7a___2
c_000000000001	.	CDS	2921	3157	.	+	.	ID=Dpig_Prokka___hash2785db7a___3
c_000000000001	.	CDS	3160	4305	.	+	.	ID=Dpig_Prokka___hash2785db7a___4
c_000000000001	.	CDS	4308	6254	.	+	.	ID=Dpig_Prokka___hash2785db7a___5
c_000000000001	.	CDS	6349	9012	.	+	.	ID=Dpig_Prokka___hash2785db7a___6
c_000000000001	.	CDS	9159	9896	.	+	.	ID=Dpig_Prokka___hash2785db7a___7
c_000000000001	.	CDS	9992	11287	.	-	.	ID=Dpig_Prokka___hash2785db7a___8
c_000000000001	.	CDS	11549	12886	.	-	.	ID=Dpig_Prokka___hash2785db7a___9
c_000000000001	.	CDS	13205	13504	.	+	.	ID=Dpig_Prokka___hash2785db7a___10

The problem is that the Dpig_Prokka___hash2785db7a___1 format doesnt match on the anvi-summarize output that I will need to use to create the tsv file. Could a column with this richer IDs be added to the anvi-summarize output?

axbazin · 2021-02-23T08:08:05Z

The _ separator is perfectly fine, I've been using the same thing.
Yes @IsabelFE this is correct. For the matrix.csv file, the Gene field does store the cluster ID, and the Non-unique Gene name stores the predicted partition.

IsabelFE · 2021-02-23T14:40:55Z

The _ separator is perfectly fine, I've been using the same thing.
Yes @IsabelFE this is correct. For the matrix.csv file, the Gene field does store the cluster ID, and the Non-unique Gene name stores the predicted partition.

@axbazin, OK that's true, the matrix file has one row per gene cluster, not per individual gene. So the Gene field does store the cluster ID, not the unique gene ID as I indicated.

axbazin · 2021-02-23T14:53:43Z

Yes, sorry. The naming is quite impactical as it is, but we chose to have it as such to match roary's file formating to be compatible with other pangenomic tools that take roary's file format as input. I should probably improve our own wiki about this file to make it clearer.

If you are looking for a partition information for each gene separately, the projection files ( ppanggolin write -p pangenome.h5 --projection ) are one file per genome with some of information for each gene, including its family partition. (and its format is actually detailed in the wiki )

meren · 2021-02-23T15:05:52Z

I can see that this requires a bit more work, @IsabelFE.

The gene cluster summary output anvi'o generates is quite comprehensive, but it is missing gene start/stop positions. It is because when a genomes storage database is generated from a set of contigs databases, that information is discarded.

If I change this:

Dpig_Prokka___hash2785db7a___10

to this:

Dpig_Prokka___10

You would have a temporary solution to link everything together since gene clusters file does contain the genome name and gene caller id for each gene cluster. Would you like me to do that?

The real solution is to re-implement the anvi'o's genome storage database, which is something we have been meaning to do but we haven't allocated any time for it.

IsabelFE · 2021-02-23T15:13:56Z

@meren, I see the problem.

If you change to Dpig_Prokka___10 then is back to the original problem, genes will have a unique ID in each genome, but not across all genomes on a pangenome (Dpig_Prokka was the pangenome project ID in this case). Is it possible to have something like Dpig_Prokka__GenomeName_10 on the gff files? No need to change anything on the gene cluster summary output, I will just concatenate the columns with the genome_name and gene_callers_id on the anvi-summarize output.

As I understand hash2785db7a is just a random ID you are assigning to each genome, but the real genome_name could be used instead.

meren · 2021-02-23T15:41:16Z

@IsabelFE,

I actually was generating the GFF output files using this command:

anvi-get-sequences-for-gene-calls -c CONTIGS.db -o gene_calls.gff --export-gff

So it is supposed to be run on every genome (contigs db) separately. In that case Dpig_Prokka part will be matching to the genome_name column in the summary output. The hash is a unique identifier of the contigs database, but it is not listed in the gene cluster summary output.

IsabelFE · 2021-02-24T01:03:52Z

I am running it on every genome (contigs db) separately using a loop:

path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_o="analysis_PPanGGOLiN/Anvio7_Exported_Anno"

for file in $path_f/*.db; do
    FILENAME=`basename ${file%.*}`
    anvi-get-sequences-for-gene-calls -c $file --export-gff3 \
                                      -o $path_o/Anvio7_$FILENAME.gff
      
done

I think that the problem is on the loop I used to create the contigs.db:

path_f="analysis_Anvio7/Reformat_Dpig"
path_o="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_e="analysis_Anvio7/Parsed_prokka_Dpig"

for file in $path_f/*.fa; do
    FILENAME=`basename ${file%.*}`
    anvi-gen-contigs-database -f $file \
                              -o $path_o/$FILENAME.db \
                              --external-gene-calls $path_e/calls_prokka_$FILENAME.txt \
                              --ignore-internal-stop-codons \
                              -n Dpig_Prokka;
done

I am using -n Dpig_Prokka, instead of n $FILENAME, so all my genomes got named the same. 🤦🏻‍♀️

I think that I am going to need to run the whole analysis again...

IsabelFE · 2021-02-25T16:05:12Z

I ran the whole pipeline again and have the new contig databases with the correct names. Now my gffs look like this:

c_000000000001	.	CDS	3654	4955	.	-	.	ID=ATCC_51524___hash349ce8f2___1
c_000000000001	.	CDS	5083	6612	.	-	.	ID=ATCC_51524___hash349ce8f2___2
c_000000000001	.	CDS	6612	7370	.	-	.	ID=ATCC_51524___hash349ce8f2___3
c_000000000001	.	CDS	7460	8644	.	-	.	ID=ATCC_51524___hash349ce8f2___4

@meren could you please make anvi-get-sequences-for-gene-calls not to add the hash349ce8f2 part as you previously suggested?

Thanks for all the help!!

re: labgem/PPanGGOLiN#56

meren · 2021-02-25T19:56:36Z

Done :)

IsabelFE · 2021-02-26T16:00:57Z

Thanks @meren!!

Now my gffs look like this:

c_000000000001	.	CDS	3654	4955	.	-	.	ID=ATCC_51524___1
c_000000000001	.	CDS	5083	6612	.	-	.	ID=ATCC_51524___2
c_000000000001	.	CDS	6612	7370	.	-	.	ID=ATCC_51524___3
c_000000000001	.	CDS	7460	8644	.	-	.	ID=ATCC_51524___4
c_000000000001	.	CDS	8772	9767	.	-	.	ID=ATCC_51524___5

And my tsv file for ppanggolin cluster has compatible IDs:

GC_00000001 ATCC_51524___900
GC_00000001 KPL1914___1178
GC_00000001 KPL1914___475
GC_00000001 KPL1922_CDC39_95___1080
GC_00000001 KPL1922_CDC39_95___135

However, I still got an error from PPAnGGOLiN:

(PPanGGOLiN) isabelfe@x86_64-apple-darwin13 Dpig_Pangenomics % ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv

2021-02-26 10:47:31 main.py:l180 INFO	Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-02-26 10:47:31 main.py:l181 INFO	PPanGGOLiN version: 1.1.132
2021-02-26 10:47:31 readBinaries.py:l37 INFO	Getting the current pangenome's status
2021-02-26 10:47:31 readBinaries.py:l294 INFO	Reading pangenome annotations...
100%|█████████████████████████████████████████████████████████████████████████████████████| 49587/49587 [00:00<00:00, 201335.07gene/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:01<00:00, 18.40organism/s]
2021-02-26 10:47:33 cluster.py:l233 INFO	Reading analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv the gene families file ...
100%|███████████████████████████████████████████████████████████████████████████████| 1478457/1478457 [00:00<00:00, 2474952.77bytes/s]
Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 185, in main
    ppanggolin.cluster.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 288, in launch
    readClustering(pangenome, args.clusters, args.infer_singletons, args.force, show_bar=args.show_prog_bars)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 272, in readClustering
    raise Exception(f"Some genes ({nbGeneWtFam}) did not have an associated cluster. Either change your cluster file so that each gene has a cluster, or use the --infer_singletons option to infer a cluster for each non-clustered gene.")
Exception: Some genes (49412) did not have an associated cluster. Either change your cluster file so that each gene has a cluster, or use the --infer_singletons option to infer a cluster for each non-clustered gene.

@axbazin, it looks like it is not finding the clusters on the tsv file for any of the genes (49412). Is this because Anvi'o names every cluster with a unique ID (like GC_00000001), while PPanGGOLiN uses the gene ID of one of the members of the cluster to name the whole cluster (like ATCC_51524___900)?

IsabelFE · 2021-02-26T16:07:14Z

This is my code and files for the PPanGGOLiN analysis

ppanggolin annotate --anno analysis_PPanGGOLiN_Anvio7/Anvio7GenomesExported.gff.txt --fasta analysis_PPanGGOLiN_Anvio7/Anvio7GenomesReformatted.fa.txt -o analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7 --basename FromAnvio7

ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv

analysis_PPanGGOLiN_Anvio7.zip

axbazin · 2021-02-26T18:58:10Z

The fact that PPanGGOLiN uses gene IDs by default should not create problems. We do this to "have a unique ID" but anything could be used in theory.

This is a little puzzling, I will look at it more closely as soon as I have access to a linux machine.

axbazin · 2021-03-01T08:55:03Z

Hello,

It turned out that there was a slight problem in the log here. The number reported is the number of genes that were associated to a family, and not the opposite. I've fixed that in the master branch. Sorry for the inconvenience.

There are only 175 genes without families, I've listed them in genes_without_families.txt

Adelme

IsabelFE · 2021-03-01T13:59:19Z

Hello,

I update to the new version and run it again:

(PPanGGOLiN) isabelfe@x86_64-apple-darwin13 Dpig_Pangenomics % ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-03-01 08:53:53 main.py:l180 INFO	Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin cluster -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --clusters analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv
2021-03-01 08:53:53 main.py:l181 INFO	PPanGGOLiN version: 1.1.140
2021-03-01 08:53:53 readBinaries.py:l37 INFO	Getting the current pangenome's status
2021-03-01 08:53:53 readBinaries.py:l294 INFO	Reading pangenome annotations...
100%|███████████████████████████████| 49587/49587 [00:00<00:00, 111534.44gene/s]
100%|█████████████████████████████████████| 28/28 [00:01<00:00, 15.73organism/s]
2021-03-01 08:53:56 cluster.py:l235 INFO	Reading analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv the gene families file ...
  0%|                                            | 0/1478457 [00:00<?, ?bytes/s]Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 257, in readClustering
    nbGeneWithFam+=1
UnboundLocalError: local variable 'nbGeneWithFam' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 185, in main
    ppanggolin.cluster.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 290, in launch
    readClustering(pangenome, args.clusters, args.infer_singletons, args.force, show_bar=args.show_prog_bars)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/cluster/cluster.py", line 264, in readClustering
    raise Exception(f"line {lineCounter} of the file '{families_tsv_file.name}' raised an error.")
Exception: line 1 of the file 'analysis_PPanGGOLiN_Anvio7/Clusters_Dpig_Anvio7.tsv' raised an error.
  0%|                                | 29/1478457 [00:00<3:15:44, 125.89bytes/s]

axbazin · 2021-03-01T14:11:39Z

That was a bad fix. It should be better now.

IsabelFE · 2021-03-01T16:26:27Z

Thanks @axbazin It worked!! 🥳

I have the whole PPanGGOLiN pipeline working. I was able to write all the outputs and also run the RGPs without any problem.

There is just one last thing that could make the outputs even more useful. In the exported gffs from Anvio, the functional annotations are not present, therefore all the outputs in PPanGGOLiN have only the gene IDs. It would be much better to be able to explore the partition graph and the spots plots with the functional annotation on them.

For example, this is one of the hotspots using the default PPanGGOLiN with Prokka annotated gffs:

And this is a hotspot now, only with the gene IDs, but no annotations.

For exploration purposes having the annotations will be great. @meren is there any way to export the gffs with functional annotations on them? This is not 100% needed, but it would be cool.

A big thanks to both of you, @axbazin and @meren, both PPanGGOLiN and Anvi'o have been really useful for our research and now going from one output to the other is much easier since the gene IDs and cluster IDs are the same in both of them. 😊

meren · 2021-03-01T16:29:16Z

I'm so glad to see you're making progress, @IsabelFE. Thank you, also, @axbazin.

It is doable to export gene functional annotations in GFF files, @IsabelFE and I can take a look at that. Would you mind submitting an issue for this to anvi'o GitHub and in which mention the URL for this issue so there is context?

Thank you,

IsabelFE · 2021-03-01T17:12:41Z

@meren, I submitted a new issue to Anvi'o GitHub.
@axbazin, I think that you can close this issue if you want since everything works now to import from Anvi'o into PPanGGOLiN.

Thanks again to both of you.

axbazin · 2021-03-01T21:33:43Z

Thank you for your very clear and detailed bug reports and answers !

IsabelFE · 2021-03-16T14:16:18Z

@axbazin, I think we closed this issue too soon... I found some issued with the annotations.

When I look at the spot plots or at the matrix file output genes have an ID like this: KPL3070_CDS_0757, but this ID does not correspond with the ID on the .gff files:

c_000000000001	.	CDS	789846	792719	.	+	.	ID=KPL3070___757;Name=COG0860;db_xref=COG20:COG0860;product=N-acetylmuramoyl-L-alanine amidase (AmiC) (PDB:1JWQ)
c_000000000001	.	CDS	792855	793406	.	+	.	ID=KPL3070___758
c_000000000001	.	CDS	793384	794352	.	+	.	ID=KPL3070___759;Name=COG2264;db_xref=COG20:COG2264;product=Ribosomal protein L11 methylase PrmA (PrmA) (PDB:3GRZ)

For example, KPL3070_CDS_0757, which based on the spot plot corresponds to COG2264, is not equal to KPL3070___757, it is in fact KPL3070___759 (that as you can see it is COG2264). It looks like instead of using the information from the Anvio .gff file PPanGGOLiN is doing CDS search again. This goes to my initial post in this issue:

The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio.

It looks like PPanGGOLiN is generating again new IDs for each gene (the ones in the KPL3070_CDS_0757 format), instead of using the ones from the gff. Therefore, on the PPanGGOLiN output, the gene clusters IDs match the ones from Anvio, but if you look at the individual gene IDs, the numbers are off.

Thanks again!

IsabelFE · 2021-03-16T15:38:58Z

The funny thing is that the info on the gephi table has the Anvio IDs, but in the matrix file you get the other type of IDs that don't have the same gene_callers_id.
gephi_table.xlsx
matrix.xlsx

I realized this because I am trying to draw a subgraph for some hotspots using ppanggolin align and I need to provide the sequence of the flanking proteins. When searching for those sequences on the Anvio output I was searching for KPL3070_CDS_0757, this is genome_name=="KPL3070" & gene_callers_id=="757", but soon I realized that what I need is indeed genome_name=="KPL3070" & gene_callers_id=="759"

Could be possible to get the KPL3070___757 annotations on the spots plots? That way, when I find a spot of interest I can find the corresponding protein sequences.

Best,

axbazin · 2021-03-16T15:58:40Z

In the case of provided annotation, PPanGGOLiN uses the ids in the genome files but also creates internal gene IDs (just in case there are genes with identical ids in different genomes AND PPanGGOLiN is the one performing the clustering afterwards). What I am guessing is happening here is that those internal IDs are reported instead of the user's.

I will look at this ASAP. It is definitely possible to get the right labels on the plot. I notice also that there are graphical issues with margins when IDs are quite long, I'll see if I can do something about this too.

IsabelFE · 2021-03-16T16:12:44Z

Thanks for looking into this and reopening the issue. I see the point with the genome files IDs vs the PPanGGOLiN internal gene IDs.

I also found another issue with the current annotations on the spot plots. For some genes, only the COG ID is plotted, but a COG ID can be assigned to more than one GC, so ideally having both the KPL3070___757 and the COG ID will be perfect. But I understand that from a graphical point of view this is too long. Therefore, having only the gene ID as KPL3070___757 (not the internal IDs) is the most useful. I could easily search for the ID either on the gephi_table.xlsx or on the anvio summary table and found the COG ID or the other annotation info.

Thanks!

IsabelFE · 2021-03-16T18:29:25Z

@axbazin while you fix this I moved ahead and manually found the flanking proteins for my spot of interest and ran ppanggolin align. I got this error:

ppanggolin align -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --getinfo --draw_related --proteins analysis_PPanGGOLiN/Spots/Spot10.fasta -o analysis_PPanGGOLiN/OutputDefaultAnno/Spots/Spot10_output

2021-03-16 14:23:03 main.py:l180 INFO	Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin align -p analysis_PPanGGOLiN_Anvio7/OutputFromAnvio7/FromAnvio7.h5 --getinfo --draw_related --proteins analysis_PPanGGOLiN/Spots/Spot10.fasta -o analysis_PPanGGOLiN/OutputDefaultAnno/Spots/Spot10_output
2021-03-16 14:23:03 main.py:l181 INFO	PPanGGOLiN version: 1.1.141
2021-03-16 14:23:03 readBinaries.py:l37 INFO	Getting the current pangenome's status
Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 205, in main
    ppanggolin.align.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/align/alignOnPang.py", line 343, in launch
    align(pangenome = pangenome, proteinFile = args.proteins, output = args.output, tmpdir = args.tmpdir, identity = args.identity, coverage =args.coverage, defrag =args.defrag,cpu= args.cpu, getinfo =args.getinfo, draw_related = args.draw_related )
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/align/alignOnPang.py", line 309, in align
    raise Exception("Cannot use this function as your pangenome does not have gene families representatives associated to it. For now this works only if the clustering is realised by PPanGGOLiN.")
Exception: Cannot use this function as your pangenome does not have gene families representatives associated to it. For now this works only if the clustering is realised by PPanGGOLiN.

Do you have plans to add this functionality for externally clustered gene families? 🙏🏻

axbazin · 2021-03-17T09:55:48Z

For the last thing that you reported, it is probably possible, though that might require some work. I'll see if I can do something about this in the near futur. It would indeed be quite practical.

axbazin · 2021-03-19T10:18:27Z

Hello,

In the latest commits I added a new options so that you can choose what is used as label in the figures in priority. This will put gene IDs as labels:

ppanggolin spot -p pangenome.h5 --label_priority ID --draw_hotspots --output figures

If there is a label indicated in the case of provided annotations like yours, it should always be reported in priority in the figures
This should solve your first issue.

As another example, this command will display the behavior of showing gene names if they exist, or gene ID if there are no gene names:

ppanggolin spot -p pangenome.h5 --label_priority name,ID --draw_hotspots --output figures

Also, to maybe make things a bit easier, I've updated the '--interest' option to add elements of interest to the figure file names, if you want to see all spots with a given annotation, for example:

ppanggolin spot -p pangenome.h5 --label_priority ID --draw_hotspots --output figures --interest COG2264

should indicate which spots have genes with the annotation 'COG2264', either in the spot or as flanking persistent genes. This should work with family id, gene id, gene names and strings that can be found in the 'product' field of the annotation files.

We're looking to update this feature of drawing spots in the futur, as we're not entirely happy with how it renders currently, so it might evolve.

Adelme

IsabelFE · 2021-03-22T15:50:55Z

Hi Adelme,

This is really cool! It is great to be able to explore the plots with the name, family, or ID options. Makes going back and forward with Anvio much easier.
I love the --interest option, it is a really neat feature! I am already playing with it.

Thanks,

Isabel

axbazin · 2021-11-23T13:26:35Z

Hi,
I'll close this issue since problems have more or less been resolved.
Adelme

meren added a commit to merenlab/anvio that referenced this issue Feb 23, 2021

richer ID tags in GFF outputs.

7bd4efd

related to labgem/PPanGGOLiN#56

meren added a commit to merenlab/anvio that referenced this issue Feb 25, 2021

much cleaner.

20e42b6

re: labgem/PPanGGOLiN#56

IsabelFE mentioned this issue Mar 1, 2021

Add functional annotations to .gffs exported with anvi-get-sequences-for-gene-calls merenlab/anvio#1685

Closed

axbazin closed this as completed Mar 1, 2021

axbazin reopened this Mar 16, 2021

axbazin mentioned this issue Mar 17, 2021

add the possibility to use 'align' even when the clustering is not performed through PPanGGOLiN #58

Closed

axbazin mentioned this issue May 6, 2021

Unable to Render graph using RAST Annotations #44--FOLLOW UP #61

Closed

axbazin closed this as completed Nov 23, 2021

Importing Anvio Pangenomic gene clusters into PPAnGGOLiN #56

Importing Anvio Pangenomic gene clusters into PPAnGGOLiN #56

Comments

IsabelFE commented Feb 19, 2021 • edited Loading

IsabelFE commented Feb 19, 2021 • edited Loading

axbazin commented Feb 19, 2021

IsabelFE commented Feb 19, 2021 • edited Loading

meren commented Feb 21, 2021

axbazin commented Feb 22, 2021

meren commented Feb 22, 2021

IsabelFE commented Feb 22, 2021 • edited Loading

IsabelFE commented Feb 22, 2021

meren commented Feb 23, 2021

IsabelFE commented Feb 23, 2021

axbazin commented Feb 23, 2021

IsabelFE commented Feb 23, 2021

axbazin commented Feb 23, 2021 • edited Loading

meren commented Feb 23, 2021

IsabelFE commented Feb 23, 2021

meren commented Feb 23, 2021

IsabelFE commented Feb 24, 2021 • edited Loading

IsabelFE commented Feb 25, 2021

meren commented Feb 25, 2021

IsabelFE commented Feb 26, 2021

IsabelFE commented Feb 26, 2021

axbazin commented Feb 26, 2021

axbazin commented Mar 1, 2021

IsabelFE commented Mar 1, 2021

axbazin commented Mar 1, 2021

IsabelFE commented Mar 1, 2021 • edited Loading

meren commented Mar 1, 2021

IsabelFE commented Mar 1, 2021

axbazin commented Mar 1, 2021

IsabelFE commented Mar 16, 2021

IsabelFE commented Mar 16, 2021 • edited Loading

axbazin commented Mar 16, 2021 • edited Loading

IsabelFE commented Mar 16, 2021

IsabelFE commented Mar 16, 2021

axbazin commented Mar 17, 2021

axbazin commented Mar 19, 2021 • edited Loading

IsabelFE commented Mar 22, 2021

axbazin commented Nov 23, 2021

IsabelFE commented Feb 19, 2021 •

edited

Loading

IsabelFE commented Feb 19, 2021 •

edited

Loading

IsabelFE commented Feb 19, 2021 •

edited

Loading

IsabelFE commented Feb 22, 2021 •

edited

Loading

axbazin commented Feb 23, 2021 •

edited

Loading

IsabelFE commented Feb 24, 2021 •

edited

Loading

IsabelFE commented Mar 1, 2021 •

edited

Loading

IsabelFE commented Mar 16, 2021 •

edited

Loading

axbazin commented Mar 16, 2021 •

edited

Loading

axbazin commented Mar 19, 2021 •

edited

Loading