Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Preserve prodigal metadata for anvi-export-gene-calls #2181

Closed
Ge0rges opened this issue Nov 28, 2023 · 10 comments
Closed

Comments

@Ge0rges
Copy link
Collaborator

Ge0rges commented Nov 28, 2023

The need

Identifying things like RBS motif, start codon, etc. can come in handy with gene-aware analyses. Such analyses may become more frequent in the future especially given the current effort

to soon implement a general framework for storing the coordinates of any type of genomic feature

#2152

The solution

From @ivagljiva on discord:

to modify what we store in the [contigs] DB, and then add a flag to anvi-export-gene-calls (ie, in the genes_in_contigs table) to include that information in the output.

Perhaps this should be relegated to the effort mentioned by @semiller10 in #2152, but I thought it pertinent to bring it up.

Beneficiaries

Anyone doing gene aware analyses.

@meren
Copy link
Member

meren commented Nov 28, 2023

Just a quick note as I'm passing through this: When we do this we need to think of a way that doesn't require a specific design that locks us in with Prodigal for gene calling. A generic design that can keep track of additional features for genes (or genomic regions, or nucleotides, or codons) that can also be populated from Prodigal output.

@Ge0rges Ge0rges changed the title [FEATURE REQUEST] Preserve prodigal metadata for anti-export-gene-calls [FEATURE REQUEST] Preserve prodigal metadata for anvi-export-gene-calls Jan 17, 2024
@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Feb 21, 2024

Because I understand this exists in the context of broader changes that need be done (that are beyond my current mastery of the Anvi'o codebase), here is a temporary pseudo-solution for anyone who ends up here.

I wrote this script which essentially piggy backs on Anvi'o prodigal caller and response parser:

import sys
sys.path.insert(1, '/path/to/anvio/anvio/drivers')
from prodigal import Prodigal

class ExtendedProdigal(Prodigal):
    def __init__(self):
        super().__init__()

        self.available_parsers = {'v2.6.3': self.extended_parser,
                                  'v2.6.2': self.extended_parser,
                                  'v2.6.0': self.extended_parser}
        
        super().check_version()

    def extended_parser(self, defline):
        """parses this, but keep more information than the default anvio parser:

            204_10M_MERGED.PERFECT.gz.keep_contig_1720_7 # 7086 # 7262 # 1 # ID=3_7;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.294

        """
    
        hit = self._Prodigal__parser_1(defline)
        
        fields = defline.split()

        additional_attributes = fields[8].split(';')

        hit['start_codon'] = additional_attributes[2].split('=')[1]
        hit['rbs_motif'] = additional_attributes[3].split('=')[1]
        hit['rbs_spacer'] = additional_attributes[4].split('=')[1]
        hit['gc_cont'] = additional_attributes[5].split('=')[1]

        return hit

prodigal = ExtendedProdigal()
prodigal.num_threads = int(sys.argv[3])
gene_calls, amino_acids = prodigal.process(sys.argv[1], sys.argv[2])

This won't update your contigs database or otherwise modify any Anvi'o functionality, however if you call it with a FASTA as input, it will return the same dict anvi'o generates in addition to keeping the other prodigal outputs chosen here (e.g. 'gc_cont'). Example run command python prodigal_caller.py mags/Pelagibacter_r-contigs.fna output_folder 40 where the format is python script_name.py input_fasta_path output_folder_for_anvio thread_number.

@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Feb 21, 2024

@meren sidenote: while working on this I noticed an important typo here. I believe that line should read v2.6.0. I didn't want to make a pull request just for that one typo though (but I can if you'd like!).

meren added a commit that referenced this issue Feb 23, 2024
@meren
Copy link
Member

meren commented Feb 23, 2024

Thanks for letting me know about the typo, @Ge0rges. I'm not sure how did it survive this long. I guess because no one is using Prodigal v2.6.0 anymore.

Your temporary workaround is masterful and beautiful. Regarding the original feature request: this has been a difficult one to address because it requires a change in the way we keep gene calls in our relevant table with the addition of a few new columns, which will likely add millions of additional data points to that table, increasing the contigs-db size by a lot while only being relevant to a fraction of the users.

A better solution would be to extend that table if anvi-gen-contigs-db receives a specific flag (i.e., --store-extended-gene-call-data) to store larger amount of information regarding gene calls, rather than doing it every time anvi-gen-contigs-db. But since this table will already go through a change soon with the eukaryotic gene calls, I've been sitting on my hands :)

@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Feb 24, 2024

That makes sense. I was wondering if the revamped mentioned in #2152 is thought of to affect the contigs-db or to involve the creation of new type of artifact centered around genomes/MAGs? If the latter was the case, this feature could be relegated to that artifact rather than expanding on contigs-db.

@meren
Copy link
Member

meren commented Feb 26, 2024

I think it will have to be new, optional tables in contigs-db. We already have the code to mark nucleotide, codon/amino acid positions in contig sequences in contigs-db files, but they are not used outside of anvi'o structure currently. We will have to make them more accessible to mainstream programs :)

The best way to get these things done is to have a project in the lab that needs this solution to be in place. That's why there is a delay currently :(

@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Jul 3, 2024

Hi @meren I've run into an issue with my workaround and wanted to see if you had any ideas. My goal is to get list of gene calls, and a list of functions, (for an anvio curated MAG, i.e. bin in my contigs DB/ a fasta file) that I can use separately and then downstream match the gene_callers_id to execute other analyses.

The issue with my workaround is that it produces an export of gene calls with gene_callers_id different to those that are already in the contigs db. But simultaneously, I need that extra metadata from the more extensive prodigal output parsing.

What would be the good way to go about this? I thought of modifying the function annotation commands to take in a fasta file as well but that seems pretty daunting and perhaps inefficient?

@meren
Copy link
Member

meren commented Jul 13, 2024

Dear @Ge0rges, I just sent a PR for anvio-dev (#2306), which replaces prodigal with pyrodigal. The PR also includes a new parameter for anvi-gen-contigs-database, --full-gene-calling-report, that I hope offers a partial solution to this issue. When an output file path is provided to --full-gene-calling-report, anvi-gen-contigs-database generates an additional TAB-delimited file that includes all data from the gene caller and looks like this:

gene_callers_id contig start stop direction partial partial_begin partial_end confidence gc_cont rbs_motif rbs_spacer score cscore rscore sscore start_type translation_table tscore uscore sequence translated_sequence
0 NC_009091 169 1327 f False False False 99.99 0.3013 153.96808116167358 151.14841116167358 -0.9874499999999999 2.8196700000000003 ATG 11 2.8971 0.9100200000000003 ATGGAAATTATTTGTAATCAAAATGAATTA(...) MEIICNQNELNNAIQLVSKAVASRPTHPIL(...)
1 NC_009091 1388 2036 f False False False 99.99 0.2885 GGA/GAG/AGG 5-10bp 81.87259573141999 71.58136573142 2.71875 10.291229999999999 ATG 11 2.8971 4.67538 ATGGTCCTTAATTATGGAAATGGTGAAAAT(...) MVLNYGNGENVWMHPPVHRILGWYSRPSNL(...)
2 NC_009091 2039 4379 f False False False 99.99 0.3260 GGA/GAG/AGG 5-10bp 352.2303246874159 347.0894946874159 2.71875 5.14083 ATG 11 2.8971 -0.47502 ATGATAAATCAAGAAAATAATGATCTATAT(...) MINQENNDLYDLNEALQVENLTLNDYEEIC(...)
3 NC_009091 4426 5887 f False False False 99.99 0.3347 194.19481790304437 188.61028790304437 -0.9874499999999999 5.584530000000002 ATG 11 2.8971 3.6748800000000017 ATGTGCGGAATAGTTGGAATCGTTTCTTCA(...) MCGIVGIVSSDDVNQQIYDSLLLLQHRGQD(...)
4 NC_009091 5883 8325 r False False False 99.99 0.2731 318.93104438253084 315.33533438253085 -0.9874499999999999 3.5957100000000004 ATG 11 2.8971 1.6860600000000003 ATGGATAAGAAAAACTTCACTTCTATCTCA(...) MDKKNFTSISLQEEMQRSYLEYAMSVIVGR(...)
5 NC_009091 8402 9266 r False False False 99.99 0.2719 99.52927145207937 93.12346145207937 -0.9874499999999999 6.405809999999998 ATG 11 2.8971 4.496159999999998 ATGAAAAAGTTTTTACAAAGAATACTCTGG(...) MKKFLQRILWISLISFYFLQIKKVQAIVPY(...)
6 NC_009091 9262 10219 r False False False 99.99 0.3239 107.44082411540528 104.45237411540528 -0.9874499999999999 2.9884500000000003 ATG 11 2.8971 1.0788000000000002 ATGATTAATAGGATTCAAGACAAAAAAGAA(...) MINRIQDKKEISKKLKERAIFEGFTIAGIA(...)
7 NC_009091 10365 11100 f False False False 99.99 0.3319 71.66956065705634 79.71184065705633 -0.9874499999999999 -8.042279999999998 TTG 11 -9.60915 2.55432 TTGGTTGAATCTAATCAAAATCAAGATTCC(...) MVESNQNQDSNLGSRLQQDLKNDLIAGLLV(...)
8 NC_009091 11103 11721 f False False False 99.99 0.2993 69.10757680331382 69.03014680331381 -0.9874499999999999 0.07743000000000055 ATG 11 2.8971 -1.8322199999999995 ATGCATAATAGATCTCTTTCTAGAGAATTA(...) MHNRSLSRELSLISLGLIKDKGDLKLNKFQ(...)
(...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...) (...)

The gene_callers_id column here will obviously match to the gene calls stored in the contigs-db file.

What do you think about this as a solution?

@Ge0rges
Copy link
Collaborator Author

Ge0rges commented Aug 9, 2024

Hi @meren sorry this slipped by me for too long (I was on vacation last month). Your solution is great. I had worked around my work around so I lost focus of this, but this is an elegant feature to add to Anvio. Thanks for working on it! Feel free to close this issue.

@meren
Copy link
Member

meren commented Aug 10, 2024

Perfect :) Thank you for the feedback, @Ge0rges. I'm closing the issue.

@meren meren closed this as completed Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants