Add detection of pseudo genes #4

oschwengers · 2020-01-30T09:12:17Z

First hints for ideas:

oschwengers · 2022-01-27T12:13:02Z

idea for workflow:

start from CDS w/o annotations: either no hits at all or still hypothetical
re-search pseudogene candidates against Bakta DB using the following thresholds:

identity >=80%
query coverage >=80%
subject coverage >=40%

align subject AA sequence against translated CDS region plus N bp up-/downstream
look for frameshifts or stop codons

* Add initial pseudogene detection * Fix type hints for Python 3.8 * Add pseudogene prediction tests * Refactor pseudogene prediction functions * Add pseudogene metrics always to the Genome Annotation Summary * Reanable preliminary pseudogene-candidate cutoffs * Set alignment length based on protein level as default * Add minor improvements * Add missing docstrings * Refactor --pseudo argument to --skip-pseudo * Add unused loss of stop codon test, comment loss of start codon * Add unused 'loss of stop codon' case, comment 'loss of start codon' * Refactor missed '--pseudo' argument to '--skip-pseudo' * Refactor pseudogene constants * Refactor unused code and inconsistencies * Add test for get_elongated_cds() * Fix get_elongated_cds() fill calculation * Refactor constant variables * Refactor subprocess.run(cmd) * Refactor pseudogene variable names * Add pseudogene count to annotation summary * Change: drop unused pseudogene causes * Refactor cause variable to causes

amvarani · 2022-08-05T20:50:16Z

Hi there,
So, how bakta last version deal with pseudogenes ?

oschwengers · 2022-08-15T09:24:17Z

Hi @amvarani , thanks for reaching out on this. We're actively working on a first pseudogene detection/annotation feature as a default step in the Bakta workflow. We're currently fixing the last little things and look forward to release a new version very soon (i.e. next weeks).

There are different strategies how to detect pseudogenes, the most promising would be to use a closely related genome - however, as this is often not the case in a de novo assembly/annotation workflow, we address this without external genome information.
At this point, Bakta will look for gene residues that could not be annotated (hypothetical proteins) b/c hits for these gene residues only result in Diamond hits with less than 80% subject coverage. In a second more relaxed search Bakta looks for decent hits against the database with subject coverages of less than 80%. These references are then blasted against the 6 frame-translated CDS regions of the hypotheticals plus a 300 bp extension in up- and downward direction. If Bakta finds a conserved homology, it tries to detect indels/mutation and start/stop codon events. If this is the case, the hypothetical gene is annotated as a pseudo gene. Later, we'll extent this approach by taking into account spare genomic regions (w/o annotations), using some information to detect & annotate translational exceptions and so on.

I hope this answers your question. Best regards!

* Fix pseudogene position calculation #4 * Refactor pseudogene logging #4 * Change pseudogene tests to use new positioning logic * Rewrite pseudogene positioning logic * Add pseudogene stop codon point mutation cause * Add pseudogene stop codon point mutation cause test * Fix type hint TypeError * Refactor variable names * Refactor missed variable names

oschwengers added feature maybe labels Jun 14, 2021

oschwengers removed the maybe label Jan 17, 2022

oschwengers linked a pull request Jun 30, 2022 that will close this issue

Detect pseudogenes #115

Merged

oschwengers closed this as completed in #115 Jun 30, 2022

oschwengers added a commit that referenced this issue Aug 9, 2022

add functional annotations for pseudogenes #4

fdfddcb

oschwengers added a commit that referenced this issue Aug 9, 2022

fix pseudo-candidate object #4

a1210ae

oschwengers added a commit that referenced this issue Aug 17, 2022

refactor pseudogenes #4

f04d7f6

oschwengers added a commit that referenced this issue Aug 17, 2022

fix truncation of pseudogenes #4

4d511fe

oschwengers added a commit that referenced this issue Aug 17, 2022

disjoint analyses hypotheticals and pseudogenes #4

b5203d3

oschwengers self-assigned this Aug 25, 2022

oschwengers added this to the v1.5.0 milestone Aug 25, 2022

jhahnfeld added a commit to jhahnfeld/bakta that referenced this issue Aug 25, 2022

Fix pseudogene position calculation oschwengers#4

7cdf964

jhahnfeld added a commit to jhahnfeld/bakta that referenced this issue Aug 25, 2022

Refactor pseudogene logging oschwengers#4

d26299e

jhahnfeld mentioned this issue Aug 25, 2022

Fix pseudogene positions #122

Merged

oschwengers added a commit that referenced this issue Aug 29, 2022

add skip-pseudo option to CWL file #4

e945147

oschwengers added a commit that referenced this issue Aug 30, 2022

add pseudogene comments to readmde #4

836593e

oschwengers added a commit that referenced this issue Aug 30, 2022

rename pseudo-candidate to pseudo-inference #4

2cd684b

oschwengers added a commit that referenced this issue Aug 30, 2022

polish pseudogene descriptions #4

da87b68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detection of pseudo genes #4

Add detection of pseudo genes #4

oschwengers commented Jan 30, 2020 •

edited

oschwengers commented Jan 27, 2022

amvarani commented Aug 5, 2022

oschwengers commented Aug 15, 2022

Add detection of pseudo genes #4

Add detection of pseudo genes #4

Comments

oschwengers commented Jan 30, 2020 • edited

oschwengers commented Jan 27, 2022

amvarani commented Aug 5, 2022

oschwengers commented Aug 15, 2022

oschwengers commented Jan 30, 2020 •

edited