Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detection of pseudo genes #4

Closed
oschwengers opened this issue Jan 30, 2020 · 3 comments · Fixed by #115
Closed

Add detection of pseudo genes #4

oschwengers opened this issue Jan 30, 2020 · 3 comments · Fixed by #115
Assignees
Labels
Milestone

Comments

@oschwengers
Copy link
Owner

oschwengers commented Jan 30, 2020

First hints for ideas:

@oschwengers
Copy link
Owner Author

idea for workflow:

  1. start from CDS w/o annotations: either no hits at all or still hypothetical
  2. re-search pseudogene candidates against Bakta DB using the following thresholds:
  • identity >=80%
  • query coverage >=80%
  • subject coverage >=40%
  1. align subject AA sequence against translated CDS region plus N bp up-/downstream
  2. look for frameshifts or stop codons

@oschwengers oschwengers linked a pull request Jun 30, 2022 that will close this issue
oschwengers pushed a commit that referenced this issue Jun 30, 2022
* Add initial pseudogene detection
* Fix type hints for Python 3.8
* Add pseudogene prediction tests
* Refactor pseudogene prediction functions
* Add pseudogene metrics always to the Genome Annotation Summary
* Reanable preliminary pseudogene-candidate cutoffs
* Set alignment length based on protein level as default
* Add minor improvements
* Add missing docstrings
* Refactor --pseudo argument to --skip-pseudo
* Add unused loss of stop codon test, comment loss of start codon
* Add unused 'loss of stop codon' case, comment 'loss of start codon'
* Refactor missed '--pseudo' argument to '--skip-pseudo'
* Refactor pseudogene constants
* Refactor unused code and inconsistencies
* Add test for get_elongated_cds()
* Fix get_elongated_cds() fill calculation
* Refactor constant variables
* Refactor subprocess.run(cmd)
* Refactor pseudogene variable names
* Add pseudogene count to annotation summary
* Change: drop unused pseudogene causes
* Refactor cause variable to causes
@amvarani
Copy link

amvarani commented Aug 5, 2022

Hi there,
So, how bakta last version deal with pseudogenes ?

@oschwengers
Copy link
Owner Author

Hi @amvarani , thanks for reaching out on this. We're actively working on a first pseudogene detection/annotation feature as a default step in the Bakta workflow. We're currently fixing the last little things and look forward to release a new version very soon (i.e. next weeks).

There are different strategies how to detect pseudogenes, the most promising would be to use a closely related genome - however, as this is often not the case in a de novo assembly/annotation workflow, we address this without external genome information.
At this point, Bakta will look for gene residues that could not be annotated (hypothetical proteins) b/c hits for these gene residues only result in Diamond hits with less than 80% subject coverage. In a second more relaxed search Bakta looks for decent hits against the database with subject coverages of less than 80%. These references are then blasted against the 6 frame-translated CDS regions of the hypotheticals plus a 300 bp extension in up- and downward direction. If Bakta finds a conserved homology, it tries to detect indels/mutation and start/stop codon events. If this is the case, the hypothetical gene is annotated as a pseudo gene. Later, we'll extent this approach by taking into account spare genomic regions (w/o annotations), using some information to detect & annotate translational exceptions and so on.

I hope this answers your question. Best regards!

oschwengers added a commit that referenced this issue Aug 17, 2022
oschwengers added a commit that referenced this issue Aug 17, 2022
@oschwengers oschwengers self-assigned this Aug 25, 2022
@oschwengers oschwengers added this to the v1.5.0 milestone Aug 25, 2022
jhahnfeld added a commit to jhahnfeld/bakta that referenced this issue Aug 25, 2022
jhahnfeld added a commit to jhahnfeld/bakta that referenced this issue Aug 25, 2022
oschwengers pushed a commit that referenced this issue Aug 30, 2022
* Fix pseudogene position calculation #4
* Refactor pseudogene logging #4
* Change pseudogene tests to use new positioning logic
* Rewrite pseudogene positioning logic
* Add pseudogene stop codon point mutation cause
* Add pseudogene stop codon point mutation cause test
* Fix type hint TypeError
* Refactor variable names
* Refactor missed variable names
oschwengers added a commit that referenced this issue Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants