feat: Add canonical_transcript() method#1196
Conversation
- Add gene TypeAlias to base_params.py for type annotations - Implement canonical_transcript() method in genome_features.py * Finds the transcript with the longest total exon length * Supports gene ID and gene name (case-insensitive) lookup * Includes comprehensive error handling and debug logging - Add 11 comprehensive test cases * Basic functionality (ID and name lookup) * Error handling (non-existent genes, empty strings) * Edge cases (whitespace, case sensitivity, single-transcript genes) * Algorithm correctness verification * Multi-species support (ag3, af1, adir1) All tests passing: 11/11 new + 24 existing = 35 total
There was a problem hiding this comment.
Pull request overview
This PR adds a new canonical_transcript() API to AnophelesGenomeFeaturesData to identify a gene’s canonical transcript, defined as the transcript with the greatest total exon length, and introduces a gene parameter TypeAlias plus new tests covering the new behavior.
Changes:
- Add
base_params.geneTypeAlias for consistent typing/documentation of gene identifiers. - Implement
AnophelesGenomeFeaturesData.canonical_transcript()with ID/name lookup and transcript selection by summed exon lengths. - Add unit tests for canonical transcript lookup, error cases, and multi-species fixtures.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
malariagen_data/anoph/base_params.py |
Introduces gene TypeAlias for gene identifier parameters. |
malariagen_data/anoph/genome_features.py |
Adds canonical_transcript() implementation and associated logging/error handling. |
tests/anoph/test_genome_features.py |
Adds test cases for canonical transcript behavior across fixtures/species. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
give me a few moments let me implement the copilot suggestions and test everything locally first |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Two critical bugs were introduced in commit 66607b7: 1. API Misuse: Attempted to pass a list to genome_feature_children(parent: str) - Changed: genome_feature_children(parent=transcript_ids, ...) - Result: SyntaxError/TypeError in all tests 2. Missing coordinate adjustment: Removed +1 from exon length calculation - Changed: (end - start) instead of (end - start + 1) - Result: Incorrect transcript length calculations This fix reverts to the original per-transcript iteration approach while preserving the critical +1 for 1-based inclusive coordinates. Test Results: - All 35 genome_features tests pass (11 canonical + 24 existing) - All pre-commit checks pass (ruff, black, flake8) - Zero regressions
- Line 446: Replace 'key=transcript_lengths.get' with 'key=lambda tid: transcript_lengths[tid]' to resolve 'Argument key has incompatible type' type checker error - Add .mypy_cache to .gitignore to prevent type checking cache files from being staged
|
Hi @jonbrenas, I have resolved all the broken commits that I merged directly through the Copilot suggestions locally and have tested everything locally |
|
Thanks @kunal-10-cloud, could you change the tests so that they work with the same kind of |
|
sure give me a few moments will look into it @jonbrenas |
|
@jonbrenas, I have made the relevant changes and have tested locally whether everything is tuned in or not |
Summary
Implement
canonical_transcript()method to retrieve the longest transcript for a given gene by total exon length.Description
This PR addresses issue #794 by adding a new method to identify the canonical (longest) transcript for a gene in the Anopheles genome. The method supports gene lookup by both ID and name with case-insensitive matching and robust error handling.
Changes
malariagen_data/anoph/base_params.py: AddedgeneTypeAlias for consistent gene parameter typingmalariagen_data/anoph/genome_features.py: Implementedcanonical_transcript()method with:tests/anoph/test_genome_features.py: Added 11 test cases covering:Technical Details
genome_feature_children()method for robust exon handling (including multi-parent exons per issue Accommodate exons with multiple parents inplot_transcript()#334)@_check_types,@doc())Testing
Checklist