Skip to content

feat: Add canonical_transcript() method#1196

Merged
jonbrenas merged 9 commits intomalariagen:masterfrom
kunal-10-cloud:feature/gh794-canonical-transcript
Mar 23, 2026
Merged

feat: Add canonical_transcript() method#1196
jonbrenas merged 9 commits intomalariagen:masterfrom
kunal-10-cloud:feature/gh794-canonical-transcript

Conversation

@kunal-10-cloud
Copy link
Copy Markdown
Contributor

@kunal-10-cloud kunal-10-cloud commented Mar 22, 2026

Summary

Implement canonical_transcript() method to retrieve the longest transcript for a given gene by total exon length.

Description

This PR addresses issue #794 by adding a new method to identify the canonical (longest) transcript for a gene in the Anopheles genome. The method supports gene lookup by both ID and name with case-insensitive matching and robust error handling.

Changes

  • malariagen_data/anoph/base_params.py: Added gene TypeAlias for consistent gene parameter typing
  • malariagen_data/anoph/genome_features.py: Implemented canonical_transcript() method with:
    • Gene ID and name-based lookup (case-insensitive)
    • Transcript selection by maximum total exon length
    • Comprehensive error handling with helpful messages
    • Debug-level logging for troubleshooting
  • tests/anoph/test_genome_features.py: Added 11 test cases covering:
    • Basic functionality (ID and name lookup)
    • Error handling (non-existent genes, empty strings)
    • Edge cases (whitespace handling, case insensitivity, single-transcript genes)
    • Algorithm correctness
    • Multi-species support (ag3, af1, adir1)

Technical Details

  • The canonical transcript is defined as the one with the highest total transcribed base pairs (sum of exon lengths)
  • Leverages existing genome_feature_children() method for robust exon handling (including multi-parent exons per issue Accommodate exons with multiple parents in plot_transcript() #334)
  • Integrated with existing codebase patterns and decorators (@_check_types, @doc())

Testing

  • All 11 new tests passing
  • All 24 existing tests still passing (zero regressions)
  • Total: 35/35 tests passing

Checklist

  • Code follows project conventions
  • Pre-commit hooks pass (black, ruff, flake8)
  • Tests written and passing
  • Documentation added (docstring with parameters, returns, raises, examples)
  • No breaking changes

- Add gene TypeAlias to base_params.py for type annotations
- Implement canonical_transcript() method in genome_features.py
  * Finds the transcript with the longest total exon length
  * Supports gene ID and gene name (case-insensitive) lookup
  * Includes comprehensive error handling and debug logging
- Add 11 comprehensive test cases
  * Basic functionality (ID and name lookup)
  * Error handling (non-existent genes, empty strings)
  * Edge cases (whitespace, case sensitivity, single-transcript genes)
  * Algorithm correctness verification
  * Multi-species support (ag3, af1, adir1)

All tests passing: 11/11 new + 24 existing = 35 total
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new canonical_transcript() API to AnophelesGenomeFeaturesData to identify a gene’s canonical transcript, defined as the transcript with the greatest total exon length, and introduces a gene parameter TypeAlias plus new tests covering the new behavior.

Changes:

  • Add base_params.gene TypeAlias for consistent typing/documentation of gene identifiers.
  • Implement AnophelesGenomeFeaturesData.canonical_transcript() with ID/name lookup and transcript selection by summed exon lengths.
  • Add unit tests for canonical transcript lookup, error cases, and multi-species fixtures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
malariagen_data/anoph/base_params.py Introduces gene TypeAlias for gene identifier parameters.
malariagen_data/anoph/genome_features.py Adds canonical_transcript() implementation and associated logging/error handling.
tests/anoph/test_genome_features.py Adds test cases for canonical transcript behavior across fixtures/species.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread malariagen_data/anoph/genome_features.py Outdated
Comment thread malariagen_data/anoph/genome_features.py
Comment thread malariagen_data/anoph/genome_features.py Outdated
Comment thread tests/anoph/test_genome_features.py
Comment thread tests/anoph/test_genome_features.py Outdated
kunal-10-cloud and others added 3 commits March 23, 2026 00:06
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@kunal-10-cloud
Copy link
Copy Markdown
Contributor Author

give me a few moments let me implement the copilot suggestions and test everything locally first

kunal-10-cloud and others added 4 commits March 23, 2026 00:08
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Two critical bugs were introduced in commit 66607b7:

1. API Misuse: Attempted to pass a list to genome_feature_children(parent: str)
   - Changed: genome_feature_children(parent=transcript_ids, ...)
   - Result: SyntaxError/TypeError in all tests

2. Missing coordinate adjustment: Removed +1 from exon length calculation
   - Changed: (end - start) instead of (end - start + 1)
   - Result: Incorrect transcript length calculations

This fix reverts to the original per-transcript iteration approach while
preserving the critical +1 for 1-based inclusive coordinates.

Test Results:
- All 35 genome_features tests pass (11 canonical + 24 existing)
- All pre-commit checks pass (ruff, black, flake8)
- Zero regressions
- Line 446: Replace 'key=transcript_lengths.get' with 'key=lambda tid: transcript_lengths[tid]'
  to resolve 'Argument key has incompatible type' type checker error
- Add .mypy_cache to .gitignore to prevent type checking cache files from being staged
@kunal-10-cloud
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas, I have resolved all the broken commits that I merged directly through the Copilot suggestions locally and have tested everything locally
Can you please review it once and let me know if any changes are required?

@jonbrenas
Copy link
Copy Markdown
Collaborator

Thanks @kunal-10-cloud, could you change the tests so that they work with the same kind of cases as the other tests in the codebase?

@kunal-10-cloud
Copy link
Copy Markdown
Contributor Author

sure give me a few moments will look into it @jonbrenas

@kunal-10-cloud
Copy link
Copy Markdown
Contributor Author

@jonbrenas, I have made the relevant changes and have tested locally whether everything is tuned in or not
Can you please take a look at it and let me know if there is something else that needs to be done?

@jonbrenas jonbrenas merged commit b0db6ed into malariagen:master Mar 23, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants