Fix Transcript.exons crash when GTF lacks exon_id attribute#331
Merged
Conversation
Ensembl release 54 and some non-Ensembl GTFs (e.g. UCSC refseq/gencode)
omit the exon_id attribute. pyensembl's installer already treats the
column as optional (database.py:134), but Transcript.exons still
unconditionally SELECTed exon_id, crashing with
sqlite3.OperationalError: no such column: exon_id.
Transcript.exons now checks db.column_exists("exon", "exon_id") and
falls back to building Exon objects directly from the exon row with a
synthesized per-transcript ID of the form "<transcript_id>_exon_<n>".
Adds a regression test that builds an Ensembl-style GTF with
exon_number but no exon_id and verifies both exon ordering and
synthesized IDs.
Bumps to 2.6.7.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Some GTFs (Ensembl release 54 and earlier, plus non-Ensembl GTFs) omit the
exon_idattribute. pyensembl's installer already treats that column as optional (see `database.py:134`), but `Transcript.exons` still unconditionallySELECTedexon_id, so any call on such a genome crashed with:This was hit from pirlygenes: FN1 tests call `transcript.exons` and the local pyensembl cache (old release) has the exon table without the exon_id column.
`Transcript.exons` now checks `db.column_exists("exon", "exon_id")` and falls back to constructing Exon objects directly from the exon row, with a synthesized per-transcript ID of the form `"<transcript_id>exon"`.
Exon objects returned from the fallback path carry the real contig/start/end/strand/gene coordinates from the GTF; only the id is synthetic.
Regression test builds a minimal Ensembl-style GTF with `exon_number` but no `exon_id` and verifies exon ordering and synthesized IDs.
Bumps to 2.6.7.
Test plan