Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include transcript/protein version in the database #89

Closed
ccwang002 opened this issue Dec 6, 2018 · 8 comments
Closed

Include transcript/protein version in the database #89

ccwang002 opened this issue Dec 6, 2018 · 8 comments

Comments

@ccwang002
Copy link

I was using the EnsDb database of Ensembl release 90 from AnnotationHub AH57757, and I was wondering if EnsDb can include the transcript version in the database as well.

For example, there are 4 transcripts associated with a human gene GATA3,

> edb <- EnsDb('EnsDb.Hsapiens.v90.sqlite')
> transcripts(edb, filter = ~ gene_name == "GATA3")[, c('tx_id', 'gene_name')]
GRanges object with 4 ranges and 2 metadata columns:
                  seqnames          ranges strand |           tx_id   gene_name
                     <Rle>       <IRanges>  <Rle> |     <character> <character>
  ENST00000481743       10 8053604-8055553      + | ENST00000481743       GATA3
  ENST00000379328       10 8054693-8075198      + | ENST00000379328       GATA3
  ENST00000346208       10 8054806-8074890      + | ENST00000346208       GATA3
  ENST00000461472       10 8058399-8074064      + | ENST00000461472       GATA3
  -------
  seqinfo: 1 sequence from GRCh38 genome

Instead of just having the transcript ID likeENST00000481743 and ENST00000379328, it would be nice to have an option to display the transcript version as well, like ENST00000481743.2 and ENST00000379328.8. Sometimes it is quite helpful to have the full version of the transcript so when a project involves multiple versions of Ensembl annotation, it is easier to tell if any transcript annotation has changed. Otherwise, the user has to go back to the transcript GTF to retrieve that information.

Thanks again for making this tool.

@jorainer
Copy link
Owner

jorainer commented Dec 7, 2018

Thanks for your feedback @ccwang002 . I am not storing the version information for the transcripts (and genes, exons etc) in the EnsDb databases because they should be fixed/constant for the same Ensembl release. I thought that having different EnsDb databases for different Ensembl version would suffice (hence skipping the transcript versions).

If you really require that information I could an additional column to the database. I would however then have to update also all EnsDb databases in AnnotationHub (just to explain why I am hesitant).

@jorainer
Copy link
Owner

jorainer commented Dec 7, 2018

If we would add this we would have to be consistent and add also the gene_id_version. So:

  • add transcript_id_version column to transcript table.
  • add gene_id_version column to gene table.

In the Perl API we would have to use the ->stable_id_version() method to extract the respective ID with version appended.

@jorainer
Copy link
Owner

jorainer commented Dec 7, 2018

OK, so I will implement this.

jorainer added a commit that referenced this issue Dec 7, 2018
- Modify the perl script to extract the gene and transcript IDs with version
  from Ensembl (issue #89).
- Add columns tx_id_version and gene_id_version to the transcript and gene
  database tables.
@jorainer
Copy link
Owner

jorainer commented Dec 7, 2018

Done - I've to create some EnsDbs first to check if it works. Then I can go ahead to re-create all EnsDb databases from AnnotationHub - most likely I will just do it (first) for Ensembl version 94.

@jorainer
Copy link
Owner

jorainer commented Dec 13, 2018

Updating the EnsDbs on AnnotationHub:

  • 94
  • 93
  • 92
  • 91
  • 90

@jorainer
Copy link
Owner

@ccwang002 , for the (checked) versions above I have already uploaded updated EnsDb databases to AnnotationHub. You should be able to use them right away. If you use these databases you will get the additional columns tx_id_version and gene_id_version by default with the genes, transcripts, ... calls. You don't need to update ensembldb for that.

@ccwang002
Copy link
Author

@jotsetung Thank you very much for your help! I was able to get the id versions from the new EnsDbs.

By the way, great work for maintaining and developing ensembldb. It is easy to use and powerful.

@jorainer
Copy link
Owner

jorainer commented Jan 7, 2019

Just an update: I've updated the EnsDb for Ensembl versions 90 to 94 hosted on AnnotationHub. All these contain now also the versioned gene and transcript IDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants