New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include transcript/protein version in the database #89

Closed
ccwang002 opened this Issue Dec 6, 2018 · 8 comments

Comments

Projects
None yet
2 participants
@ccwang002
Copy link

ccwang002 commented Dec 6, 2018

I was using the EnsDb database of Ensembl release 90 from AnnotationHub AH57757, and I was wondering if EnsDb can include the transcript version in the database as well.

For example, there are 4 transcripts associated with a human gene GATA3,

> edb <- EnsDb('EnsDb.Hsapiens.v90.sqlite')
> transcripts(edb, filter = ~ gene_name == "GATA3")[, c('tx_id', 'gene_name')]
GRanges object with 4 ranges and 2 metadata columns:
                  seqnames          ranges strand |           tx_id   gene_name
                     <Rle>       <IRanges>  <Rle> |     <character> <character>
  ENST00000481743       10 8053604-8055553      + | ENST00000481743       GATA3
  ENST00000379328       10 8054693-8075198      + | ENST00000379328       GATA3
  ENST00000346208       10 8054806-8074890      + | ENST00000346208       GATA3
  ENST00000461472       10 8058399-8074064      + | ENST00000461472       GATA3
  -------
  seqinfo: 1 sequence from GRCh38 genome

Instead of just having the transcript ID likeENST00000481743 and ENST00000379328, it would be nice to have an option to display the transcript version as well, like ENST00000481743.2 and ENST00000379328.8. Sometimes it is quite helpful to have the full version of the transcript so when a project involves multiple versions of Ensembl annotation, it is easier to tell if any transcript annotation has changed. Otherwise, the user has to go back to the transcript GTF to retrieve that information.

Thanks again for making this tool.

@jotsetung

This comment has been minimized.

Copy link
Owner

jotsetung commented Dec 7, 2018

Thanks for your feedback @ccwang002 . I am not storing the version information for the transcripts (and genes, exons etc) in the EnsDb databases because they should be fixed/constant for the same Ensembl release. I thought that having different EnsDb databases for different Ensembl version would suffice (hence skipping the transcript versions).

If you really require that information I could an additional column to the database. I would however then have to update also all EnsDb databases in AnnotationHub (just to explain why I am hesitant).

@jotsetung

This comment has been minimized.

Copy link
Owner

jotsetung commented Dec 7, 2018

If we would add this we would have to be consistent and add also the gene_id_version. So:

  • add transcript_id_version column to transcript table.
  • add gene_id_version column to gene table.

In the Perl API we would have to use the ->stable_id_version() method to extract the respective ID with version appended.

@jotsetung

This comment has been minimized.

Copy link
Owner

jotsetung commented Dec 7, 2018

OK, so I will implement this.

jotsetung added a commit that referenced this issue Dec 7, 2018

Modify perl script to include transcript and gene versions
- Modify the perl script to extract the gene and transcript IDs with version
  from Ensembl (issue #89).
- Add columns tx_id_version and gene_id_version to the transcript and gene
  database tables.
@jotsetung

This comment has been minimized.

Copy link
Owner

jotsetung commented Dec 7, 2018

Done - I've to create some EnsDbs first to check if it works. Then I can go ahead to re-create all EnsDb databases from AnnotationHub - most likely I will just do it (first) for Ensembl version 94.

@jotsetung

This comment has been minimized.

Copy link
Owner

jotsetung commented Dec 13, 2018

Updating the EnsDbs on AnnotationHub:

  • 94
  • 93
  • 92
  • 91
  • 90
@jotsetung

This comment has been minimized.

Copy link
Owner

jotsetung commented Dec 21, 2018

@ccwang002 , for the (checked) versions above I have already uploaded updated EnsDb databases to AnnotationHub. You should be able to use them right away. If you use these databases you will get the additional columns tx_id_version and gene_id_version by default with the genes, transcripts, ... calls. You don't need to update ensembldb for that.

@ccwang002

This comment has been minimized.

Copy link

ccwang002 commented Jan 3, 2019

@jotsetung Thank you very much for your help! I was able to get the id versions from the new EnsDbs.

By the way, great work for maintaining and developing ensembldb. It is easy to use and powerful.

@ccwang002 ccwang002 closed this Jan 3, 2019

@jotsetung

This comment has been minimized.

Copy link
Owner

jotsetung commented Jan 7, 2019

Just an update: I've updated the EnsDb for Ensembl versions 90 to 94 hosted on AnnotationHub. All these contain now also the versioned gene and transcript IDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment