Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to represent the protein data #20

Closed
jorainer opened this issue Oct 5, 2016 · 3 comments
Closed

Best way to represent the protein data #20

jorainer opened this issue Oct 5, 2016 · 3 comments

Comments

@jorainer
Copy link
Owner

jorainer commented Oct 5, 2016

Extracting the protein annotations in form of a data.frame and DataFrame is straight forward, the question however is what type of object could best represent the protein annotation.

The object should be something similar to a GRanges, eventually the Proteins class from the Pbase (https://github.com/ComputationalProteomicsUnit/Pbase) package?

I've got:

  • (Ensembl) protein ID with sequence.
  • 1:n mapping of protein ID to Uniprot ID.
  • n:m mapping between protein ID and protein domain ID, which provides in addition the position of the protein domain within the protein sequence.

@lgatto any suggestions/preferences here?

@lgatto
Copy link
Contributor

lgatto commented Oct 7, 2016

I would have thought Proteins would have been the best choice, but given the circular dependency, this might not be an option. But maybe something more low-level will suffice, and we can then make use of it.

Do you want a single data structure for all the points above?

What do you mean by protein domain ID? Functional domains, or transcript exons start/end sites?

@jorainer
Copy link
Owner Author

jorainer commented Oct 7, 2016

regarding the protein domain ID: from Ensembl I get for each protein coding transcript its translation, which is in fact a protein sequence (AA) along with its ID (the Ensembl protein_id). For each protein_id I can then fetch the Uniprot ID (which can be none, 1 or more) and I fetch all protein domains from the various sources (Pfam, prosite, Smart); these have then start and end coordinates on the AA-sequence of the protein. That's why I thought Proteins might be a good data structure, as it allows to add to each AA-sequence also features on this sequence.

In a first version I will return protein results from the database as a AAStringSet with all additional annotations in the mcols.

@lgatto
Copy link
Contributor

lgatto commented Oct 7, 2016

Fantastic! Let me know when this becomes available and I will update Pbase to make use of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants