Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain dataset for pretraining ProteinBERT #12

Closed
duongvtt96 opened this issue Mar 8, 2022 · 1 comment
Closed

Obtain dataset for pretraining ProteinBERT #12

duongvtt96 opened this issue Mar 8, 2022 · 1 comment

Comments

@duongvtt96
Copy link

Hi, I would like to pretrain the ProteinBERT model with a smaller dataset, such as human proteins in UniRef90. I have searched in between UniProtKB and UniRef90 website but could not find out how the obtain the .xml.gz file containing GO annotations similar to your uniref90.xml.gz file.
Could you explain how to get that kind of input file?

Thank you!

@ddofer
Copy link
Collaborator

ddofer commented Mar 8, 2022

The easiest alternative would be to download just Swissprot (most, but not all human proteins are there), or Uniref50:
https://www.uniprot.org/downloads
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz

@ddofer ddofer closed this as completed Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants