Packaging dayhoff#10
Conversation
samirchar
commented
May 1, 2025
- Created Dockerfile
- Developed Readme
- Designed new generate.py with HF
- Wrote project.toml for package distribution and removed legacy setup.py
- Finalized requirements.txt
- Deleted default azure credentials
- Fixed a few minor bugs
- Made defaulting flash attention = False
| * **DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models | ||
| * Splits: train (5 GB) | ||
| * **BackboneRef**: structure-based synthetic protein dataset generated by sampling backbone structures from RFDiffusion and using them to design synthetic sequences. | ||
| * Splits: rfdiffusion_unfiltered (BBR-u; 3 GB), rfdiffusion_scrmsd (BBR-s; 3 GB), rfdiffusion_novelty (BBR-n; 3 GB) |
There was a problem hiding this comment.
nit: can you add an example of how to load BBR-u vs BBR-s its unclear from the examples provided
There was a problem hiding this comment.
oh nevermind I see the example below, is there a way to rename "rfdiffusion_unfiltered" to just be bbr-u in the split call?
| name="uniref90", | ||
| split = "train") | ||
|
|
||
| backboneref_both_filter = load_dataset("microsoft/DayhoffDataset", |
There was a problem hiding this comment.
i would make backboneref_both_filter backboneref_n here for consistency with paper naming scheme
| * **DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models | ||
| * Splits: train (5 GB) | ||
| * **BackboneRef**: structure-based synthetic protein dataset generated by sampling backbone structures from RFDiffusion and using them to design synthetic sequences. | ||
| * Splits: rfdiffusion_unfiltered (BBR-u; 3 GB), rfdiffusion_scrmsd (BBR-s; 3 GB), rfdiffusion_novelty (BBR-n; 3 GB) |
There was a problem hiding this comment.
oh nevermind I see the example below, is there a way to rename "rfdiffusion_unfiltered" to just be bbr-u in the split call?
|
|
||
| ## Analysis scripts | ||
|
|
||
| The following list briefly describes the functionality of the most important scripts used to produce the results of the paper: |
There was a problem hiding this comment.
Have these been confirmed/cleaned? Didn't have a chance to review the PR and I know we had extra files not being used at this point
There was a problem hiding this comment.
No! I added all of them and asked Kevin exactly this, and if he could provide a brief description of each. It would be great if you can help me with this. I can start with some of the scripts i created and know!
I think some of them also require some basic clean up like removing comments
| out_dir = os.path.join(args.out_fpath, args.model_name + '_' + str(total_steps) + "_" + task + '_t%.1f' %args.temp) | ||
| if RANK == 0: | ||
| os.makedirs(out_dir, exist_ok=True) | ||
| # if args.task == "sequence": |
There was a problem hiding this comment.
nit: clean if not being used
There was a problem hiding this comment.
I think delete this file - not needed in repo
There was a problem hiding this comment.
Makes sense. Removed the reference in the README
| from evodiff.utils import Tokenizer | ||
| from dayhoff.model import _get_hf_model, ARDiffusionModel | ||
| from dayhoff.constants import UL_ALPHABET_PLUS | ||
| # from torch.serialization import add_safe_globals |
There was a problem hiding this comment.
nit: clean if not needed
| license = "MIT" | ||
| license-files = ["LICEN[CS]E*"] | ||
| authors = [ | ||
| { name = "Sarah A. Alamdari"}, |
There was a problem hiding this comment.
Update what or how?
There was a problem hiding this comment.
for consistency might make sense to place this as an analysis .py script, rather than a standalone notebook examples