Packaging dayhoff by samirchar · Pull Request #10 · microsoft/dayhoff

samirchar · 2025-05-01T14:13:10Z

Created Dockerfile
Developed Readme
Designed new generate.py with HF
Wrote project.toml for package distribution and removed legacy setup.py
Finalized requirements.txt
Deleted default azure credentials
Fixed a few minor bugs
Made defaulting flash attention = False

sarahalamdari · 2025-05-02T14:11:51Z

+* **DayhoffRef**:  dataset of 16 million synthetic protein sequences generated by the Dayhoff models
+    * Splits: train (5 GB)
+* **BackboneRef**: structure-based synthetic protein dataset generated by sampling backbone structures from RFDiffusion and using them to design synthetic sequences.
+    * Splits: rfdiffusion_unfiltered (BBR-u; 3 GB), rfdiffusion_scrmsd (BBR-s; 3 GB), rfdiffusion_novelty (BBR-n; 3 GB)


nit: can you add an example of how to load BBR-u vs BBR-s its unclear from the examples provided

oh nevermind I see the example below, is there a way to rename "rfdiffusion_unfiltered" to just be bbr-u in the split call?

sarahalamdari · 2025-05-02T14:12:45Z

+                  name="uniref90",
+                  split = "train")
+
+backboneref_both_filter = load_dataset("microsoft/DayhoffDataset",


i would make backboneref_both_filter backboneref_n here for consistency with paper naming scheme

sarahalamdari · 2025-05-02T14:13:35Z

+* **DayhoffRef**:  dataset of 16 million synthetic protein sequences generated by the Dayhoff models
+    * Splits: train (5 GB)
+* **BackboneRef**: structure-based synthetic protein dataset generated by sampling backbone structures from RFDiffusion and using them to design synthetic sequences.
+    * Splits: rfdiffusion_unfiltered (BBR-u; 3 GB), rfdiffusion_scrmsd (BBR-s; 3 GB), rfdiffusion_novelty (BBR-n; 3 GB)


oh nevermind I see the example below, is there a way to rename "rfdiffusion_unfiltered" to just be bbr-u in the split call?

sarahalamdari · 2025-05-02T14:15:54Z

+
+## Analysis scripts
+
+The following list briefly describes the functionality of the most important scripts used to produce the results of the paper:


Have these been confirmed/cleaned? Didn't have a chance to review the PR and I know we had extra files not being used at this point

No! I added all of them and asked Kevin exactly this, and if he could provide a brief description of each. It would be great if you can help me with this. I can start with some of the scripts i created and know!

I think some of them also require some basic clean up like removing comments

sarahalamdari · 2025-05-02T14:16:17Z

+    out_dir = os.path.join(args.out_fpath, args.model_name + '_' + str(total_steps) + "_" + task + '_t%.1f' %args.temp)
+    if RANK == 0:
+        os.makedirs(out_dir, exist_ok=True)
+    # if args.task == "sequence":


nit: clean if not being used

sarahalamdari · 2025-05-02T14:16:57Z

I think delete this file - not needed in repo

Makes sense. Removed the reference in the README

sarahalamdari · 2025-05-02T14:17:08Z

 from evodiff.utils import Tokenizer
 from dayhoff.model import _get_hf_model, ARDiffusionModel
 from dayhoff.constants import UL_ALPHABET_PLUS
+# from torch.serialization import add_safe_globals


nit: clean if not needed

sarahalamdari · 2025-05-02T14:17:34Z

+license = "MIT"
+license-files = ["LICEN[CS]E*"]
+authors = [
+    { name = "Sarah A. Alamdari"},


need to update

Update what or how?

sarahalamdari · 2025-05-02T14:28:55Z

for consistency might make sense to place this as an analysis .py script, rather than a standalone notebook examples

Samir Char added 11 commits May 1, 2025 12:33

restrucutre. Edit readme, reqs.txt and toml

8309531

restructure

530ec41

deleted old models from hub

d3d65cc

edit readme. modified generate.py. included assets folder

8ffdb60

added intended use

be95794

added docker, cleaned.py, edit readme

776bfe5

added docker and dataset sizes

f037aa8

uncommitted files

2130753

aligned requirements.txt with toml

00d344b

updated TOC in readme

4e33a02

added detail to datasets

da7bfb7

sarahalamdari reviewed May 2, 2025

View reviewed changes

Samir Char added 2 commits May 6, 2025 12:45

modified readme. cleaned comments from py

08de898

renamed rfdiffusion models

8d8344b

sarahalamdari merged commit 9b11378 into main May 6, 2025
3 checks passed


		## Analysis scripts

		The following list briefly describes the functionality of the most important scripts used to produce the results of the paper:

Conversation

samirchar commented May 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samirchar May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samirchar May 6, 2025 •

edited

Loading