Skip to content

Packaging dayhoff#10

Merged
sarahalamdari merged 13 commits into
mainfrom
packaging
May 6, 2025
Merged

Packaging dayhoff#10
sarahalamdari merged 13 commits into
mainfrom
packaging

Conversation

@samirchar
Copy link
Copy Markdown
Collaborator

  • Created Dockerfile
  • Developed Readme
  • Designed new generate.py with HF
  • Wrote project.toml for package distribution and removed legacy setup.py
  • Finalized requirements.txt
  • Deleted default azure credentials
  • Fixed a few minor bugs
  • Made defaulting flash attention = False

Comment thread README.md Outdated
* **DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models
* Splits: train (5 GB)
* **BackboneRef**: structure-based synthetic protein dataset generated by sampling backbone structures from RFDiffusion and using them to design synthetic sequences.
* Splits: rfdiffusion_unfiltered (BBR-u; 3 GB), rfdiffusion_scrmsd (BBR-s; 3 GB), rfdiffusion_novelty (BBR-n; 3 GB)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add an example of how to load BBR-u vs BBR-s its unclear from the examples provided

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nevermind I see the example below, is there a way to rename "rfdiffusion_unfiltered" to just be bbr-u in the split call?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread README.md Outdated
name="uniref90",
split = "train")

backboneref_both_filter = load_dataset("microsoft/DayhoffDataset",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would make backboneref_both_filter backboneref_n here for consistency with paper naming scheme

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done!

Comment thread README.md Outdated
* **DayhoffRef**: dataset of 16 million synthetic protein sequences generated by the Dayhoff models
* Splits: train (5 GB)
* **BackboneRef**: structure-based synthetic protein dataset generated by sampling backbone structures from RFDiffusion and using them to design synthetic sequences.
* Splits: rfdiffusion_unfiltered (BBR-u; 3 GB), rfdiffusion_scrmsd (BBR-s; 3 GB), rfdiffusion_novelty (BBR-n; 3 GB)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nevermind I see the example below, is there a way to rename "rfdiffusion_unfiltered" to just be bbr-u in the split call?

Comment thread README.md Outdated
Comment thread README.md

## Analysis scripts

The following list briefly describes the functionality of the most important scripts used to produce the results of the paper:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have these been confirmed/cleaned? Didn't have a chance to review the PR and I know we had extra files not being used at this point

Copy link
Copy Markdown
Collaborator Author

@samirchar samirchar May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! I added all of them and asked Kevin exactly this, and if he could provide a brief description of each. It would be great if you can help me with this. I can start with some of the scripts i created and know!

I think some of them also require some basic clean up like removing comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

Comment thread analysis/generate.py Outdated
out_dir = os.path.join(args.out_fpath, args.model_name + '_' + str(total_steps) + "_" + task + '_t%.1f' %args.temp)
if RANK == 0:
os.makedirs(out_dir, exist_ok=True)
# if args.task == "sequence":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: clean if not being used

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread assets/dayhoff_models.png
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delete this file - not needed in repo

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Removed the reference in the README

Comment thread dayhoff/utils.py Outdated
from evodiff.utils import Tokenizer
from dayhoff.model import _get_hf_model, ARDiffusionModel
from dayhoff.constants import UL_ALPHABET_PLUS
# from torch.serialization import add_safe_globals
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: clean if not needed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread pyproject.toml
license = "MIT"
license-files = ["LICEN[CS]E*"]
authors = [
{ name = "Sarah A. Alamdari"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update what or how?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency might make sense to place this as an analysis .py script, rather than a standalone notebook examples

@sarahalamdari sarahalamdari merged commit 9b11378 into main May 6, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants