Skip to content

kalekundert/macromol_census

Repository files navigation

Macromolecule Census

Last release Python version Documentation Test status Test coverage Last commit

Macromolecule Census is a set of tools for creating machine-learning datasets from macromolecular structure data, especially those made available by the protein data bank (PDB). The purpose of these tools is to account for the following:

  • Filter for high-quality (e.g. high resolution, low R-factor), low-redundancy (i.e. sequence identity cutoffs) structures.
  • Make robust training/validation/test splits by accounting for domain-level structural similarities.
  • Store atomic coordinates in a compact, portable, standard format (SQLite).