Skip to content

Conversation

@cwognum
Copy link
Collaborator

@cwognum cwognum commented Mar 14, 2024

Changelogs

  • Made it possible to use slices in pointers (e.g. path/to/data.zarr/col#0:10).
  • Added the SDFConverter class to convert a SDF object to a Polaris Dataset.
  • Started the implementation of a DatasetFactory as a unified system for creating datasets.
  • Added the create_dataset_from_file utility method.
  • Added Adapters to convert datapoints during data loading. These can be saved alongside a dataset.

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

This PR started to add support for the MARCEL benchmark.

There were three critical changes to support MARCEL:

  • We needed to convert a SDF file to Zarr, which we ended up doing using RDKit's bytestrings. See also the discussion Array-like structure for MolBlock IO rdkit/rdkit#7235
  • We needed to allow for a variable number of conformers per datapoint, which we did by adding support for pointers with slice indexing (e.g. path/to/data.zarr/col#0:10)
  • Allow users to specify an Adapter, which during data loading adapts the content saved in the Zarr file to a desired format (in this case, from a bytestring to a RDKit Mol object).

In addition, I long wanted to start on a system to make it easier to create dataset. I'm quite happy with this as a first version. It's a little crude still, but think it will be easy to extend as we work to make adding datasets more and more easy.

I added a tutorial and docstrings, but here is a short snippet:

from polaris.dataset import DatasetFactory
from polaris.dataset.converters import SDFConverter

# Create a new factory object
save_dst = "/path/to/destination.zarr
factory = DatasetFactory(zarr_root_path=save_dst)

# Register a converter for the SDF file format
factory.register_converter("sdf", SDFConverter())

# Process your SDF file
factory.add_from_file("/path/to/data.sdf")

# Build the dataset
dataset = factory.build()

@cwognum cwognum added the feature Annotates any PR that adds new features; Used in the release process label Mar 14, 2024
@cwognum cwognum requested a review from zhu0619 March 14, 2024 21:08
@cwognum cwognum self-assigned this Mar 14, 2024
@cwognum cwognum requested a review from jstlaurent March 14, 2024 21:08
Copy link
Contributor

@jstlaurent jstlaurent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit of my depth here, but that's some nice test coverage and I like concepts like pluggable converters.

@cwognum cwognum linked an issue Mar 15, 2024 that may be closed by this pull request
@cwognum cwognum merged commit 95734e5 into main Mar 18, 2024
@cwognum cwognum deleted the feat/sdf-to-zarr branch March 18, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Annotates any PR that adds new features; Used in the release process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for converting SDF to Zarr

4 participants