Easily featurize the inputs in the `Subset` class #71

cwognum · 2024-02-19T18:12:12Z

Changelogs

Adds featurization_fn to the Subset class and propagates this parameter to theget_train_test_split() method.
Added test cases to ensure the different data formats of the Subset class function properly.

Checklist:

~~Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.~~
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

Discussion

Throughout user interviews, adding support for easily featurizing the inputs was a common feature request. This PR takes a first step in that direction.

Before:

# Before: Featurization is not part of the Polaris API
train, test = benchmark.get_train_test_split()

# Need to manually featurize the train set
X_train = np.array([dm.to_fp(smi) for smi, y in train])
model.fit(X_train, train.targets) 

# Need to manually repeat the featurization for the test set
X_test = np.array([dm.to_fp(smi) for smi in test])
y_pred = model.predict(X_test)

After:

# After: You can now specify a featurization function. 
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)

# No need to manually featurize the train or test inputs
model.fit(train.inputs, train.targets)
y_pred = model.predict(test.inputs)

What about the targets?

There could be benchmarks in which the target would have to be featurized too (e.g. a reaction product). I went down the rabbit hole of also adding a featurization function for the targets. However, this opened up other difficult questions that were difficult to address.

BenchmarkSpecification.evaluate() does not have a y_true parameter to minimize the chance of a user accidentally changing the test labels. However, if we allow a target transformation to be passed to the get_train_test_split() method, this means we need to keep track of that function to replicate the correct y_true values in evaluate().
An additional issue of (1) is that this could lead to ambiguity w.r.t. the test set parameters. What if someone splits a dataset three times with a different target transformation function each time? How would we know which to use once the user calls evaluate()? We probably can't, so then it seems important to throw an error if this happens.
People could use also (mis)use this to normalize the output values (e.g. min-max normalization or z-score normalization). Without having an inverse transform, this would lead to a different range for some of the metrics (e.g. MAE, MSE). We could use (something similar to) scikit-learn's scalers (e.g. here), but how to differentiate between target featurization (which does not need an inverse) and target normalization (which does need an inverse).

I can see ways around all of the above, but this is out-of-scope for now. We can revisit this once a benchmark actually needs it.

polaris/dataset/_subset.py

zhu0619

@cwognum Looks good to me!

I see the rabbit hole is getting deeper. I would say let's keep things simple as possible now until we see an urgent need. Especially, the user should take care of the post-processing on target values for any transformation or scaling.
The evaluation inputs should always in the original scale/format designed by Polaris benchmark.

cwognum · 2024-03-06T15:54:25Z

@zhu0619 Yeah, I agree. We will need to make informed choices moving forward to keep things manageable. We will not be able to support all use cases.

cwognum and others added 5 commits February 16, 2024 18:35

First implementation to add featurization to the Subset class

a8343aa

Added test cases and throw an error when there is test set ambiguity

b99dfbf

Removed target transformation function

83403d9

Update notebook

1c5ca97

Added test cases for the different Subset formats

19b2029

cwognum added feature Annotates any PR that adds new features; Used in the release process test Annotates any PR that adds tests labels Feb 19, 2024

updated outdated docstring

8820e3d

cwognum changed the title ~~Easily featurize the transformation functions in the Subset class~~ Easily featurize the inputs in the Subset class Feb 19, 2024

cwognum requested a review from zhu0619 February 22, 2024 16:48

zhu0619 reviewed Mar 6, 2024

View reviewed changes

polaris/dataset/_subset.py Show resolved Hide resolved

zhu0619 approved these changes Mar 6, 2024

View reviewed changes

zhu0619 and others added 3 commits March 6, 2024 14:58

Merge branch 'main' into feat/featurization-support

f974f7e

ruff format

e623f15

format

871fcc7

cwognum merged commit cb7db9d into main Mar 6, 2024

cwognum deleted the feat/featurization-support branch March 6, 2024 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Easily featurize the inputs in the `Subset` class #71

Easily featurize the inputs in the `Subset` class #71

Uh oh!

cwognum commented Feb 19, 2024

Uh oh!

Uh oh!

zhu0619 left a comment •

edited

Loading

Uh oh!

cwognum commented Mar 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Easily featurize the inputs in the Subset class #71

Easily featurize the inputs in the Subset class #71

Uh oh!

Conversation

cwognum commented Feb 19, 2024

Changelogs

Discussion

What about the targets?

Uh oh!

Uh oh!

zhu0619 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cwognum commented Mar 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Easily featurize the inputs in the `Subset` class #71

Easily featurize the inputs in the `Subset` class #71

zhu0619 left a comment •

edited

Loading