Skip to content

Conversation

@cwognum
Copy link
Collaborator

@cwognum cwognum commented Feb 19, 2024

Changelogs

  • Adds featurization_fn to the Subset class and propagates this parameter to theget_train_test_split() method.
  • Added test cases to ensure the different data formats of the Subset class function properly.

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

Discussion

Throughout user interviews, adding support for easily featurizing the inputs was a common feature request. This PR takes a first step in that direction.

Before:

# Before: Featurization is not part of the Polaris API
train, test = benchmark.get_train_test_split()

# Need to manually featurize the train set
X_train = np.array([dm.to_fp(smi) for smi, y in train])
model.fit(X_train, train.targets) 

# Need to manually repeat the featurization for the test set
X_test = np.array([dm.to_fp(smi) for smi in test])
y_pred = model.predict(X_test)

After:

# After: You can now specify a featurization function. 
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)

# No need to manually featurize the train or test inputs
model.fit(train.inputs, train.targets)
y_pred = model.predict(test.inputs)

What about the targets?

There could be benchmarks in which the target would have to be featurized too (e.g. a reaction product). I went down the rabbit hole of also adding a featurization function for the targets. However, this opened up other difficult questions that were difficult to address.

  1. BenchmarkSpecification.evaluate() does not have a y_true parameter to minimize the chance of a user accidentally changing the test labels. However, if we allow a target transformation to be passed to the get_train_test_split() method, this means we need to keep track of that function to replicate the correct y_true values in evaluate().
  2. An additional issue of (1) is that this could lead to ambiguity w.r.t. the test set parameters. What if someone splits a dataset three times with a different target transformation function each time? How would we know which to use once the user calls evaluate()? We probably can't, so then it seems important to throw an error if this happens.
  3. People could use also (mis)use this to normalize the output values (e.g. min-max normalization or z-score normalization). Without having an inverse transform, this would lead to a different range for some of the metrics (e.g. MAE, MSE). We could use (something similar to) scikit-learn's scalers (e.g. here), but how to differentiate between target featurization (which does not need an inverse) and target normalization (which does need an inverse).

I can see ways around all of the above, but this is out-of-scope for now. We can revisit this once a benchmark actually needs it.

@cwognum cwognum added feature Annotates any PR that adds new features; Used in the release process test Annotates any PR that adds tests labels Feb 19, 2024
@cwognum cwognum changed the title Easily featurize the transformation functions in the Subset class Easily featurize the inputs in the Subset class Feb 19, 2024
@cwognum cwognum requested a review from zhu0619 February 22, 2024 16:48
Copy link
Contributor

@zhu0619 zhu0619 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwognum Looks good to me!

I see the rabbit hole is getting deeper. I would say let's keep things simple as possible now until we see an urgent need. Especially, the user should take care of the post-processing on target values for any transformation or scaling.
The evaluation inputs should always in the original scale/format designed by Polaris benchmark.

@cwognum
Copy link
Collaborator Author

cwognum commented Mar 6, 2024

@zhu0619 Yeah, I agree. We will need to make informed choices moving forward to keep things manageable. We will not be able to support all use cases.

@cwognum cwognum merged commit cb7db9d into main Mar 6, 2024
@cwognum cwognum deleted the feat/featurization-support branch March 6, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Annotates any PR that adds new features; Used in the release process test Annotates any PR that adds tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants