Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass toolkit when filtering by SMIRKS #503

Merged
merged 9 commits into from
Mar 10, 2023
Merged

Conversation

mattwthompson
Copy link
Member

@mattwthompson mattwthompson commented Mar 9, 2023

Description

This PR implements a suggestion in #502

Todos

Notable points that this PR has either accomplished or will accomplish.

  • Support filtering by SMARTS/SMIRKS(?) patterns with isotopes
  • Use RDKit directly in SMIRKS filtering
  • Only look for one match since the result is the same if one or many matches are found

Questions

  • Since we only care about whether or not there is any match, do we really need to return the matches themselves and care about the atom indices?

Status

  • Ready to go

@codecov
Copy link

codecov bot commented Mar 9, 2023

Codecov Report

Merging #503 (b453c9c) into main (20d206b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head b453c9c differs from pull request most recent head 55187e3. Consider uploading reports for the commit 55187e3 to get more accurate results

Additional details and impacted files

@jthorton
Copy link
Contributor

jthorton commented Mar 10, 2023

Thanks @mattwthompson this looks great, I think your idea of just returning a bool for if there is a match or not would be enough here and keep it nice and simple!

uniquify=False, maxMatches=2**32 - 1, useChirality=True
)

return len(rdmol.GetSubstructMatches(qmol, **match_kwargs)) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using GetSubstructMatch work here? I think the only difference is that this method returns a single match and so breaks after finding the first instance of a match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good - I don't know the RDKit API that well and didn't look here.

If you have one handy, could you share a script that uses this filter on a large dataset? You mentioned the 70k ThermoML compounds earlier. I should know my way around Evaluator better than this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure this script currently returns an empty dataset when it should contain a few records with properties of heavy water

import pandas
from openff.evaluator.datasets.curation.components import filtering, thermoml
from openff.evaluator.datasets.curation.workflow import CurationWorkflow, CurationWorkflowSchema

N_PROCESSES = 2

# download thermol and cache
thermoml_data_frame = thermoml.ImportThermoMLData.apply(
        pandas.DataFrame(),
        thermoml.ImportThermoMLDataSchema(cache_file_name="thermoml.csv"),
        N_PROCESSES,
    )

# filter the dataset to only keep heavy water
curation_schema = CurationWorkflowSchema(component_schemas=[filtering.FilterBySmirksSchema(smirks_to_include=["[2H]O[2H]"], allow_partial_inclusion=True)])

# pandas dataframe of properties
heavy_water_dataset = CurationWorkflow.apply(thermoml_data_frame, curation_schema, N_PROCESSES)

heavy_water_dataset.to_csv("heavy_water.csv")

@mattwthompson
Copy link
Member Author

There appear to be some performance impacts. Plucking a small data set and doing some basic filtering shows no change when using an inclusive SMARTS pattern and some speedup when using an exclusive one. I spent some time with the full ThermoML dataset but it's too large to run through several times.

In [1]: from openff.evaluator.datasets.curation.components import filtering, thermoml
/Users/mattthompson/software/openff-evaluator/openff/evaluator/protocols/yank.py:62: YankDeprecrationWarning: The YANK protocol has been deprecated and will be removed in a future release. Free energy calculations will be enabled by Perses in the future.
  warnings.warn(

In [2]: data_set = thermoml.ThermoMLDataSet.from_doi(
   ...:     "10.1016/j.fluid.2013.10.034",
   ...:     "10.1021/je1013476",
   ...: )

In [3]: %%timeit -n 1 -r 1
   ...: filtering.FilterBySmirks.apply(
   ...:     data_set,
   ...:     filtering.FilterBySmirksSchema(
   ...:         smirks_to_include=["C#C"], allow_partial_inclusion=True
   ...:     ),
   ...: )
   ...:
   ...:
# upstream/main:
312 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# This PR:
106 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [4]: %%timeit
   ...: filtering.FilterBySmirks.apply(
   ...:     data_set,
   ...:     filtering.FilterBySmirksSchema(
   ...:         smirks_to_include=["C"], allow_partial_inclusion=True
   ...:     ),
   ...: )
   ...:
   ...:
# upstream/main:
15 s ± 54.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This PR:
15.5 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A mostly-permissive pattern (C=O, hits ~95% of this data set) also shows a minor slowdown.

I'm not sure why this would be the case, I thought I was stripping out steps after copying the code over, not adding steps. I'm going to move forward with this, anyway, but we can revisit this later if the performance here is an issue.

@mattwthompson mattwthompson marked this pull request as ready for review March 10, 2023 20:58
@mattwthompson mattwthompson merged commit 27ef2a9 into main Mar 10, 2023
@mattwthompson mattwthompson deleted the filter-smirks-rdkit branch March 10, 2023 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants