-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bypass toolkit when filtering by SMIRKS #503
Conversation
updates: - [github.com/nbQA-dev/nbQA: 1.6.1 → 1.6.3](nbQA-dev/nbQA@1.6.1...1.6.3)
Codecov Report
Additional details and impacted files |
Thanks @mattwthompson this looks great, I think your idea of just returning a bool for if there is a match or not would be enough here and keep it nice and simple! |
uniquify=False, maxMatches=2**32 - 1, useChirality=True | ||
) | ||
|
||
return len(rdmol.GetSubstructMatches(qmol, **match_kwargs)) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using GetSubstructMatch
work here? I think the only difference is that this method returns a single match and so breaks after finding the first instance of a match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good - I don't know the RDKit API that well and didn't look here.
If you have one handy, could you share a script that uses this filter on a large dataset? You mentioned the 70k ThermoML compounds earlier. I should know my way around Evaluator better than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure this script currently returns an empty dataset when it should contain a few records with properties of heavy water
import pandas
from openff.evaluator.datasets.curation.components import filtering, thermoml
from openff.evaluator.datasets.curation.workflow import CurationWorkflow, CurationWorkflowSchema
N_PROCESSES = 2
# download thermol and cache
thermoml_data_frame = thermoml.ImportThermoMLData.apply(
pandas.DataFrame(),
thermoml.ImportThermoMLDataSchema(cache_file_name="thermoml.csv"),
N_PROCESSES,
)
# filter the dataset to only keep heavy water
curation_schema = CurationWorkflowSchema(component_schemas=[filtering.FilterBySmirksSchema(smirks_to_include=["[2H]O[2H]"], allow_partial_inclusion=True)])
# pandas dataframe of properties
heavy_water_dataset = CurationWorkflow.apply(thermoml_data_frame, curation_schema, N_PROCESSES)
heavy_water_dataset.to_csv("heavy_water.csv")
There appear to be some performance impacts. Plucking a small data set and doing some basic filtering shows no change when using an inclusive SMARTS pattern and some speedup when using an exclusive one. I spent some time with the full ThermoML dataset but it's too large to run through several times. In [1]: from openff.evaluator.datasets.curation.components import filtering, thermoml
/Users/mattthompson/software/openff-evaluator/openff/evaluator/protocols/yank.py:62: YankDeprecrationWarning: The YANK protocol has been deprecated and will be removed in a future release. Free energy calculations will be enabled by Perses in the future.
warnings.warn(
In [2]: data_set = thermoml.ThermoMLDataSet.from_doi(
...: "10.1016/j.fluid.2013.10.034",
...: "10.1021/je1013476",
...: )
In [3]: %%timeit -n 1 -r 1
...: filtering.FilterBySmirks.apply(
...: data_set,
...: filtering.FilterBySmirksSchema(
...: smirks_to_include=["C#C"], allow_partial_inclusion=True
...: ),
...: )
...:
...:
# upstream/main:
312 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# This PR:
106 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [4]: %%timeit
...: filtering.FilterBySmirks.apply(
...: data_set,
...: filtering.FilterBySmirksSchema(
...: smirks_to_include=["C"], allow_partial_inclusion=True
...: ),
...: )
...:
...:
# upstream/main:
15 s ± 54.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This PR:
15.5 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) A mostly-permissive pattern ( I'm not sure why this would be the case, I thought I was stripping out steps after copying the code over, not adding steps. I'm going to move forward with this, anyway, but we can revisit this later if the performance here is an issue. |
Description
This PR implements a suggestion in #502
Todos
Notable points that this PR has either accomplished or will accomplish.
Questions
Status