Bypass toolkit when filtering by SMIRKS #503

mattwthompson · 2023-03-09T19:17:13Z

Description

This PR implements a suggestion in #502

Todos

Notable points that this PR has either accomplished or will accomplish.

Support filtering by SMARTS/SMIRKS(?) patterns with isotopes
Use RDKit directly in SMIRKS filtering
Only look for one match since the result is the same if one or many matches are found

Questions

Since we only care about whether or not there is any match, do we really need to return the matches themselves and care about the atom indices?

Status

Ready to go

updates: - [github.com/nbQA-dev/nbQA: 1.6.1 → 1.6.3](nbQA-dev/nbQA@1.6.1...1.6.3)

openff/evaluator/datasets/curation/components/filtering.py

codecov · 2023-03-09T19:23:35Z

Codecov Report

Merging #503 (b453c9c) into main (20d206b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head b453c9c differs from pull request most recent head 55187e3. Consider uploading reports for the commit 55187e3 to get more accurate results

Additional details and impacted files

…nto filter-smirks-rdkit

jthorton · 2023-03-10T09:05:19Z

Thanks @mattwthompson this looks great, I think your idea of just returning a bool for if there is a match or not would be enough here and keep it nice and simple!

jthorton · 2023-03-10T15:45:49Z

openff/evaluator/datasets/curation/components/filtering.py

+                uniquify=False, maxMatches=2**32 - 1, useChirality=True
+            )
+
+            return len(rdmol.GetSubstructMatches(qmol, **match_kwargs)) > 0


Would using GetSubstructMatch work here? I think the only difference is that this method returns a single match and so breaks after finding the first instance of a match.

Seems good - I don't know the RDKit API that well and didn't look here.

If you have one handy, could you share a script that uses this filter on a large dataset? You mentioned the 70k ThermoML compounds earlier. I should know my way around Evaluator better than this.

Sure this script currently returns an empty dataset when it should contain a few records with properties of heavy water

import pandas from openff.evaluator.datasets.curation.components import filtering, thermoml from openff.evaluator.datasets.curation.workflow import CurationWorkflow, CurationWorkflowSchema N_PROCESSES = 2 # download thermol and cache thermoml_data_frame = thermoml.ImportThermoMLData.apply( pandas.DataFrame(), thermoml.ImportThermoMLDataSchema(cache_file_name="thermoml.csv"), N_PROCESSES, ) # filter the dataset to only keep heavy water curation_schema = CurationWorkflowSchema(component_schemas=[filtering.FilterBySmirksSchema(smirks_to_include=["[2H]O[2H]"], allow_partial_inclusion=True)]) # pandas dataframe of properties heavy_water_dataset = CurationWorkflow.apply(thermoml_data_frame, curation_schema, N_PROCESSES) heavy_water_dataset.to_csv("heavy_water.csv")

mattwthompson · 2023-03-10T20:58:07Z

There appear to be some performance impacts. Plucking a small data set and doing some basic filtering shows no change when using an inclusive SMARTS pattern and some speedup when using an exclusive one. I spent some time with the full ThermoML dataset but it's too large to run through several times.

In [1]: from openff.evaluator.datasets.curation.components import filtering, thermoml
/Users/mattthompson/software/openff-evaluator/openff/evaluator/protocols/yank.py:62: YankDeprecrationWarning: The YANK protocol has been deprecated and will be removed in a future release. Free energy calculations will be enabled by Perses in the future.
  warnings.warn(

In [2]: data_set = thermoml.ThermoMLDataSet.from_doi(
   ...:     "10.1016/j.fluid.2013.10.034",
   ...:     "10.1021/je1013476",
   ...: )

In [3]: %%timeit -n 1 -r 1
   ...: filtering.FilterBySmirks.apply(
   ...:     data_set,
   ...:     filtering.FilterBySmirksSchema(
   ...:         smirks_to_include=["C#C"], allow_partial_inclusion=True
   ...:     ),
   ...: )
   ...:
   ...:
# upstream/main:
312 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# This PR:
106 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [4]: %%timeit
   ...: filtering.FilterBySmirks.apply(
   ...:     data_set,
   ...:     filtering.FilterBySmirksSchema(
   ...:         smirks_to_include=["C"], allow_partial_inclusion=True
   ...:     ),
   ...: )
   ...:
   ...:
# upstream/main:
15 s ± 54.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This PR:
15.5 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A mostly-permissive pattern (C=O, hits ~95% of this data set) also shows a minor slowdown.

I'm not sure why this would be the case, I thought I was stripping out steps after copying the code over, not adding steps. I'm going to move forward with this, anyway, but we can revisit this later if the performance here is an issue.

pre-commit-ci bot and others added 2 commits March 6, 2023 23:37

[pre-commit.ci] pre-commit autoupdate

46eae3b

updates: - [github.com/nbQA-dev/nbQA: 1.6.1 → 1.6.3](nbQA-dev/nbQA@1.6.1...1.6.3)

Bypass toolkit when filtering by SMIRKS

14cae85

mattwthompson commented Mar 9, 2023

View reviewed changes

openff/evaluator/datasets/curation/components/filtering.py Outdated Show resolved Hide resolved

mattwthompson added 3 commits March 9, 2023 13:29

Merge remote-tracking branch 'upstream/main' into filter-smirks-rdkit

184b29d

Merge remote-tracking branch 'upstream/pre-commit-ci-update-config' i…

3011409

…nto filter-smirks-rdkit

Skip failure handling

37079d6

Implement bool idea

5584a50

jthorton reviewed Mar 10, 2023

View reviewed changes

Just use GetSubstructMatch

ff660b5

Add useChirality=True to match toolkit

b453c9c

mattwthompson marked this pull request as ready for review March 10, 2023 20:58

Update release history

55187e3

mattwthompson merged commit 27ef2a9 into main Mar 10, 2023

mattwthompson deleted the filter-smirks-rdkit branch March 10, 2023 21:55

mattwthompson mentioned this pull request Mar 14, 2023

FilterBySmirks does not work with isotopes #502

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass toolkit when filtering by SMIRKS #503

Bypass toolkit when filtering by SMIRKS #503

mattwthompson commented Mar 9, 2023 •

edited

Loading

codecov bot commented Mar 9, 2023 •

edited

Loading

jthorton commented Mar 10, 2023 •

edited

Loading

jthorton Mar 10, 2023

mattwthompson Mar 10, 2023

jthorton Mar 10, 2023

mattwthompson commented Mar 10, 2023

Bypass toolkit when filtering by SMIRKS #503

Bypass toolkit when filtering by SMIRKS #503

Conversation

mattwthompson commented Mar 9, 2023 • edited Loading

Description

Todos

Questions

Status

codecov bot commented Mar 9, 2023 • edited Loading

Codecov Report

jthorton commented Mar 10, 2023 • edited Loading

jthorton Mar 10, 2023

Choose a reason for hiding this comment

mattwthompson Mar 10, 2023

Choose a reason for hiding this comment

jthorton Mar 10, 2023

Choose a reason for hiding this comment

mattwthompson commented Mar 10, 2023

mattwthompson commented Mar 9, 2023 •

edited

Loading

codecov bot commented Mar 9, 2023 •

edited

Loading

jthorton commented Mar 10, 2023 •

edited

Loading