[BRAINSTORM] refactor to use `MOF` objects as inputs #421

kjappelbaum · 2022-12-21T12:58:56Z

I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.

I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.

All of this would be much easier if the inputs for the featurizers would always be MOF objects. I already started implementing MOF objects for mofchecker and moffragmentor and they typically have (memoized) methods for computing graph/fragments/hashes/...

If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).

In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.

Advantages

Fewer conditionals for making the control flow in featurizers dependent on the input type
Less logic for caching --- the input objects can cache it
Common design pattern with mofchecker and moffragmentor. Could, perhaps, be factored out in its very thin own library to just provide the API (to be seen)
Multiple featurizer would also no longer need to handle so much logic about primitive structures
We could relatively easily serialize fragments and features in the same workflow without computational overhead
We can get rid of things such as MOFBBs objects
Also make lru cache sizes customizable #420 would close itself because we would not need lru_cache anymore
Featurizers such as RACs would no longer need to have their own nearest-neighbor search thing
There is no need anymore for a MOF base featurizer.

Disadvantages

We would need a very thin wrapper for interoperability with matminer, i.e., we would basically always extract the structure attribute and then pass this to pymatgen
It would be a breaking change for the current API: need to change all docs, tutorials and the examples in the paper (however, doing it now might be a good time as there is not that much use yet)
BU featurizer would need to follow a different pattern and would still, in some way, need annotation on what type the featurizers extract
BU featurizer seems still difficult to couple with MultipleFeaturizer with mixed types. Probably, the recommended usage pattern would be to use a MultipleFeaturizer around BUFeaturizers if needed and the BU featurizer always only operates on one type which we could then easily extract and pass to the featurizer. Caching of the fragments is no problem as it is handled by the MOF object
many featurizers do not only work on MOFs. In this sense the naming (also of the library) is misleading

Usage example

from mofdscribe import MOF
mof = MOF.from_file("some_file.cif")
bu = BUFeaturizer(LSOP())
feats = bu.featurize(mof)

Implementation idea

class MOF: 
    def __init__(self, structure, structure_graph): 
        self.structure = structure 
        self.structure_graph = structure_graph
    
    @classmethod
    def from_structure(cls, structure): 
        structure_graph = get_structure_graph(structure)
        return cls(structure, structure_graph

    @classmethod
    def from_file(): 
        ... 
    @cached_property
    def fragments(): 
        ...
    
    @cached_property
    def scaffold_hash():
        ...

The featurizers would have the following signatures

class GraphFeaturizer: 
    def featurize(self, mof): 
        structure_graph = mof.structure_graph
        return self._featurize(structure_graph)
        
class StructureFeaturizer: 
    def featurize(self, mof): 
        structure = mof.structure
        return self._featurize(structure)

MatminerFeaturizer = StructureFeaturizer

class SiteFeaturizer: 
      def featurize(self, mof, i): 
        structure = mof.structure
        return self._featurize(structure, structuregraph, i)

class BUFeaturizer: 
    def __init__(self, featurizer):
        if isinstance(featurizer, MultiFeaturizer): 
            raise ValueError("featurizer must be a single featurizer")
    
    def featurize(self, mof): 
        fragments = mof.fragments

In the last case, for the BUFeaturizer, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the _featurize method of the featurizer passed in the constructor.

The MOFMultipleFeaturizer should probably always loop over structures for featurize_many.

Also, if I'd do something like featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()]) it should work without problems as all featurizers always accept MOF objects.

from mofdscribe import MOF 
from fastcore.xtras import save_pickle 

for file in files: 
    mof = MOF.from_file(file) 
    fragments = mof.fragment
    save_pickle(fragments) 
    features = featurizer.featurize(MOF)

The text was updated successfully, but these errors were encountered:

kjappelbaum · 2022-12-21T15:53:19Z

also, testing the implementation now, another benefit seems to be that I no longer need to think about the primitive option. From files, this is the default. However, if users already have pymatgen.core.Structure objects we assume that they know what they do

kjappelbaum added enhancement New feature or request question Further information is requested labels Dec 21, 2022

kjappelbaum changed the title ~~refactor to use MOF objects as inputs~~ [BRAINSTORM] refactor to use MOF objects as inputs Dec 21, 2022

kjappelbaum linked a pull request Dec 21, 2022 that will close this issue

BREAKING CHANGE: mof interface, Graph featurizer, more BU featurizers #422

Open

kjappelbaum linked a pull request Dec 23, 2022 that will close this issue

BREAKING CHANGE: mof interface, Graph featurizer, more BU featurizers #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BRAINSTORM] refactor to use `MOF` objects as inputs #421

[BRAINSTORM] refactor to use `MOF` objects as inputs #421

kjappelbaum commented Dec 21, 2022 •

edited

kjappelbaum commented Dec 21, 2022

[BRAINSTORM] refactor to use MOF objects as inputs #421

[BRAINSTORM] refactor to use MOF objects as inputs #421

Comments

kjappelbaum commented Dec 21, 2022 • edited

Advantages

Disadvantages

Usage example

Implementation idea

kjappelbaum commented Dec 21, 2022

[BRAINSTORM] refactor to use `MOF` objects as inputs #421

[BRAINSTORM] refactor to use `MOF` objects as inputs #421

kjappelbaum commented Dec 21, 2022 •

edited