Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BRAINSTORM] refactor to use MOF objects as inputs #421

Open
kjappelbaum opened this issue Dec 21, 2022 · 1 comment · May be fixed by #422
Open

[BRAINSTORM] refactor to use MOF objects as inputs #421

kjappelbaum opened this issue Dec 21, 2022 · 1 comment · May be fixed by #422
Labels
enhancement New feature or request question Further information is requested

Comments

@kjappelbaum
Copy link
Owner

kjappelbaum commented Dec 21, 2022

I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.

I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.

All of this would be much easier if the inputs for the featurizers would always be MOF objects. I already started implementing MOF objects for mofchecker and moffragmentor and they typically have (memoized) methods for computing graph/fragments/hashes/...

If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).

In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.

Advantages

  • Fewer conditionals for making the control flow in featurizers dependent on the input type
  • Less logic for caching --- the input objects can cache it
  • Common design pattern with mofchecker and moffragmentor. Could, perhaps, be factored out in its very thin own library to just provide the API (to be seen)
  • Multiple featurizer would also no longer need to handle so much logic about primitive structures
  • We could relatively easily serialize fragments and features in the same workflow without computational overhead
  • We can get rid of things such as MOFBBs objects
  • Also make lru cache sizes customizable  #420 would close itself because we would not need lru_cache anymore
  • Featurizers such as RACs would no longer need to have their own nearest-neighbor search thing
  • There is no need anymore for a MOF base featurizer.

Disadvantages

  • We would need a very thin wrapper for interoperability with matminer, i.e., we would basically always extract the structure attribute and then pass this to pymatgen
  • It would be a breaking change for the current API: need to change all docs, tutorials and the examples in the paper (however, doing it now might be a good time as there is not that much use yet)
  • BU featurizer would need to follow a different pattern and would still, in some way, need annotation on what type the featurizers extract
  • BU featurizer seems still difficult to couple with MultipleFeaturizer with mixed types. Probably, the recommended usage pattern would be to use a MultipleFeaturizer around BUFeaturizers if needed and the BU featurizer always only operates on one type which we could then easily extract and pass to the featurizer. Caching of the fragments is no problem as it is handled by the MOF object
  • many featurizers do not only work on MOFs. In this sense the naming (also of the library) is misleading

Usage example

from mofdscribe import MOF
mof = MOF.from_file("some_file.cif")
bu = BUFeaturizer(LSOP())
feats = bu.featurize(mof)

Implementation idea

class MOF: 
    def __init__(self, structure, structure_graph): 
        self.structure = structure 
        self.structure_graph = structure_graph
    
    @classmethod
    def from_structure(cls, structure): 
        structure_graph = get_structure_graph(structure)
        return cls(structure, structure_graph

    @classmethod
    def from_file(): 
        ... 
    @cached_property
    def fragments(): 
        ...
    
    @cached_property
    def scaffold_hash():
        ... 

The featurizers would have the following signatures

class GraphFeaturizer: 
    def featurize(self, mof): 
        structure_graph = mof.structure_graph
        return self._featurize(structure_graph)
        
class StructureFeaturizer: 
    def featurize(self, mof): 
        structure = mof.structure
        return self._featurize(structure)

MatminerFeaturizer = StructureFeaturizer

class SiteFeaturizer: 
      def featurize(self, mof, i): 
        structure = mof.structure
        return self._featurize(structure, structuregraph, i)

class BUFeaturizer: 
    def __init__(self, featurizer):
        if isinstance(featurizer, MultiFeaturizer): 
            raise ValueError("featurizer must be a single featurizer")
    
    def featurize(self, mof): 
        fragments = mof.fragments

In the last case, for the BUFeaturizer, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the _featurize method of the featurizer passed in the constructor.

The MOFMultipleFeaturizer should probably always loop over structures for featurize_many.

Also, if I'd do something like featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()]) it should work without problems as all featurizers always accept MOF objects.

from mofdscribe import MOF 
from fastcore.xtras import save_pickle 

for file in files: 
    mof = MOF.from_file(file) 
    fragments = mof.fragment
    save_pickle(fragments) 
    features = featurizer.featurize(MOF)
@kjappelbaum kjappelbaum added enhancement New feature or request question Further information is requested labels Dec 21, 2022
@kjappelbaum kjappelbaum changed the title refactor to use MOF objects as inputs [BRAINSTORM] refactor to use MOF objects as inputs Dec 21, 2022
@kjappelbaum
Copy link
Owner Author

also, testing the implementation now, another benefit seems to be that I no longer need to think about the primitive option. From files, this is the default. However, if users already have pymatgen.core.Structure objects we assume that they know what they do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant