Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hooks for before/after instantiating objects #170

Closed
Erotemic opened this issue Oct 7, 2022 · 17 comments · Fixed by #326
Closed

Add hooks for before/after instantiating objects #170

Erotemic opened this issue Oct 7, 2022 · 17 comments · Fixed by #326
Labels
enhancement New feature or request

Comments

@Erotemic
Copy link
Contributor

Erotemic commented Oct 7, 2022

🚀 Feature request

Add methods jsonargparse.ArgumentParser that let the user specify custom code that can be executed before / after an object is instantiated.

Motivation

I'm working on porting the ML system I'm working on to LightningCLI with jsonargparse.

Some of the design goals of this system are:

  • Remove magic numbers
  • Maintain a ledger of how data is derived
  • Remove the need for redundant specifications.

So far jsonargparse is helping a lot and I'm very impressed with the library. It's by far the best argparse alternative I've come across. However, I'm having an issue slotting it into my system, because I require that my data module introspects my underlying dataset and passes relevant parameters to the model. Unfortunately these are not available at initialization time. I have a MWE illustrating the issue:

import jsonargparse
import pathlib


class Data:
    def __init__(self, fpath=None):
        self.fpath = pathlib.Path(fpath)
        self.content = None

    def setup(self):
        self.content = self.fpath.read_text()
        return self


class Model:
    def __init__(self, content=None, param1=None):
        assert content is not None
        self.content = content


def main():
    parser = jsonargparse.ArgumentParser()
    parser.add_class_arguments(Data, nested_key='data', fail_untyped=False)
    parser.add_class_arguments(Model, nested_key='model', fail_untyped=False)

    parser.link_arguments('data.content', 'model.content', apply_on='instantiate')

    foo_fpath = pathlib.Path('foo.txt')
    foo_fpath.write_text('content of a file')

    config = parser.parse_args(args=['--data.fpath', str(foo_fpath)])
    instances = parser.instantiate_classes(config)


if __name__ == '__main__':
    main()

In the above example, the model cannot be instantiated because the content of the file hasn't been read by the Data class yet.

Pitch

Perhaps adding methods like:

  • parser.add_before_instantiate(key, fn): "key specifies the argument, and fn is passed the key, cls, and init_args"
  • parser.add_after_instantiate(key, fn): "key specifies the argument, and fn is passed the instantiated object"

But idk if these are the best options. This could also be done by adding before_instantiate and after_instantiate arguments to add_class_arguments. Or perhaps the before / after is shirked entirely and instantiate could be given as a function where the user is responsible for actually creating the object. I'm not super concerned about the details, I just want a way to control the creation of objects with more fidelity than is currently possible.

I would like to be able to do something like:

def main_ideal():
    parser = jsonargparse.ArgumentParser()
    parser.add_class_arguments(Data, nested_key='data', fail_untyped=False)
    parser.add_class_arguments(Model, nested_key='model', fail_untyped=False)

    # Proposed new feature
    parser.add_after_instantiate('data', lambda obj: obj.setup())

    parser.link_arguments('data.content', 'model.content', apply_on='instantiate')

    foo_fpath = pathlib.Path('foo.txt')
    foo_fpath.write_text('content of a file')

    config = parser.parse_args(args=['--data.fpath', str(foo_fpath)])
    instances = parser.instantiate_classes(config)

Where I can control some pre/post processing step after instantiation happens. That way, in this instance, I can ensure that the correct value is populated in data before link_arguments connects data.content to model.content.

Alternatives

The natural question is then: why don't I just call setup() in Data.__init__? The reason is that jsonargparse may not be the only object that needs to use Data. And in general I think it's a good design practice to keep as much work out of __init__ methods as possible (it makes debugging objects easier if you can always make an instance).

But what about a wrapper class that only jsonargparse gets to see? Well, now we are adding more boilerplate that jsonargparse is designed to avoid.

There is an alternative that I'm currently using, although I really don't like it. I'm effectively hacking link_arguments to do something similar enough to what I want. Namely:

    parser = jsonargparse.ArgumentParser()
    parser.add_class_arguments(Data, nested_key='data', fail_untyped=False)
    parser.add_class_arguments(Model, nested_key='model', fail_untyped=False)

    parser.link_arguments('data', 'model.content', apply_on='instantiate', compute_fn=lambda data: data.setup().content)

    foo_fpath = pathlib.Path('foo.txt')
    foo_fpath.write_text('content of a file')

    config = parser.parse_args(args=['--data.fpath', str(foo_fpath)])

    instances = parser.instantiate_classes(config)
    model = instances.model
    print(f'model.content={model.content}')

This abuses link arguments to get an instance of "data" before an instance of "model" is created and then applies the custom logic.

I think this is a clear anti-pattern, but unfortunately I don't see a viable alternative to it.

Discussion

I'm open to being the person who implements this PR, but I'd like to get feedback on the issue first. I'm pretty sure I'm not missing anything existing in jsonarparse that could solve my issue, but if I am, I'd love to be pointed to it and just close the issue. However, if this feature is not supported and deemed useful enough to add support for it, I think there should be discussion about how it is implemented.

Adding two methods add_after_instantiate and add_before_instantiate seems like it adds too much clutter to the top-level API. I'm wondering if this should be implemented as a more general hook system: e.g. add_hook(hook_name, hook_fn) where the first hooks added are for the instantiate cases I've mentioned.

Although I'd accept adding hook args to add_class_arguments as a solution to my issue, it isn't ideal in my case because I'd have to modify LightningCLI to gain access to that feature. Not the end of the world, but I think a method attached to the parser itself that lets you add these hooks would be more valuable.

The other option I'd be completely happy with is an add_instantiator that allows the allows the user to override exactly how instantiation happens for an object.

Summary

I'd like to customize instantiated objects either before and/or after they are created. Is this in the scope of jsonargparse? What are the thoughts of other users / maintainers?

@Erotemic Erotemic added the enhancement New feature or request label Oct 7, 2022
@mauvilsa
Copy link
Member

mauvilsa commented Oct 8, 2022

Thank you very much for the detailed proposal!

I will first comment about a potential after_instantiate hook. After a few thoughts, the only reason I could think of for doing something after instantiation is that it is going to be used in some argument link, which is the case described above. Otherwise, whatever needs to be done can be after the call to instantiate_classes finishes. Thus, it seems redundant to have after_instantiate hook and link arguments compute_fn.

I am not sure what you mean by abuses link arguments to get an instance of "data" before an instance of "model". This is how links applied on instantiation work, first data would be instantiated and then a compute_fn does whatever is needed to prepare an argument that model needs. There is nothing wrong by giving the instance of data to the compute_fn. In general I think that it is better for compute_fn to be a named function, so it could be something like:

def get_data_content(data):
    data.setup()
    return data.content

...

    parser.link_arguments('data', 'model.content', apply_on='instantiate', compute_fn=get_data_content)

With this, debugging is easy because a break point can be set inside get_data_content and the help of the CLI would show the link as get_data_content(data) --> model.content.

However, for this specific case I would recommend something a bit different. First, it might be convenient that calling setup multiple times does the work only once. And implement a property for content, i.e.:

class Data:
    def __init__(self, fpath=None):
        ...
        self._already_setup = False

    def setup(self):
        if self._already_setup:
            return
        self._content = ...
        ...
        self._already_setup = True

    @property
    def content(self):
        self.setup()
        return self._content

Then the link is created as parser.link_arguments('data.content', 'model.content', apply_on='instantiate'). The call to setup() is not done in Data.__init__ as you wanted. Even better, the class has more cohesion and simplifies the use of content because there is no need to remember that setup() must be called first.

@mauvilsa
Copy link
Member

mauvilsa commented Oct 8, 2022

Regarding a potential before_instantiate hook, there is no motivation for it in the description. You added it just for completeness, to have before and after, or do you have any real use case for it? I have in mind a use case that might need something like this, but I would like to hear about other possible use cases.

@Erotemic
Copy link
Contributor Author

Erotemic commented Oct 8, 2022

I am not sure what you mean by abuses link arguments

I call it an abuse because it's linking "data" to "model.content", even though the goal is to link "data.content" to "model.content". The function isn't just returning a modified value, it's determining the attribute that needs to be linked as well. I think this is conceptually inelegant, but that's just an opinion, I think the measurable objective criteria are that using link arguments like this:

  1. requires non-trivial boilerplate on in the user's code, and
  2. it's not obvious that "link_arguments" is currently the only way to accomplish customized setup of object (when there are dependencies).

Because I perceive a purpose of jsonargparse to be reducing boilerplate (otherwise why not just use stdlib argparse?) and because it's non-obvious that link_arguments is the way to accomplish this sort of behavior, I feel it's accurate to call the workaround an abuse.

With this, debugging is easy because a break point can be set inside get_data_content and the help of the CLI would show the link as get_data_content(data) --> model.content.

Debugging might be easier but if you have to do this for more than one property it becomes cumbersome. In the MWE I only included on property, but in my system I'm doing this for multiple properties. The workaround solution I came up with is:

            def data_value_getter(key):
                # Hack to call setup on the datamodule before linking args
                def get_value(data):
                    if not data.did_setup:
                        data.setup('fit')
                    return getattr(data, key)
                return get_value

which naively makes a help message like:

  get_value(data) --> model.init_args.dataset_stats [applied on instantiate]
                        Use --model.help CLASS_PATH for details.

And yes, I could modify the name of the returned function to make it more clear, and I probably will in the short-term, but I wan't to stress that is more boilerplate on the user-side, which makes the definition of the CLI less clean. And yes, I have augmented my data setup to include logic to be lazy on secondary calls, but again, that adds boilerplate on the userside.

And implement a property for content

The note on the propriety that computes content lazily is reasonable, and that might be a decent way to modify my code to fit more cleanly with jsonargparse, but I think the use-case of adding some form of instantiate hooks to jsonargparse is still well-motivated.

Consider the use case (which happens to be similar to the situation I find myself in) where a new engineer (not-me) on a team wants to move an existing system to jsonargparse. Demonstrating the benefits of jsonargparse to other team members (me) is much easier when it doesn't need to impact the way other code is written in order to slot itself in.

Regarding a potential before_instantiate hook hook, there is no motivation for it in the description

Yes, it's more for completion with the original idea of a before / after hook. Really either one of them would work for my use case. I'm less concerned on the exact details of the hook, and more about adding some conceptually clear method for handling the details of object instantiation, and one of my goals with this discussion is to determine what the best way to do that is, because I don't think my specific proposal is the best way to do it.

Something like parser.process_arguments(<name>, <func>) might be better, but again I don't have a perfectly clear idea of what it is. I feel like there is a method name that is conceptually related to "link_arguments", takes a parameter name and a function and is called to modify the value of that parameter, but I'm failing to come up with a name / signature that feels exactly right. I'm just going to brainstorm a few of them quickly:

  • parser.augment_argument
  • parser.add_argument_hook
  • parser.add_hook
  • parser.register_hook
  • parser.process_arguments
  • parser.add_after_instantiate

I feel like a full blown hook system is probably "the right way ™" to accomplish my and potentially many other use-cases, but without motivation for those other use-cases I realize that's a hard sell. But then again, that's a lot of what argparse is doing under the hood, so maybe it isn't that hard of a sell, although it is not the quickest path to addressing my particular problem, but it might be the most sustainable path for jsonargparse in general.

Now that I'm thinking about it, adding a hook system is about as much work as adding anything of the more specific alternatives. Assuming my arguments as to why jsonargparse would benefit from such a feature have been convincing, perhaps the best way forward would be to implement a hook system with exactly one event: after_instantiate, and then add more events if use cases arise?

@mauvilsa
Copy link
Member

Sorry, but I am not convinced. In the use case that you described, the best solution is to add a property to the Data class as I mentioned. If it is not wanted or not possible to improve the Data class, the compute_fn option already allows to do whatever is needed. The reason why compute_fn was added was precisely for cases like this. It is not just for barely modifying a value. If you think it is inelegant or a workaround, then so be it. That is the way to do it. If the documentation is not clear, then it can be improved. Adding a second way to do the same is redundant, makes the learning curve steeper, increases the maintenance cost of the project, etc.

jsonargparse has plenty of features that makes it worth using instead of argparse or even other parsing libraries. It is not just about reduction of boilerplate. Certainly jsonargparse can be used without impacting how the other code is written. But if it makes sense to change the other code, then it should be recommended to do so. In the Data case, changing content to a property which makes sure that setup is called only once makes the Data class better, completely independent from argument parsing or configuration files.

Also as I mentioned before, I do know of a use case that would require a hook before instantiation, and currently simply there is no way to do it. But that would not help in the use case you mentioned. Unless in that hook the class is instantiated, which would be useless since jsonargparse would instantiate again afterwards. Registering a custom instantiator, like the add_instantiator you mentioned could be. Though, this needs motivating use cases. The one that I have in mind only requires to manipulate the arguments prior to instantiation.

@Erotemic
Copy link
Contributor Author

Fair points, but the thing I'm getting hung up on is that the destination seems superfluous, I just want to postprocess and argument without worrying about what else it might be linked to. Yet, the maintenance cost is real, and I respect that.

Now that I'm framing it like this what about this, I'm thinking of an alternative.

What if jsonargparse is modified to allow the source and destination of link_arguments to be the same?

parser.link_arguments('data', 'data', compute_fn=...)

I half expected this to just work when I tried it, but it did fail due to cycles in the graph. However, the modification to simply ignore self-loops in this check and then respect those self-loops in further processing would be a much smaller modification than adding a second way to do something.

Does this sound any better?

@mauvilsa
Copy link
Member

There is a missing feature in link_arguments which is a target without a source. This only makes sense for links applied on parse. Take for example a class which has a when parameter. Someone might want to implement a CLI in which when is not configurable. So it makes sense to allow adding a link like:

parser.link_arguments(None, 'some.nested.key.when', compute_fn=two_days_from_now)

And in the help the link would be shown as:

two_days_from_now() --> some.nested.key.when       [applied on parse]

Does this sound any better?

It does sound better. If source can be None then it is not unreasonable to allow target to be None. Though in this case it would only make sense for links applied on instantiation. A link could be added as:

parser.link_arguments('data', None, compute_fn=call_setup)

And in the help the link would be shown as:

call_setup(data)       [applied on instantiate]

It is a bit strange that the method is called link_arguments and source or target can be None. But not a deal breaker I think. Also better from the learning curve perspective, since it is just an extension of the already existing link_arguments concept. Not a completely new topic which would be hooks.

@Erotemic
Copy link
Contributor Author

I was thinking about having target as None, but I think it makes more sense to force the user to always specify a target.
You're linking something to it. Having source as None does make sense because the compute_fn provides the value to be "linked in" to the target.

However, when target is None, I think it cause a conceptual break because in every other instance, the compute_fn never modifies the source based on it's return value. I think for this to make sense the target needs to be specified as itself.

It is a bit strange that the method is called link_arguments

The way I would think about it is that you are linking something to the target argument. It could be from a secondary source argument, or via a argumentless compute_fn.

I'll see if I can code up this feature and submit a PR.

@mauvilsa
Copy link
Member

mauvilsa commented Oct 13, 2022

Would be great if you contribute. But maybe don't implement just yet. Now I have some doubts. Your plan is to change the DirectedGraph class to not fail with self loops? If the graph has self-loops it will no longer be a directed acyclic graph and in general the topological sort will not do what is needed. I don't know of any standard graph algorithms that do what is needed in this case. So I do worry this will just mess up the check for cycles and in cases lead to incorrect instantiation order.

I fear that even having this in link_arguments, it unnecessarily increases the complexity and maintenance of the project. I still don't understand why are you so insistent on this. Your data_value_getter solution is perfectly fine and can be improved so that the generated functions have specific names. Function generators are less trivial than a normal function. But apart from this, why is there a need to have a second way to do what can be done with what already exists?

@mauvilsa
Copy link
Member

There is another reason to not rush into implementation. As I mentioned, I have a use case that requires to manipulate arguments before instantiation. Is is better to first have clarity on how that would be implemented to avoid redundancy of features.

@Erotemic
Copy link
Contributor Author

Your plan is to change the DirectedGraph class to not fail with self loops? If the graph has self-loops it will no longer be a directed acyclic graph and in general the topological sort will not do what is needed.

The topological sort would be run on the graph where self-loops are removed. The standard graph algorithm is just a composition of primitive operations: remove self-loops, run topological sort. That produces a node order, and then the original graph is iterated over where the self loops still exist. (I'm I have a good deal of experience working with graph algorithms, networkx specifically. I'm confident in this idea, so let me know if there is anything else that needs further explanation).

It unnecessarily increases the complexity and maintenance of the project. I still don't understand why are you so insistent on this.

The proposed feature is a natural extension of the current way that link_parameters works, and it results in a conceptually easy way to accomplish the task I'm trying to do. If I was teaching this to someone I would not want to tell them: well you have to make a nested closure that returns a function to do what you want, and you have to connect it to something, even if you don't directly mean to use that. I care a lot about making my CLI definition clear, and the combination of requiring that I connect data to something as well as requiring a decorated function generator is enough of a distraction where I feel any added complexity (and I would argue self-loop handling does not add that much) is warranted.

Your use case would also be supported in the PR I have in mind. Consider a link-arguments graph where a special node is marked as None. That will represent the None source, and if it is linked to a target node, then it will populate that node with None. However, if a compute_fn is specified, then you can return whatever you want (e.g. two_days_from_now).

The following rules capture all use cases discussed here:

  • For each link-edge look at the source node, map it to the target node, use a compute function if it is specified.
  • None as a source node is a special node that always exists and and always returns None.
  • None as a target node is a special node (distinct from the None source) that does not store the computed value anywhere.
  • Ignore self-loops and None nodes when computing the topological sort.

@mauvilsa
Copy link
Member

Good to hear that you have experience with graphs. Just note, a directed graph can have many topological orders. Right now any topological order will do. But with the possible link_arguments extension, the self loops always need to come before other edges in the node, regardless of the order in which edges are added.

The use case that I was saying is not the one from the two_days_from_now example. Also a clarification, that is for links applied on parse which do not use DirectedGraph. The use case that I was referring to is modifying parameters right before instantiation. Might be that to support this, it makes sense to have an add_instantiator or similar, which would also allow to add logic before and after instantiation. This would overlap with the extension of link_arguments which is what I want to avoid. So please, don't rush into implementation just yet.

If I was teaching this to someone I would not want to tell them: well you have to make a nested closure that returns a function to do what you want, and you have to connect it to something, even if you don't directly mean to use that.

If you were teaching this to someone, then teach them that the class should implement a property like I was saying before. The ease of use of a class such as Data, e.g. be able to access a property without having to worry about calling setup() first, is a concern of that class, not a concern of a CLI that uses that class. You know, separation of concerns principle. Not wanting to populate an attribute on __init__ and additionally disallowing to improve the class, are restrictions that limit significantly the scope of why the extension of link_arguments would be needed. Assuming these restrictions for teaching does not make sense to me. And if for some rare use case these restrictions are imposed, then the data_value_getter solution is fine. So I still don't see it, why is this so important?

@Erotemic
Copy link
Contributor Author

a directed graph can have many topological orders. Right now any topological order will do.

Yes, because the topological sort guarantees you see all ancestors of a node before you see a node. I can't think of any reason why non-uniqueness would matter. There's no way self-loops would mess up instantiation order.

referring to is modifying parameters right before instantiation

Ah, so you want to have access to all parameters before instantiating the object, and then effectively construct and return the object yourself? That does go beyond the current feature I had in mind. But such a feature would also work for what I want to do. I don't think it fits neatly into link arguments, which is effectively adding special edges in a graph where the nodes have already been determined. From a graph level, this feature would need to add in a new node to handle the arguments before passing them to the intended target.

So I still don't see it, why is this so important?

I'm feeling like you're focusing on the minimal working example a bit too much. I have to maintain a large system that requires backwards compatibility and sometimes changing other code for the better is not an option. I have to work with code from a lot of different people of various Python skill levels. Getting them to change the way they are doing things is not always easy. Instead it's much easier if I can just add in some post processing hooks and not have to redesign every class I'm given.

Furthermore, I think the idea stands on its own merits as a natural extension of what currently exists. Link arguments defines a attribute setting graph with an optional function on each edge. It is natural for a graph like that to have a null source / sink and to allow self loops. If I haven't satisfied you with my use case by now, then I don't think I will be able to. If you want to reject the idea that's fine. If I'm the only person who would ever use this, then I understand it's not worth the maintenance cost.

@mauvilsa
Copy link
Member

I am not rejecting the idea. I just don't want to rush into any decision. In fact I am very grateful about this discussion. When I have more time I will describe here the other use case. It would be great to hear your thoughts on it.

@mauvilsa
Copy link
Member

The use case that I was mentioning is the following. In LightningCLI currently the trainer group is configured to not instantiate, see cli.py#L124. This is needed because of cli.py#L531-L564. Unfortunately this has the consequence of not being possible to create an instantiation link with the trainer object as source.

One known use case for this would be a link from trainer.estimated_stepping_batches to a OneCycleLR learning rate scheduler. I think this was asked by @catalys1 in the pytorch-lightning slack.

@mauvilsa
Copy link
Member

mauvilsa commented Dec 1, 2022

Another use case can be Lightning-AI/pytorch-lightning#15427. The save_hyperparameters method can only see the parameters the __init__ gets. This is limited because it does not work with the dependency injection pattern, since what __init__ sees are instances of classes or functions (see models-with-multiple-submodules and Lightning-AI/pytorch-lightning#15869). There is no way for save_hyperparameters to know what are the parameters needed to get those. With a custom instantiator, it would be possible to provide save_hyperparameters with the actual parameters.

mauvilsa added a commit that referenced this issue Jul 18, 2023
mauvilsa added a commit that referenced this issue Aug 16, 2023
@mauvilsa
Copy link
Member

I created pull request #326 implementing an add_instantiator method. The original description of this issue suggested add_instantiator as one of the possibilities. However, there is an important distinction. In the original description, a parser key is given when adding a function. This has shortcomings:

  • It can be desired to add an instantiator for multiple (or all) parser arguments. Providing a key makes these use cases cumbersome.
  • When a class is being instantiated, it could be in a deeply nested context. Knowing the key for which this class corresponds to is complex and does not justify adding this complexity to the code.

The implemented add_instantiator expects a class type. The parser when choosing an instantiator function, it checks whether the class to instantiate is one of the classes, or optionally a subclass, corresponding to one of the instantiations.

@Erotemic would this cover your use case?

mauvilsa added a commit that referenced this issue Aug 16, 2023
@Erotemic
Copy link
Contributor Author

Yes I think it would. I know what the specific class I'm interested in is, so I'd be able to provide that a-priori.

mauvilsa added a commit that referenced this issue Aug 23, 2023
mauvilsa added a commit that referenced this issue Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants