RFC: Caching field values #134

AdamHillier · 2020-04-12T14:56:07Z

Feature description and motivation

At the moment, field values on component instances behave much like instance attributes of generic Python class instances. One value exists per instance, and if it is mutable then an access after modification will return the same, modified value, e.g.:

@component
class A:
    foo: List[int] = Field(lambda: [1, 2, 3])

class B:
    def __init__(self):
        self.foo = [1, 2, 3]

a = A()
b = B()

assert a.foo == b.foo == [1, 2, 3]

a.foo.append(4)
b.foo.append(4)

assert a.foo == b.foo == [1, 2, 3, 4]

We could change this behaviour so that field values instead behave much more like @property values, i.e. the value is not 'cached' on the instance and instead re-generated on every access. See discussion here for a motivation of this different behaviour: larq/zoo#148 (comment).

Current implementation

For a full explanation of how components access field values, see the docstring of the _wrap_getattribute method in component.py:

zookeeper/zookeeper/core/component.py

Lines 116 to 144 in 2b8812a

    
               """ 
        
               The logic for this overriden `__getattribute__` is as follows: 
        
               During component instantiation, any values passed to `__init__` are stored 
        
               in a dict on the instance `__component_instantiated_field_values__`. This 
        
               means that a priori the `__dict__` of a component instance is empty (of 
        
               non-dunder attributes). 
        
               Field values can come from three sources, in descending order of priority: 
        
                 1) A value that was passed into `configure` (e.g. via the CLI), which is 
        
                    stored in the `__component_configured_field_values__` dict on the 
        
                    component instance or some parent component instance. 
        
                 2) A value that was passed in at instantiation, which is stored in the 
        
                    `__component_instantiated_field_values__` dict on the current component 
        
                    instance (but not any parent instance). 
        
                 3) A default value obtained from the `get_default` factory method of a 
        
                    field defined on the component class of the current instance if it has 
        
                    one, or otherwise from the factory of the field on the component class 
        
                    of the nearest parent component instance with a field of the same name, 
        
                    et cetera. 
        
               Once we find a field value from one of these three sources, we set the value 
        
               on the instance `__dict__` (i.e. we 'cache' it). 
        
               This means that if we find a value in the instance `__dict__` we can 
        
               immediately return it without worrying about checking the three cases above 
        
               in order. It also means that each look-up other than the first will incur no 
        
               substantial time penalty. 
        
               """

New implementation

It would be straightforward to implement @property-esqe behaviour for default values which are passed into fields, as mutable default values are already generated from lambdas, and there's no issue with immutable default values being cached .

However, it would be much more difficult to implement for values passed in through the CLI. Consider the configuration CLI argument foo=[1, 2, 3]. We receive this as a string, and parse it into a Python value (in this case a list) to be used as the value for the field foo. If we wanted to return a new instance of this list on each access of foo, we would either need to be able to deep-clone generic mutable objects, or we would have to hold on to the configuration value as a string, and re-parse it into a Python value each time.

It's an open question whether we are happy for the behaviour of default values vs cli-overriden values to be different.

The text was updated successfully, but these errors were encountered:

leonoverweel · 2020-04-14T07:39:14Z

Thanks for the detailed RFC!

It's an open question whether we are happy for the behaviour of default values vs cli-overriden values to be different.

I'd say no, because this could lead to some very hard to diagnose bugs for a user who is unaware of these implementation details.

If we wanted to return a new instance of this list on each access of foo, we would either need to be able to deep-clone generic mutable objects, or we would have to hold on to the configuration value as a string, and re-parse it into a Python value each time.

In general I'm usually in favor of having a bit of internal implementation ugliness as a way to support more intuitive/robust/harder-to-break behavior for the end user, so these options don't sound too bad to me. I'd probably go for the latter, since it sounds easiest.

(Same reasoning as favoring giving up some cleanliness/minimalism in model code to enable larq/zoo#61, for example.)

AdamHillier · 2020-04-14T09:38:35Z

In general I'm usually in favor of having a bit of internal implementation ugliness as a way to support more intuitive/robust/harder-to-break behavior for the end user, so these options don't sound too bad to me. I'd probably go for the latter, since it sounds easiest.

This is a very valid point. I guess then the question becomes which behaviour we actually want. We can also support both (we already have a keyword arg to Field - allow_missing - and we could add cache_value as well, in which case we would need to decide what the default is).

lgeiger · 2020-04-14T10:06:21Z

Thanks for the detailed explanation!

I didn't think of the CLI use case, but I also think it would be very confusing if we'd have different behaviours for default vs CLI values.

If we wanted to return a new instance of this list on each access of foo, we would either need to be able to deep-clone generic mutable objects, or we would have to hold on to the configuration value as a string, and re-parse it into a Python value each time.

Can we just wrap the value into a lambda function if it is mutable and store that function on the instance? That way the implementation logic is also quite consistent: We'd always store the argument of Field on the instance. If users pass in mutable values via the CLI, we'd implicitely wrap them into a lambda functions to be consistent with how the Python API works.

One other approach would be to keep the current behaviour (Field ≈ cached_property) and encourage users to not use Field in this case like this and do larq/zoo#149 instead.

we could add cache_value as well, in which case we would need to decide what the default is.

I thought about introducing something like a PartialField, but I think having a keyword argument is cleaner.

In general having

@Field
def input_quantizer(self):
    return lq.quantizers.SteSign(clip_value=1.25)

behave like @property is very intuitive for me, though the same argument could be made the other way around.

AdamHillier · 2020-04-14T10:12:45Z

If users pass in mutable values via the CLI, we'd implicitely wrap them into a lambda functions to be consistent with how the Python API works.

If we have a single mutable object then wrapping it inside a lambda won't solve the issue because each call to the lambda will return a reference to the same object. I think we would have to do the CLI string parsing inside the lambda.

AdamHillier · 2020-04-14T10:14:13Z

In general having
@Field
def input_quantizer(self):
    return lq.quantizers.SteSign(clip_value=1.25)
behave like @property is very intuitive for me, though the same argument could be made the other way around.

I completely agree. I think having cache_value default to false is a reasonable thing to do.

lgeiger · 2020-04-14T10:50:59Z

If we have a single mutable object then wrapping it inside a lambda won't solve the issue because each call to the lambda will return a reference to the same object.

Ah yeah makes sense, since we want to replicate this behaviour:

class A:
    @property
    def foo(self):
        return [1, 2, 3]

class B:
    foo = [1, 2, 3]

a = A()
b = B()

assert a.foo == b.foo == [1, 2, 3]

a.foo.append(4)
b.foo.append(4)

assert a.foo == [1, 2, 3]
assert b.foo == [1, 2, 3, 4]

I think we would have to do the CLI string parsing inside the lambda.

What about using copy.deepcopy in the lambda?

AdamHillier · 2020-04-14T10:56:55Z

What about using copy.deepcopy in the lambda?

Cool, I hadn't come across that before but it looks like it would work. Especially as the values that can be passed in through the CLI (i.e. parsed from a string) are a very small subset of possible Python objects, and no doubt exclude any 'difficult' ones which would be hard to clone.

AdamHillier · 2020-04-17T11:09:05Z

Okay, so the concrete proposal is that Field gains a cache_value keyword argument, defaulting to false. When set to true, Field accesses have their current behaviour. When set to false, Field accesses behave like @property in the way specified above.

Are we all happy to move forward with this?

lgeiger · 2020-04-17T11:22:06Z

Will the bahaviour of immutable values like numbers change depending on cache_value?

Are we all happy to move forward with this?

👍

AdamHillier · 2020-04-17T11:30:04Z

Will the bahaviour of immutable values like numbers change depending on cache_value?

Unless we require the immutable defaults to be passed in with lambdas we can't really not cache immutable values, but I don't think that matters?

lgeiger · 2020-04-17T11:35:42Z

Unless we require the immutable defaults to be passed in with lambdas we can't really not cache immutable values, but I don't think that matters?

I agree, just wanted to make sure that this applies only to values we pass via lambda functions.

AdamHillier · 2020-04-17T11:49:19Z

I agree, just wanted to make sure that this applies only to values we pass via lambda functions.

Makes sense, and yes to be clear this will apply to mutable defaults passed with lambda functions and we'll use the copy.deepcopy module to do the same for things like lists that are passed in through the CLI.

AdamHillier · 2020-04-27T09:01:04Z

Apologies to come back to this, but I have some further questions after partially implementing this.

It seems to me that behaviour could be quite confusing when you mix cached and not cached fields with parent/child components.

@component
class Child:
    foo: List[int] = Field(cache_value=False)

@component
class Parent:
    child: Child = ComponentField(Child)
    foo: List[int] = Field(lambda: [1, 2, 3], cache_value=True)

p = Parent()
configure(p, {})

assert p.foo == [1, 2, 3]
assert p.child.foo == [1, 2, 3]

p.foo.append(4)

assert p.foo == [1, 2, 3, 4]
assert p.child.foo == ???   # What should this be?

It seems to me that either behaviour here would be confusing. We either don't respect value inheritence, or we don't respect the fact that the child field iscache_value=False.

The solution would probably be to disallow cache_value=False if any 'parent field' has cache_value=True. But what happens if the parent is a super-class defined in another package, that can't be modified (especially as the default for a field is cache_value=True).

lgeiger · 2020-04-27T10:33:54Z

@AdamHillier What about not supporting cache_value=True at all, or a I missing something?

AdamHillier · 2020-04-27T10:39:39Z

@AdamHillier What about not supporting cache_value=True at all, or a I missing something?

I think we need to support caching values. For example, in Zoo it would be very weird (and potentially buggy) if a new optimizer was created for every access of self.optimizer.

Here is another Zoo example where having caching is important (because the default output directory includes the current time).

I think in general there will be cases where you don't want to do expensive computation every time you access something.

lgeiger · 2020-04-27T11:05:43Z

I think in general there will be cases where you don't want to do expensive computation every time you access something.

I agree, for the case of the optimizer it is easy to get around by assigning it to a temporary variable, but for other cases were you might want to access the same value from different methods it makes sense to still have this possibility.

What do you think about to adhere to the way Python handles imutable attributes, by default?

class Parent:
    foo =[1, 2, 3]

class Child(Parent):
    pass

c = Child()
p = Parent()

assert c.foo == [1, 2, 3]
assert p.foo == [1, 2, 3]

c.foo.append(4)

assert c.foo == [1, 2, 3, 4]
assert p.foo == [1, 2, 3, 4]

AdamHillier · 2020-04-27T17:07:28Z

I've gone back and forth on this several times now, API design is not easy :)

We all agree that having the ability to cache values is important. The proposal is to also have the ability to have values which aren't cached. The problem then lies in how the two interact, especially when you mix in field value inheritence and CLI-overriden values. There are a lot of corner cases, some of which we can resolve by disallowing some combinations of behaviour (e.g. disallowing cache_value=False if any 'parent field' has cache_value=True), but in general are hard to cover all of them. I am concerned that by allowing both behaviours we are introducing more points of confusion into an API that's already quite confusing.

I now feel like the best course of action is probably to keep the existing behaviour, where we always cache values, and discourage the use of Fields for anything other than the 'built-in' types of int/float/str/bool/... plus sets, dicts, and lists of those types. These are, co-incidentally, the subset of Python values which can be parsed by the CLI. I've actually always been uncomfortable with the use of Fields for more complicated values such as functions and the like, that's what @components and ComponentFields are for!

Regarding larq/zoo#148, I think we can go with @lgeiger's initial fix, in PR larq/zoo#149, and not use Fields for setting quantisers.

leonoverweel · 2020-04-28T13:09:28Z

This sounds good to me. I've merged #149.

I think maybe it'd be good to have some simple examples of how these Zookeeper concepts (Fields, @components, and ComponentFields) can/should be used in the context of an experiment/@task, perhaps alongside or replacing the more abstract child/parent examples we have in README.md now. I've still learned this mostly from pattern matching what I've seen others do across Zoo and other places. It'd be nice to have a reference of the most ~~Pythonic~~ Zookeperic ways of doing different things. Shall I make an issue for this?

AdamHillier · 2020-04-28T13:38:05Z

Yes, that's a good idea! There is an example experiment thing which I think is a good start, but we should make that more prominent and give more examples, you're right.

AdamHillier mentioned this issue Apr 12, 2020

QuickNet model and flip_ratio metric do not work together larq/zoo#148

Closed

AdamHillier self-assigned this Apr 17, 2020

leonoverweel mentioned this issue Apr 28, 2020

Docs: add walk-through examples of how to use Zookeeper concepts in experiments #137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Caching field values #134

RFC: Caching field values #134

AdamHillier commented Apr 12, 2020

leonoverweel commented Apr 14, 2020 •

edited

Loading

AdamHillier commented Apr 14, 2020

lgeiger commented Apr 14, 2020

AdamHillier commented Apr 14, 2020

AdamHillier commented Apr 14, 2020

lgeiger commented Apr 14, 2020

AdamHillier commented Apr 14, 2020

AdamHillier commented Apr 17, 2020

lgeiger commented Apr 17, 2020

AdamHillier commented Apr 17, 2020

lgeiger commented Apr 17, 2020

AdamHillier commented Apr 17, 2020

AdamHillier commented Apr 27, 2020

lgeiger commented Apr 27, 2020

AdamHillier commented Apr 27, 2020 •

edited

Loading

lgeiger commented Apr 27, 2020

AdamHillier commented Apr 27, 2020

leonoverweel commented Apr 28, 2020 •

edited

Loading

AdamHillier commented Apr 28, 2020

RFC: Caching field values #134

RFC: Caching field values #134

Comments

AdamHillier commented Apr 12, 2020

Feature description and motivation

Current implementation

New implementation

leonoverweel commented Apr 14, 2020 • edited Loading

AdamHillier commented Apr 14, 2020

lgeiger commented Apr 14, 2020

AdamHillier commented Apr 14, 2020

AdamHillier commented Apr 14, 2020

lgeiger commented Apr 14, 2020

AdamHillier commented Apr 14, 2020

AdamHillier commented Apr 17, 2020

lgeiger commented Apr 17, 2020

AdamHillier commented Apr 17, 2020

lgeiger commented Apr 17, 2020

AdamHillier commented Apr 17, 2020

AdamHillier commented Apr 27, 2020

lgeiger commented Apr 27, 2020

AdamHillier commented Apr 27, 2020 • edited Loading

lgeiger commented Apr 27, 2020

AdamHillier commented Apr 27, 2020

leonoverweel commented Apr 28, 2020 • edited Loading

AdamHillier commented Apr 28, 2020

leonoverweel commented Apr 14, 2020 •

edited

Loading

AdamHillier commented Apr 27, 2020 •

edited

Loading

leonoverweel commented Apr 28, 2020 •

edited

Loading