Brick inputs and outputs are tagged with an application call #68

janchorowski · 2014-12-14T15:53:25Z

The application call can be used to e.g. include auxiliary variables, such as monitors, extra cost terms etc.

Also: fixes to graph parsing.

rizar · 2014-12-15T21:09:17Z

blocks/bricks/__init__.py

+    The application call can be used to attach to an apply call auxiliary variables (e.g. monitors or regularizers)
+    that do not form part of the main computation graph.
+
+    The application call object is created before the call to the application method and can be accessed by 


Parameters section is missing (see documentation of any other class).

rizar · 2014-12-15T21:34:26Z

I like the idea of having a separate ApplicationCall.

Please pay attention to PEP8 compliance, it can be checked with flake8 (installable by pip)

bartvm · 2014-12-17T14:06:57Z

blocks/bricks/__init__.py

+        expression.tag.role=role
+        self.auxiliary_variables.append(expression)
+
+    def add_monitor(self, expression):


Just wondering if you have given any more thought to the semantics of adding monitoring channels: The variables need a bit more information to be able to be interpreted as a monitoring channel e.g. do they need data, and if so, what operator should be used for the reduction (i.e. do we want the mean, max, etc. over batches). I guess we should add that information as tags, and then search the ComputationGraph later in order to convert them into MonitorChannel instances (see my monitor_and_train branch), because we don't want the Brick class to rely directly on the MonitorChannel interface.

I guess an interface like this would be nice:

def apply(inputs): ... return outputs @apply.monitor_channels def apply_monitor_channels(self, inputs, outputs): rvals = {} rvals['mean_weights'] = create_monitor_variable(self.W.mean(), needs_data=False) rvals['shannon_entropy'] = create_monitor_variable(-tensor.sum(outputs * tensor.log(outputs)), needs_data=True, reduce=np.mean) return rvals

But it raises the question as to how we add monitor channels that rely on targets, but maybe those shouldn't be given by bricks in the first place?

Just wondering if you have given any more thought to the semantics of
adding monitoring channels: The variables need a bit more information
to be able to be interpreted as a monitoring channel e.g. do they need
data, and if so, what operator should be used for the reduction (i.e.
do we want the mean, max, etc. over batches). I guess we should add
that information as tags, and then search the |ComputationGraph| later
in order to convert them into |MonitorChannel| instances (see my
monitor_and_train branch), because we don't want the |Brick| class to
rely directly on the |MonitorChannel| interface.

Actually we did have a lengthy discussion with Jan about that. We
converged at the following: in addition to inputs and outputs the user
can construct "auxiliary variables" as results of an application call.
He can do it by using the ApplicationCall object which is passed as an
application_call parameter to the application method. The user can
specify how to average this variable over multiple batches by chaging
the .tag:

penalty = one - another penalty.tag.averaging_method = ArithmeticMean()

Seems like it is pretty much the same what you mean, except for data
specification. In fact the way it is done in PyLearn2 is quite weird,
and Jan has some nice ideas about complete and compromise-free
separation of monitoring and validation from the main loop class.

We did not agree though on how exactly the variables that are subject to
monitoring should be marked. variable.tag.role == MONITOR?
variable.tag.is_monitor == True? hasattr(variable.tag, "averaging_method"?

I guess an interface like this would be nice:

def apply(inputs):
...
return outputs

@apply.monitor_channels
def apply_monitor_channels(self,inputs,outputs):
rvals= {}
rvals['mean_weights']= create_monitor_variable(self.W.mean(),needs_data=False)
rvals['shannon_entropy']= create_monitor_variable(-tensor.sum(outputs* tensor.log(outputs)),needs_data=True,reduce=np.mean)
return rvals

But it raises the question as to how we add monitor channels that rely
on targets, but maybe those shouldn't be given by bricks in the first
place?

The interface you propose does not allow to monitor intermediate results
of the computation, does it?

Targets must be given to bricks: it is only feedforward networks when
you can first do fprop and then compute the cost. In the case of
structured (e.g. sequential) outputs it is often not possible, see
cost method of the BaseSequenceGenerator.

—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/blocks/pull/68/files#r21972887.

In general I like the idea, but I wouldn't want the user to edit tags himself. I'd prefer ApplicationCall's add_monitor method to take the averaging_method as an argument instead (that way we can do error checking). Also, do we really need a new class for ArithmeticMean? Can't we just pass numpy.mean, numpy.max as functions? Basically, the user is just expected to pass a function which takes an array as input and output.

Also, averaging_method seems a bit of a bad name, because if we want to take the maximum or minimum, we're not averaging anymore. I guess something like reduction_method would strictly be more correct, although it sounds a bit abstract.

Agreed on the targets.

Hi, I want to separate two concepts which are orthogonal to each other:

a way to return auxiliary variables from a brick.apply call through the list of application_call.auxiliary_variables.

an aggregation scheme tag, which can be used on auxiliary as well as regular outputs too to know how to aggregate them (e.g. final cost is a regular output, the aggregation scheme tells us how to automatically correctly compute it on a validation set).

Ad.1: The application_call.auxiliary_variables holds auxiliary things computed by the brick. These can be monitors, additional costs, or e.g. asserts. Thus not all of them need to have an aggregation (reduction) scheme. The auxiliary variables have a suitable role tag which tells their use, I am open for suggestions which roles do we want. The role tags would be used e.g. to select them.

Ad 2:
I tentatively started to refer to averaging_method as aggregation. Reduction is fine too. maybe even MinibatchReduction or MinibatchAggregation? This make it even more explicit.

The user will not attach the tags himself. Here is my idea (my take on what we discussed with Dima):

The aggregation scheme tag is more than just a numpy reduction. Consider the average bits-per-word in the whole set. It is not the same as average over batches of average over words. Proper computation over a full data set requires keeping track of two values: the number of bits, and the number of words. I propose to:

define an aggregation scheme class with three methods:

allocation of shared var accumulators + listing of theano updates that initialize them. The initial values must not depend on the data.

listing of theano updates to run over each aggregated batch of data

a finalization method that computes the final value using the accumulators.

define a convenience functions, such as aggregated_fraction(numerator, denominator) that return an expression with a proper aggregation_scheme tag.

This scheme obviates the need for flagging whether we need data or not. Monitors of parameter properties, such as column norm would simply compute the value in the initialization update, then do nothing during the loop over the data.

The scheme is very general, and it should be easy to e.g. compute micro- and macro- averaged class losses http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization etc.

Sorry for the slow responses; I'm in the Netherlands now and have been busy with a lot of social obligations after not having seen people for a year.

I'm not convinced by the need for these complicated aggregation schemes. For very complicated cases one could simply write a monitor channel consisting of a function. I don't think we should build support in for these complicated cases directly, considering how relatively uncommon they are, but how complicated they make the framework.

This sounds like exactly something Dima would accuse me of, but I think we shouldn't try to make things too general, because we lose the simplicity and clarity of the framework. I think that some rarer cases, even though valid use cases, should be implemented using extensions, or in this case, callbacks, instead of cluttering the entire framework.

I am aware that expecting to write the user to write a callable function means that the computational efficiency could be sub-optimal (because the Theano computations can't be compiled together), but maybe this is worth it for the sake of simplicity.

I feel that in many cases, you can also solve this by some clever reshaping. If you want to calculate per-word statistics, you simply collapse the first and second axes so that you can take the average over that:

output_words = output.reshape(output.shape[0] * output.shape[1], output.shape[2]) [mask.flatten().nonzero()] # Averaging over the first axis will give you per word statistics

Welcome back to Europe.

This PR was only for a more structured way of extracting auxiliary information out of the graph.
Stuff that you want extracted may be:

monitors

auxiliary costs

The aggregating monitors will be in an extension. I like the idea of writing err_rate=fraction(num_errs, num_examples_in_batch) and knowing that during monitoring it will report the value truthfully even when batch_size varies from batch-to-batch. Mean over batches of the per-batch means de-emphasizes examples in large batches.

Now to be able to do it by acting on a list of minibatch results I have to:

either return a binary matrix indicating correct samples, concatenate it over batches and take the mean. This can work, you are just moving more memory around.

return two intermediate numbers, num_errs and num_examples, then sum then in the aggregation. My idea is to hide the two numbers in the aggregator.

I will try to put my 2 cents in this hot discussion.

First of call, this pull request does not yet specify any particular way of attaching an aggregation scheme and I hope @bartvm is fine that we have pushed these changes.

Second, I think @janchorowski tries to address an important application: it is indeed very painful to aggregate different quantities over several batches of varying size. Even if my batch size is the same: what if I want to monitor let's say the ratio between number of opened and closed gates? To do it correctly I should use an aggregation scheme.

I do not think that it would clutter the framework too much: the implementation amounts to allowing the Monitor (or whatever it is called) to ask the TrainingAlgorithm (or whatever it is called) to do some updates at every iteration. We need this anyways to monitor the gradient norm and things like that.

I would changes the interface though, it is enough to have two methods:

reset to flush all the accumulators

get_updates - returns the set of updates to be done at every training iteration to keep accumulators consistent

I'm aware this PR didn't decide on any interface; just figured I'd continue the discussion here. I have no issues with the principle, but mostly with the interface, which I want to make sure stays simple for the basic cases of monitoring values that don't require data, or which are simply average over examples (which covers 99% of use cases). For these cases, I don't think you should need to provide an entire averaging scheme (fraction(num_errs, num_examples_in_batch)). This flexibility should be reserved for the exceptional cases where we average over elements (words) in temporal sequences (sentences), or for when for some reason we want to average over batches of non-fixed size.

Otherwise, no problem with this PR, I like the design choice. There are a few tiny documentation issues, but I'll fix those when I have the time (no newline after the opening """, and the summary description after """ cannot exceed 1 line).

Also, I think comments should only be used to explain non-obvious programming logic, and I guess we can use them to tag # TODO. I'd prefer that we do not merge questions or personal opinions as with this PR. Those will just linger and confuse people 1 year from now.

On 19.12.2014 16:45, Bart van Merriënboer wrote:

In blocks/bricks/init.py
#68 (diff):

The application call object is created before the call to the application method and can be accessed by

specifying an application_call argument.

"""

def init(self, brick, application):

self.brick = brick

self.application = application

self.auxiliary_variables = []

def add_auxiliary_variable(self, expression, role):

#the copy destorys the name. I believe adding a role tag is pretty harmless, so don't copy

#expression = expression.copy()

expression.tag.role=role

self.auxiliary_variables.append(expression)

def add_monitor(self, expression):

I'm aware this PR didn't decide on any interface; just figured I'd
continue the discussion here. I have no issues with the principle, but
mostly with the interface, which I want to make sure stays simple for
the basic cases of monitoring values that don't require data, or which
are simply average over examples (which covers 99% of use cases). For
these cases, I don't think you should need to provide an entire
averaging scheme (|fraction(num_errs, num_examples_in_batch)|). This
flexibility should be reserved for the exceptional cases where we
average over elements (words) in temporal sequences (sentences), or
for when for some reason we want to average over batches of non-fixed
size.

I agree that we should keep things simple for simple usecases. What if
we agree that the default aggregation semantics (i.e. when the attribute
in the tag is not given) is simple averaging over batches regardless of
their relative sizes?

However I very much looking forward to a prototype from Jan, since my
experience shows that with a code sample the discussion runs much smoother.

Otherwise, no problem with this PR, I like the design choice. There
are a few tiny documentation issues, but I'll fix those when I have
the time (no newline after the opening |"""|, and the summary
description after |"""| cannot exceed 1 line).

Also, I think comments should only be used to explain non-obvious
programming logic, and I guess we can use them to tag |# TODO|. I'd
prefer that we do not merge questions or personal opinions as with
this PR. Those will just linger and confuse people 1 year from now.

OK, I think we should have a document "For a New Contributor", where we
should at least:

discuss code-style: uppercased constants, numpy-style docstrings,
length of summary

explain the user how automatic testing works and why he should always
take care of syntax and use flake8

show how to generate docs locally

Bart, would you like to start such a document? I think it should be
another page in the documentation.

—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/blocks/pull/68/files#r22110594.

Agreed, easier to start the discussion from code, so let's wait for @janchorowski's code. Although I'd argue that the default aggregation method should be example-wise, not batch-wise, which makes most sense for things like likelihood.

I can make a start with a developer guidelines-document.

rizar · 2014-12-18T09:35:39Z

blocks/bricks/__init__.py

+    ----------
+    brick : object
+        The brick whose application is called
+


No need for a blank line here.

rizar · 2014-12-18T17:47:51Z

Nice Jan, I don't mind merging this code. But we have conflicts. I guess the right way to go is to merge master into your branch and update the pull request, but I am not sure.

rizar · 2014-12-18T17:48:49Z

blocks/bricks/__init__.py

+    """
+    A dummy class to keep track of brick roles
+    """
+    cost = "cost"


I think we should uppercase here.

Like all caps COST="cost" or just Cost="cost"?

All capitals are typically used for constants, e.g. logging.DEBUG.

janchorowski · 2014-12-18T18:16:45Z

OK, I merged master and uppercased role names

Brick inputs and outputs are tagged with an application call

rizar · 2014-12-18T18:50:44Z

Cool, thanks for the contribution!

Jan Chorowski added 2 commits December 14, 2014 10:50

Brick inputs and outputs are tagged with an application call

f0c7898

updates go into the application call object

32f61d7

rizar reviewed Dec 15, 2014
View reviewed changes

bartvm reviewed Dec 17, 2014
View reviewed changes

fixes of pep8

c200172

rizar reviewed Dec 18, 2014
View reviewed changes

Jan Chorowski added 2 commits December 18, 2014 11:30

fixes Dimas coments on GH

029aa27

Grouped all variable roles in a dummy class to prevent name hardcoding

d1d18c7

rizar reviewed Dec 18, 2014
View reviewed changes

Jan Chorowski added 2 commits December 18, 2014 19:00

merged bart/master

c1d4adf

Fixed role names to uppercase

19dee22

rizar added a commit that referenced this pull request Dec 18, 2014

Merge pull request #68 from janchorowski/introducing_applycall_tag

d8dd319

Brick inputs and outputs are tagged with an application call

rizar merged commit d8dd319 into mila-iqia:master Dec 18, 2014

bartvm mentioned this pull request Dec 19, 2014

Monitoring #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brick inputs and outputs are tagged with an application call #68

Brick inputs and outputs are tagged with an application call #68

janchorowski commented Dec 14, 2014

rizar Dec 15, 2014

janchorowski Dec 17, 2014

rizar commented Dec 15, 2014

bartvm Dec 17, 2014

rizar Dec 17, 2014

bartvm Dec 17, 2014

janchorowski Dec 17, 2014

bartvm Dec 19, 2014

janchorowski Dec 19, 2014

rizar Dec 19, 2014

bartvm Dec 19, 2014

rizar Dec 19, 2014

bartvm Dec 19, 2014

rizar Dec 18, 2014

janchorowski Dec 18, 2014

rizar commented Dec 18, 2014

rizar Dec 18, 2014

janchorowski Dec 18, 2014

rizar Dec 18, 2014

janchorowski commented Dec 18, 2014

rizar commented Dec 18, 2014

Brick inputs and outputs are tagged with an application call #68

Brick inputs and outputs are tagged with an application call #68

Conversation

janchorowski commented Dec 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rizar commented Dec 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rizar commented Dec 18, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janchorowski commented Dec 18, 2014

rizar commented Dec 18, 2014