-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brick inputs and outputs are tagged with an application call #68
Brick inputs and outputs are tagged with an application call #68
Conversation
The application call can be used to attach to an apply call auxiliary variables (e.g. monitors or regularizers) | ||
that do not form part of the main computation graph. | ||
|
||
The application call object is created before the call to the application method and can be accessed by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameters
section is missing (see documentation of any other class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I like the idea of having a separate Please pay attention to PEP8 compliance, it can be checked with |
expression.tag.role=role | ||
self.auxiliary_variables.append(expression) | ||
|
||
def add_monitor(self, expression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if you have given any more thought to the semantics of adding monitoring channels: The variables need a bit more information to be able to be interpreted as a monitoring channel e.g. do they need data, and if so, what operator should be used for the reduction (i.e. do we want the mean, max, etc. over batches). I guess we should add that information as tags, and then search the ComputationGraph
later in order to convert them into MonitorChannel
instances (see my monitor_and_train branch), because we don't want the Brick
class to rely directly on the MonitorChannel
interface.
I guess an interface like this would be nice:
def apply(inputs):
...
return outputs
@apply.monitor_channels
def apply_monitor_channels(self, inputs, outputs):
rvals = {}
rvals['mean_weights'] = create_monitor_variable(self.W.mean(), needs_data=False)
rvals['shannon_entropy'] = create_monitor_variable(-tensor.sum(outputs * tensor.log(outputs)), needs_data=True, reduce=np.mean)
return rvals
But it raises the question as to how we add monitor channels that rely on targets, but maybe those shouldn't be given by bricks in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if you have given any more thought to the semantics of
adding monitoring channels: The variables need a bit more information
to be able to be interpreted as a monitoring channel e.g. do they need
data, and if so, what operator should be used for the reduction (i.e.
do we want the mean, max, etc. over batches). I guess we should add
that information as tags, and then search the |ComputationGraph| later
in order to convert them into |MonitorChannel| instances (see my
monitor_and_train branch), because we don't want the |Brick| class to
rely directly on the |MonitorChannel| interface.
Actually we did have a lengthy discussion with Jan about that. We
converged at the following: in addition to inputs and outputs the user
can construct "auxiliary variables" as results of an application call.
He can do it by using the ApplicationCall
object which is passed as an
application_call
parameter to the application method. The user can
specify how to average this variable over multiple batches by chaging
the .tag
:
penalty = one - another
penalty.tag.averaging_method = ArithmeticMean()
Seems like it is pretty much the same what you mean, except for data
specification. In fact the way it is done in PyLearn2 is quite weird,
and Jan has some nice ideas about complete and compromise-free
separation of monitoring and validation from the main loop class.
We did not agree though on how exactly the variables that are subject to
monitoring should be marked. variable.tag.role == MONITOR
?
variable.tag.is_monitor == True
? hasattr(variable.tag, "averaging_method"
?
I guess an interface like this would be nice:
def apply(inputs):
...
return outputs@apply.monitor_channels
def apply_monitor_channels(self,inputs,outputs):
rvals= {}
rvals['mean_weights']= create_monitor_variable(self.W.mean(),needs_data=False)
rvals['shannon_entropy']= create_monitor_variable(-tensor.sum(outputs* tensor.log(outputs)),needs_data=True,reduce=np.mean)
return rvalsBut it raises the question as to how we add monitor channels that rely
on targets, but maybe those shouldn't be given by bricks in the first
place?The interface you propose does not allow to monitor intermediate results
of the computation, does it?
Targets must be given to bricks: it is only feedforward networks when
you can first do fprop and then compute the cost. In the case of
structured (e.g. sequential) outputs it is often not possible, see
cost
method of the BaseSequenceGenerator
.
—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/blocks/pull/68/files#r21972887.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like the idea, but I wouldn't want the user to edit tags himself. I'd prefer ApplicationCall
's add_monitor
method to take the averaging_method
as an argument instead (that way we can do error checking). Also, do we really need a new class for ArithmeticMean
? Can't we just pass numpy.mean
, numpy.max
as functions? Basically, the user is just expected to pass a function which takes an array as input and output.
Also, averaging_method
seems a bit of a bad name, because if we want to take the maximum or minimum, we're not averaging anymore. I guess something like reduction_method
would strictly be more correct, although it sounds a bit abstract.
Agreed on the targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I want to separate two concepts which are orthogonal to each other:
- a way to return auxiliary variables from a brick.apply call through the list of application_call.auxiliary_variables.
- an aggregation scheme tag, which can be used on auxiliary as well as regular outputs too to know how to aggregate them (e.g. final cost is a regular output, the aggregation scheme tells us how to automatically correctly compute it on a validation set).
Ad.1: The application_call.auxiliary_variables holds auxiliary things computed by the brick. These can be monitors, additional costs, or e.g. asserts. Thus not all of them need to have an aggregation (reduction) scheme. The auxiliary variables have a suitable role tag which tells their use, I am open for suggestions which roles do we want. The role tags would be used e.g. to select them.
Ad 2:
I tentatively started to refer to averaging_method as aggregation. Reduction is fine too. maybe even MinibatchReduction or MinibatchAggregation? This make it even more explicit.
The user will not attach the tags himself. Here is my idea (my take on what we discussed with Dima):
The aggregation scheme tag is more than just a numpy reduction. Consider the average bits-per-word in the whole set. It is not the same as average over batches of average over words. Proper computation over a full data set requires keeping track of two values: the number of bits, and the number of words. I propose to:
- define an aggregation scheme class with three methods:
- allocation of shared var accumulators + listing of theano updates that initialize them. The initial values must not depend on the data.
- listing of theano updates to run over each aggregated batch of data
- a finalization method that computes the final value using the accumulators.
- define a convenience functions, such as aggregated_fraction(numerator, denominator) that return an expression with a proper aggregation_scheme tag.
This scheme obviates the need for flagging whether we need data or not. Monitors of parameter properties, such as column norm would simply compute the value in the initialization update, then do nothing during the loop over the data.
The scheme is very general, and it should be easy to e.g. compute micro- and macro- averaged class losses http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow responses; I'm in the Netherlands now and have been busy with a lot of social obligations after not having seen people for a year.
I'm not convinced by the need for these complicated aggregation schemes. For very complicated cases one could simply write a monitor channel consisting of a function. I don't think we should build support in for these complicated cases directly, considering how relatively uncommon they are, but how complicated they make the framework.
This sounds like exactly something Dima would accuse me of, but I think we shouldn't try to make things too general, because we lose the simplicity and clarity of the framework. I think that some rarer cases, even though valid use cases, should be implemented using extensions, or in this case, callbacks, instead of cluttering the entire framework.
I am aware that expecting to write the user to write a callable function means that the computational efficiency could be sub-optimal (because the Theano computations can't be compiled together), but maybe this is worth it for the sake of simplicity.
I feel that in many cases, you can also solve this by some clever reshaping. If you want to calculate per-word statistics, you simply collapse the first and second axes so that you can take the average over that:
output_words = output.reshape(output.shape[0] * output.shape[1], output.shape[2])
[mask.flatten().nonzero()] # Averaging over the first axis will give you per word statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome back to Europe.
This PR was only for a more structured way of extracting auxiliary information out of the graph.
Stuff that you want extracted may be:
- monitors
- auxiliary costs
The aggregating monitors will be in an extension. I like the idea of writing err_rate=fraction(num_errs, num_examples_in_batch) and knowing that during monitoring it will report the value truthfully even when batch_size varies from batch-to-batch. Mean over batches of the per-batch means de-emphasizes examples in large batches.
Now to be able to do it by acting on a list of minibatch results I have to:
- either return a binary matrix indicating correct samples, concatenate it over batches and take the mean. This can work, you are just moving more memory around.
- return two intermediate numbers, num_errs and num_examples, then sum then in the aggregation. My idea is to hide the two numbers in the aggregator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to put my 2 cents in this hot discussion.
First of call, this pull request does not yet specify any particular way of attaching an aggregation scheme and I hope @bartvm is fine that we have pushed these changes.
Second, I think @janchorowski tries to address an important application: it is indeed very painful to aggregate different quantities over several batches of varying size. Even if my batch size is the same: what if I want to monitor let's say the ratio between number of opened and closed gates? To do it correctly I should use an aggregation scheme.
I do not think that it would clutter the framework too much: the implementation amounts to allowing the Monitor
(or whatever it is called) to ask the TrainingAlgorithm
(or whatever it is called) to do some updates at every iteration. We need this anyways to monitor the gradient norm and things like that.
I would changes the interface though, it is enough to have two methods:
reset
to flush all the accumulatorsget_updates
- returns the set of updates to be done at every training iteration to keep accumulators consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware this PR didn't decide on any interface; just figured I'd continue the discussion here. I have no issues with the principle, but mostly with the interface, which I want to make sure stays simple for the basic cases of monitoring values that don't require data, or which are simply average over examples (which covers 99% of use cases). For these cases, I don't think you should need to provide an entire averaging scheme (fraction(num_errs, num_examples_in_batch)
). This flexibility should be reserved for the exceptional cases where we average over elements (words) in temporal sequences (sentences), or for when for some reason we want to average over batches of non-fixed size.
Otherwise, no problem with this PR, I like the design choice. There are a few tiny documentation issues, but I'll fix those when I have the time (no newline after the opening """
, and the summary description after """
cannot exceed 1 line).
Also, I think comments should only be used to explain non-obvious programming logic, and I guess we can use them to tag # TODO
. I'd prefer that we do not merge questions or personal opinions as with this PR. Those will just linger and confuse people 1 year from now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 19.12.2014 16:45, Bart van Merriënboer wrote:
In blocks/bricks/init.py
#68 (diff):
- The application call object is created before the call to the application method and can be accessed by
- specifying an application_call argument.
- """
- def init(self, brick, application):
self.brick = brick
self.application = application
self.auxiliary_variables = []
- def add_auxiliary_variable(self, expression, role):
#the copy destorys the name. I believe adding a role tag is pretty harmless, so don't copy
#expression = expression.copy()
expression.tag.role=role
self.auxiliary_variables.append(expression)
- def add_monitor(self, expression):
I'm aware this PR didn't decide on any interface; just figured I'd
continue the discussion here. I have no issues with the principle, but
mostly with the interface, which I want to make sure stays simple for
the basic cases of monitoring values that don't require data, or which
are simply average over examples (which covers 99% of use cases). For
these cases, I don't think you should need to provide an entire
averaging scheme (|fraction(num_errs, num_examples_in_batch)|). This
flexibility should be reserved for the exceptional cases where we
average over elements (words) in temporal sequences (sentences), or
for when for some reason we want to average over batches of non-fixed
size.I agree that we should keep things simple for simple usecases. What if
we agree that the default aggregation semantics (i.e. when the attribute
in the tag is not given) is simple averaging over batches regardless of
their relative sizes?
However I very much looking forward to a prototype from Jan, since my
experience shows that with a code sample the discussion runs much smoother.
Otherwise, no problem with this PR, I like the design choice. There
are a few tiny documentation issues, but I'll fix those when I have
the time (no newline after the opening |"""|, and the summary
description after |"""| cannot exceed 1 line).Also, I think comments should only be used to explain non-obvious
programming logic, and I guess we can use them to tag |# TODO|. I'd
prefer that we do not merge questions or personal opinions as with
this PR. Those will just linger and confuse people 1 year from now.OK, I think we should have a document "For a New Contributor", where we
should at least:
- discuss code-style: uppercased constants, numpy-style docstrings,
length of summary - explain the user how automatic testing works and why he should always
take care of syntax and use flake8 - show how to generate docs locally
Bart, would you like to start such a document? I think it should be
another page in the documentation.
—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/blocks/pull/68/files#r22110594.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, easier to start the discussion from code, so let's wait for @janchorowski's code. Although I'd argue that the default aggregation method should be example-wise, not batch-wise, which makes most sense for things like likelihood.
I can make a start with a developer guidelines-document.
---------- | ||
brick : object | ||
The brick whose application is called | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a blank line here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Nice Jan, I don't mind merging this code. But we have conflicts. I guess the right way to go is to merge master into your branch and update the pull request, but I am not sure. |
""" | ||
A dummy class to keep track of brick roles | ||
""" | ||
cost = "cost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should uppercase here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like all caps COST="cost" or just Cost="cost"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All capitals are typically used for constants, e.g. logging.DEBUG
.
OK, I merged master and uppercased role names |
Brick inputs and outputs are tagged with an application call
Cool, thanks for the contribution! |
The application call can be used to e.g. include auxiliary variables, such as monitors, extra cost terms etc.
Also: fixes to graph parsing.