New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count number of floating point operations per layer #6203
Conversation
keras/engine/topology.py
Outdated
| output_area = np.prod( self.output_shape[1:-1] ) | ||
| filter_area = np.prod( self.kernel_size ) | ||
| if output_area is not None: | ||
| flops = input_filters * output_filters * filter_area * output_area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this switch, this should be implemented as a recursive call to layer methods (with a reasonable default for the base layer, leveraging the ndim of the weights).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base layer should raise an error actually or None. Depend if we want everyone to provide this service. Every layer in keras' repo should provide this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet I replaced the switch with a default based on the kernel shape. What do you think?
Any layer can override count_ops, but I think at the moment this handles most cases with no need for overriding.
keras/engine/topology.py
Outdated
| @@ -1229,6 +1229,45 @@ def count_params(self): | |||
| self.name + '.build(batch_input_shape)`.') | |||
| return sum([K.count_params(p) for p in self.weights]) | |||
|
|
|||
| def count_flops(self): | |||
| """Count the total number of floating point operations for a _forward_ calculation of this layer. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of your lines are too long. We don't strictly enforce line length, but please do your best to follow PEP8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet Fixed
keras/engine/topology.py
Outdated
| @@ -1229,6 +1229,45 @@ def count_params(self): | |||
| self.name + '.build(batch_input_shape)`.') | |||
| return sum([K.count_params(p) for p in self.weights]) | |||
|
|
|||
| def count_flops(self): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flops is generally used for floating point operations per second. Since this is merely a flop count, maybe we need a different name. Maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just count_ops ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good. There may be confusion with the concept of op in TF though (e.g. a dot product). But maybe not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is count_flop (without the s which may be confused for "per second"). I was concerned about confusion with op in TF just as @fchollet said that's why I didn't got for count_ops initially. Happy to rename it to whatever you guys prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote count_ops since it is more consistent with count_params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't think the confusion with TF ops is a big deal. As long as the docstring makes it very clear what is being counted.
One more note: I don't think op counting is a common enough use case that it should be displayed in model.summary().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet @farizrahman4u Okay, I renamed it to count_ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet I would argue in favor of having op counting in model.summary().
Op counts are a good predictor of speed, and can be very unevenly distributed among layers (so it is not obvious where to optimize). People tend to optimize what they can see, optimizing param counts has been very easy with keras, op counts / speed not so much until now.
Also, most papers seem to report both param counts and op counts as a measure of the "size" of the network.
If you are stongly against having op counting as default in model.summary(), perhaps there could be an option to display it. I would really like it to be the default though, with an option to hide it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed before, this should be a layer method, implemented on every layer. You cannot come up with a unique implementation that will work for every layer; the current attempt would only return a result for a subset of layers, and would give the wrong answer in most cases... computing the proper answer requires knowing how each layer works.
| 'You can build it manually via: `' + | ||
| self.name + '.build(batch_input_shape)`.') | ||
| ops = None | ||
| if K.image_data_format() == 'channels_last': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you always return None in mode channels_first? Why? That's not the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Just didn't have time to implement yet; will add that soon.
| ops = None | ||
| if K.image_data_format() == 'channels_last': | ||
| output_area = np.prod(self.output_shape[1:-1]) | ||
| output_area = int(output_area) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are later multiplying with a float. This has no effect on the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_area is an int except when it is calculated as np.prod([]) which returns 1.0 as a float (for Dense layers). Somewhat surprising behavior from numpy, this is just a workaround for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does; two lines later you are multiplying your int with float, which returns a float. Hence casting to int has no effect on the final type.
| if K.image_data_format() == 'channels_last': | ||
| output_area = np.prod(self.output_shape[1:-1]) | ||
| output_area = int(output_area) | ||
| if hasattr(self, 'kernel'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems strangely ad-hoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally this will give the wrong number for RNN layers
| line_length = line_length or 65 | ||
| positions = positions or [.45, .85, 1.] | ||
| line_length = line_length or 80 | ||
| positions = positions or [.40, .72, .85, 1.] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Screen real estate is scarce, and op counting is too niche to be part of the model summary (as we previously discussed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet Mos notable deep learning papers I can think of report op counts. Why do you say it's niche?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The papers that report op count are those that are trying to make a point about op count efficiency, which is fairly niche. Most DL papers barely report their model architectures...
| output_area = np.prod(self.output_shape[1:-1]) | ||
| output_area = int(output_area) | ||
| if hasattr(self, 'kernel'): | ||
| ops = np.prod(self.kernel._keras_shape) * output_area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not rely on private property _keras_shape which may not be set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this should use K.int_shape(self.kernel) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, which will be defined at least for the built-in layers
| self.name + '.build(batch_input_shape)`.') | ||
| ops = None | ||
| if K.image_data_format() == 'channels_last': | ||
| output_area = np.prod(self.output_shape[1:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formula is wrong for most convolution layer configs because it doesn't take into account padding and strides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet I've checked and it seems correct to me. The output area is what determines the number of positions at which the convolution has to be calculated; stride and padding affect that, but only by changing the output area. Could you show a counter-example? (for a convolutional layer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formula is wrong because np.prod(self.output_shape[1:-1]) is not the number of positions where the kernel is applied, it is the maximum number of positions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet Does the output shape include padding? That may be the problem you're referring to. I tried to think of any other way the output shape would not be equal to the number of positions the kernel is applied, but as far as I can tell input stride and input padding are already taken into account by this. A specific counter example would be greatly appreciated.
|
Writing a new method for every layer would be a lot of work, too. And it would still be fairly brittle because it would still fail for some layers, in particular Overall, computing the right answer at the network level requires a lot of knowledge about the network, and is not feasible in the general case. We may be better off leaving this outside of Keras. Thoughts? |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed. |
|
How to deal with |
|
So, is it provided now?? |
This PR partially implements the feature requested in #5625.
Opening it here for feedback on the implementation strategy.
Supported layers: Conv1D, Conv2D, Conv3D, Dense. Other layers return None, and don't count towards the total.
Behavior with different padding and strides seems okay.
Questions: