Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count number of floating point operations per layer #6203

Closed
wants to merge 6 commits into from

Conversation

aizvorski
Copy link

@aizvorski aizvorski commented Apr 10, 2017

This PR partially implements the feature requested in #5625.

Opening it here for feedback on the implementation strategy.

Supported layers: Conv1D, Conv2D, Conv3D, Dense. Other layers return None, and don't count towards the total.

Behavior with different padding and strides seems okay.

Questions:

  • Should this go in the Layer class (as here), or in each individual layer implementation?
  • Should this rely on some knowledge about the layers (as here), or on backend features for counting flops? If backend, how?
  • Should a multiply-add be counted as one flop (as here) or two? Common usage seems to be one flop, eg ResNet paper.
  • Should layers like pooling or activations even be counted at all? (How many flops is an ELU?) It seems that these are not usually counted at all, in any case typically a few convolutions account for the bulk of the calculation.

output_area = np.prod( self.output_shape[1:-1] )
filter_area = np.prod( self.kernel_size )
if output_area is not None:
flops = input_filters * output_filters * filter_area * output_area
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this switch, this should be implemented as a recursive call to layer methods (with a reasonable default for the base layer, leveraging the ndim of the weights).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base layer should raise an error actually or None. Depend if we want everyone to provide this service. Every layer in keras' repo should provide this

Copy link
Author

@aizvorski aizvorski Apr 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet I replaced the switch with a default based on the kernel shape. What do you think?

Any layer can override count_ops, but I think at the moment this handles most cases with no need for overriding.

@@ -1229,6 +1229,45 @@ def count_params(self):
self.name + '.build(batch_input_shape)`.')
return sum([K.count_params(p) for p in self.weights])

def count_flops(self):
"""Count the total number of floating point operations for a _forward_ calculation of this layer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of your lines are too long. We don't strictly enforce line length, but please do your best to follow PEP8.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet Fixed

@@ -1229,6 +1229,45 @@ def count_params(self):
self.name + '.build(batch_input_shape)`.')
return sum([K.count_params(p) for p in self.weights])

def count_flops(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flops is generally used for floating point operations per second. Since this is merely a flop count, maybe we need a different name. Maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just count_ops ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. There may be confusion with the concept of op in TF though (e.g. a dot product). But maybe not a big deal.

Copy link
Author

@aizvorski aizvorski Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is count_flop (without the s which may be confused for "per second"). I was concerned about confusion with op in TF just as @fchollet said that's why I didn't got for count_ops initially. Happy to rename it to whatever you guys prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote count_ops since it is more consistent with count_params

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think the confusion with TF ops is a big deal. As long as the docstring makes it very clear what is being counted.

One more note: I don't think op counting is a common enough use case that it should be displayed in model.summary().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet @farizrahman4u Okay, I renamed it to count_ops.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet I would argue in favor of having op counting in model.summary().

Op counts are a good predictor of speed, and can be very unevenly distributed among layers (so it is not obvious where to optimize). People tend to optimize what they can see, optimizing param counts has been very easy with keras, op counts / speed not so much until now.

Also, most papers seem to report both param counts and op counts as a measure of the "size" of the network.

If you are stongly against having op counting as default in model.summary(), perhaps there could be an option to display it. I would really like it to be the default though, with an option to hide it.

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed before, this should be a layer method, implemented on every layer. You cannot come up with a unique implementation that will work for every layer; the current attempt would only return a result for a subset of layers, and would give the wrong answer in most cases... computing the proper answer requires knowing how each layer works.

'You can build it manually via: `' +
self.name + '.build(batch_input_shape)`.')
ops = None
if K.image_data_format() == 'channels_last':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you always return None in mode channels_first? Why? That's not the expected behavior.

Copy link
Author

@aizvorski aizvorski Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Just didn't have time to implement yet; will add that soon.

ops = None
if K.image_data_format() == 'channels_last':
output_area = np.prod(self.output_shape[1:-1])
output_area = int(output_area)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are later multiplying with a float. This has no effect on the output

Copy link
Author

@aizvorski aizvorski Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_area is an int except when it is calculated as np.prod([]) which returns 1.0 as a float (for Dense layers). Somewhat surprising behavior from numpy, this is just a workaround for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does; two lines later you are multiplying your int with float, which returns a float. Hence casting to int has no effect on the final type.

if K.image_data_format() == 'channels_last':
output_area = np.prod(self.output_shape[1:-1])
output_area = int(output_area)
if hasattr(self, 'kernel'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strangely ad-hoc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally this will give the wrong number for RNN layers

line_length = line_length or 65
positions = positions or [.45, .85, 1.]
line_length = line_length or 80
positions = positions or [.40, .72, .85, 1.]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen real estate is scarce, and op counting is too niche to be part of the model summary (as we previously discussed).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet Mos notable deep learning papers I can think of report op counts. Why do you say it's niche?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The papers that report op count are those that are trying to make a point about op count efficiency, which is fairly niche. Most DL papers barely report their model architectures...

output_area = np.prod(self.output_shape[1:-1])
output_area = int(output_area)
if hasattr(self, 'kernel'):
ops = np.prod(self.kernel._keras_shape) * output_area
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not rely on private property _keras_shape which may not be set

Copy link
Author

@aizvorski aizvorski Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this should use K.int_shape(self.kernel) instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which will be defined at least for the built-in layers

self.name + '.build(batch_input_shape)`.')
ops = None
if K.image_data_format() == 'channels_last':
output_area = np.prod(self.output_shape[1:-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula is wrong for most convolution layer configs because it doesn't take into account padding and strides

Copy link
Author

@aizvorski aizvorski Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet I've checked and it seems correct to me. The output area is what determines the number of positions at which the convolution has to be calculated; stride and padding affect that, but only by changing the output area. Could you show a counter-example? (for a convolutional layer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula is wrong because np.prod(self.output_shape[1:-1]) is not the number of positions where the kernel is applied, it is the maximum number of positions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet Does the output shape include padding? That may be the problem you're referring to. I tried to think of any other way the output shape would not be equal to the number of positions the kernel is applied, but as far as I can tell input stride and input padding are already taken into account by this. A specific counter example would be greatly appreciated.

@fchollet
Copy link
Member

Writing a new method for every layer would be a lot of work, too. And it would still be fairly brittle because it would still fail for some layers, in particular Lambda layers.

Overall, computing the right answer at the network level requires a lot of knowledge about the network, and is not feasible in the general case. We may be better off leaving this outside of Keras. Thoughts?

@stale stale bot added the stale label Sep 5, 2017
@stale
Copy link

stale bot commented Sep 5, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot closed this Oct 5, 2017
@cwzat
Copy link

cwzat commented Apr 28, 2018

How to deal with input_shape = (None, None,None,3). Now it cannot deal with the typeNoneType?Thank you!

@ShuteLee
Copy link

ShuteLee commented May 1, 2019

So, is it provided now??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants