Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform LMUFFT with raw convolution #42

Merged
merged 5 commits into from Aug 16, 2021
Merged

Conversation

hunse
Copy link
Contributor

@hunse hunse commented Jun 15, 2021

Add the ability to run the impulse response convolution as a raw convolution, rather than using the FFT. In practice, I've found that this can speed things up, though it also appears to require more CPU memory (which is surprising).

I also added a profiling test.

Based on #40.

TODO:

  • Figure out why this uses more CPU memory in some of our larger models (I have an example). Update: I looked into this, but was unable to reproduce the problem. I was trying on different hardware, though (the GPU I originally ran on was unavailable), so it's possible that's part of it. That said, I don't see any reason that the raw conv approach should use more memory (unless perhaps there's poorly-implemented explicit padding), so I'm not too worried about this.
  • Have the profiling test assert something, or make it an optional test/benchmark
  • Look into using shorter impulse responses when we have shorter theta. This is where this approach could really shine.

@hunse
Copy link
Contributor Author

hunse commented Jul 6, 2021

When testing this on my RTX 3060 for my specific model, I'm finding that the FFT implementation is faster than the raw conv implementation. So which one is best does seem to depend on the specific hardware/CUDA/TensorFlow. I'm hoping to test across more hardware soon, but I think for the foreseeable future, we're looking at keeping both implementations around. The best would be to autotune it, but that's probably a good chunk more work.

@hunse
Copy link
Contributor Author

hunse commented Jul 14, 2021

I think this is ready to go. In the end, I had to add two ways of doing the raw convolution, since the one that's faster on GPUs (using NCHW format) doesn't work on CPU.

@hunse hunse changed the title WIP: Perform LMUFFT with raw convolution Perform LMUFFT with raw convolution Jul 14, 2021
Copy link
Contributor

@arvoelke arvoelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a changelog entry too.

keras_lmu/layers.py Show resolved Hide resolved
keras_lmu/layers.py Outdated Show resolved Hide resolved
keras_lmu/layers.py Outdated Show resolved Hide resolved
keras_lmu/layers.py Outdated Show resolved Hide resolved
keras_lmu/layers.py Outdated Show resolved Hide resolved
keras_lmu/tests/test_benchmarks.py Outdated Show resolved Hide resolved
keras_lmu/tests/test_layers.py Outdated Show resolved Hide resolved
keras_lmu/tests/test_benchmarks.py Outdated Show resolved Hide resolved
keras_lmu/layers.py Outdated Show resolved Hide resolved
Copy link
Contributor

@arvoelke arvoelke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left the one comment about the benchmarks still open for when the next reviewer takes a look. It'll also need a changelog entry, and the codecov failure resolved, but it looks good to me.

@@ -523,6 +523,16 @@ class LMUFFT(tf.keras.layers.Layer):
return_sequences : bool, optional
If True, return the full output sequence. Otherwise, return just the last
output in the output sequence.
conv_mode : "fft" or "raw" or "raw_nchw"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of performance difference were you seeing between raw and raw_nchw? On my machine I don't see any. TensorFlow should automatically be swapping things around to use NCHW on the GPU (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/rewriter_config.proto#L60).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to see a performance difference by making the benchmark a bit more compute intensive. But I think the difference was more to do with the other reshapes/transposes going on (which are different in the two conditions), rather than the actual convolution. I pushed a fixup that simplifies the raw implementation a bit, and on my machine raw is now faster than raw_nchw. It'd be good to check what results you see on your machine though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes definitely make "raw" faster for me. I still see "raw_nchw" as slightly faster, but it's pretty minor. Interestingly, I'm now seeing the base "rnn" implementation as faster than both the "raw" implementations (and "fft" is even faster). Note: I had to drop the batch size to 16 since my GPU wasn't large enough to run with 32.

In light of these changes, I decided to re-test having the signal along the "H" dimension, and it's way faster for me. Double check that it's the same for you, but it removes the transposes, so makes sense that it's faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The H convolution is still dramatically slower for me (>10x).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's super weird that it's such a big difference. I guess it's either to do with GPU, or with CUDA/CuDNN/Tensorflow versions. That's annoying...

Copy link
Member

@drasmuss drasmuss Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the slowdown I was seeing by upgrading cuda/cudnn, now I see H as faster same as you, so that's good news.

I'm kind of inclined to just get rid of the raw_nchw option. Dropping it would simplify the code and the user interface (and remove any potential GPU/CPU compatibility issues). And I increased the number of steps in the benchmark locally to reduce any noise, and I very consistently see raw being as fast or faster than raw_nchw now, so I don't think there's a lot of value added in supporting that extra option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've maybe seen one or two cases where nchw is marginally faster, but it's pretty marginal, and plain "raw" is often as fast or faster for me as well. I'd be fine to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, about what percentage speedup did you see between W and H? One of my GPUs showed about a 1.5x speedup (old time / new time), but the other showed no speedup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was relatively minor but consistent, on the order of a 10% speedup.

keras_lmu/layers.py Show resolved Hide resolved
steptimes = SteptimeLogger()
model.fit(x_train, y_train, epochs=2, batch_size=batch_size, callbacks=[steptimes])

step_time = np.mean(steptimes.batch_times[1:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically you use min for timing benchmarks (since all the error is positive).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all the batches are doing exactly the same thing; for example, the first batch is much longer because there's a lot of overhead, that's why I drop it. I'm not sure if other batches have much overhead, but we'd lose that by doing min. Maybe that's not a problem, or maybe it's even a good thing, since it would get at the time of the underlying computation better, which is what we really care about. (Assuming that there would be no differences in overhead between the implementations.)

My only other question is whether each batch is fully synchronous (i.e. whether the whole computation has to be finished before on_batch_end is called). I think that this should be the case, but it's the only other way I can think of there being a problem with min.

Anyway, I'm fine to make the change, and it should become apparent pretty quickly if there are any problems. (i.e. if the qualitative results from min don't agree with mean)

keras_lmu/tests/test_benchmarks.py Outdated Show resolved Hide resolved
keras_lmu/layers.py Outdated Show resolved Hide resolved
def test_performance(mode, capsys):
@pytest.mark.parametrize(
"mode, min_time, max_time",
[("rnn", 0.3, 0.4), ("fft", 0.02, 0.03), ("raw", 0.2, 0.3)],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just note somewhere what kind of GPU these come from (is it a K80 that we use on Azure?)

@hunse
Copy link
Contributor Author

hunse commented Aug 13, 2021

Fixups look good to me. When you've got all the tests passing, feel free to merge.

hunse and others added 5 commits August 13, 2021 19:52
For some reason the latest release of matplotlib
causes a version of numpy to be installed during
the build process that conflicts with the version
installed at run time.
@tbekolay
Copy link
Member

New commits lgtm 👍

@drasmuss drasmuss merged commit ac00184 into master Aug 16, 2021
@drasmuss drasmuss deleted the lmu-fft-improvements branch August 16, 2021 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants