-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[activations] pytorch-1.11+ Tanh Gelu Approximation #15397
Comments
Hi @stas00, may I take a look into this? Below is a list of to-do's I can think of, but of course, there could be more.
What do you think? |
By all means, @jaketae - thank you! the most important thing is numerical backward compatibility. Since |
I definitely agree numerical BC is key here. I think we can have extensive tests using (1) random tensors as input, and (2) within the context of model forward and backward. I assume we'll also have to check for device and dtypes. Should this issue be tabled until PyTorch 1.11 is released? IIRC the current stable is 1.10.2. Alternatively, I could use the nightly build to get started. |
It's your call, you can wait till pt-1.11 is released - probably in a month or so, or you can start with nightly get everything ready and merge it once it's released. |
PyTorch 1.11 is out! I could maybe get started on this if you haven't already @stas00? |
go for it, Jaesung! |
Upon more investigation, I realized 1.11 didn't ship with hyperbolic tangent GELU. I'll use the nightly instead. |
Apologies for the delay @stas00. Here is a quick update of where I am at the moment. 1. Understanding PyTorch's GELU Approximation The implementation can be found here: if (approximate == GeluType::Tanh) {
AT_DISPATCH_FLOATING_TYPES_AND(
ScalarType::BFloat16, it.dtype(), "GeluKernelImpl", [&]() {
using Vec = vec::Vectorized<scalar_t>;
const Vec kBetaVec(scalar_t(M_SQRT2 * M_2_SQRTPI * 0.5));
const Vec kKappaVec(scalar_t(0.044715));
const Vec kOneVec(scalar_t(1));
const Vec kPointFiveVec(scalar_t(0.5));
cpu_kernel_vec(
it,
[](scalar_t x) {
const scalar_t kBeta = M_SQRT2 * M_2_SQRTPI * 0.5;
const scalar_t kKappa = 0.044715;
auto x_cube = x * x * x;
auto inner = kBeta * (x + kKappa * x_cube);
return scalar_t(0.5) * x * (scalar_t(1) + std::tanh(inner));
}, As noted in the docs, this boils down to
HF transformers has a number of GELU implementations, but the one which corresponds to this specific variant appears to be
Hence, I investigated whether the output of 2. Preliminary Experiments A simple first-level check might be to generate a random tensor and compare the output of the two functions. Here is the experiment, with link to the Colab notebook. NUM_TRIALS = 1000
cpu_equal = []
cpu_allclose = []
for _ in range(NUM_TRIALS):
x_cpu = torch.randn(3, 3)
torch_cpu = F.gelu(x_cpu, approximate="tanh")
hf_cpu = gelu_new(x_cpu)
cpu_equal.append(torch.equal(torch_cpu, hf_cpu))
cpu_allclose.append(torch.allclose(torch_cpu, hf_cpu, rtol=1e-6))
print(average(cpu_equal))
print(average(cpu_allclose)) The same experiment was conducted with GPU by replacing Given an
Computations seem to be more robust on the GPU. In particular, 3. Next Steps Here is a non-exhaustive list of tasks.
Generally, my intuition is that replacing
Unless there is a performance overhead we are concerned with, I do not see a compelling reason to make what could be a dangerous transition. |
That's a great report, Jaesung! Thank you! Well, it can't be equal since it's an approximation, so it's really about deciding on the and of course you want to experiment with much larger tensors than 3x3 to come up with conclusions. If I try with more realistic sizes I get 0 matches with close or equal: |
@vadimkantorov, please have a look - the results are very different between the nightly fast version and the slow python approximation function. |
I guess we need to tag @rdspring1 who authored pytorch/pytorch#61439... |
Hey @rdspring1, could you kindly look into this? In particular, we've observed that the number of In the meantime, @stas00, do you think there's something actionable on our end? Perhaps I could load a model and replace HF GELUs with the PyTorch approximation to see if model outputs differ (they most surely will). What I'm not sure about is how to quantify/analyze this difference, as it's difficult to say with certainty that an X amount of |
Probably the creator of this feature would know the best expected tolerance - we will probably need to wait for their reply before we can proceed. Unless of course you'd like to dig into the source code and try to understand it yourself. |
In my personal tests, I used For pytorch internal testing, they use double, long, and complex128 for their numpy reference check. Here is the numpy tanh gelu reference implementation: Here are the tolerances used in pytorch. For |
🚀 Feature request
As kindly flagged by @vadimkantorov pt-1.11 will have a fast Tanh Gelu Approximation as implemented here pytorch/pytorch#61439 so we could replace our manual implementation with the fast one when pt>=1.11 is detected.
for additional context please see this thread: pytorch/pytorch#39853
The text was updated successfully, but these errors were encountered: