Implementation - Padé Approximant in Log Softmax Layer #3685

MarkFischinger · 2024-04-10T13:55:35Z

Following our discussion in issue #3662, I've implemented the pade approximant in the log softmax layer. Due to time constraints, I haven't run the tests yet, but I plan to do so shortly and update you with the results.

shrit

Would you use mlpack style for variables ?

MarkFischinger · 2024-04-10T21:49:27Z

@shrit, thank you for pointing that out. The new commit includes the fix :)

shrit · 2024-04-11T10:38:49Z

I approved this one too quickly, I did not see the that the tests were not passing.
@MarkFischinger could you try to run the tests locally ?
I would be nice to compare the matrices generated by the original fast method and Padé because I think there are a good amount of difference, otherwise the tests would not have failed ?

rcurtin

Just preventing mlpack-bot from auto-approving until we get the fixes worked out. I'm guessing that the level of approximation might be too high, and things are not converging? (Maybe a threshold like x < 13 is needed?)

rcurtin · 2024-04-11T13:00:10Z

src/mlpack/methods/ann/layer/log_softmax_impl.hpp

+  };
+
+  output.transform([padeApproximant](double x) {
+    return padeApproximant(x);


I think it would be a bit cleaner to just inline the whole approximant into the lambda, but, up to you.

MarkFischinger · 2024-04-12T23:27:21Z

@shrit @rcurtin Sorry for the delay with the benchmarks -I needed to run a more detailed analysis to find an effective solution.
Here’s what I found:

Mnist Simple
Old Implementation:
Validation Loss: 567.577
Duration: 79709ms
Loss: 0.0331713
Accuracy: Train = 98.463%, Valid = 97.0707%

New Implementation (Scale 4, x < 13.0):
Validation Loss: 563.98
Duration: 80412ms
Loss: 0.049433
Accuracy: Train = 98.3571%, Valid = 96.9517%

The initial idea of adding only x < 13.0 proved too broad, leading to uncontrolled error spikes due to the large $X$ values I had been concerned about. In the discussion issue example, it featured only small X values, which worked perfectly with the Padé approximation, but large values (above 8.0) do not work quite well. But by scaling $X$ by $4$, I reduced the error, now notably smaller than in the old version, as you can see in this graph:

Despite the graph showing a seemingly doubled duration in runtime, the actual difference in the cnn run is minor. This improvement could be a viable option for implementation? What do you think?

auto scaledPadeApproxExpMinusX = [](double x) {
    if (x < 13.0) {
      double s = 4.0;
  
      double xs = x / s;
  
      double numerator = 24 - 12*xs + 4*xs*xs - xs*xs*xs;
      double denominator = 24 + 12*xs + 4*xs*xs + xs*xs*xs;
  
      double pade = numerator / denominator;
      return std::pow(pade, s);
    }
  
    return 0.0;
  };

  output.transform([scaledPadeApproxExpMinusX](double x) {
    return scaledPadeApproxExpMinusX(x);
  });

I think I will also test the algorithm on mnist_cnn soon.

shrit · 2024-04-16T11:16:23Z

@MarkFischinger give it a try, what I find weird is that, when we tested this separately the time was way faster, while here it looks much slower than the original one.
This worth investigating.

MarkFischinger · 2024-04-16T15:05:17Z

Hey @shrit, I think the trouble we're seeing comes from the higher $X$ values in our MNIST examples. Originally, we only saw $X$ values above $4$ about $2.275$% of the time, based on our normal distribution setup with arma::mat output = arma::randn(1000, 1000, arma::distr_param(0, 2)). But the MNIST data showed much higher $X$ values frequently, which caused those spikes and ultimately broke the code. That's why I had to scale them down, which did slow our runtime a bit.

Here are the $X$ values for the mnist example (output):

x = 8.78431
x = 23.1975
x = 22.0821
x = 16.2784
x = 18.5682

I'm thinking, since lower $X$ values are more common and they handle better, maybe we should try the original Padé approximation for values up to say, $4$, and keep our current method as a backup for anything higher. This way, we can handle typical cases fast and still catch any outliers without any problems. What do you think? Should I run some benchmarks on this mixed approach?

rcurtin · 2024-04-16T21:47:08Z

Yeah, a switch to the existing implementation at about x > 4 would probably do the trick for convergence too. I would be interested to see if it would be faster, too---although, to check that, you'd need to ensure that the number of epochs used for training are constant (or, just time a single epoch, that's fine too).

The scaling trick is definitely a good one for convergence, but I suspect the std::pow is painful and what causes it to be slower.

MarkFischinger · 2024-04-22T22:36:35Z

@rcurtin I did some backtesting, and the results showed unfortunately no/only minor improvements in runtime. Statistically, combining those two algorithms should reduce the error, but I'm still looking for faster implementations because I'm hopeful that I can find a better solution. I'll update you as soon as possible, though my available time will be limited for the next few days due to the exams :/

feat: implement pade approximant in log softmax

646bec4

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Apr 10, 2024

MarkFischinger changed the title ~~Implementation - Pade Approximant in Log Softmax Layer~~ Implementation - Padé Approximant in Log Softmax Layer Apr 10, 2024

shrit reviewed Apr 10, 2024

View reviewed changes

fix: camel case vars

6d88f39

shrit removed s: needs review s: unanswered s: unlabeled labels Apr 11, 2024

shrit approved these changes Apr 11, 2024

View reviewed changes

rcurtin requested changes Apr 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation - Padé Approximant in Log Softmax Layer #3685

Implementation - Padé Approximant in Log Softmax Layer #3685

MarkFischinger commented Apr 10, 2024

shrit left a comment

MarkFischinger commented Apr 10, 2024

shrit commented Apr 11, 2024

rcurtin left a comment

rcurtin Apr 11, 2024

MarkFischinger commented Apr 12, 2024

shrit commented Apr 16, 2024

MarkFischinger commented Apr 16, 2024

rcurtin commented Apr 16, 2024

MarkFischinger commented Apr 22, 2024

Implementation - Padé Approximant in Log Softmax Layer #3685

Are you sure you want to change the base?

Implementation - Padé Approximant in Log Softmax Layer #3685

Conversation

MarkFischinger commented Apr 10, 2024

shrit left a comment

Choose a reason for hiding this comment

MarkFischinger commented Apr 10, 2024

shrit commented Apr 11, 2024

rcurtin left a comment

Choose a reason for hiding this comment

rcurtin Apr 11, 2024

Choose a reason for hiding this comment

MarkFischinger commented Apr 12, 2024

shrit commented Apr 16, 2024

MarkFischinger commented Apr 16, 2024

rcurtin commented Apr 16, 2024

MarkFischinger commented Apr 22, 2024