Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added FFTMod #67

Merged
merged 50 commits into from Jun 24, 2019
Merged

Conversation

Chillee
Copy link
Collaborator

@Chillee Chillee commented Apr 23, 2019

I think I've seen a lot of people have this in their library. Allows us to perform FFT over arbitrary moduli without needing to do finicking with primes close to a power of 2 and such. Really, if we ever need more space, I don't think there's anywhere NTT can be used that FFT-MOD can't. An optimized FFT-MOD is only about twice as slow as an optimized NTT.

On the other hand, NTT is pretty cool, and it is seriously nice to not need to think about precision at all.

Here's an example exemplifying the precision issues (I'll turn it into a fuzz test later). Basically, as long as the coefficients don't surpass ~1e15 FFT is fine. Higher though, and you start to see precision issues.
https://ideone.com/lUrDCT

It is kind of ugly though :'(. I tried my best to shorten it, but it's still uglier than I'd like.

This is somewhat WIP - I still need to add a description about it. But I'd appreciate feedback on the necessity as well as the code itself.

@Chillee Chillee changed the title Added FFTMod as well Added FFTMod Apr 23, 2019
@simonlindholm
Copy link
Member

Here's an example exemplifying the precision issues (I'll turn it into a fuzz test later). Basically, as long as the coefficients don't surpass ~1e15 FFT is fine. Higher though, and you start to see precision issues.

So basically, if MOD ~ 1e9, we start seeing precision issues at around n ~ 1e6? That's uncomfortably small :( Especially so since this is at random inputs; the worst case may be, well, worse. I know I researched this before, and most theoretical guarantees were pretty weak -- the FFT currently pessimistically says the following: "For integers rounding works if $(|a| + |b|)\max(a, b) < ~ 10^9$, or in theory maybe $10^6$." And it would take an annoying amount of research to determine whether one can actually get closer to the worst case...

It is kind of ugly though :'(. I tried my best to shorten it, but it's still uglier than I'd like.

Yeah. :/ Is there any chance of shortening it by making it call the existing conv method?

One alternative, though it loses on code size, is to say that if NTT * 2 ~ FFT-MOD, then can't we run two NTT's and use CRT for the result for about the same performance? It avoids the precision problems, and with luck it might be relatively short since it can reuse a lot of existing KACTL code. (This is currently noted in the FFT description, but the implementation is left as an exercise to the reader.)

necessity

I don't know, it's tricky making these sorts of decisions... vague preference towards keeping just the NTT and existing FFT description, but it depends on code size and stuff... Though even if we don't include this, it's always nice to have a high-quality implementation that's not included in the pdf, which enterprising teams can uncomment if they think it's important to them.

@Chillee
Copy link
Collaborator Author

Chillee commented Apr 24, 2019

I'll check out how a CRT + NTT based FFTMod equivalent might look like.

@Chillee
Copy link
Collaborator Author

Chillee commented Apr 25, 2019

Unluckily, CRT+NTT is 3x slower. Even though our NTT is very fast, the bottleneck isn't the NTT, it's the CRT with 128 bit numbers.

The benchmarks look like this:

254ms elapsed [neal]
201ms elapsed [mine]

722ms elapsed [ntt crt]
    283ms elapsed [ntt portion]
    436ms elapsed [crt portion]

Which is a shame, since it's actually quite easy to implement.
Actually, implementing it in KACTL's fashion is actually fairly tricky, due to the need to template our mods. We would probably need to add a templated mod to modpow, to ntt, and to conv. In addition, we'd need to make CRT take in __int128_t (the best way would probably be a template for type as well).

@Chillee
Copy link
Collaborator Author

Chillee commented Apr 25, 2019

So, TL;DR: I think we should still include FFT-MOD. I'll think more about whether we can simplify it.

@Chillee
Copy link
Collaborator Author

Chillee commented Apr 25, 2019

Yeah. :/ Is there any chance of shortening it by making it call the existing conv method?

Unluckily, I doubt it. Not without significant runtime losses at least (probably 2x). The issue is that if we do a bunch of repeated calls to existing conv, we're gonna be redoing a bunch of FFT's (twice for each polynomial). A total of 8 FFT's for reusing existing conv vs the current 4 FFT's.

@Chillee
Copy link
Collaborator Author

Chillee commented Apr 25, 2019

So I tried it out and it would look look like this

template <int MOD> vector<ll> convMod(const vi &a, const vi &b) {
    if (a.empty() || b.empty()) return {};
    ll cut = sqrt(MOD);
    vector<double> ah(sz(a)), al(ah), bh(sz(b)), bl(bh);
    rep(i, 0, sz(a)) ah[i] = a[i] / cut, al[i] = a[i] % cut;
    rep(i, 0, sz(b)) bh[i] = b[i] / cut, bl[i] = b[i] % cut;
    auto h = conv(ah, bh), l = conv(al, bl);
    auto m1 = conv(ah, bl), m2 = conv(al, bh);
    vector<ll> res(h.size());
    for (int i = 0; i < res.size(); i++) {
        ll av = round(h[i]);
        ll bv = round(m1[i]) + round(m2[i]);
        ll cv = round(l[i]);
        av %= MOD, bv %= MOD, cv %= MOD;
        res[i] = (av * cut * cut + bv * cut + cv) % MOD;
    }
    return res;
}

Significantly shorter, but still kind of weighty. Also ~2x slower.

241ms elapsed [mine]
497ms elapsed [reuse]

@Chillee
Copy link
Collaborator Author

Chillee commented Apr 25, 2019

One thing I did was to move the root computation from the conv routines to the FFT routine itself. Computing the roots should only take a small percentage (~5%) of the total time needed, so this isn't much of a loss. In regular FFT, this should correspond to a 5% slowdown since we would be computing roots unnecessarily once. In FFT-MOD this should correspond to a 15% slowdown (I actually get 243 vs 212 ms).

I think this is a worthy tradeoff.

@simonlindholm
Copy link
Member

That seems reasonable. (I imagine it's a larger percentage for small n, though.) I think convMod should probably go in a separate file. Also, can you fix the -Wconversion warnings?

@Chillee Chillee force-pushed the fftmod branch 4 times, most recently from b886871 to 951f07d Compare April 26, 2019 18:38
@Chillee
Copy link
Collaborator Author

Chillee commented Apr 29, 2019

I did some proper benchmarking and it seems like recalculating the roots (with long double calculations) slows it down by 25%. If we're only using double it slows it down by ~10%.

I think that's decently substantial and worth fixing (if we can do it with just only a few lines).

Copy link
Member

@simonlindholm simonlindholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code mostly looks good, still need to look into precision issues and description wordsmithing

content/numerical/FastFourierTransform.h Outdated Show resolved Hide resolved
content/numerical/FastFourierTransform.h Outdated Show resolved Hide resolved
content/numerical/FastFourierTransform.h Outdated Show resolved Hide resolved
@simonlindholm
Copy link
Member

https://github.com/simonlindholm/fft-precision/blob/master/fft-precision.md has a bunch of analysis of FFT precision. I'll try to merge this soon.

@simonlindholm simonlindholm merged commit b3be665 into kth-competitive-programming:master Jun 24, 2019
@Chillee
Copy link
Collaborator Author

Chillee commented Jun 24, 2019

:O. Awesome!

hockyy pushed a commit to hockyy/kactl that referenced this pull request Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants