New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf improvements for findDomainID() #1286
Conversation
1f4b1c6
to
b4bc4da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, your idea looks interesting. There is room for a few improvements in your code, I will add this as comments below. I will furthermore setup up a small test program which will make testing individual optimizations simpler for us in this PR.
As promised, I set up a very simple script aimed at generating realistic numbers. You can find it at http://dl6er.de/FTL/test.c. I doesn't have any dependencies, a simple I tried what you suggested here and an optimized form of it (PR #1286 runs). I furthermore added the currently used method of first byte comparisons ( Results without optimization:
With optimization this is a lot more tricky to measure as the compiler sees that it can be simplifying things a lot as everything is constant and it can do a lot of fancy tricks. Hence, comparing higher than
We have seen that (a) |
Thank you so much for attaching the test code, it was really helpful for breaking the ideas down further. I did some digging into why the unoptimized PR #1286 was not performant comparably. Looks like without the optimizations the packing step ends up generating a whole bunch of instructions that slow everything down. I could not work out the best way to get the
This ends up looking something like:
Of course, it's not really all that important in the scheme of things - I was just curious about it - and a hash based pre-computation approach is an obvious winner. |
I suspect there are some overheads that are not apparent through the test code. They might not necessarily be massive but every little bit counts on raspberry pi sized hardware I suppose! I'm curious to see whether by removing all the compares, except for 1, what kind of performance I'm able to see in reality. As the time-memory trade off approach looks much better, and if you're open to that, I'll commit the version I've implemented (same as the new hash-based) after I've had it running a few days and see how it goes and collect some statistics on my pi-hole. |
FTL is supposed to run on a variety of platforms (more than just x86 and ARM) and inline assembler should be avoided. Even when one might argue that For v6.0, I'm looking at a B-tree for all the datastructure to hopefully gain even more speed. I don't want to go into detail here but using a B-tree showed some improvements - compared to a binary tree - as is much shallower (more branches) and leads to the leafs much earlier (fewer steps). For instance, our current implementation has a depth of 3 branches from the root to the leaf nodes for 100,000 entries. So you need 4 lookups to get the domain instead of having to loop over all the 100,000 (until match). For a binary tree, this would still be log2(100,000) = 16 seeks. A lot better than looping over everything, but the self-balancing tree turns out to be better in terms of memory usage (in terms of fewer memory pages we have to get). As you said, all this counts on Raspberry Pi devices. It also means some overhead but we can offload this overhead to run it outside of the time-critical things such as during query resolving. This new concept is so radically different (also involving shared databases between threads and forks), that it is fundamentally incompatible with the current code and, hence, lives inside a branch for v6.0. It is not fully tested yet and I have it only local but I'm pretty confident this will be how we can proceed in the future (if no major, unavoidable obstacles show up during late development). |
beb4a16
to
b197b69
Compare
That sounds like a really interesting development for v6 and hope to be able to try out when it's ready :) |
I have implemented the hash based time-memory trade off approach, for your critique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not at home right now but from looking at the code, I already have three comments. It looks very good so far!
f300a9f
to
02eed0e
Compare
…isions Signed-off-by: stuXORstu <stuxorstu@gmail.com>
02eed0e
to
8410af3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. This should definitely improve performance, we'll see by how much on real operation.
This pull request has been mentioned on Pi-hole Userspace. There might be relevant details there: https://discourse.pi-hole.net/t/pi-hole-ftl-v5-14-web-v5-11-and-core-v5-9-released/53529/1 |
By submitting this pull request, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your PR:
Please submit all pull requests against the
development
branch. Failure to do so will delay or deny your requestHow familiar are you with the codebase?:
2
The function findDomainID() is called a lot and contains a simple test to avoid
further iteration and more expensive calls within. This commit improves the
performance overall by optimizing the simple test to match more frequently and
perform the actual comparison quicker too.
The original code will try a quick match on the first char of the domain before
continuing, which from my testing actually appears to match more regularly
than expected, and therefor, continue with more expensive calls and further iteration.
This is because common "domains" tend to start with things like 'www.' or
'mail. or 'img.' (etc., etc.) and also the frequency of the first letters in
domains generally seems to be fairly common too.
This patch implements an approach to test whether 4 chars are the
same, offset 4 chars within the domain (to avoid 'www.' and similar), unless
the length of the domain is less than 4 chars and we revert to the original
first char only test.
The 4 chars are also packed into an int sized register for comparison which
means the comparison is just as fast as the single char test.
My testing of these two changes on my modest Pi 3 Model B Rev 2 (the 1GB RAM,
non-GbE Quad-core 1.2 GHz kind) showed an average 20% improvement in dns lookup
latency (namebench showed 103ms to 83ms improvement using Alexa Top 200
domains).