perf improvements for findDomainID() #1286

stuXORstu · 2022-01-22T00:37:47Z

By submitting this pull request, I confirm the following (please check boxes, eg [X]) Failure to fill the template will close your PR:

Please submit all pull requests against the development branch. Failure to do so will delay or deny your request

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.

How familiar are you with the codebase?:
2

The function findDomainID() is called a lot and contains a simple test to avoid
further iteration and more expensive calls within. This commit improves the
performance overall by optimizing the simple test to match more frequently and
perform the actual comparison quicker too.

The original code will try a quick match on the first char of the domain before
continuing, which from my testing actually appears to match more regularly
than expected, and therefor, continue with more expensive calls and further iteration.

This is because common "domains" tend to start with things like 'www.' or
'mail. or 'img.' (etc., etc.) and also the frequency of the first letters in
domains generally seems to be fairly common too.

This patch implements an approach to test whether 4 chars are the
same, offset 4 chars within the domain (to avoid 'www.' and similar), unless
the length of the domain is less than 4 chars and we revert to the original
first char only test.

The 4 chars are also packed into an int sized register for comparison which
means the comparison is just as fast as the single char test.

My testing of these two changes on my modest Pi 3 Model B Rev 2 (the 1GB RAM,
non-GbE Quad-core 1.2 GHz kind) showed an average 20% improvement in dns lookup
latency (namebench showed 103ms to 83ms improvement using Alexa Top 200
domains).

DL6ER

Thanks for your contribution, your idea looks interesting. There is room for a few improvements in your code, I will add this as comments below. I will furthermore setup up a small test program which will make testing individual optimizations simpler for us in this PR.

src/datastructure.c

DL6ER · 2022-01-22T11:17:07Z

As promised, I set up a very simple script aimed at generating realistic numbers. You can find it at http://dl6er.de/FTL/test.c. I doesn't have any dependencies, a simple gcc test.c will do the trick. The input data is the Top 1 Million Alexa domains you have to download separately and extract into the same folder (http://s3.amazonaws.com/alexa-static/top-1m.csv.zip).

I tried what you suggested here and an optimized form of it (PR #1286 runs). I furthermore added the currently used method of first byte comparisons (FTL master) and a non-optimized version that would call strcmp for every domain until a match is found. Furthermore, I added another variant of using a 32bit hash that was able to reduce the number of strcmp calls to one in all cases (New hash-based).

Results without optimization:

Reading top-1m.csv .... done (read 590427 domains)

Domain: culture1391.blogfa.com
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         534899          3.210               
FTL master           36066           2.302               
PR #1286             1               4.268               
PR #1286 (improved)  1               1.123               
New hash-based       1               1.097               

Domain: prospecllc.com
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         258569          0.960               
FTL master           15027           0.771               
PR #1286             3               1.330               
PR #1286 (improved)  3               0.529               
New hash-based       1               0.539               

Domain: eassos.com
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         156682          0.486               
FTL master           6556            0.417               
PR #1286             77              0.832               
PR #1286 (improved)  77              0.314               
New hash-based       1               0.309               

Domain: stud.kz
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         36908           0.103               
FTL master           3371            0.114               
PR #1286             3371            0.145               
PR #1286 (improved)  126             0.078               
New hash-based       1               0.075               

Domain: turkuazyazilim.com.tr
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         482289          1.692               
FTL master           34056           1.345               
PR #1286             1               2.436               
PR #1286 (improved)  1               0.974               
New hash-based       1               0.974

With optimization this is a lot more tricky to measure as the compiler sees that it can be simplifying things a lot as everything is constant and it can do a lot of fancy tricks. Hence, comparing higher than -O1 isn't fair (it even has to be -O1 -fno-ipa-pure-const):

Reading top-1m.csv .... done (read 590427 domains)

Domain: culture1391.blogfa.com
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         534899          2.319               
FTL master           36066           1.400               
PR #1286             1               1.858               
PR #1286 (improved)  1               0.593               
New hash-based       1               0.543               

Domain: prospecllc.com
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         258569          1.245               
FTL master           15027           0.571               
PR #1286             3               0.649               
PR #1286 (improved)  3               0.205               
New hash-based       1               0.156               

Domain: eassos.com
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         156682          0.421               
FTL master           6556            0.314               
PR #1286             77              0.376               
PR #1286 (improved)  77              0.082               
New hash-based       1               0.049               

Domain: stud.kz
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         36908           0.080               
FTL master           3371            0.041               
PR #1286             3371            0.050               
PR #1286 (improved)  126             0.020               
New hash-based       1               0.011               

Domain: turkuazyazilim.com.tr
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         482289          1.562               
FTL master           34056           1.056               
PR #1286             1               1.269               
PR #1286 (improved)  1               0.429               
New hash-based       1               0.405

We have seen that (a) strcmp isn't all that bad and the small time-memory tradeoff I added (4 bytes for each known domain) for the last two iterations pays out in this test. This all is not directly applicable to running inside FTL (by default optimized with -O3) but it gives us some basis for further discussion and thoughts.

stuXORstu · 2022-01-29T13:01:46Z

Thank you so much for attaching the test code, it was really helpful for breaking the ideas down further. I did some digging into why the unoptimized PR #1286 was not performant comparably. Looks like without the optimizations the packing step ends up generating a whole bunch of instructions that slow everything down. I could not work out the best way to get the -O1 -fno-ipa-pure-const compiled version to do something more reasonable and ended up doing a single inlined asm instruction to achieve the same thing.

int packedDomainPos, packedDomainString;

__asm ("LDR %[result], [%[input], #4]"
    : [result] "=r" (packedDomainPos)
    : [input] "r" (localDomain)
    );
__asm ("LDR %[result], [%[input], #4]"
    : [result] "=r" (packedDomainString)
    : [input] "r" (domainString)
);

if (packedDomainPos != packedDomainString)
    continue;

This ends up looking something like:

Domain: losangelestransfer.com (pos: 218049)
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         218050          5.156
FTL master           7602            3.812
PR #1286             14              3.717
PR #1286 (refactor)  14              3.731
PR #1286 (improved)  14              1.206
New hash-based       1               1.200

Domain: dadpardaz.com (pos: 244673)
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         244674          5.815
FTL master           10642           4.279
PR #1286             17              4.162
PR #1286 (refactor)  17              4.171
PR #1286 (improved)  17              1.342
New hash-based       1               1.333

Domain: digitalpart.ir (pos: 352841)
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         352842          8.408
FTL master           15865           6.249
PR #1286             42              6.036
PR #1286 (refactor)  42              6.022
PR #1286 (improved)  42              1.957
New hash-based       1               1.924

Domain: porkbun.com (pos: 7090)
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         7091            0.126
FTL master           369             0.063
PR #1286             1               0.051
PR #1286 (refactor)  1               0.051
PR #1286 (improved)  1               0.031
New hash-based       1               0.031

Domain: tilegrafimanews.gr (pos: 61670)
Method               Full compares   avg. time per run [ms]
-------------------  --------------  ----------------------
Full compare         61671           1.469
FTL master           4121            1.058
PR #1286             5               1.022
PR #1286 (refactor)  5               1.021
PR #1286 (improved)  5               0.299
New hash-based       1               0.300

Of course, it's not really all that important in the scheme of things - I was just curious about it - and a hash based pre-computation approach is an obvious winner.

stuXORstu · 2022-01-29T13:02:34Z

I suspect there are some overheads that are not apparent through the test code. They might not necessarily be massive but every little bit counts on raspberry pi sized hardware I suppose! I'm curious to see whether by removing all the compares, except for 1, what kind of performance I'm able to see in reality.

As the time-memory trade off approach looks much better, and if you're open to that, I'll commit the version I've implemented (same as the new hash-based) after I've had it running a few days and see how it goes and collect some statistics on my pi-hole.

DL6ER · 2022-01-29T14:50:02Z

FTL is supposed to run on a variety of platforms (more than just x86 and ARM) and inline assembler should be avoided. Even when one might argue that LDR will always be there. I agree that the hash-based approach is something we should go for now.

For v6.0, I'm looking at a B-tree for all the datastructure to hopefully gain even more speed. I don't want to go into detail here but using a B-tree showed some improvements - compared to a binary tree - as is much shallower (more branches) and leads to the leafs much earlier (fewer steps). For instance, our current implementation has a depth of 3 branches from the root to the leaf nodes for 100,000 entries. So you need 4 lookups to get the domain instead of having to loop over all the 100,000 (until match). For a binary tree, this would still be log2(100,000) = 16 seeks. A lot better than looping over everything, but the self-balancing tree turns out to be better in terms of memory usage (in terms of fewer memory pages we have to get). As you said, all this counts on Raspberry Pi devices. It also means some overhead but we can offload this overhead to run it outside of the time-critical things such as during query resolving.

This new concept is so radically different (also involving shared databases between threads and forks), that it is fundamentally incompatible with the current code and, hence, lives inside a branch for v6.0. It is not fully tested yet and I have it only local but I'm pretty confident this will be how we can proceed in the future (if no major, unavoidable obstacles show up during late development).

stuXORstu · 2022-02-04T14:06:20Z

That sounds like a really interesting development for v6 and hope to be able to try out when it's ready :)

stuXORstu · 2022-02-04T14:10:49Z

I have implemented the hash based time-memory trade off approach, for your critique.

DL6ER

I'm not at home right now but from looking at the code, I already have three comments. It looks very good so far!

src/datastructure.h

src/datastructure.c

src/datastructure.h

…isions Signed-off-by: stuXORstu <stuxorstu@gmail.com>

DL6ER

Thank you very much. This should definitely improve performance, we'll see by how much on real operation.

pralor-bot · 2022-02-12T20:10:06Z

This pull request has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/pi-hole-ftl-v5-14-web-v5-11-and-core-v5-9-released/53529/1

stuXORstu force-pushed the tweak/domainid_perf branch from 1f4b1c6 to b4bc4da Compare January 22, 2022 00:40

DL6ER requested changes Jan 22, 2022

View reviewed changes

src/datastructure.c Outdated Show resolved Hide resolved

src/datastructure.c Outdated Show resolved Hide resolved

src/datastructure.c Outdated Show resolved Hide resolved

DL6ER added the Discussion label Jan 30, 2022

DL6ER marked this pull request as draft January 30, 2022 10:37

stuXORstu closed this Feb 1, 2022

stuXORstu force-pushed the tweak/domainid_perf branch from beb4a16 to b197b69 Compare February 1, 2022 15:35

stuXORstu reopened this Feb 4, 2022

DL6ER reviewed Feb 4, 2022

View reviewed changes

src/datastructure.h Outdated Show resolved Hide resolved

src/datastructure.c Outdated Show resolved Hide resolved

src/datastructure.c Outdated Show resolved Hide resolved

stuXORstu force-pushed the tweak/domainid_perf branch from f300a9f to 02eed0e Compare February 6, 2022 11:06

DL6ER reviewed Feb 6, 2022

View reviewed changes

src/datastructure.h Outdated Show resolved Hide resolved

Updates for tweak/domainid_perf to implement better hash based compar…

8410af3

…isions Signed-off-by: stuXORstu <stuxorstu@gmail.com>

stuXORstu force-pushed the tweak/domainid_perf branch from 02eed0e to 8410af3 Compare February 6, 2022 12:36

DL6ER marked this pull request as ready for review February 6, 2022 20:58

DL6ER self-requested a review February 6, 2022 20:58

DL6ER removed the Discussion label Feb 6, 2022

DL6ER approved these changes Feb 8, 2022

View reviewed changes

DL6ER merged commit 2e479f6 into pi-hole:development Feb 8, 2022

renovate bot mentioned this pull request Mar 5, 2022

Update helm values pihole/pihole to v2022 cterence/homelab-gitops#28

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf improvements for findDomainID() #1286

perf improvements for findDomainID() #1286

stuXORstu commented Jan 22, 2022

DL6ER left a comment

DL6ER commented Jan 22, 2022

stuXORstu commented Jan 29, 2022

stuXORstu commented Jan 29, 2022

DL6ER commented Jan 29, 2022

stuXORstu commented Feb 4, 2022

stuXORstu commented Feb 4, 2022

DL6ER left a comment

DL6ER left a comment

pralor-bot commented Feb 12, 2022

perf improvements for findDomainID() #1286

perf improvements for findDomainID() #1286

Conversation

stuXORstu commented Jan 22, 2022

DL6ER left a comment

Choose a reason for hiding this comment

DL6ER commented Jan 22, 2022

stuXORstu commented Jan 29, 2022

stuXORstu commented Jan 29, 2022

DL6ER commented Jan 29, 2022

stuXORstu commented Feb 4, 2022

stuXORstu commented Feb 4, 2022

DL6ER left a comment

Choose a reason for hiding this comment

DL6ER left a comment

Choose a reason for hiding this comment

pralor-bot commented Feb 12, 2022