-
-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I do 1.2 billion extractions faster? #175
Comments
I don't think so. This scale is uncharted territory. 😃 |
Can you share the rough code here? I would expect this in-memory, CPU-oriented splitting and traversal to parallelize well. |
This the code I ran. from itertools import chain, islice
from multiprocessing import Pool
import socket
import struct
import sys
import uuid
from tldextract import extract as tld_ex
ip2int = lambda x: struct.unpack("!I", socket.inet_aton(x))[0]
def group(num_items, iterable):
iterable = iter(iterable)
while True:
chunk = islice(iterable, num_items)
try:
first_element = next(chunk)
except StopIteration:
return
yield chain((first_element,), chunk)
def extract(manifest):
file_, line = manifest
try:
ip, domain = line.split(',')
open(file_, 'a')\
.write('%d,%s\n' % (ip2int(ip),
tld_ex(domain)[1]))
except:
pass
file_names = ['out.%s.csv' % str(uuid.uuid4())[:6]
for _ in range(0, 16)]
pool = Pool(16)
while True:
payload = [(file_names[x], sys.stdin.readline().strip())
for x in range(0, 16)]
pool.map(extract, payload) |
I like these questions. Reviewing the core algorithm, I thought it was O(k), where k is the number of parts of the input string. I thought we'd be bounded by set lookups. To your point, maybe there are set lookups it shouldn't try? It could exit sooner? Where is it spending its time in those 23.4s? |
That's a good question. I haven't run it through a flame graph or anything, I guess that would be a good place to start. If it were the regex causing the most amount of load do you think it would be possible to have two TLD lists, one with all 7K TLDs and one with the top 50 or so? Would something like that cause a lot of edge case issues? |
See what the problem is first. Again, my thinking is it's input bound, not PSL length bound. We'll see. I've had decent luck with Python's builtin profiler. It dumps a text spreadsheet. I'm not up to date on Python state of the art. A flame graph would be nice to share. |
I've managed to produce a flame graph, GitHub demands the SVG is ZIP-compressed in order to upload. I'll leave my setup notes that I used to produce this graph here in case its of any use. This was all run on Ubuntu 16.04.2 LTS. $ sudo apt update
$ sudo apt install \
autoconf \
automake \
autotools-dev \
g++ \
git \
libtool \
make \
pkg-config \
python-dev \
virtualenv
$ virtualenv ~/.test
$ source ~/.test/bin/activate
$ pip install tldextract $ vi ~/test.py from tldextract import extract as tld_ex
h = 'kd111239172113.au-net.ne.jp'
_ = [tld_ex(h) for x in range(0, 1000000)] $ git clone https://github.com/brendangregg/FlameGraph ~/flamegraph
$ git clone https://github.com/uber/pyflame ~/pyflame
$ cd ~/pyflame
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install $ cd ~
$ pyflame -s 10 -r 0.01 -o ~/perf.data -t python ~/test.py
$ ~/flamegraph/flamegraph.pl ~/perf.data > ~/perf.svg |
Hi! I'm somewhat familiar with this process.. looking into things on my laptop here (2.7 GHz Intel Core i7) I get the following performance to start:
Writing the following function, to remove the extra processing related to urls and punycode:
gives
so, half the time gone, but still almost 4x slower. When I have implemented this in the past, I have used a trie type structure mapping the tlds from right to left. so visually something like
so, if you are parsing the domain from right to left, if you see a 'com' you hit the end of the tree and know you have a tld, but if you see 'uk', you need to check to see if the next part is 'co' or 'ac' etc. This looks something like this:
I basically ignored those *. or ! stuff for now, I don't think it is hard to handle but I'm not familiar with those. re-running the test gives:
This gives the biggest win for long domain names. They are still both O(k), but going from right to left means k ends up being 1 or 2 in the most common cases. before
after:
|
That made it click for me. Yes, a trie going right to left would help for these pathological inputs (yet common inputs for certain use cases). I like that the optimization is not much more code. |
I also like the clear stats. Thank you for that effort. I didn't understand the flame graph. Nor my own cProfile spelunking. (The best I could blame was the years-old, obtuse chain of string manipulation, near the beginning of processing 1 input. It's a bit wasteful. I dreaded making it more obtuse in the name of maybe speeding up the OP's case. I guess that's still on the table, but we'll cross that bridge if we come to it.) |
And if one were hesitant about the change to a trie, another optimization against this pathological case would be to ignore any levels beyond the max level in the PSL. At time of writing, that appears to be 5 levels, experimentally determined. ❯ rg '^[^/].*\..*\..*\..*\.' public_suffix_list.dat
10688:*.compute.amazonaws.com.cn
10693:cn-north-1.eb.amazonaws.com.cn
10694:cn-northwest-1.eb.amazonaws.com.cn
10717:*.elb.amazonaws.com.cn
10741:s3.cn-north-1.amazonaws.com.cn
10747:s3.dualstack.ap-northeast-1.amazonaws.com
10748:s3.dualstack.ap-northeast-2.amazonaws.com
10749:s3.dualstack.ap-south-1.amazonaws.com
10750:s3.dualstack.ap-southeast-1.amazonaws.com
10751:s3.dualstack.ap-southeast-2.amazonaws.com
10752:s3.dualstack.ca-central-1.amazonaws.com
10753:s3.dualstack.eu-central-1.amazonaws.com
10754:s3.dualstack.eu-west-1.amazonaws.com
10755:s3.dualstack.eu-west-2.amazonaws.com
10756:s3.dualstack.eu-west-3.amazonaws.com
10757:s3.dualstack.sa-east-1.amazonaws.com
10758:s3.dualstack.us-east-1.amazonaws.com
10759:s3.dualstack.us-east-2.amazonaws.com
11780:app.os.stg.fedoraproject.org
12185:users.scale.virtualcloud.com.br
12193:it1.eur.aruba.jenv-aruba.cloud
12240:cloud.jelastic.open.tim.it
~
❯ rg '^[^/].*\..*\..*\..*\..*\.' public_suffix_list.dat
~ |
Note that even for smaller domains, tldextract seems to be significantly slower than the others I've tested, publicsuffix2 and publicsuffixlist: from timeit import timeit
def test(length):
domain = 'a.' * length + 'net.uy'
tlde = timeit(f"psl('{domain}')", setup="from tldextract import TLDExtract; psl=TLDExtract(suffix_list_urls=None)")
ps2 = timeit(f"psl.get_public_suffix('{domain}')", setup="from publicsuffix2 import PublicSuffixList; psl=PublicSuffixList()")
psl = timeit(f"psl.privatesuffix('{domain}')", setup="from publicsuffixlist import PublicSuffixList; psl=PublicSuffixList()")
print(f"{length}: tlde {tlde:.2} / ps2 {ps2:.2} / psl {psl:.2} ({domain})")
for i in range(7):
test(i) results in:
(That's in second for 1 million invocations - also, thanks to @lufte for the original test code over in qutebrowser/qutebrowser#756 (comment)) |
I've found a library that uses the trie data structure like what @JustinAzoff did. Might be useful for any future attempts to improve tldextract performance. |
Some of the discussed optimizations were contributed in #283 (and more in #284) and released in 3.4.2. I don't know if it addresses this whole ticket yet, but give it a try! Thank you @elliotwutingfeng! |
Thanks @john-kurkowski for the merges, I've re-run the benchmarks by @The-Compiler. For typical use cases, the performance gap between tldextract and the other libraries is much narrower now. Specs: Linux x64, Ryzen 7 5800X Python 3.10 0: tlde 1.7 / ps2 1.6 / psl 0.6 (net.uy)
1: tlde 1.9 / ps2 1.8 / psl 1.0 (a.net.uy)
2: tlde 2.0 / ps2 1.8 / psl 1.3 (a.a.net.uy)
3: tlde 2.0 / ps2 1.8 / psl 1.7 (a.a.a.net.uy)
4: tlde 2.1 / ps2 1.8 / psl 1.9 (a.a.a.a.net.uy)
5: tlde 2.1 / ps2 1.9 / psl 1.8 (a.a.a.a.a.net.uy)
6: tlde 2.1 / ps2 1.9 / psl 1.9 (a.a.a.a.a.a.net.uy) Interestingly, tldextract now outperforms the other libraries on PyPy for long FQDNs. Unless the results are being cached by PyPy, not sure about that... PyPy 7.3.11 with GCC 12.2.0 (Python 3.9) 0: tlde 0.22 / ps2 0.28 / psl 0.11 (net.uy)
1: tlde 0.24 / ps2 0.31 / psl 0.29 (a.net.uy)
2: tlde 0.29 / ps2 0.33 / psl 0.44 (a.a.net.uy)
3: tlde 0.28 / ps2 0.36 / psl 0.6 (a.a.a.net.uy)
4: tlde 0.33 / ps2 0.38 / psl 0.68 (a.a.a.a.net.uy)
5: tlde 0.31 / ps2 0.39 / psl 0.68 (a.a.a.a.a.net.uy)
6: tlde 0.36 / ps2 0.4 / psl 0.69 (a.a.a.a.a.a.net.uy) |
I haven't heard further on this issue, so I'll call it a win. |
I recently used this library to extract the second-level domain names from the 1.2 billion PTR records in Rapid7's Sonar database. As an example it would extract
totinternet
fromnode-nnf.pool-1-1.dynamic.totinternet.net
. I used a Pool of 16 threads on Intel Core i5 4670K @ 3.40 GHz and it finished in just over a day. I suspect the pool count might have been too high for four cores.I'm trying to see if there are any optimisations that could be made to speed this 1.2B-record job up. I compared running the extractor on a single hostname one million times versus doing a million trivial string splitting and index operations. I know the two operations below do very different things but I'm trying to understand what the floor looks like performance-wise when dealing with these sorts of data types in Python.
For the edge-cases tldextract can support it looks to be operating very quickly already. If that 23.4 seconds per one million lookups where to be maintained it would still take ~8 hours to do 1.2 billion lookups.
I'm wondering if there is something that could speed up the library even further. There appears to be around 7K TLDs that are looked up against, is there a way the search space could be reduced here? Could more common TLDs be compared against sooner?
Would porting the library to CPython be feasible?
Is there some obvious optimisation I'm missing in this library's basic usage?
The text was updated successfully, but these errors were encountered: