Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
How can I do 1.2 billion extractions faster? #175
I recently used this library to extract the second-level domain names from the 1.2 billion PTR records in Rapid7's Sonar database. As an example it would extract
I'm trying to see if there are any optimisations that could be made to speed this 1.2B-record job up. I compared running the extractor on a single hostname one million times versus doing a million trivial string splitting and index operations. I know the two operations below do very different things but I'm trying to understand what the floor looks like performance-wise when dealing with these sorts of data types in Python.
from tldextract import extract as tld_ex h = 'kd111239172113.au-net.ne.jp' _ = [h.split('.')[:-3] for x in range(0, 1000000)] # 1.26 seconds _ = [tld_ex(h) for x in range(0, 1000000)] # 23.4 seconds
For the edge-cases tldextract can support it looks to be operating very quickly already. If that 23.4 seconds per one million lookups where to be maintained it would still take ~8 hours to do 1.2 billion lookups.
I'm wondering if there is something that could speed up the library even further. There appears to be around 7K TLDs that are looked up against, is there a way the search space could be reduced here? Could more common TLDs be compared against sooner?
Would porting the library to CPython be feasible?
Is there some obvious optimisation I'm missing in this library's basic usage?
This the code I ran.
from itertools import chain, islice from multiprocessing import Pool import socket import struct import sys import uuid from tldextract import extract as tld_ex ip2int = lambda x: struct.unpack("!I", socket.inet_aton(x)) def group(num_items, iterable): iterable = iter(iterable) while True: chunk = islice(iterable, num_items) try: first_element = next(chunk) except StopIteration: return yield chain((first_element,), chunk) def extract(manifest): file_, line = manifest try: ip, domain = line.split(',') open(file_, 'a')\ .write('%d,%s\n' % (ip2int(ip), tld_ex(domain))) except: pass file_names = ['out.%s.csv' % str(uuid.uuid4())[:6] for _ in range(0, 16)] pool = Pool(16) while True: payload = [(file_names[x], sys.stdin.readline().strip()) for x in range(0, 16)] pool.map(extract, payload)
I like these questions. Reviewing the core algorithm, I thought it was O(k), where k is the number of parts of the input string. I thought we'd be bounded by set lookups. To your point, maybe there are set lookups it shouldn't try? It could exit sooner?
Where is it spending its time in those 23.4s?
That's a good question. I haven't run it through a flame graph or anything, I guess that would be a good place to start.
If it were the regex causing the most amount of load do you think it would be possible to have two TLD lists, one with all 7K TLDs and one with the top 50 or so? Would something like that cause a lot of edge case issues?
See what the problem is first. Again, my thinking is it's input bound, not PSL length bound. We'll see.
I've had decent luck with Python's builtin profiler. It dumps a text spreadsheet. I'm not up to date on Python state of the art. A flame graph would be nice to share.
I've managed to produce a flame graph, GitHub demands the SVG is ZIP-compressed in order to upload. I'll leave my setup notes that I used to produce this graph here in case its of any use. This was all run on Ubuntu 16.04.2 LTS.
$ sudo apt update $ sudo apt install \ autoconf \ automake \ autotools-dev \ g++ \ git \ libtool \ make \ pkg-config \ python-dev \ virtualenv $ virtualenv ~/.test $ source ~/.test/bin/activate $ pip install tldextract
$ vi ~/test.py
from tldextract import extract as tld_ex h = 'kd111239172113.au-net.ne.jp' _ = [tld_ex(h) for x in range(0, 1000000)]
$ git clone https://github.com/brendangregg/FlameGraph ~/flamegraph $ git clone https://github.com/uber/pyflame ~/pyflame $ cd ~/pyflame $ ./autogen.sh $ ./configure $ make $ sudo make install
$ cd ~ $ pyflame -s 10 -r 0.01 -o ~/perf.data -t python ~/test.py $ ~/flamegraph/flamegraph.pl ~/perf.data > ~/perf.svg
I'm somewhat familiar with this process.. looking into things on my laptop here (2.7 GHz Intel Core i7) I get the following performance to start:
Writing the following function, to remove the extra processing related to urls and punycode:
so, half the time gone, but still almost 4x slower.
When I have implemented this in the past, I have used a trie type structure mapping the tlds from right to left. so visually something like
so, if you are parsing the domain from right to left, if you see a 'com' you hit the end of the tree and know you have a tld, but if you see 'uk', you need to check to see if the next part is 'co' or 'ac' etc.
This looks something like this:
I basically ignored those *. or ! stuff for now, I don't think it is hard to handle but I'm not familiar with those.
re-running the test gives:
This gives the biggest win for long domain names. They are still both O(k), but going from right to left means k ends up being 1 or 2 in the most common cases.
I also like the clear stats. Thank you for that effort. I didn't understand the flame graph. Nor my own cProfile spelunking. (The best I could blame was the years-old, obtuse chain of string manipulation, near the beginning of processing 1 input. It's a bit wasteful. I dreaded making it more obtuse in the name of maybe speeding up the OP's case. I guess that's still on the table, but we'll cross that bridge if we come to it.)