Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I do 1.2 billion extractions faster? #175

Open
marklit opened this issue Sep 28, 2019 · 10 comments
Open

How can I do 1.2 billion extractions faster? #175

marklit opened this issue Sep 28, 2019 · 10 comments

Comments

@marklit
Copy link

@marklit marklit commented Sep 28, 2019

I recently used this library to extract the second-level domain names from the 1.2 billion PTR records in Rapid7's Sonar database. As an example it would extract totinternet from node-nnf.pool-1-1.dynamic.totinternet.net. I used a Pool of 16 threads on Intel Core i5 4670K @ 3.40 GHz and it finished in just over a day. I suspect the pool count might have been too high for four cores.

I'm trying to see if there are any optimisations that could be made to speed this 1.2B-record job up. I compared running the extractor on a single hostname one million times versus doing a million trivial string splitting and index operations. I know the two operations below do very different things but I'm trying to understand what the floor looks like performance-wise when dealing with these sorts of data types in Python.

from tldextract import extract as tld_ex

h = 'kd111239172113.au-net.ne.jp'

_ = [h.split('.')[:-3] for x in range(0, 1000000)] # 1.26 seconds
_ = [tld_ex(h) for x in range(0, 1000000)] # 23.4 seconds

For the edge-cases tldextract can support it looks to be operating very quickly already. If that 23.4 seconds per one million lookups where to be maintained it would still take ~8 hours to do 1.2 billion lookups.

I'm wondering if there is something that could speed up the library even further. There appears to be around 7K TLDs that are looked up against, is there a way the search space could be reduced here? Could more common TLDs be compared against sooner?

Would porting the library to CPython be feasible?

Is there some obvious optimisation I'm missing in this library's basic usage?

@marklit marklit changed the title How to do 1.2 billion extractions faster? How can I do 1.2 billion extractions faster? Sep 28, 2019
@john-kurkowski

This comment has been minimized.

Copy link
Owner

@john-kurkowski john-kurkowski commented Sep 30, 2019

Is there some obvious optimisation I'm missing in this library's basic usage?

I don't think so. This scale is uncharted territory. 😃

@john-kurkowski

This comment has been minimized.

Copy link
Owner

@john-kurkowski john-kurkowski commented Sep 30, 2019

I used a Pool of 16 threads

Can you share the rough code here? I would expect this in-memory, CPU-oriented splitting and traversal to parallelize well.

@marklit

This comment has been minimized.

Copy link
Author

@marklit marklit commented Sep 30, 2019

This the code I ran.

from   itertools import chain, islice
from   multiprocessing import Pool
import socket
import struct
import sys
import uuid

from tldextract import extract as tld_ex


ip2int = lambda x: struct.unpack("!I", socket.inet_aton(x))[0]


def group(num_items, iterable):
    iterable = iter(iterable)

    while True:
        chunk = islice(iterable, num_items)

        try:
            first_element = next(chunk)
        except StopIteration:
            return

        yield chain((first_element,), chunk)


def extract(manifest):
    file_, line = manifest
    try:
        ip, domain = line.split(',')
        open(file_, 'a')\
            .write('%d,%s\n' % (ip2int(ip),
                                tld_ex(domain)[1]))
    except:
        pass


file_names = ['out.%s.csv' % str(uuid.uuid4())[:6]
              for _ in range(0, 16)]

pool = Pool(16)

while True:
    payload = [(file_names[x], sys.stdin.readline().strip())
               for x in range(0, 16)]
    pool.map(extract, payload)
@john-kurkowski

This comment has been minimized.

Copy link
Owner

@john-kurkowski john-kurkowski commented Sep 30, 2019

23.4 seconds per one million lookups … is there a way the search space could be reduced here? Could more common TLDs be compared against sooner?

I like these questions. Reviewing the core algorithm, I thought it was O(k), where k is the number of parts of the input string. I thought we'd be bounded by set lookups. To your point, maybe there are set lookups it shouldn't try? It could exit sooner?

Where is it spending its time in those 23.4s?

@marklit

This comment has been minimized.

Copy link
Author

@marklit marklit commented Sep 30, 2019

That's a good question. I haven't run it through a flame graph or anything, I guess that would be a good place to start.

If it were the regex causing the most amount of load do you think it would be possible to have two TLD lists, one with all 7K TLDs and one with the top 50 or so? Would something like that cause a lot of edge case issues?

@john-kurkowski

This comment has been minimized.

Copy link
Owner

@john-kurkowski john-kurkowski commented Oct 1, 2019

See what the problem is first. Again, my thinking is it's input bound, not PSL length bound. We'll see.

I've had decent luck with Python's builtin profiler. It dumps a text spreadsheet. I'm not up to date on Python state of the art. A flame graph would be nice to share.

@marklit

This comment has been minimized.

Copy link
Author

@marklit marklit commented Oct 2, 2019

I've managed to produce a flame graph, GitHub demands the SVG is ZIP-compressed in order to upload. I'll leave my setup notes that I used to produce this graph here in case its of any use. This was all run on Ubuntu 16.04.2 LTS.

$ sudo apt update
$ sudo apt install \
    autoconf \
    automake \
    autotools-dev \
    g++ \
    git \
    libtool \
    make \
    pkg-config \
    python-dev \
    virtualenv

$ virtualenv ~/.test
$ source ~/.test/bin/activate
$ pip install tldextract
$ vi ~/test.py
from tldextract import extract as tld_ex

h = 'kd111239172113.au-net.ne.jp'

_ = [tld_ex(h) for x in range(0, 1000000)]
$ git clone https://github.com/brendangregg/FlameGraph ~/flamegraph
$ git clone https://github.com/uber/pyflame ~/pyflame
$ cd ~/pyflame
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
$ cd ~
$ pyflame -s 10 -r 0.01 -o ~/perf.data -t python ~/test.py
$ ~/flamegraph/flamegraph.pl ~/perf.data > ~/perf.svg
@JustinAzoff

This comment has been minimized.

Copy link

@JustinAzoff JustinAzoff commented Jan 2, 2020

Hi!

I'm somewhat familiar with this process.. looking into things on my laptop here (2.7 GHz Intel Core i7) I get the following performance to start:

In [2]: %time _ = [h.split('.')[:-3] for x in range(0, 1000000)]
CPU times: user 801 ms, sys: 76 ms, total: 877 ms
Wall time: 878 ms

In [3]: %time _=[tld_ex(h) for x in range(0, 1000000)]
CPU times: user 7.66 s, sys: 158 ms, total: 7.82 s
Wall time: 7.85 s

Writing the following function, to remove the extra processing related to urls and punycode:

from tldextract import TLDExtract
def ex(netloc, _ex=TLDExtract()._get_tld_extractor()):
    labels = netloc.split(".")
    suffix_index = _ex.suffix_index(labels)
    suffix = ".".join(labels[suffix_index:])
    subdomain = ".".join(labels[:suffix_index - 1]) if suffix_index else ""
    domain = labels[suffix_index - 1] if suffix_index else ""
    return subdomain,domain,suffix

gives

In [1]: %time _ = [ex(h) for x in range(0, 1000000)]
CPU times: user 3.23 s, sys: 99.8 ms, total: 3.33 s
Wall time: 3.34 s

so, half the time gone, but still almost 4x slower.

When I have implemented this in the past, I have used a trie type structure mapping the tlds from right to left. so visually something like

com
uk
  co
  ac
  ..
jp
  ne
  ..
..

so, if you are parsing the domain from right to left, if you see a 'com' you hit the end of the tree and know you have a tld, but if you see 'uk', you need to check to see if the next part is 'co' or 'ac' etc.

This looks something like this:

import collections
def tree():
    return collections.defaultdict(tree)

class _PublicSuffixListTLDExtractor(object):
    """Wrapper around this project's main algo for PSL
    lookups.
    """

    def __init__(self, tlds):
        trie = tree()
        for tld in tlds:
            loc = trie
            for part in reversed(tld.split(".")):
                part = part.lstrip('*!.')
                loc = loc[part]
        self.trie = trie

    def suffix_index(self, lower_spl):
        """Returns the index of the first suffix label.
        Returns len(spl) if no suffix is found
        """
        length = len(lower_spl)
        loc = self.trie
        for i in range(length-1, -1, -1):
            key = lower_spl[i]
            if key not in loc:
                return i+1
            loc = loc[key]

        return length

I basically ignored those *. or ! stuff for now, I don't think it is hard to handle but I'm not familiar with those.

re-running the test gives:

In [6]: %time _ = [ex(h) for x in range(0, 1000000)]
CPU times: user 1.76 s, sys: 90.6 ms, total: 1.86 s
Wall time: 1.86 s

This gives the biggest win for long domain names. They are still both O(k), but going from right to left means k ends up being 1 or 2 in the most common cases.

before

In [2]: %time _ = [ex("a.really.long.domain.name.with.a.lot.of.parts.com") for x in range(0, 1000000)]
CPU times: user 9.92 s, sys: 152 ms, total: 10.1 s
Wall time: 10.1 s

after:

In [1]: %time _ = [ex("a.really.long.domain.name.with.a.lot.of.parts.com") for x in range(0, 1000000)]
CPU times: user 1.88 s, sys: 108 ms, total: 1.99 s
Wall time: 2 s
@john-kurkowski

This comment has been minimized.

Copy link
Owner

@john-kurkowski john-kurkowski commented Jan 4, 2020

a.really.long.domain.name.with.a.lot.of.parts.com

That made it click for me. Yes, a trie going right to left would help for these pathological inputs (yet common inputs for certain use cases). I like that the optimization is not much more code.

@john-kurkowski

This comment has been minimized.

Copy link
Owner

@john-kurkowski john-kurkowski commented Jan 4, 2020

I also like the clear stats. Thank you for that effort. I didn't understand the flame graph. Nor my own cProfile spelunking. (The best I could blame was the years-old, obtuse chain of string manipulation, near the beginning of processing 1 input. It's a bit wasteful. I dreaded making it more obtuse in the name of maybe speeding up the OP's case. I guess that's still on the table, but we'll cross that bridge if we come to it.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.