Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combing SQL Coverage Data unacceptably slow #761

Closed
strichter opened this issue Jan 25, 2019 · 5 comments
Closed

Combing SQL Coverage Data unacceptably slow #761

strichter opened this issue Jan 25, 2019 · 5 comments
Labels
bug

Comments

@strichter
Copy link
Contributor

@strichter strichter commented Jan 25, 2019

Describe the bug
Generating a combined coverage report is about 50-60 times slower with SQL-based coverage data than with JSON.

To Reproduce
coverage master with sql data coverage. You need a sizable set of coverage data.

Additional context
We are combining about 100-200 coverage files into one large one. The resulting SQL coverage file is about 390MB. It used to be able to combine those files in about 1 minute. Now it takes somewhere between 50-60 minutes.

The reason for the inefficiency is obvious: It combines files through the regular coverage measurement recording APIs, which are inappropriate to use here, since they cause one transaction per arc. SQL should be used in proper ways with do bulk reading and inserts.

@strichter strichter added the bug label Jan 25, 2019
@strichter

This comment has been minimized.

Copy link
Contributor Author

@strichter strichter commented Jan 25, 2019

This issue should really have label:5.0-pre-release

@strichter

This comment has been minimized.

Copy link
Contributor Author

@strichter strichter commented Jan 25, 2019

Okay, some more information. It turns out that the new context-based coverage is about 50-60 times the data compared to the context-free coverage. Our arc table now has 9.5M entries, but there are only 172k unique (file_id, fromno, tono) entries.

We have been working all morning on optimizing combining and have come up with the following script that runs in 2:30 minutes, so it is somewhere between 20-30 times faster then the current master solution.

import os
import sys
import time
from coverage import sqldata


def main(args=sys.argv[1:]):
    combined_fn = args[0]
    part_fns = args[1:]

    arcs = []
    lines = []

    t0 = t1 = time.time()
    for path in part_fns:
        data = sqldata.CoverageSqliteData(path)
        data.read()
        with data._connect() as conn:
            cur = conn.execute(
                    'SELECT file.path, context.context, arc.fromno, arc.tono '
                    'FROM arc '
                    'INNER JOIN file on file.id = arc.file_id '
                    'INNER JOIN context on context.id = arc.context_id ')
            arcs.extend(cur)
            cur.close()

            cur = conn.execute(
                    'SELECT file.path, context.context, line.lineno '
                    'FROM line '
                    'INNER JOIN file on file.id = line.file_id '
                    'INNER JOIN context on context.id = line.context_id ')
            lines.extend(cur)
            cur.close()

            print(f'{path}: {time.time()-t1}')
            t1 = time.time()

    print('Total Loading:', time.time()-t0)

    t2 = time.time()

    files = set()
    contexts = set()
    for row in arcs:
        files.add(row[0])
        contexts.add(row[1])

    for row in lines:
        files.add(row[0])
        contexts.add(row[1])

    print('Compute:', time.time()-t2)

    files = dict(enumerate(files))
    contexts = dict(enumerate(contexts))

    t3 = time.time()

    data = sqldata.CoverageSqliteData('master.db')
    data.erase()

    with data._connect() as mconn:
        mconn.isolation_level = 'IMMEDIATE'

        mconn.executemany(
            'insert into file (id, path) values (?, ?)', files.items())
        mconn.executemany(
            'insert into context (id, context) values (?, ?)', contexts.items())

        file_ids = {v: k for k, v in files.items()}
        context_ids = {v: k for k, v in contexts.items()}

        arc_rows = (
            (file_ids[file], context_ids[context], fromno, tono)
            for file, context, fromno, tono in arcs
        )
        line_rows = (
            (file_ids[file], context_ids[context], lineno)
            for file, context, lineno in lines
        )

        t4 = time.time()
        mconn.executemany(
            'insert or ignore into arc (file_id, context_id, fromno, tono) values (?, ?, ?, ?)',
            arc_rows)
        mconn.executemany(
            'insert or ignore into line (file_id, context_id, lineno) values (?, ?, ?)',
            line_rows)

    print('Insert:', time.time()-t3)

    print('Total:', time.time()-t0)


if __name__ == '__main__':
    main()

We are working on a PR for this. But it will require a change on how combining is done, since we are considering all parts at once instead of one at a time.

@strichter

This comment has been minimized.

Copy link
Contributor Author

@strichter strichter commented Jan 25, 2019

I have been able to integrate the above using the existing APIs. A few tests are still failing, but I am planning to create a clean PR for this issue.

https://github.com/Shoobx/coveragepy/tree/fast-combine

@strichter

This comment has been minimized.

Copy link
Contributor Author

@strichter strichter commented Jan 26, 2019

PR: #765

@strichter

This comment has been minimized.

Copy link
Contributor Author

@strichter strichter commented Jan 27, 2019

Speedup is really just about 2x. There was another bug that caused a slowdown for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.