Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getindex(::Column) performance #252

Merged
merged 20 commits into from
Aug 22, 2022
Merged

getindex(::Column) performance #252

merged 20 commits into from
Aug 22, 2022

Conversation

willtebbutt
Copy link
Contributor

@willtebbutt willtebbutt commented Jul 31, 2022

Starts to address #250 . Entirely fixes the bug as described in that issue (Float64s, each call to getindex allocates nothing).

The changes include:

  1. introducing a layer of indirection (thanks @tpgillam for this suggestion) in the PQValue type to prevent the need to allocate when result (which is a mutable struct) is used to construct a PQValue.
  2. moving a lot of type information into the type of PQValue and Column so that it can be statically resolved.
  3. Replacing the parse_type closure with the ParseType{T} callable to force the compiler to specialise.

I need advice as to how to proceed (@iamed2 I would like to take you up on your offer in #250 :) ) :

  1. Have I gone about this in a sensible way -- I don't have the overview of the package's architecture that you do @iamed2, so I might be missing something obvious.
  2. how should I write performance tests for this? Is there a good place in runtests into which I should hook? It's a rather large file, so I'm not completely sure what a good entrance point would be.
  3. What types should I be testing for performance here? Clearly floats / ints / dates etc are good targets -- do we have a collection of examples columns instantiated somewhere in the tests that I could attach some performance tests to?

edit: interestingly, it actually looks like the indirection wasn't necessary in order to remove the allocations. It appears to be the case that everything is sorted by being slightly more assertive in the specialisations that the compiler is made / allowed to perform. Even more interestingly, it doesn't even look like we need to specialise on the typeof jl_result in PQValue.

src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
src/tables.jl Outdated Show resolved Hide resolved
src/tables.jl Outdated Show resolved Hide resolved
src/tables.jl Outdated Show resolved Hide resolved
src/tables.jl Outdated Show resolved Hide resolved
src/tables.jl Outdated Show resolved Hide resolved
src/tables.jl Outdated Show resolved Hide resolved
@willtebbutt willtebbutt changed the title [WIP] getindex performance getindex(::Column) performance Aug 12, 2022
src/tables.jl Show resolved Hide resolved
src/parsing.jl Show resolved Hide resolved
@iamed2
Copy link
Collaborator

iamed2 commented Aug 12, 2022

Have I gone about this in a sensible way -- I don't have the overview of the package's architecture that you do @iamed2, so I might be missing something obvious.

This is excellent, I love how small and targeted it is now.

@iamed2
Copy link
Collaborator

iamed2 commented Aug 12, 2022

What types should I be testing for performance here? Clearly floats / ints / dates etc are good targets -- do we have a collection of examples columns instantiated somewhere in the tests that I could attach some performance tests to?

We should test with binary format and with text format, and with custom conversion functions (the conversions argument to execute) and custom types (type_map and column_types arguments to execute). We should also check with not_null and without. Testing with those types are good, as well as strings, and it might make sense to test with array types or interval types since those are a bit more complicated type-wise.

What you could do is create a single row with the data of types you care about (in text), then create a temporary table and insert that row N times. You can then select the contents of the table (several times for different args to execute) and measure parsing on each column or together. The test_data variables in the test file have examples of PostgreSQL literals and what they parse to.

how should I write performance tests for this? Is there a good place in runtests into which I should hook? It's a rather large file, so I'm not completely sure what a good entrance point would be.

Given that the benefits are so very huge, it could be worth just recording the memory estimate and time from a benchmark generated on this branch now and just testing that we haven't exceeded it by ~2-3x. You can add a new testset in the results section. Long term it would be good to set up a proper benchmark suite using https://juliaci.github.io/PkgBenchmark.jl/stable/.

@willtebbutt
Copy link
Contributor Author

willtebbutt commented Aug 17, 2022

What you could do is create a single row with the data of types you care about (in text), then create a temporary table and insert that row N times. You can then select the contents of the table (several times for different args to execute) and measure parsing on each column or together. The test_data variables in the test file have examples of PostgreSQL literals and what they parse to.

I'm actually not sure how to proceed with this. Could you provide an example?

@iamed2
Copy link
Collaborator

iamed2 commented Aug 17, 2022

There are examples of all the pieces in the nebulous runtests.jl so it's just a matter of pulling them together in the right way. Collecting them:

This creates a temporary table that will be destroyed when you close the connection. This particular table has one identity column (you can ignore it) and then one string column, for other types you can just add items to the column list replace varchar with e.g. timestamptz.

        conn = LibPQ.Connection("dbname=postgres user=$DATABASE_USER")

        result = execute(conn, """
            CREATE TEMPORARY TABLE libpqjl_test (
                id    bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
                my_string_column   varchar
            );
        """)

This code inserts a row (you can repeat the line to insert multiple rows; they will get a different id):

    result = execute(conn, "INSERT INTO libpqjl_test (my_string_column) VALUES (\$1)", ["foo"])

or

    result = execute(conn, "INSERT INTO libpqjl_test (my_string_column) VALUES ('foo')")

In the latter version, anything in this variable test_data in the first tuple element can be substituted for 'foo'. When SELECTing, it should parse as the second tuple element by default.

In this other test_data, we have a middle column. When you specify that type in a type_map or column_types argument, you should expect the third tuple element as the result when SELECTing.

I think that's a good place to start, I can explain the other bits once you've made it through the above, so that we have that shared context.

Happy to have a call or a pair-programming session to help you get started. But it's worth saying I don't think performance tests should prevent this MR from going in.

@willtebbutt
Copy link
Contributor Author

Thanks for the advice @iamed2 . I've added some regression tests (wound up only using one of the two test data sets). Let me know if you think I've missed anything!

test/runtests.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test_bug/test_bug.jl Outdated Show resolved Hide resolved
test_bug/test_bug.jl Outdated Show resolved Hide resolved
test_bug/test_bug.jl Outdated Show resolved Hide resolved
test_bug/test_bug.jl Outdated Show resolved Hide resolved
test_bug/test_bug.jl Outdated Show resolved Hide resolved
willtebbutt and others added 2 commits August 19, 2022 08:07
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
@willtebbutt
Copy link
Contributor Author

willtebbutt commented Aug 19, 2022

Note: we should definitely squash + merge this one when it goes in, because I accidentally committed a bunch of larger files that we don't want in the history (just profiling-related stuff)

test/runtests.jl Outdated Show resolved Hide resolved
Co-authored-by: Eric Davies <iamed2@gmail.com>
@willtebbutt
Copy link
Contributor Author

Thanks @iamed2 . Will merge when CI passes

test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
@willtebbutt
Copy link
Contributor Author

Having said that, I don't have merge right.

@iamed2 pls squash + merge when you get a minute :) Thanks again for the review and helping me get this across the line!

@iamed2 iamed2 merged commit dca8d75 into master Aug 22, 2022
@iamed2 iamed2 deleted the wct/getindex-performance branch August 22, 2022 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants