Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a less painful solution for doing a bulk insert #50

Closed
ExpandingMan opened this issue Sep 20, 2018 · 38 comments
Closed

a less painful solution for doing a bulk insert #50

ExpandingMan opened this issue Sep 20, 2018 · 38 comments

Comments

@ExpandingMan
Copy link
Contributor

Currently it is recommended that inserts are done with Data.stream!. The problem with this is that right now this performs a row-by-row insert which is unbelievably slow (for me it was at least 10 minutes for 5e4 rows with a perfectly good connection). It seems that libpq itself does not provide us with many good solutions.

One suggestion, thanks to Ibilli on discourse is to write a DataFrame to a buffer in the form of a CSV and upload it with a COPY statement. While horrific, this might be the best option as it might otherwise be completely impractical to do a large insert.

Now, I realize that the posted solution involves both DataFrames and CSV and that these will not be added as dependencies. My intention in opening this issue was to discuss:

  1. Should some function be added to LibPQ to make implementing this easier? Perhaps something like bulkcopy(::IO) and bulkcopy(::String)? (As far as I can tell there's nothing like that here now.)
  2. Is there a better solution lurking around that I'm just not seeing? If so I'd be glad to make a PR to document it.

Thanks all.

@iamed2
Copy link
Collaborator

iamed2 commented Sep 20, 2018

It should definitely be possible to do a row-by-row insert faster than that. Were you using Data.stream! to a Statement as shown in the example? Did you surround it in a transaction? What were the types of the parameters?

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Sep 20, 2018

Yes, I am doing Data.stream! to a Statement as in the example. I'm doing

Data.stream!(df, LibPQ.Statement, cnxn, str)

where df is a DataFrame, str is an insert statement of the same form as in the example (usually with about 10 columns).

I'm not sure how to surround it in a transaction, could you provide an example?

@iamed2
Copy link
Collaborator

iamed2 commented Sep 20, 2018

Put execute(conn, "BEGIN") before and execute(conn, "COMMIT") after

@iamed2
Copy link
Collaborator

iamed2 commented Sep 20, 2018

This tells PostgreSQL "record but don't actually make the changes to the db until I'm done"

@ExpandingMan
Copy link
Contributor Author

That makes a hell of a lot of sense.

Sorry, this is probably due to my SQL inexperience. Still, if I get it working nicely I'll do a PR to improve the example, if you don't mind.

@iamed2
Copy link
Collaborator

iamed2 commented Sep 20, 2018

Absolutely, I'd love better examples :)

@galenlynch
Copy link
Contributor

I've also been struggling with slow bulk inserts. I didn't know that transactions had such a big impact on performance!

@johannspies
Copy link

johannspies commented Sep 25, 2018

I use these functions:

function runexecutequery(query)
    conn = dbcon()
    result = execute(conn, query)
    clear!(result)
    close(conn)
end

function runcopyquery(conn, query,list)
    """
    conn - the connection to PostgreSQL.
    The query must specify the table and fields to copy the [list] to
    e.g.
    "COPY a_uts (ut, id) FROM stdin;\n"

    fields: must be an indication of what is in the list(in the correct order)
            and must correspond with what is in the query.
    e.g. for the above query:

    Use
  runcopyquery(conn, query, list)
    """
    execute(conn, query);
    for (fields) in list
        buf = string(join(fields, '\t'), '\n')
        PQputCopyData(conn.conn, pointer(buf), Cint(length(buf)))
    end
    PQputCopyEnd(conn.conn, C_NULL)

end

@iamed2
Copy link
Collaborator

iamed2 commented Sep 25, 2018

This week I will:

  • Add a hint about transactions in the documentation, and an example
  • Implement a COPY function

@ExpandingMan
Copy link
Contributor Author

Hm, Data.stream! is still incredibly slow for me even enclosing it with execute(cnxn, "BEGIN") and execute(cnxn, "COMMIT").

@iamed2
Copy link
Collaborator

iamed2 commented Sep 26, 2018

What are the types of your DataFrame columns?

@ExpandingMan
Copy link
Contributor Author

The one I just did that was pretty slow was (UUID, String, DateTime, Int, Int, Int).

@iamed2
Copy link
Collaborator

iamed2 commented Sep 26, 2018

Can try running LibPQ.parameter_pointers(LibPQ.string_parameters(row)) on each row of the DataFrame and getting the total time of that? That will really help narrow the issue down.

@ExpandingMan
Copy link
Contributor Author

I'm not sure whether you mean the rows or the columns here (i.e. should I collect the rows into Vector{Any}?)

In the worst case scenario of doing this row-wise, it takes less than half a second, so I don't think this is the problem.

@iamed2
Copy link
Collaborator

iamed2 commented Sep 27, 2018

In the worst case scenario of doing this row-wise, it takes less than half a second, so I don't think this is the problem.

How exactly are you doing this? I can compare that with the Data.stream methods to figure out where the shenanigans are happening.

Btw, the Data.stream! method is going row-wise as well.

@ExpandingMan
Copy link
Contributor Author

I did

v = [Any[c for c in row] for row in eachrow(df)]
LibPQ.parameter_pointers.(LibPQ.string_parameters.(row))

Yes, I realize Data.stream! is row-wise. I'm using the tagged version of LibPQ so the logger really, really likes to remind me of that fact 😆

@iamed2
Copy link
Collaborator

iamed2 commented Sep 27, 2018

Oh! Hmm. Maybe outputting the log on every row is actually what's causing it. You could try running after Memento.setlevel!(getlogger(LibPQ), "warn")?

@ExpandingMan
Copy link
Contributor Author

I'll just try running it on LibPQ master where we changed it to debug...

@ExpandingMan
Copy link
Contributor Author

Even with the log output hidden, with the "BEGIN" and "COMMIT" statements it took 4 minutes to upload 2.7e4 rows of the types I gave above.

@iamed2
Copy link
Collaborator

iamed2 commented Sep 27, 2018

Alright I think I have enough to do a deep perf dive now.

@iamed2
Copy link
Collaborator

iamed2 commented Sep 27, 2018

Thanks for running things for me!

@ExpandingMan
Copy link
Contributor Author

Thanks for working on it!

@iamed2
Copy link
Collaborator

iamed2 commented Sep 28, 2018

On Julia 1.0.1 I did:

julia> using LibPQ, DataFrames, DataStreams, UUIDs, Dates

julia> df = DataFrame(a=map(_->uuid4(),1:27000), b=fill("foo", 27000), c=fill(DateTime(2015,1,1,1,1,1), 27000), d=collect(1:27000), e=collect(1:27000), f=collect(1:27000))
27000×6 DataFrame
│ Row   │ a                                            │ b      │ c                   │ d     │ e     │ f     │
│       │ UUID                                         │ String │ DateTime            │ Int64 │ Int64 │ Int64 │
├───────┼──────────────────────────────────────────────┼────────┼─────────────────────┼───────┼───────┼───────┤
│ 1UUID("cd3d8684-6c79-44af-bf53-0eba7e1842f3") │ foo    │ 2015-01-01T01:01:01111     │
│ 2UUID("f0d6f314-fed1-4db6-8ad2-2dfe31ef7185") │ foo    │ 2015-01-01T01:01:01222     │
│ 3UUID("71c36602-b9f8-4398-b3fe-9d37a8c0d931") │ foo    │ 2015-01-01T01:01:01333     │
│ 4UUID("e924d31c-a055-455f-9140-e09a44540e3f") │ foo    │ 2015-01-01T01:01:01444     │
│ 5UUID("e779d993-2cef-43dc-a747-d9076dc1e2ea") │ foo    │ 2015-01-01T01:01:01555     │
│ 6UUID("df5c5269-aad4-4b7c-a757-2270234aef5b") │ foo    │ 2015-01-01T01:01:01666     │
│ 7UUID("c77e1714-14a3-47e7-b5b1-85eb50fd2aee") │ foo    │ 2015-01-01T01:01:01777     │
│ 8UUID("9354e6e3-0703-4cc7-8ab1-e85361e15cf3") │ foo    │ 2015-01-01T01:01:01888     │
│ 9UUID("c89a57ac-c148-46df-b1a6-9bb92518d006") │ foo    │ 2015-01-01T01:01:01999     │
│ 10UUID("89de8e20-ef9f-481d-a9e7-4e806e554ed6") │ foo    │ 2015-01-01T01:01:01101010    │
│ 11UUID("f89cf802-1b41-41a9-a1a8-fa19a4be7f32") │ foo    │ 2015-01-01T01:01:0111111126989UUID("f097b5fb-d0a6-4795-a029-5c13298ae600") │ foo    │ 2015-01-01T01:01:01269892698926989 │
│ 26990UUID("6b84d49f-c421-4cb6-8ded-e50cc63a868b") │ foo    │ 2015-01-01T01:01:01269902699026990 │
│ 26991UUID("39dc324a-d202-4347-b52c-f8a31e9ff8f2") │ foo    │ 2015-01-01T01:01:01269912699126991 │
│ 26992UUID("6ebd3252-4456-4702-bc14-5de78a27576f") │ foo    │ 2015-01-01T01:01:01269922699226992 │
│ 26993UUID("0259b17a-0ddb-481f-8a91-f50631a62a11") │ foo    │ 2015-01-01T01:01:01269932699326993 │
│ 26994UUID("6afd9759-88c1-4f53-a88f-98af4970c04c") │ foo    │ 2015-01-01T01:01:01269942699426994 │
│ 26995UUID("b609d2eb-49b2-4772-a4e3-ddb3f5d57671") │ foo    │ 2015-01-01T01:01:01269952699526995 │
│ 26996UUID("b147fce2-1293-48f1-9b8c-45dcf3ab4b7e") │ foo    │ 2015-01-01T01:01:01269962699626996 │
│ 26997UUID("18ae6f38-3dd4-4027-8ff2-080332c880b6") │ foo    │ 2015-01-01T01:01:01269972699726997 │
│ 26998UUID("bd8ea0b1-4a4f-4ed6-899f-706185bc6ce8") │ foo    │ 2015-01-01T01:01:01269982699826998 │
│ 26999UUID("15b949b7-2441-4091-9c65-b26063d837b7") │ foo    │ 2015-01-01T01:01:01269992699926999 │
│ 27000UUID("7da5c207-fc73-47c9-86d4-63b7612da1b8") │ foo    │ 2015-01-01T01:01:01270002700027000 │

julia> conn = LibPQ.Connection("dbname=postgres");

julia> execute(conn, "CREATE TABLE insert_perf (a varchar(37) PRIMARY KEY, b varchar, c timestamp, d bigint, e bigint, f bigint)");

julia> execute(conn, "BEGIN;");

julia> @time Data.stream!(df, LibPQ.Statement, conn, "INSERT INTO insert_perf (a, b, c, d, e, f) VALUES (\$1, \$2, \$3, \$4, \$5, \$6)")
  7.453611 seconds (13.25 M allocations: 642.813 MiB, 4.29% gc time)
PostgreSQL prepared statement named __libpq_stmt_0__ with query INSERT INTO insert_perf (a, b, c, d, e, f) VALUES ($1, $2, $3, $4, $5, $6)

julia> execute(conn, "COMMIT;");

What version of Julia are you running? How big are your strings?

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Nov 30, 2018

I've forgotten about, but never been able to actually resolve this issue. I don't have terribly long strings, the only difference I see is that you are using a string type instead of the uuid type (I'm pretty sure that's not the issue since in the past I had done tables without uuid).

It actually seems surprisingly tricky to run a local database for testing, so I haven't been able to confirm your result yet.

@iamed2
Copy link
Collaborator

iamed2 commented Nov 30, 2018

It actually seems surprisingly tricky to run a local database for testing, so I haven't been able to confirm your result yet.

What platform? Maybe I can help

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Nov 30, 2018

Right now I'm trying to do it on an old Ubuntu 16.04 install.

Most of what I'm seeing online seems to be indicating that the binary for the server itself is called postgres, but I don't seem to have that, despite seemingly having the server installed...

@iamed2
Copy link
Collaborator

iamed2 commented Nov 30, 2018

You should be able to start and stop it with whatever service manager your OS has (I've seen service postgresql start and systemctl start postgresql.service). The binary should be at /usr/local/pgsql/bin/postgres, maybe that's not symlinked into /usr/local/bin on install? You could also try the EnterpriseDB version: https://www.enterprisedb.com/download-postgresql-binaries

@ExpandingMan
Copy link
Contributor Author

The service always has status loaded active exited so I don't think it's actually running. I tried restarting it a number of times and it doesn't seem to fix it. It's ok, I'll have to dig into that. I suppose the right thing to do would probably be to run a docker container anyway.

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Dec 3, 2018

Ok, setting up a local PostgreSQL server is surprisingly annoying (for one, the server is not even open on localhost by default, which is kind of weird).

Anyway, confirmed, I do 10^5 rows in about 14 s (definitely not fast, but tolerable for my purposes). So, I guess there must be some problem with my connection. I'm going to look into it more thoroughly and see if I can get to the bottom of it.

@amellnik
Copy link

If anyone needs to spin up a quick local Postgres instance for testing I recommend using the official docker container:

docker run --rm --name local_postgres \
  -e POSTGRES_PASSWORD=something \
  -p 5432:5432 \
  postgres:10.5

or similar (running it locally is a pain).

@galenlynch
Copy link
Contributor

galenlynch commented Jan 4, 2019

Wow I cannot believe how much faster copy is than insert statements, even though the insert statements were prepared statements, which ran inside of a transaction.

I just rewrote a set of insert statements that put data into four interconnected tables, which took around four hours to load ~500,000 rows. I rewrote the insert statements to instead copy the data into a temporary table, and then split the data from the temporary table into the final tables. I used the copy statement on ~135,000 rows, which took twenty seconds instead of four hours.

Amazing, thank you so much!

@ExpandingMan
Copy link
Contributor Author

My original problem was not even caused by this package, and there is now a COPY example in the documentation (thanks to whoever put it there!).

Seems appropriate to close this issue. Thanks all.

@Nosferican
Copy link

Is there any way to make the transacted load! more efficient?

@iamed2
Copy link
Collaborator

iamed2 commented Sep 5, 2019

Yes, but I don't have time to work on them. I will also be away for a month so I won't be able to review.

Take a look at http://initd.org/psycopg/docs/extras.html#fast-execution-helpers for some ideas. You may be able to implement some of these yourself without needing any changes to LibPQ.jl.

@Nosferican
Copy link

Thanks for the response. That might be a bit too much too bite in the next couple months, but I can probably contribute cases and benchmarks in case someone else can look into it.

@Nosferican
Copy link

Nosferican commented Nov 29, 2019

Is there a way to use LibPQ.jl to use psql \copy command?

@iamed2
Copy link
Collaborator

iamed2 commented Nov 29, 2019

Yes, that is what https://invenia.github.io/LibPQ.jl/dev/#COPY-1 does. Only copying data into the db though, not out.

@Nosferican
Copy link

I am not a superuser in that server so I ended up using,

run(`psql "user = $db_usr host =$db_host dbname = $db_name password = $db_pwd" -c "\copy $schema.$name FROM '$path' CSV HEADER NULL 'na';"`)

I thought it would be useful to share it here. I don't know if that would be something worth implementing directly thought LibPQ.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants