-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a less painful solution for doing a bulk insert #50
Comments
It should definitely be possible to do a row-by-row insert faster than that. Were you using |
Yes, I am doing Data.stream!(df, LibPQ.Statement, cnxn, str) where I'm not sure how to surround it in a transaction, could you provide an example? |
Put |
This tells PostgreSQL "record but don't actually make the changes to the db until I'm done" |
That makes a hell of a lot of sense. Sorry, this is probably due to my SQL inexperience. Still, if I get it working nicely I'll do a PR to improve the example, if you don't mind. |
Absolutely, I'd love better examples :) |
I've also been struggling with slow bulk inserts. I didn't know that transactions had such a big impact on performance! |
I use these functions:
|
This week I will:
|
Hm, |
What are the types of your DataFrame columns? |
The one I just did that was pretty slow was |
Can try running |
I'm not sure whether you mean the rows or the columns here (i.e. should I collect the rows into In the worst case scenario of doing this row-wise, it takes less than half a second, so I don't think this is the problem. |
How exactly are you doing this? I can compare that with the Data.stream methods to figure out where the shenanigans are happening. Btw, the |
I did v = [Any[c for c in row] for row in eachrow(df)]
LibPQ.parameter_pointers.(LibPQ.string_parameters.(row)) Yes, I realize |
Oh! Hmm. Maybe outputting the log on every row is actually what's causing it. You could try running after |
I'll just try running it on LibPQ master where we changed it to |
Even with the log output hidden, with the "BEGIN" and "COMMIT" statements it took 4 minutes to upload |
Alright I think I have enough to do a deep perf dive now. |
Thanks for running things for me! |
Thanks for working on it! |
On Julia 1.0.1 I did: julia> using LibPQ, DataFrames, DataStreams, UUIDs, Dates
julia> df = DataFrame(a=map(_->uuid4(),1:27000), b=fill("foo", 27000), c=fill(DateTime(2015,1,1,1,1,1), 27000), d=collect(1:27000), e=collect(1:27000), f=collect(1:27000))
27000×6 DataFrame
│ Row │ a │ b │ c │ d │ e │ f │
│ │ UUID │ String │ DateTime │ Int64 │ Int64 │ Int64 │
├───────┼──────────────────────────────────────────────┼────────┼─────────────────────┼───────┼───────┼───────┤
│ 1 │ UUID("cd3d8684-6c79-44af-bf53-0eba7e1842f3") │ foo │ 2015-01-01T01:01:01 │ 1 │ 1 │ 1 │
│ 2 │ UUID("f0d6f314-fed1-4db6-8ad2-2dfe31ef7185") │ foo │ 2015-01-01T01:01:01 │ 2 │ 2 │ 2 │
│ 3 │ UUID("71c36602-b9f8-4398-b3fe-9d37a8c0d931") │ foo │ 2015-01-01T01:01:01 │ 3 │ 3 │ 3 │
│ 4 │ UUID("e924d31c-a055-455f-9140-e09a44540e3f") │ foo │ 2015-01-01T01:01:01 │ 4 │ 4 │ 4 │
│ 5 │ UUID("e779d993-2cef-43dc-a747-d9076dc1e2ea") │ foo │ 2015-01-01T01:01:01 │ 5 │ 5 │ 5 │
│ 6 │ UUID("df5c5269-aad4-4b7c-a757-2270234aef5b") │ foo │ 2015-01-01T01:01:01 │ 6 │ 6 │ 6 │
│ 7 │ UUID("c77e1714-14a3-47e7-b5b1-85eb50fd2aee") │ foo │ 2015-01-01T01:01:01 │ 7 │ 7 │ 7 │
│ 8 │ UUID("9354e6e3-0703-4cc7-8ab1-e85361e15cf3") │ foo │ 2015-01-01T01:01:01 │ 8 │ 8 │ 8 │
│ 9 │ UUID("c89a57ac-c148-46df-b1a6-9bb92518d006") │ foo │ 2015-01-01T01:01:01 │ 9 │ 9 │ 9 │
│ 10 │ UUID("89de8e20-ef9f-481d-a9e7-4e806e554ed6") │ foo │ 2015-01-01T01:01:01 │ 10 │ 10 │ 10 │
│ 11 │ UUID("f89cf802-1b41-41a9-a1a8-fa19a4be7f32") │ foo │ 2015-01-01T01:01:01 │ 11 │ 11 │ 11 │
⋮
│ 26989 │ UUID("f097b5fb-d0a6-4795-a029-5c13298ae600") │ foo │ 2015-01-01T01:01:01 │ 26989 │ 26989 │ 26989 │
│ 26990 │ UUID("6b84d49f-c421-4cb6-8ded-e50cc63a868b") │ foo │ 2015-01-01T01:01:01 │ 26990 │ 26990 │ 26990 │
│ 26991 │ UUID("39dc324a-d202-4347-b52c-f8a31e9ff8f2") │ foo │ 2015-01-01T01:01:01 │ 26991 │ 26991 │ 26991 │
│ 26992 │ UUID("6ebd3252-4456-4702-bc14-5de78a27576f") │ foo │ 2015-01-01T01:01:01 │ 26992 │ 26992 │ 26992 │
│ 26993 │ UUID("0259b17a-0ddb-481f-8a91-f50631a62a11") │ foo │ 2015-01-01T01:01:01 │ 26993 │ 26993 │ 26993 │
│ 26994 │ UUID("6afd9759-88c1-4f53-a88f-98af4970c04c") │ foo │ 2015-01-01T01:01:01 │ 26994 │ 26994 │ 26994 │
│ 26995 │ UUID("b609d2eb-49b2-4772-a4e3-ddb3f5d57671") │ foo │ 2015-01-01T01:01:01 │ 26995 │ 26995 │ 26995 │
│ 26996 │ UUID("b147fce2-1293-48f1-9b8c-45dcf3ab4b7e") │ foo │ 2015-01-01T01:01:01 │ 26996 │ 26996 │ 26996 │
│ 26997 │ UUID("18ae6f38-3dd4-4027-8ff2-080332c880b6") │ foo │ 2015-01-01T01:01:01 │ 26997 │ 26997 │ 26997 │
│ 26998 │ UUID("bd8ea0b1-4a4f-4ed6-899f-706185bc6ce8") │ foo │ 2015-01-01T01:01:01 │ 26998 │ 26998 │ 26998 │
│ 26999 │ UUID("15b949b7-2441-4091-9c65-b26063d837b7") │ foo │ 2015-01-01T01:01:01 │ 26999 │ 26999 │ 26999 │
│ 27000 │ UUID("7da5c207-fc73-47c9-86d4-63b7612da1b8") │ foo │ 2015-01-01T01:01:01 │ 27000 │ 27000 │ 27000 │
julia> conn = LibPQ.Connection("dbname=postgres");
julia> execute(conn, "CREATE TABLE insert_perf (a varchar(37) PRIMARY KEY, b varchar, c timestamp, d bigint, e bigint, f bigint)");
julia> execute(conn, "BEGIN;");
julia> @time Data.stream!(df, LibPQ.Statement, conn, "INSERT INTO insert_perf (a, b, c, d, e, f) VALUES (\$1, \$2, \$3, \$4, \$5, \$6)")
7.453611 seconds (13.25 M allocations: 642.813 MiB, 4.29% gc time)
PostgreSQL prepared statement named __libpq_stmt_0__ with query INSERT INTO insert_perf (a, b, c, d, e, f) VALUES ($1, $2, $3, $4, $5, $6)
julia> execute(conn, "COMMIT;"); What version of Julia are you running? How big are your strings? |
I've forgotten about, but never been able to actually resolve this issue. I don't have terribly long strings, the only difference I see is that you are using a string type instead of the It actually seems surprisingly tricky to run a local database for testing, so I haven't been able to confirm your result yet. |
What platform? Maybe I can help |
Right now I'm trying to do it on an old Ubuntu 16.04 install. Most of what I'm seeing online seems to be indicating that the binary for the server itself is called |
You should be able to start and stop it with whatever service manager your OS has (I've seen |
The service always has status |
Ok, setting up a local PostgreSQL server is surprisingly annoying (for one, the server is not even open on Anyway, confirmed, I do |
If anyone needs to spin up a quick local Postgres instance for testing I recommend using the official docker container:
or similar (running it locally is a pain). |
Wow I cannot believe how much faster copy is than insert statements, even though the insert statements were prepared statements, which ran inside of a transaction. I just rewrote a set of insert statements that put data into four interconnected tables, which took around four hours to load ~500,000 rows. I rewrote the insert statements to instead copy the data into a temporary table, and then split the data from the temporary table into the final tables. I used the copy statement on ~135,000 rows, which took twenty seconds instead of four hours. Amazing, thank you so much! |
My original problem was not even caused by this package, and there is now a Seems appropriate to close this issue. Thanks all. |
Is there any way to make the transacted |
Yes, but I don't have time to work on them. I will also be away for a month so I won't be able to review. Take a look at http://initd.org/psycopg/docs/extras.html#fast-execution-helpers for some ideas. You may be able to implement some of these yourself without needing any changes to LibPQ.jl. |
Thanks for the response. That might be a bit too much too bite in the next couple months, but I can probably contribute cases and benchmarks in case someone else can look into it. |
Is there a way to use LibPQ.jl to use psql |
Yes, that is what https://invenia.github.io/LibPQ.jl/dev/#COPY-1 does. Only copying data into the db though, not out. |
I am not a superuser in that server so I ended up using,
I thought it would be useful to share it here. I don't know if that would be something worth implementing directly thought LibPQ.jl. |
Currently it is recommended that inserts are done with
Data.stream!
. The problem with this is that right now this performs a row-by-row insert which is unbelievably slow (for me it was at least 10 minutes for 5e4 rows with a perfectly good connection). It seems thatlibpq
itself does not provide us with many good solutions.One suggestion, thanks to Ibilli on discourse is to write a DataFrame to a buffer in the form of a
CSV
and upload it with aCOPY
statement. While horrific, this might be the best option as it might otherwise be completely impractical to do a large insert.Now, I realize that the posted solution involves both DataFrames and CSV and that these will not be added as dependencies. My intention in opening this issue was to discuss:
bulkcopy(::IO)
andbulkcopy(::String)
? (As far as I can tell there's nothing like that here now.)Thanks all.
The text was updated successfully, but these errors were encountered: