a less painful solution for doing a bulk insert #50

ExpandingMan · 2018-09-20T14:07:51Z

Currently it is recommended that inserts are done with Data.stream!. The problem with this is that right now this performs a row-by-row insert which is unbelievably slow (for me it was at least 10 minutes for 5e4 rows with a perfectly good connection). It seems that libpq itself does not provide us with many good solutions.

One suggestion, thanks to Ibilli on discourse is to write a DataFrame to a buffer in the form of a CSV and upload it with a COPY statement. While horrific, this might be the best option as it might otherwise be completely impractical to do a large insert.

Now, I realize that the posted solution involves both DataFrames and CSV and that these will not be added as dependencies. My intention in opening this issue was to discuss:

Should some function be added to LibPQ to make implementing this easier? Perhaps something like bulkcopy(::IO) and bulkcopy(::String)? (As far as I can tell there's nothing like that here now.)
Is there a better solution lurking around that I'm just not seeing? If so I'd be glad to make a PR to document it.

Thanks all.

The text was updated successfully, but these errors were encountered:

iamed2 · 2018-09-20T15:04:07Z

It should definitely be possible to do a row-by-row insert faster than that. Were you using Data.stream! to a Statement as shown in the example? Did you surround it in a transaction? What were the types of the parameters?

ExpandingMan · 2018-09-20T15:30:09Z

Yes, I am doing Data.stream! to a Statement as in the example. I'm doing

Data.stream!(df, LibPQ.Statement, cnxn, str)

where df is a DataFrame, str is an insert statement of the same form as in the example (usually with about 10 columns).

I'm not sure how to surround it in a transaction, could you provide an example?

iamed2 · 2018-09-20T15:43:52Z

Put execute(conn, "BEGIN") before and execute(conn, "COMMIT") after

iamed2 · 2018-09-20T15:44:41Z

This tells PostgreSQL "record but don't actually make the changes to the db until I'm done"

ExpandingMan · 2018-09-20T15:47:55Z

That makes a hell of a lot of sense.

Sorry, this is probably due to my SQL inexperience. Still, if I get it working nicely I'll do a PR to improve the example, if you don't mind.

iamed2 · 2018-09-20T15:51:27Z

Absolutely, I'd love better examples :)

galenlynch · 2018-09-21T18:48:37Z

I've also been struggling with slow bulk inserts. I didn't know that transactions had such a big impact on performance!

johannspies · 2018-09-25T13:31:44Z

I use these functions:

function runexecutequery(query)
    conn = dbcon()
    result = execute(conn, query)
    clear!(result)
    close(conn)
end

function runcopyquery(conn, query,list)
    """
    conn - the connection to PostgreSQL.
    The query must specify the table and fields to copy the [list] to
    e.g.
    "COPY a_uts (ut, id) FROM stdin;\n"

    fields: must be an indication of what is in the list(in the correct order)
            and must correspond with what is in the query.
    e.g. for the above query:

    Use
  runcopyquery(conn, query, list)
    """
    execute(conn, query);
    for (fields) in list
        buf = string(join(fields, '\t'), '\n')
        PQputCopyData(conn.conn, pointer(buf), Cint(length(buf)))
    end
    PQputCopyEnd(conn.conn, C_NULL)

end

iamed2 · 2018-09-25T16:40:13Z

This week I will:

Add a hint about transactions in the documentation, and an example
Implement a COPY function

ExpandingMan · 2018-09-26T20:29:04Z

Hm, Data.stream! is still incredibly slow for me even enclosing it with execute(cnxn, "BEGIN") and execute(cnxn, "COMMIT").

iamed2 · 2018-09-26T20:33:06Z

What are the types of your DataFrame columns?

ExpandingMan · 2018-09-26T20:40:47Z

The one I just did that was pretty slow was (UUID, String, DateTime, Int, Int, Int).

iamed2 · 2018-09-26T20:49:29Z

Can try running LibPQ.parameter_pointers(LibPQ.string_parameters(row)) on each row of the DataFrame and getting the total time of that? That will really help narrow the issue down.

ExpandingMan · 2018-09-27T13:51:52Z

I'm not sure whether you mean the rows or the columns here (i.e. should I collect the rows into Vector{Any}?)

In the worst case scenario of doing this row-wise, it takes less than half a second, so I don't think this is the problem.

iamed2 · 2018-09-27T14:39:31Z

In the worst case scenario of doing this row-wise, it takes less than half a second, so I don't think this is the problem.

How exactly are you doing this? I can compare that with the Data.stream methods to figure out where the shenanigans are happening.

Btw, the Data.stream! method is going row-wise as well.

ExpandingMan · 2018-09-27T14:45:07Z

I did

v = [Any[c for c in row] for row in eachrow(df)]
LibPQ.parameter_pointers.(LibPQ.string_parameters.(row))

Yes, I realize Data.stream! is row-wise. I'm using the tagged version of LibPQ so the logger really, really likes to remind me of that fact 😆

iamed2 · 2018-09-27T14:56:56Z

Oh! Hmm. Maybe outputting the log on every row is actually what's causing it. You could try running after Memento.setlevel!(getlogger(LibPQ), "warn")?

ExpandingMan · 2018-09-27T15:00:13Z

I'll just try running it on LibPQ master where we changed it to debug...

ExpandingMan · 2018-09-27T15:28:02Z

Even with the log output hidden, with the "BEGIN" and "COMMIT" statements it took 4 minutes to upload 2.7e4 rows of the types I gave above.

iamed2 · 2018-09-27T15:29:13Z

Alright I think I have enough to do a deep perf dive now.

iamed2 · 2018-09-27T15:29:26Z

Thanks for running things for me!

ExpandingMan · 2018-09-27T15:29:37Z

Thanks for working on it!

iamed2 · 2018-09-28T18:11:47Z

On Julia 1.0.1 I did:

julia> using LibPQ, DataFrames, DataStreams, UUIDs, Dates

julia> df = DataFrame(a=map(_->uuid4(),1:27000), b=fill("foo", 27000), c=fill(DateTime(2015,1,1,1,1,1), 27000), d=collect(1:27000), e=collect(1:27000), f=collect(1:27000))
27000×6 DataFrame
│ Row   │ a                                            │ b      │ c                   │ d     │ e     │ f     │
│       │ UUID                                         │ String │ DateTime            │ Int64 │ Int64 │ Int64 │
├───────┼──────────────────────────────────────────────┼────────┼─────────────────────┼───────┼───────┼───────┤
│ 1     │ UUID("cd3d8684-6c79-44af-bf53-0eba7e1842f3") │ foo    │ 2015-01-01T01:01:01 │ 1     │ 1     │ 1     │
│ 2     │ UUID("f0d6f314-fed1-4db6-8ad2-2dfe31ef7185") │ foo    │ 2015-01-01T01:01:01 │ 2     │ 2     │ 2     │
│ 3     │ UUID("71c36602-b9f8-4398-b3fe-9d37a8c0d931") │ foo    │ 2015-01-01T01:01:01 │ 3     │ 3     │ 3     │
│ 4     │ UUID("e924d31c-a055-455f-9140-e09a44540e3f") │ foo    │ 2015-01-01T01:01:01 │ 4     │ 4     │ 4     │
│ 5     │ UUID("e779d993-2cef-43dc-a747-d9076dc1e2ea") │ foo    │ 2015-01-01T01:01:01 │ 5     │ 5     │ 5     │
│ 6     │ UUID("df5c5269-aad4-4b7c-a757-2270234aef5b") │ foo    │ 2015-01-01T01:01:01 │ 6     │ 6     │ 6     │
│ 7     │ UUID("c77e1714-14a3-47e7-b5b1-85eb50fd2aee") │ foo    │ 2015-01-01T01:01:01 │ 7     │ 7     │ 7     │
│ 8     │ UUID("9354e6e3-0703-4cc7-8ab1-e85361e15cf3") │ foo    │ 2015-01-01T01:01:01 │ 8     │ 8     │ 8     │
│ 9     │ UUID("c89a57ac-c148-46df-b1a6-9bb92518d006") │ foo    │ 2015-01-01T01:01:01 │ 9     │ 9     │ 9     │
│ 10    │ UUID("89de8e20-ef9f-481d-a9e7-4e806e554ed6") │ foo    │ 2015-01-01T01:01:01 │ 10    │ 10    │ 10    │
│ 11    │ UUID("f89cf802-1b41-41a9-a1a8-fa19a4be7f32") │ foo    │ 2015-01-01T01:01:01 │ 11    │ 11    │ 11    │
⋮
│ 26989 │ UUID("f097b5fb-d0a6-4795-a029-5c13298ae600") │ foo    │ 2015-01-01T01:01:01 │ 26989 │ 26989 │ 26989 │
│ 26990 │ UUID("6b84d49f-c421-4cb6-8ded-e50cc63a868b") │ foo    │ 2015-01-01T01:01:01 │ 26990 │ 26990 │ 26990 │
│ 26991 │ UUID("39dc324a-d202-4347-b52c-f8a31e9ff8f2") │ foo    │ 2015-01-01T01:01:01 │ 26991 │ 26991 │ 26991 │
│ 26992 │ UUID("6ebd3252-4456-4702-bc14-5de78a27576f") │ foo    │ 2015-01-01T01:01:01 │ 26992 │ 26992 │ 26992 │
│ 26993 │ UUID("0259b17a-0ddb-481f-8a91-f50631a62a11") │ foo    │ 2015-01-01T01:01:01 │ 26993 │ 26993 │ 26993 │
│ 26994 │ UUID("6afd9759-88c1-4f53-a88f-98af4970c04c") │ foo    │ 2015-01-01T01:01:01 │ 26994 │ 26994 │ 26994 │
│ 26995 │ UUID("b609d2eb-49b2-4772-a4e3-ddb3f5d57671") │ foo    │ 2015-01-01T01:01:01 │ 26995 │ 26995 │ 26995 │
│ 26996 │ UUID("b147fce2-1293-48f1-9b8c-45dcf3ab4b7e") │ foo    │ 2015-01-01T01:01:01 │ 26996 │ 26996 │ 26996 │
│ 26997 │ UUID("18ae6f38-3dd4-4027-8ff2-080332c880b6") │ foo    │ 2015-01-01T01:01:01 │ 26997 │ 26997 │ 26997 │
│ 26998 │ UUID("bd8ea0b1-4a4f-4ed6-899f-706185bc6ce8") │ foo    │ 2015-01-01T01:01:01 │ 26998 │ 26998 │ 26998 │
│ 26999 │ UUID("15b949b7-2441-4091-9c65-b26063d837b7") │ foo    │ 2015-01-01T01:01:01 │ 26999 │ 26999 │ 26999 │
│ 27000 │ UUID("7da5c207-fc73-47c9-86d4-63b7612da1b8") │ foo    │ 2015-01-01T01:01:01 │ 27000 │ 27000 │ 27000 │

julia> conn = LibPQ.Connection("dbname=postgres");

julia> execute(conn, "CREATE TABLE insert_perf (a varchar(37) PRIMARY KEY, b varchar, c timestamp, d bigint, e bigint, f bigint)");

julia> execute(conn, "BEGIN;");

julia> @time Data.stream!(df, LibPQ.Statement, conn, "INSERT INTO insert_perf (a, b, c, d, e, f) VALUES (\$1, \$2, \$3, \$4, \$5, \$6)")
  7.453611 seconds (13.25 M allocations: 642.813 MiB, 4.29% gc time)
PostgreSQL prepared statement named __libpq_stmt_0__ with query INSERT INTO insert_perf (a, b, c, d, e, f) VALUES ($1, $2, $3, $4, $5, $6)

julia> execute(conn, "COMMIT;");

What version of Julia are you running? How big are your strings?

ExpandingMan · 2018-11-30T20:51:46Z

I've forgotten about, but never been able to actually resolve this issue. I don't have terribly long strings, the only difference I see is that you are using a string type instead of the uuid type (I'm pretty sure that's not the issue since in the past I had done tables without uuid).

It actually seems surprisingly tricky to run a local database for testing, so I haven't been able to confirm your result yet.

iamed2 · 2018-11-30T20:54:08Z

It actually seems surprisingly tricky to run a local database for testing, so I haven't been able to confirm your result yet.

What platform? Maybe I can help

ExpandingMan · 2018-11-30T20:55:18Z

Right now I'm trying to do it on an old Ubuntu 16.04 install.

Most of what I'm seeing online seems to be indicating that the binary for the server itself is called postgres, but I don't seem to have that, despite seemingly having the server installed...

iamed2 · 2018-11-30T21:01:51Z

You should be able to start and stop it with whatever service manager your OS has (I've seen service postgresql start and systemctl start postgresql.service). The binary should be at /usr/local/pgsql/bin/postgres, maybe that's not symlinked into /usr/local/bin on install? You could also try the EnterpriseDB version: https://www.enterprisedb.com/download-postgresql-binaries

ExpandingMan · 2018-11-30T21:10:22Z

The service always has status loaded active exited so I don't think it's actually running. I tried restarting it a number of times and it doesn't seem to fix it. It's ok, I'll have to dig into that. I suppose the right thing to do would probably be to run a docker container anyway.

ExpandingMan · 2018-12-03T15:15:42Z

Ok, setting up a local PostgreSQL server is surprisingly annoying (for one, the server is not even open on localhost by default, which is kind of weird).

Anyway, confirmed, I do 10^5 rows in about 14 s (definitely not fast, but tolerable for my purposes). So, I guess there must be some problem with my connection. I'm going to look into it more thoroughly and see if I can get to the bottom of it.

amellnik · 2018-12-14T17:33:58Z

If anyone needs to spin up a quick local Postgres instance for testing I recommend using the official docker container:

docker run --rm --name local_postgres \
  -e POSTGRES_PASSWORD=something \
  -p 5432:5432 \
  postgres:10.5

or similar (running it locally is a pain).

galenlynch · 2019-01-04T16:21:50Z

Wow I cannot believe how much faster copy is than insert statements, even though the insert statements were prepared statements, which ran inside of a transaction.

I just rewrote a set of insert statements that put data into four interconnected tables, which took around four hours to load ~500,000 rows. I rewrote the insert statements to instead copy the data into a temporary table, and then split the data from the temporary table into the final tables. I used the copy statement on ~135,000 rows, which took twenty seconds instead of four hours.

Amazing, thank you so much!

ExpandingMan · 2019-05-03T14:46:10Z

My original problem was not even caused by this package, and there is now a COPY example in the documentation (thanks to whoever put it there!).

Seems appropriate to close this issue. Thanks all.

Nosferican · 2019-09-05T02:42:27Z

Is there any way to make the transacted load! more efficient?

iamed2 · 2019-09-05T14:50:48Z

Yes, but I don't have time to work on them. I will also be away for a month so I won't be able to review.

Take a look at http://initd.org/psycopg/docs/extras.html#fast-execution-helpers for some ideas. You may be able to implement some of these yourself without needing any changes to LibPQ.jl.

Nosferican · 2019-09-05T14:53:15Z

Thanks for the response. That might be a bit too much too bite in the next couple months, but I can probably contribute cases and benchmarks in case someone else can look into it.

Nosferican · 2019-11-29T09:49:11Z

Is there a way to use LibPQ.jl to use psql \copy command?

iamed2 · 2019-11-29T20:43:33Z

Yes, that is what https://invenia.github.io/LibPQ.jl/dev/#COPY-1 does. Only copying data into the db though, not out.

Nosferican · 2019-11-29T21:15:01Z

I am not a superuser in that server so I ended up using,

run(`psql "user = $db_usr host =$db_host dbname = $db_name password = $db_pwd" -c "\copy $schema.$name FROM '$path' CSV HEADER NULL 'na';"`)

I thought it would be useful to share it here. I don't know if that would be something worth implementing directly thought LibPQ.jl.

iamed2 mentioned this issue Sep 26, 2018

Various docs updates #52

Merged

iamed2 mentioned this issue Sep 28, 2018

Add LibPQ.CopyIn for COPY FROM STDIN queries #54

Merged

ExpandingMan closed this as completed May 3, 2019

a less painful solution for doing a bulk insert #50

a less painful solution for doing a bulk insert #50

Comments

ExpandingMan commented Sep 20, 2018

iamed2 commented Sep 20, 2018

ExpandingMan commented Sep 20, 2018 • edited Loading

iamed2 commented Sep 20, 2018

iamed2 commented Sep 20, 2018

ExpandingMan commented Sep 20, 2018

iamed2 commented Sep 20, 2018

galenlynch commented Sep 21, 2018

johannspies commented Sep 25, 2018 • edited Loading

iamed2 commented Sep 25, 2018

ExpandingMan commented Sep 26, 2018

iamed2 commented Sep 26, 2018

ExpandingMan commented Sep 26, 2018

iamed2 commented Sep 26, 2018

ExpandingMan commented Sep 27, 2018

iamed2 commented Sep 27, 2018

ExpandingMan commented Sep 27, 2018

iamed2 commented Sep 27, 2018

ExpandingMan commented Sep 27, 2018

ExpandingMan commented Sep 27, 2018

iamed2 commented Sep 27, 2018

iamed2 commented Sep 27, 2018

ExpandingMan commented Sep 27, 2018

iamed2 commented Sep 28, 2018

ExpandingMan commented Nov 30, 2018 • edited Loading

iamed2 commented Nov 30, 2018

ExpandingMan commented Nov 30, 2018 • edited Loading

iamed2 commented Nov 30, 2018

ExpandingMan commented Nov 30, 2018

ExpandingMan commented Dec 3, 2018 • edited Loading

amellnik commented Dec 14, 2018

galenlynch commented Jan 4, 2019 • edited Loading

ExpandingMan commented May 3, 2019

Nosferican commented Sep 5, 2019

iamed2 commented Sep 5, 2019

Nosferican commented Sep 5, 2019

Nosferican commented Nov 29, 2019 • edited Loading

iamed2 commented Nov 29, 2019

Nosferican commented Nov 29, 2019

ExpandingMan commented Sep 20, 2018 •

edited

Loading

johannspies commented Sep 25, 2018 •

edited

Loading

ExpandingMan commented Nov 30, 2018 •

edited

Loading

ExpandingMan commented Nov 30, 2018 •

edited

Loading

ExpandingMan commented Dec 3, 2018 •

edited

Loading

galenlynch commented Jan 4, 2019 •

edited

Loading

Nosferican commented Nov 29, 2019 •

edited

Loading