Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues #926

Closed
rpoelstra opened this issue Feb 25, 2021 · 9 comments
Closed

Performance issues #926

rpoelstra opened this issue Feb 25, 2021 · 9 comments

Comments

@rpoelstra
Copy link
Sponsor

Hi @groue ,

This is a continuation of our short discussion here.
I perform all inserts in a single transaction and my profile builds are in Release mode.

The first change I made to improve saving performance is to switch from save() to insert(). This saves me from fetching the primary key (which costs a query) and shaves of 10 s. I'm now at about 32 s for about 639k rows.
SQLite takes up only 11 seconds of this, the rest is book keeping by GRDB.
I notice two things:
-Initializing the DAO takes 9.56 s
-DAO.insertStatement(onConflict:) takes 7.5 s

The first one seems to be slow because a keyed encoder is used. As this call seems necessary for each and every record I think the only way to speed it up is by using a different encoder. Is this something I can do in my code?

The second one spends time constructing the query and retrieving it from cache even though they are all going to be the same. Would it be possible to call insert() on an array so such that the cached query can be re-used with minimal overhead?
This second call also gets the arguments from the persistence container in what seems to be a keyed way, so maybe speeding up point 1 also speeds up point 2.

Can we share the profile run data in private?

Environment

GRDB flavor(s): GRDB
GRDB version: 5.2.0
Installation method: SPM
Xcode version: 12.4
Swift version: 5
Platform(s) running GRDB: macOS
macOS version running Xcode: 11.2

@groue
Copy link
Owner

groue commented Feb 25, 2021

Hello @rpoelstra, I was waiting for this issue ;-)

The first change I made to improve saving performance is to switch from save() to insert(). This saves me from fetching the primary key (which costs a query) and shaves of 10 s. I'm now at about 32 s for about 639k rows.

Good.

SQLite takes up only 11 seconds of this, the rest is book keeping by GRDB.
I notice two things:
-Initializing the DAO takes 9.56 s
-DAO.insertStatement(onConflict:) takes 7.5 s

The first one seems to be slow because a keyed encoder is used.

Well, we pay the price of dynamicity:

  1. The standard Encodable.encode(to:) has a cost.
  2. The GRDB EncodableRecord.encode(to:) has a cost.
  3. Both those methods allow each instances to encode into a different set of keys/columns.
  4. A distinct SQL query is thus built for each instance.
  5. This SQL query is turned into a prepared INSERT statement (even if those statements are cached, we still have to perform a dictionary lookup in order to search the cache).
  6. Dictionary lookups assign encoded values to the statement argument.

This dynamicity is obviously handy for the application developer. It can also have a positive impact on application performance:

// Generates a different SQL request depending on the
// actually modified columns. If the player is not modified,
// the database is not hit, sparing a write, and a kick of
// active database observations.
try player.updateChanges(db) {
    $0.score = 1000
    $0.isCaptain = true
}

Huge batch inserts, unfortunately, are not on the good side.

The second one spends time constructing the query and retrieving it from cache even though they are all going to be the same. Would it be possible to call insert() on an array so such that the cached query can be re-used with minimal overhead?

The point (3) above answers your question: what if a single record does not encode into the same set of columns as others? I do not want EncodableRecord.encode(to:) to raise a precondition failure when not all columns are defined: this would make it impossible to leverage SQLite features such as columns with default values, or generated columns.

I think the only way to speed it up is by using a different encoder. Is this something I can do in my code?

Of course! Dynamicity is handy, but when it goes in the way, it's time to go straight to static.

For the purpose of demonstration, I'll insert batches of this record type:

struct Element: Codable, TableRecord {
    var id: Int64?
    var col1: String
    var col2: Int
    var col3: Date
}

The technique is to reuse a prepared statement:

extension Element {
    static func batchInsert(_ db: Database, elements: [Self]) throws {
        let statement = try db.makeUpdateStatement(sql: """
            INSERT INTO element (col1, col2, col3) VALUES (?, ?, ?)
            """)
        for e in elements {
            statement.setUncheckedArguments([e.col1, e.col2, e.col3])
            try statement.execute()
        }
    }
}

SQLInterpolation can help the compiler complain when columns are renamed:

extension Element {
    static func batchInsert(_ db: Database, elements: [Self]) throws {
        // Stops compiling when columns are renamed or removed.
        // Does not spot missing columns, unfortunately.
        let statement = try db.makeUpdateStatement(literal: """
            INSERT INTO \(self) (
                \(CodingKeys.col1),
                \(CodingKeys.col2),
                \(CodingKeys.col3))
            VALUES (?, ?, ?)
            """)
        for e in elements {
            statement.setUncheckedArguments([e.col1, e.col2, e.col3])
            try statement.execute()
        }
    }
}

Alternative that deals with auto-incremented ids:

extension Element {
    static func batchInsert(_ db: Database, elements: [Self]) throws -> [Self] {
        let statement = try db.makeUpdateStatement(literal: """
            INSERT INTO \(self) (
                \(CodingKeys.col1),
                \(CodingKeys.col2),
                \(CodingKeys.col3))
            VALUES (?, ?, ?)
            """)
        var elements = elements
        for i in elements.indices {
            let e = elements[i]
            statement.setUncheckedArguments([e.col1, e.col2, e.col3])
            try statement.execute()
            elements[i].id = db.lastInsertedRowID
        }
        return elements
    }
}

You'll pick your favorite. Reeusing a single prepared statement is the closest to SQLite GRDB can go, without any dynamic stuff or dictionary lookups.

Maybe we could design a new specific batch api with the precondition that all records encode the same set of keys. But it would still pay the price of dictionary lookups, so I'm not even sure it's worth it. This is just a feeling, not a hard decision.

@groue groue added the support label Feb 25, 2021
@rpoelstra
Copy link
Sponsor Author

Great! A single use of such a batch insert already reduced the time from 32 s to 15 s.

What's the difference between makeUpdateStatement and cachedUpdateStatement? (except for the arguments possibly not being cleared for the latter). If I want to call this batch function multiple times for batches of inserts, should I better use the cached version?

@groue
Copy link
Owner

groue commented Feb 25, 2021

Great! A single use of such a batch insert already reduced the time from 32 s to 15 s.

That's not tremendous, but that's better.

What's the difference between makeUpdateStatement and cachedUpdateStatement? (except for the arguments possibly not being cleared for the latter). If I want to call this batch function multiple times for batches of inserts, should I better use the cached version?

makeUpdateStatement always compile a new SQLite statement, and cachedUpdateStatement reuses the same instance. You may win a few more cycles.

@rpoelstra
Copy link
Sponsor Author

Well, with the notion that the lower limit is around 10 s (for the raw SQL inserts), I don't think it's too bad.
I've applied the batch insert trick in some more places and managed to get even a few more seconds off, but I think I'm at the limit for now. Apart from SQL itself there are no more outliers in terms of spend time.

Many thanks for the superb support!

@groue groue added the needs revisiting This issue was closed but it has unfinished business label Feb 26, 2021
@groue
Copy link
Owner

groue commented Mar 18, 2021

I forgot to ship the makeUpdateStatement(literal:) method, unfortunately 😅

This will ship in the next version.

@groue groue removed the needs revisiting This issue was closed but it has unfinished business label Mar 18, 2021
@groue
Copy link
Owner

groue commented Apr 1, 2021

I updated the sample code with db.makeUpdateStatement(literal:) shipped in v5.7.0.

@filipealva
Copy link

@groue how does these SQL statements deal with blob columns?

Does the unchecked argument for a blob column need to be passed as Data?

@groue
Copy link
Owner

groue commented Jan 16, 2023

@filipealva, it looks like your question is not related to this issue. Would you mind opening another issue (and maybe express your question with more details)?

@filipealva
Copy link

@groue I'm trying to do performance optimization, it's indirectly related to this issue because of the suggested solution here.

I opened a new issue: #1319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants