Some issues with prefetching and related queries #106

Closed
jasonmyers opened this Issue Nov 20, 2014 · 3 comments

Projects

None yet

2 participants

@jasonmyers
Contributor

(Using pony-0.6, Python 3.3)

I'm seeing some extra queries I don't understand during prefetching and getting related entities. Setup is a simple Forum entity with possible related Topics:

from pony.orm import *

sql_debug(True)
db = Database('sqlite', ':memory:')

class Forum(db.Entity):
    topics = Set('Topic')

class Topic(db.Entity):
    forum = Required('Forum')

db.generate_mapping(create_tables=True)

with db_session:
    # Forum with a topic
    forum_a = Forum()
    Topic(forum=forum_a)

    # Another forum with a topic
    forum_b = Forum()
    Topic(forum=forum_b)

    # Forum with no topics
    forum_c = Forum()

with db_session:
    forums = select(forum for forum in Forum)

    # Print forums and their topics
    for forum in forums:
        print(forum, forum.topics)

I see the following queries

SELECT "forum"."id"
FROM "Forum" "forum"

SELECT "id", "forum"
FROM "Topic"
WHERE "forum" = ?
[1]

SELECT "id", "forum"
FROM "Topic"
WHERE "forum" IN (?, ?)
[2, 3]

Forum[1] TopicSet([Topic[1]])
Forum[2] TopicSet([Topic[2]])
Forum[3] TopicSet([])

I assume there are 2 extra Topic queries to get-related because you don't know the first time if I am going to ask for related topics from other Forums, but when pony hits the second forum.topics, it goes ahead and loads the rest (for [2, 3])?

If so, I figured I could add a .prefetch() then to tell Pony to just load Topics for all 3 Forums one query:

        forums = select(forum for forum in Forum).prefetch(Forum.topics)

and expected to see just one extra query, like

SELECT "id", "forum"
FROM "Topic"
WHERE "forum" IN (?, ?)
[1, 2, 3]

but instead see all of these queries now

SELECT "forum"."id"
FROM "Forum" "forum"

SELECT "id"
FROM "Forum"
WHERE "id" IN (?, ?, ?)
[1, 2, 3]

SELECT "id", "forum"
FROM "Topic"
WHERE "forum" = ?
[1]

SELECT "id"
FROM "Forum"
WHERE "id" IN (?, ?, ?)
[2, 1, 3]

SELECT "id", "forum"
FROM "Topic"
WHERE "forum" IN (?, ?)
[2, 3]

SELECT "id"
FROM "Forum"
WHERE "id" IN (?, ?, ?)
[3, 2, 1]

At least some of those are logically duplicates, so I'm not sure what's happening there. After further testing, if I add a single field into the Forum model (without even using it), like this:

class Forum(db.Entity):
    topics = Set('Topic')
    name = Optional(str)

The queries go back to the first example (loading [1], then [2, 3]). Adding the field to Topic doesn't fix the issue. In any cases I couldn't get it to prefetch all related with a single query.

Further, I have a second (and possibly related) issue. Setup is similar, except instead of Forum -> Topic, I have nested forums, so Forum -> Forum:

from pony.orm import *

sql_debug(True)
db = Database('sqlite', ':memory:')

class Forum(db.Entity):
    parent = Optional('Forum')
    children = Set('Forum')

db.generate_mapping(create_tables=True)

with db_session:
    # Forum with a child forum
    forum_a = Forum()
    Forum(parent=forum_a)

    # Another forum with a child forum
    forum_b = Forum()
    Forum(parent=forum_b)

    # Forum with no children
    forum_c = Forum()

with db_session:
    # Select only "top level" forums without a parent
    forums = select(forum for forum in Forum if forum.parent is None)

    # Print top level forums and their children
    for forum in forums:
        print(forum, forum.children)

And I see these queries

SELECT "forum"."id", "forum"."parent"
FROM "Forum" "forum"
WHERE "forum"."parent" IS NULL

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" = ?
[1]

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" IN (?, ?, ?)
[3, 2, 5]

Forum[1] ForumSet([Forum[2]])
Forum[3] ForumSet([Forum[4]])
Forum[5] ForumSet([])

Again 2 extra queries (without using prefetch), but in [3, 2, 5]: why is pony fetching sub-forums for 2 when Forum[2] is not in the initial results (since Forum[2].parent is not None)? I added .prefetch() to see what would happen

forums = select(forum for forum in Forum if forum.parent is None).prefetch(Forum.children)

and get:

SELECT "forum"."id", "forum"."parent"
FROM "Forum" "forum"
WHERE "forum"."parent" IS NULL

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" = ?
[1]

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" IN (?, ?, ?)
[3, 2, 5]

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" = ?
[4]

So now not only Forum[2] but Forum[4] children is being pre-fetched, when I expect a single extra query with just

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" IN (?, ?, ?)
[1, 3, 5]
@kozlovsky kozlovsky self-assigned this Nov 20, 2014
@kozlovsky kozlovsky added this to the 0.6.1 milestone Nov 20, 2014
@kozlovsky kozlovsky added the bug label Nov 20, 2014
@kozlovsky kozlovsky closed this in 469894b Nov 21, 2014
@kozlovsky
Contributor

I assume there are 2 extra Topic queries to get-related because you don't know the first time if I am going to ask for related topics from other Forums, but when pony hits the second forum.topics, it goes ahead and loads the rest (for [2, 3])?

Yes, exactly.

but instead see all of these queries now

Thanks for the reporting, these extra queries were caused by a bug. Roughly speaking, after the retrieving of just primary key column Pony does not believed that this was the entire object and tries to fetch it again. Now this behavior should be fixed, and no object should be fetched more then once.

I couldn't get it to prefetch all related with a single query

When the prefetch method was introduced, its purpose was not to reduce the number of queries. It was added to allow using of ORM objects outside of their db_session. Personally I think that it is better to always use ORM objects within the corresponding db_session. For example, when these objects are used to generate some HTML template, the generation should be done inside the db_session. But some people requested the possibility to use objects after their db_session has ended. In this case, we need to be sure that all necessary objects are loaded with all necessary sub-objects and relations. The prefetch method allows to guarantee this.

It is possible to optimize logic of prefetch method in the future in order to produce more effective queries, but currently prefetch just walk recursively and load all specified relations, using standard optimization heuristics which Pony uses for typical queries which user code generates.

So now not only Forum[2] but Forum[4] children is being pre-fetched

It is because the prefetching goes recursively. When you call .prefetch(Forum.children) that means that this collection should be loaded for all objects, in order to guarantee that for any object this attribute can be accessible after the db_session has ended. Actually this logic goes wrong for recursive relations, because this way all table rows can be pulled unexpectedly, but I just don't come up with a good API for this case yet. In some cases it is desirable to prefetch not only the top level of objects, but also its related sub-objects, and so forth.

Anyway, now after the major bug in query logic was fixed the number of queries should be much lower then you initially experienced even without using of prefetching. Tell me what do you think, and thanks again for the reporting!

@jasonmyers
Contributor

Thanks! The main bug appears fixed.

I created a couple of patches to reduce the queries further in my specific situation. I don't expect you to merge these necessarily, but here they are in any case (tests are passing, added a couple):

  1. https://github.com/jasonmyers/pony/compare/prefetch-all
    Bypass nplus1_threshold when prefetching, which combines the [1], [2, 3] queries into a single [1, 2, 3] query when using .prefetch()

  2. https://github.com/jasonmyers/pony/compare/prefetch-norecursive
    Add recursive flag to .prefetch() (defaults True), which if False will only prefetch one level deep for recursive objects

  3. https://github.com/jasonmyers/pony/compare/prefetch-all-norecursive
    Both of the above patches combined

Example output, given the same script as before:

from pony.orm import *
sql_debug(True)
db = Database('sqlite', ':memory:')

class Forum(db.Entity):
    parent = Optional('Forum')
    children = Set('Forum')

db.generate_mapping(create_tables=True)

with db_session:
    # Forum with a child forum
    forum_a = Forum()
    forum_aa = Forum(parent=forum_a)

    # Another forum with a child forum
    forum_b = Forum()
    forum_bb = Forum(parent=forum_b)

    # Forum with no children
    forum_c = Forum()

with db_session:
    # Select all "top level" forums without a parent
    forums = select(forum for forum in Forum if forum.parent is None).prefetch(Forum.children)
    # Print forums and their topics
    for forum in forums:
        print(forum, forum.children)

Without patches:

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" = ?
[1]

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" IN (?, ?, ?)
[3, 2, 5]

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" = ?
[4]

With prefetch-all:

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" IN (?, ?, ?)
[1, 3, 5]

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" IN (?, ?)
[2, 4]

With prefetch-all-norecursive and .prefetch(Forum.children, recursive=False):

SELECT "id", "parent"
FROM "Forum"
WHERE "parent" IN (?, ?, ?)
[1, 3, 5]

(These patches only affect queries that use prefetch() as a hint for reduced queries, and don't apply if .prefetch() isn't used)

@kozlovsky
Contributor

Thanks, the patches looks interesting, I'll think a bit and most likely merge them

@kozlovsky kozlovsky reopened this Nov 21, 2014
@kozlovsky kozlovsky added a commit that closed this issue Feb 20, 2015
@kozlovsky kozlovsky Pony ORM Release 0.6.1
* Closed #65: Now the select(), filter(), order_by(), page(), limit(), random() methods can be applied to collection attributes
* Closed #105: Now you can pass globals and locals to the select() function
* Improved inheritance support in queries: select(x for x in BaseClass if x.subclass_attr == y)
* Now it is possible to do db.insert(SomeEntity, column1=x, column2=y) instead of db.insert(SomeEntity._table_, column1=x, column2=y)
* Discriminator attribute can be part of the composite index
* Now it is possible to specify the attribute name instead of the attribute itself in composite index
* Query statistics: global_stats_lock is deprecated, just use global_stats property without any locking
* New load() method for entity instances which retrieves all unloaded attributes except collections
* New load() method for collections, e.g. customer.orders.load()
* Enhanced error message when descendant classes declare attributes with the same name
* Fixed #98: Composite index can include attributes of base entity
* Fixed #106: incorrect loading of object which consists of primary key only
* Fixed pony.converting.check_email()
* Prefetching bug fixed: if collection is already fully loaded it shouldn't be loaded again
* Deprecated Entity.order_by(..) method was removed. Use Entity.select().order_by(...) instead
* Various performance enhancements
* Multiple bugs were fixed
5282b2d
@kozlovsky kozlovsky closed this in 5282b2d Feb 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment