Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient lookup in many-many relationship #7

Closed
chnrxn opened this issue May 22, 2013 · 13 comments
Closed

Efficient lookup in many-many relationship #7

chnrxn opened this issue May 22, 2013 · 13 comments
Assignees

Comments

@chnrxn
Copy link

chnrxn commented May 22, 2013

I have the following schema, where the relationship between Executable and Symbol is many-to-many.

    class File(db.Entity):
        loc = Required(str, unique=True)
        tim = Optional(datetime)

    class Executable(File):
        sym = Set("Symbol")

    class Symbol(db.Entity):
        sig = Required(str, 5000, encoding='utf-8')
        exe = Set(Executable)

A foreign-key table called Executable_Symbol would be created by Pony to store this relationship, but there seems to be no way to check whether a particular relationship exists via the ORM unless I drop down to raw SQL, i.e.

eid,sid = exe.id,sym.id
db.select('* from Executable_Symbol where executable=$eid and symbol=$sid ')

I figured the best way of doing this is that if I have a Symbol called sym, and an Executable called exe, I can use the expression:

exe in sym.exe

But this seems to be very slow. In comparison, accessing the Executable_Symbol table using raw SQL is much faster, but dropping to raw SQL is not very desirable. My application would check this a few hundred thousand times, so every bit of efficiency would be useful.

Is there a better way to do this?

thanks!

@ghost ghost assigned kozlovsky May 22, 2013
@kozlovsky
Copy link
Member

Currently Pony assumes that collections are not very big. In this case, loading entire collection with one query typically is more efficient than sending multiple queries to the database and loading collection items one-by-one.

In your case, collection of executable symbols is probably too big to be loaded when just one item is requested. For such big collections we plan to add option lazy=True to the Set attribute definition, this option will prevent loading of entire collection when one item is requested. We probably will implement this option before the end of the next week. When this feature will be added, expression exe in sym.exe will generate query like your raw SQL query.

If you need quick solution before we implemented this feature, you can add intermediate entity SymbolPresence with composite primary key like this:

    class Executable(File):
        sym = Set("SymbolPresence")

    class Symbol(db.Entity):
        sig = Required(str, 5000, encoding='utf-8')
        exe = Set("SymbolPresence")

    class SymbolPresence(db.Entity):
        exe = Required(Executable)
        sym = Required(Symbol)
        PrimaryKey(exe, sym)

After this, you can check presence of symbol in the file quickly:

    presence = SymbolPresence.get(exe, sym)
    if presence is not None:
        print 'file contains this symbol!'

But I don't like this solution, because many queries will be more verbose then with plain many-to-many relationship. So, it is probably better to stay with raw SQL until we add lazy loading of collection items.

@chnrxn
Copy link
Author

chnrxn commented May 23, 2013

Thank you! The intermediate table works great! I was struggling to create my own intermediate table, but I could not quite get it right. I have been debugging this for days, and finally saw that exe in sym.exe was loading the entire set, which is effectively a table scan.

I will abstract this check into another function, and will change it when lazy loading is implemented. I'm very happy with the way Pony reduces the verbosity of an application; I can't imagine writing SQL for this kind of task.

P.S. Another observation is that a large cache (>1000 iterations) degrades performance. When I get past around my 1000th symbol, the application actually slows down exponentially. commit() didn't really help as it didn't flush the cache (I know this is fixed), but flushing the cache once in a while actually gives consistently good performance. Maybe the cache should be limited in size, but finding a good general limit is tricky.

@kozlovsky
Copy link
Member

It is interesting, the performance degradation was unexpected to me, definitely looks like a bug. I will look into this. I want to reproduce this slowdown, can you give some details of situations which lead to performance degradation?

  1. Is slowdown take place when working with many-to-many collections, or intermediate entity also lead to slowdown?
  2. How exactly do you "flush the cache" to cure slowdown?
  3. In which context do you use Pony? Is it web-framework like Flask or Bottle, or is it desktop application?
  4. What do you mean by "iteration" in ">1000 iterations"? Is it full request-response cycle, or just one query to the database?

@chnrxn
Copy link
Author

chnrxn commented May 23, 2013

  1. I'm not able to pin-point the root cause at the moment. But my sequential query pattern has very little locality for effective caching, since queries are almost never repeated. So the query cache can fill up pretty fast.
  2. I call commit() followed by flush()
  3. My code is a simple command line application that indexes all the symbols in ELF archives and executables.
  4. I'm looping through all the symbols in an ELF binary (executable), inserting the binary and symbols into the database, and associating each symbol with the binary.

Pseudocode:

def get_symbols(_file):
    for i in xrange(num_symbols): #num_symbols is a global
        sig = '%s symbol (%d)' % (_file, i)
        d = dict(
            sig = sig,
            siz = len(sig),
            typ = 'T'
        )
        yield i,d

def action(_file):
    # get the exe!
    exe = Executable.query_or_insert(_file)
    for i,d in get_symbols(_file):
        # Insert (or retrieve the symbol)
        sym = Symbol.query_or_insert(**d)

        # associate Executable with Symbol
        FileSym.query_or_insert(fil=exe, sym=sym)

        if is_interval(i): # this will be true every 1000 iterations
            # print time since the last interval
            do_commit()

get_symbols() yields a list of symbols from _file. I simulated around 50k symbols in get_symbols(), and printed out the time interval between every 1000 symbols/iterations.

When flush() is not called, the time interval between every 1000 iterations keeps increasing, like 8s, 12s, 14s, 20s.

When flush() is called, the time interval is consistent. It seems like flushing every 500-1000 iterations has the best effect.

I'm using the latest Pony from PyPi installed via pip (not yet tried the version from Github).

@chnrxn
Copy link
Author

chnrxn commented May 23, 2013

Hmm, right after I wrote that last comment, I re-ran my test without doing any commit()/flush() and performance was very consistent. Whether the cache was cleared or not, the difference was not significant.

Sorry, I don't want to send you on a wild goose chase, but until I can narrow this down better, maybe you can focus on something else. Unless of course you really want to verify that this is not a bug.

For now, I'm very happy with the succinct code and the great performance!

Update: I believe this was because I re-ran the test without deleting the database, which already had all the data.

@kozlovsky
Copy link
Member

Ok, anyways, when you encounter performance problems, please tell us about them.

Actually, flush() doesn't clear cache. Instead, it switches the session to the pessimistic transaction mode and sends unsaved modifications to the database. If you call flush() right after commit(), there are no unsaved modifications at this moment, so flush() just turns off the optimistic mode for the next transaction, and doesn't send anything to the database.
In your case, there is no need in optimistic transactions, because (as far as I understand) there are no concurrent modifications in the database, so the pessimistic mode is probably more efficient in your case. You can turn off the optimistic mode permanently right after the Database object is created:

    db = Database(...)
    db.optimistic = False

The cache is not cleared after commit() because programmers often want to continue working with the loaded objects after commit. The cache clearing happened in two situations:

  1. After rollback(), so if you do rollback() immediately after commit(), then changes will be saved to the database and then cache will be cleared. But this (rollback after commit) doesn't look very straightforward and is not intended to be used often.
  2. Upon exiting of function wrapped with @with_transaction decorator. This decorator automatically does commit (or rollback if exception was raised), clears the cache and returns the connection to the connection pool. @with_transaction decorator is recommended way to do transaction and cache management with Pony.

If your action() function will be wrapped with @with_transaction decorator, then any explicit commit()s within this function will be unnecessary.

@chnrxn
Copy link
Author

chnrxn commented May 23, 2013

You seem to be right about pessimistic mode being more efficient. It also seems that even querying/inserting into the intermediate table has some performance issue.

After re-running the test, by varying a combination of these factors:

A. Setting db.optimistic = False
B. Doing commit() and flush() at regular intervals
C. Commenting out FileSym.query_or_insert(fil=exe, sym=sym)

Update: Each test was done with an empty DB. If the data already exists, then there would be no need to write to the DB at all.

Some timing information here: https://gist.github.com/chnrxn/5637879

I came up with some observations:

  1. Initially there is performance degradation when db.optimistic==True, not doing commit() and flush(), and doing FileSym.query_or_insert(fil=exe, sym=sym). (this was my original code)
  2. Setting db.optimistic = False, on its own, leads to consistently good performance, even when doing FileSym.query_or_insert(fil=exe, sym=sym) and not doing commit() and flush()
  3. Commenting out FileSym.query_or_insert(fil=exe, sym=sym), on its own, leads to consistently good performance, even with db.optimistic == True and not doing commit() and flush()
  4. Doing commit() and flush() every 500 symbols gives consistent performance, with db.optimistic==False and still doing FileSym.query_or_insert(fil=exe, sym=sym)

I'm running the test on a Windows7 laptop, Python 2.7, Pony 0.4.6, with sqlite. Eventually, the actual program will be running on Linux.

@kozlovsky
Copy link
Member

Hi chnrxn!

We added lazy option to collections. If you set lazy=True then the collection will not load all the items if only one item of collection is requested. This is useful for very big collections when only small part of collection is used. So you can define your entities as:

    class Executable(File):
        name = Required(unicode)
        symbols = Set("Symbol")

    class Symbol(db.Entity):
        signature = Required(str, 5000, encoding='utf-8')
        executables = Set(Executable, lazy=True)

With this entity definition the expression e in symbol.executables will not load entire symbol.executables collection, but will retrieve only one row from the intermediate table.

Note that today's PyPi release 0.4.7 doesn't have fully working implementation of lazy collections due to minor bug, so you should check latest GitHub version which should works as expected.

Another potential source of performance problem may be in incorrect working with database session. Pony current version should fix this problem with new @db_session decorator. Pony requires using of this decorator for each database interaction (beside interactive shell), so Pony will know when to close database session. You can see details here: http://doc.ponyorm.com/#db-session

It will be great if you restart your performance test (both in optimistic and pessimistic mode, and also with lazy set to True or False) and tell us if performance problems are gone.

@chnrxn
Copy link
Author

chnrxn commented Jun 19, 2013

Thank you! 👍 I did notice some other variation in performance, and I hope @db_session will help. Performance is really important for my app due to the huge dataset. On hindsight, I should have kept using my original design. :)

@kozlovsky
Copy link
Member

If dataset is huge, then it is important to have non-unique index for each foreign key. Currently Pony doesn't creates such indexes automatically, we'll fix it as soon as possible, probably next Tuesday.

@kozlovsky
Copy link
Member

Hi, we've implemented the automatic foreign key index generation. The update is pushed to github repo, so you can try it out. We'd appreciate if you could launch your performance tests and let us know about the results.

@hfaran
Copy link

hfaran commented Jul 5, 2016

Hi @kozlovsky, I know this issue is now over 3 years old so I just wanted to clarify current Pony behaviour.

Let's say I have a Set which I expect may grow to over several million entries. If I do not create the Set attribute with lazy=True, will Pony attempt to load all of those several million entities into the cache?

@kozlovsky
Copy link
Member

kozlovsky commented Jul 6, 2016

Hi @hfaran, if a Set attribute does not have lazy=True option then Pony assumes that the collection size is probably not too big and it is more efficient to load all items together if some items were requested. Pony operates this way in order to reduce the total number of queries. For example, if you calculate something like course1 in student1.courses and the courses collection is not lazy, then all items will be loaded. After that course2 in student1.courses will not lead to a new query and return answer immediately.

If lazy=True option is specified, then Pony will load only the items which are necessary to perform the operation. This way Pony minimizes number of rows loaded instead of number of queries. With lazy=True option, course1 in student1.courses will load a single row only. Some operations will still load full collection content, for example if you do loop over all items of collection:

for c in student1.courses:
    print c.name

It will load all items in a single query. But you can restrict number of loaded items by using filtering:

for c in student1.courses.filter(lambda c: c.credits > 4):
    print c.name

or

for c in student1.courses.filter(lambda c: c.credits > 4) \
                         .order_by(lambda c: c.name).page(1, pagesize=10):
    print c.name

This way only filtered items will be loaded. Also note that len(c.courses) loads all items while c.courses.count() just calculates COUNT(*) in SQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants