New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient lookup in many-many relationship #7
Comments
Currently Pony assumes that collections are not very big. In this case, loading entire collection with one query typically is more efficient than sending multiple queries to the database and loading collection items one-by-one. In your case, collection of executable symbols is probably too big to be loaded when just one item is requested. For such big collections we plan to add option If you need quick solution before we implemented this feature, you can add intermediate entity class Executable(File):
sym = Set("SymbolPresence")
class Symbol(db.Entity):
sig = Required(str, 5000, encoding='utf-8')
exe = Set("SymbolPresence")
class SymbolPresence(db.Entity):
exe = Required(Executable)
sym = Required(Symbol)
PrimaryKey(exe, sym) After this, you can check presence of symbol in the file quickly: presence = SymbolPresence.get(exe, sym)
if presence is not None:
print 'file contains this symbol!' But I don't like this solution, because many queries will be more verbose then with plain many-to-many relationship. So, it is probably better to stay with raw SQL until we add lazy loading of collection items. |
Thank you! The intermediate table works great! I was struggling to create my own intermediate table, but I could not quite get it right. I have been debugging this for days, and finally saw that exe in sym.exe was loading the entire set, which is effectively a table scan. I will abstract this check into another function, and will change it when lazy loading is implemented. I'm very happy with the way Pony reduces the verbosity of an application; I can't imagine writing SQL for this kind of task. P.S. Another observation is that a large cache (>1000 iterations) degrades performance. When I get past around my 1000th symbol, the application actually slows down exponentially. commit() didn't really help as it didn't flush the cache (I know this is fixed), but flushing the cache once in a while actually gives consistently good performance. Maybe the cache should be limited in size, but finding a good general limit is tricky. |
It is interesting, the performance degradation was unexpected to me, definitely looks like a bug. I will look into this. I want to reproduce this slowdown, can you give some details of situations which lead to performance degradation?
|
Pseudocode:
get_symbols() yields a list of symbols from _file. I simulated around 50k symbols in get_symbols(), and printed out the time interval between every 1000 symbols/iterations. When flush() is not called, the time interval between every 1000 iterations keeps increasing, like 8s, 12s, 14s, 20s. When flush() is called, the time interval is consistent. It seems like flushing every 500-1000 iterations has the best effect. I'm using the latest Pony from PyPi installed via pip (not yet tried the version from Github). |
Hmm, right after I wrote that last comment, I re-ran my test without doing any commit()/flush() and performance was very consistent. Whether the cache was cleared or not, the difference was not significant. Sorry, I don't want to send you on a wild goose chase, but until I can narrow this down better, maybe you can focus on something else. Unless of course you really want to verify that this is not a bug. For now, I'm very happy with the succinct code and the great performance! Update: I believe this was because I re-ran the test without deleting the database, which already had all the data. |
Ok, anyways, when you encounter performance problems, please tell us about them. Actually, flush() doesn't clear cache. Instead, it switches the session to the pessimistic transaction mode and sends unsaved modifications to the database. If you call flush() right after commit(), there are no unsaved modifications at this moment, so flush() just turns off the optimistic mode for the next transaction, and doesn't send anything to the database. db = Database(...)
db.optimistic = False The cache is not cleared after
If your |
You seem to be right about pessimistic mode being more efficient. It also seems that even querying/inserting into the intermediate table has some performance issue. After re-running the test, by varying a combination of these factors: A. Setting Update: Each test was done with an empty DB. If the data already exists, then there would be no need to write to the DB at all. Some timing information here: https://gist.github.com/chnrxn/5637879 I came up with some observations:
I'm running the test on a Windows7 laptop, Python 2.7, Pony 0.4.6, with sqlite. Eventually, the actual program will be running on Linux. |
Hi chnrxn! We added class Executable(File):
name = Required(unicode)
symbols = Set("Symbol")
class Symbol(db.Entity):
signature = Required(str, 5000, encoding='utf-8')
executables = Set(Executable, lazy=True) With this entity definition the expression Note that today's PyPi release 0.4.7 doesn't have fully working implementation of lazy collections due to minor bug, so you should check latest GitHub version which should works as expected. Another potential source of performance problem may be in incorrect working with database session. Pony current version should fix this problem with new It will be great if you restart your performance test (both in optimistic and pessimistic mode, and also with |
Thank you! 👍 I did notice some other variation in performance, and I hope @db_session will help. Performance is really important for my app due to the huge dataset. On hindsight, I should have kept using my original design. :) |
If dataset is huge, then it is important to have non-unique index for each foreign key. Currently Pony doesn't creates such indexes automatically, we'll fix it as soon as possible, probably next Tuesday. |
Hi, we've implemented the automatic foreign key index generation. The update is pushed to github repo, so you can try it out. We'd appreciate if you could launch your performance tests and let us know about the results. |
Hi @kozlovsky, I know this issue is now over 3 years old so I just wanted to clarify current Pony behaviour. Let's say I have a |
Hi @hfaran, if a If for c in student1.courses:
print c.name It will load all items in a single query. But you can restrict number of loaded items by using filtering: for c in student1.courses.filter(lambda c: c.credits > 4):
print c.name or for c in student1.courses.filter(lambda c: c.credits > 4) \
.order_by(lambda c: c.name).page(1, pagesize=10):
print c.name This way only filtered items will be loaded. Also note that |
I have the following schema, where the relationship between Executable and Symbol is many-to-many.
A foreign-key table called Executable_Symbol would be created by Pony to store this relationship, but there seems to be no way to check whether a particular relationship exists via the ORM unless I drop down to raw SQL, i.e.
I figured the best way of doing this is that if I have a Symbol called sym, and an Executable called exe, I can use the expression:
But this seems to be very slow. In comparison, accessing the Executable_Symbol table using raw SQL is much faster, but dropping to raw SQL is not very desirable. My application would check this a few hundred thousand times, so every bit of efficiency would be useful.
Is there a better way to do this?
thanks!
The text was updated successfully, but these errors were encountered: