Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to fetch the user categories for a book to proxy ... #316

Merged
merged 3 commits into from Oct 25, 2014
Merged

Conversation

cbhaley
Copy link
Contributor

@cbhaley cbhaley commented Oct 24, 2014

...metadata. Add a formatter function to use it.

@kovid: I am not completely happy with this implementation. The changes to cache.py are required to avoid infinite recursion. The method user_categories_for_book calls field_for() which calls composite_for(). The composite_for method created a new ProxyMetadata object and then attempts to evaluate the composite column in question. If that column was the one with a template calling user_categories_for_book, it recurses because the new ProxyMetadata object doesn't see the original's cache.

As you can see, I put a stop to the recursion by passing in the instance of proxy_metadata for the field_for method to use. I thing that makes me nervous with this approach is that it changes a fundamental API. I considered trying to use more basic APIs in user_categories_for_book, but that has the drawbacks of code cloning and being more difficult.

If you have an alternate idea to avoid the recursion, I am all ears.

BTW: the fact that composite_for creates a new ProxyMetadata instance raises the question of whether the caches are as effective as they should/could be. It seems that the value will be cached twice and then discarded once.

@kovidgoyal
Copy link
Owner

It's been a while since I looked at this code, so this is mostly of the top off my head, but it seems that the least disruptive, would be to have user_categories_for_book() check if the field is composite itself, and if it is, call get_value_with_cache with a single PM instance common to all such calls, and otherwise use fast_field_for()

…oxy metadata. Add a formatter function to use it."

This reverts commit 29e29fd.
@kovidgoyal
Copy link
Owner

My logic for the above preventing recursion is that composite_getter() in db.lazy adds the current field into cache with a dummy value before calling the formatter, so it should not call user_cagtegories_for_books again, but I may be wrong.

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 24, 2014

It works if I pass in the proxy_metadata object. It doesn't work if I create the proxy_metadata object in user_categories_for_books(). The reason is that the cache that composite_getter is using is not the cache of the new PM object, so it evaluates the template again.

The good thing about your suggestion is that no other API in cache.py is touched.

I ended up adding the current PM object as a fourth parameter to all the getters. This adds a "push" of a single object instance onto the stack at call time, which is faster than any test to see if the getter is user_categories_getter.

@kovidgoyal kovidgoyal merged commit 77abf61 into kovidgoyal:master Oct 25, 2014
@kovidgoyal
Copy link
Owner

I refactored your patch to:

  1. Make the API work for bulk retrieval of user categories
  2. Reduce the amount of changes in lazy.py
  3. Update the testsuite to reflect the fact that PM objects now return non-empty user_categories

Please check that everything still works.

@kovidgoyal
Copy link
Owner

Regarding (2) above, I agree that adding an extra param to all getters
is faster for the case of getting user_categories, but it is slower for
getting all other attributes, and the latter is more common. It also
makes the patch footprint smaller.

Also, I'm not quite clear as to why constructing the PM object in
user_categories_for_books does not work. If that is indeed the case,
then the API should be changed to make proxy_metadata_map non-optional.

EDIT: To expand on my confusion, if constructing the PM object in user_categories_for_books() does not work, it would imply that this sequence, works:

pm = db.get_proxy_metadata(book_id)
db.user_categories_for_books(book_id, pm)

but this one does not

db.user_categories_for_books(book_id)

Since the first thing user_categories_for_books() does is construct the PM object, that doesn't make sense to me.

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

Two issues here.

First, constructing the PM object in user_categories_for_books(). This does not work because the caches are local to the PM instance. Consider the following sequence:

  • Formatter does mi._proxy_metadata.user_categories. This is using the PM instance embedded in the mi instance.
  • PM determines that the cache is empty for user_categories. It calls db.user_categories_for_books()
  • db.ucfb() must get the value for every custom column mentioned in a user category. Assume that the composite column being evaluated by the formatter is so mentioned.
  • db.ucfb creates a new instance of PM that has empty caches, then calls db.get_value_with_cache()
  • db.get_value_with_cache() invokes (eventually) the formatter, which invokes new_proxy_metadata.user_categories. The cache is empty, so the new_PM object recursively calls db.user_categories_for_books()

Infinite recursion. There is no termination condition.

Passing the pm instance prevents the recursion because that forces using a common cache between the various processing objects. So yes, proxy_metadata_map must not be optional.

Regarding performance: an if requires 2 to 4 instructions (load, cmp, branch), where the instructions can be complex if the comparison is something like a string. A parameter requires a single instruction (push). Thus a parameter is faster if the if test is done for 50% of the accesses. Given your changes, I suspect that most accesses are of fields in SIMPLE_GETTER, so the if will be frequently executed.

I think it would be better to remove "user_categories" from SIMPLE_GETTER and do the test before fetching field metadata. I will make that change so you can see what I mean. No problem if you use the change or not.

@kovidgoyal
Copy link
Owner

On Fri, Oct 24, 2014 at 11:14:35PM -0700, Charles Haley wrote:

Two issues here.

First, constructing the PM object in user_categories_for_books(). This does not work because the caches are local to the PM instance. Consider the following sequence:

  • Formatter does mi._proxy_metadata.user_categories. This is using the PM instance embedded in the mi instance.
  • PM determines that the cache is empty for user_categories. It calls db.user_categories_for_books()
  • db.ucfb() must get the value for every custom column mentioned in a user category. Assume that the composite column being evaluated by the formatter is so mentioned.
  • db.ucfb creates a new instance of PM that has empty caches, then calls db.get_value_with_cache()
  • db.get_value_with_cache() invokes (eventually) the formatter, which invokes new_proxy_metadata.user_categories. The cache is empty, so the new_PM object recursively calls db.user_categories_for_books()

Infinite recursion. There is no termination condition.

The part that confuses me, is how are these two calls different:

pm = db.get_proxy_metadata()
db.ucfb(pm)

and

db.ucfb()

The latter constructs a new PM object at the start. So if the latter is
going to recurse infinitely, the former must also recurse infinitely.

In other words, why does the use case of calling ucfb() from inside a
template formatter function not recurse infinitely.

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

Currently it works like neither of your examples. Backing up a bit, we see that the formatter is handed a pm instance obtained from somewhere. It uses that pm instance to fetch user_categories. PM.user_categories_getter does not create a new pm instance but instead passes itself to db.ucfb. In the course of processing, db.ucfb will use this same pm instance to evaluate any composite columns, meaning that the cache in the pm object contains data, stopping the recursion. In effect we have this sequence:

  • create pm instance
  • formatter(pm)
  • pm.user_categories
  • db.ucfb(pm)
  • field.evaluate_custom_column(pm)
  • formatter(pm). return cached value from original pm instance.

It works because the pm is created outside the recursion.

It will infinitely recurse if a new PM instance is created and used anywhere inside the recursion cycle. Your examples create such an instance, so neither of them will work.

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

BTW: the cached value returned by the formatter is probably the recursive composite field text, but I haven't bothered to check that.

@kovidgoyal
Copy link
Owner

Maybe I'm just being dense, but

The call chain involving the template function is

  1. pm = db.get_proxy_metadata()
  2. pm.get('#compositecol') # where compsite call has the template {user_categories()}
  3. this calls pm.user_categories via some template evaluation calls that I wont trace here
  4. this calls db.user_categories_for_books() with the pm instance from (1)

At stage (4), the PM instance passed into ucfb() is exactly the same
as one that would be created via a fresh call to db.get_proxy_metadata()

Therefore, if instead, at this point no pm instance had been passed in,
there should be no problem.

To put it another way,

The template function has to pass in a PM instance, that much is clear.
The reason for it is that template functions can be called
recursively by each other.

But do other, non-recursive users of this API have to also pass in a PM
instance? If so, why?

@kovidgoyal
Copy link
Owner

In other words, can the following call ever lead to infinite recursion:

db.user_categories_for_books(book_id)

given that the user_categories() template function is written so that it
passes in a PM instance when it calls ucfb()

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

It isn't true that the PM instance is exactly the same as a new one. Looking at pm.composite_getter() we see that the cache for the current pm instance is filled in with

  • cache[field] = 'RECURSIVE_COMPOSITE FIELD (Metadata) ' + field
    and we also see that it passes itself (the current pm instance) to the formatter, meaning that the formatter will see the cached value. The next time that the column is evaluated, that text will be returned and db.user_categories_for_books() will not be called.

I am not sure what you mean by "other, non-recursive users".

Regarding your second comment "can the following call ever lead to infinite recursion?", the answer is yes. If a user category makes reference to a composite column that uses the user_categories() template function then there will be infinite recursion unless the pm instance is passed along the chain. This happens because the check for recursive composite columns is defeated by creating a new pm instance.

@kovidgoyal
Copy link
Owner

On Sat, Oct 25, 2014 at 12:42:30AM -0700, Charles Haley wrote:

It isn't true that the PM instance is exactly the same as a new one. Looking at pm.composite_getter() we see that the cache for the current pm instance is filled in with

  • cache[field] = 'RECURSIVE_COMPOSITE FIELD (Metadata) ' + field
    and we also see that it passes itself (the current pm instance) to the formatter, meaning that the formatter will see the cached value. The next time that the column is evaluated, that text will be returned and db.user_categories_for_books() will not be called.

That I can buy, but if that is the explanation for why it does not
recurse, then wont it recurse even in the following case:

pm = db.proxy_metadata(book_id)
db.ucfb(book_id, pm)

And that means this API effectively can only be used in a very special
circumstance, namely, when evaluating a composite column.

And if that is the case, I dont think this API should be exposed in the
cache object at all, as it is unsafe. Instead, it should be marked as
internal and used by the PM object.

Makes sense?

@kovidgoyal
Copy link
Owner

Thinking about it, it cant even be used in the PM object, since

db.get_proxy_metadata(book_id).user_categories

can recurse infinitely. Some I guess it will have to be used only in the template function and the PM object will need to be changed back to returning an empty user_categories object.

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

Your example will not infinitely recurse. db.ucfb(pm) will attempt to evaluate the composite column. Further down we discover that the column has never been evaluated, which will recursively call db.ucfb(pm) where pm is the original pm. At this point the cache has been prepared and the new call to db.ucfb will not recurse again.

I have no problem with marking db.ucfb as internal.

Re last comment: I don't see why your example will infinitely recurse. The new PM instance will be passed to db.ucfb. Another new PM instance will not be created, so the cache will work.

@kovidgoyal
Copy link
Owner

Because, your explanation for why it does not recurse when called from the template function is that composite_getter() populates the cache

db.get_proxy_metadata(book_id).user_categories

never calls composite_getter

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

Yes it does, assuming that the composite column is involved in the user category. field.get_value_with_cache() calls mi.get(colkey) which will call composite_getter.

@kovidgoyal
Copy link
Owner

On Sat, Oct 25, 2014 at 01:07:51AM -0700, Charles Haley wrote:

Yes it does, assuming that the composite column is involved in the user category. field.get_value_with_cache() calls mi.get(colkey) which will call composite_getter.

But it does not call it at stage (4) of my function call trace above.

Anyway, I think this discussion will be better served with an example
demonstrating infinite recursion, that way you wont have to convince me,
I could just see for myself. Can you construct a metadata.db that
recurses infinitely when calling

db.ucfb(book_id)

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

This program runs fine.
def init_cache(library_path):
from calibre.db.backend import DB
from calibre.db.cache import Cache
backend = DB(library_path)
cache = Cache(backend)
cache.init()
return cache

import math
from collections import defaultdict

cache = init_cache(library_path = 'd:/cbh_data/calibre.git/library.test_small')
series_info = defaultdict(dict)

for id_ in cache.all_book_ids():
pm = cache.get_proxy_metadata(id_)
print(pm.get('user_categories'))

@kovidgoyal
Copy link
Owner

On Sat, Oct 25, 2014 at 01:12:20AM -0700, Charles Haley wrote:

This program runs fine.

And does

for id_ in cache.all_book_ids():
db.ucfb(id_)

fail?

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

I have a metadata.db that fails if ucfb() creates a new PM instance instead of using the one passed in as an argument. I don't see a way to attach files here so I will email it separately.

@kovidgoyal
Copy link
Owner

On Sat, Oct 25, 2014 at 01:33:29AM -0700, Charles Haley wrote:

I have a metadata.db that fails if ucfb() creates a new PM instance instead of using the one passed in as an argument. I don't see a way to attach files here so I will email it separately.

That will help, thanks. That will allow me to investigate the cause an
decide the best place for the API accordingly, without pestering you :)

@kovidgoyal
Copy link
Owner

Oh and note, my claim is that db.ucfb(book_id) will not recurse infintely. Not that db.ucfb(book_id, pm) will not recurse infinitely even if ucfb() ignores pm. I agree that the latter will recurse infinitely.

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

How are the two different? If pm is not passed as an argument then ucfb must create an instance. If ucfb ignores the argument then it must create an instance.

@kovidgoyal
Copy link
Owner

On Sat, Oct 25, 2014 at 01:42:32AM -0700, Charles Haley wrote:

How are the two different? If pm is not passed as an argument then ucfb must create an instance. If ucfb ignores the argument then it must create an instance.

Because, in one case subsequent calls to ucfb() from evaluating the
template all use the passed in PM, in the other case they do not.

With the metadata.db you sent me, I see no infinite recursions with the
following:

calibre-debug -c "from calibre.library import db; db = db('/t').new_api; import pprint; pprint.pprint(db.user_categories_for_books(db.all_book_ids()))"

This is with the code from my master branch.

If I change ucfb() to ignore proxy_metadata_map, I get an infinite
recursion.

Therefore, I conclude that proxy_metadata_map can indeed be optional.

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

Re: "And does

for id_ in cache.all_book_ids():
db.ucfb(id_)

fail?"

I think I see at least some of what you are saying.

The short answer is "no, it doesn't fail". The longer answer is that it doesn't fail because it only creates a new PMM if the one passed in is None, which will never happen except if ucfb() is called by someone other than proxy_metadata.

Re your latest comment: I think we are in agreement.

@kovidgoyal
Copy link
Owner

On Sat, Oct 25, 2014 at 01:52:24AM -0700, Charles Haley wrote:

Re your latest comment: I think we are in agreement.

So are you OK with the code as is in master, or is there anything you
would like to see changed?

@cbhaley
Copy link
Contributor Author

cbhaley commented Oct 25, 2014

I changed the latest pull request to contain only the changes to the formatter function. Those changes should be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants