-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement index backend classes for CDXJ files from local, web, and IPFS sources #637
Comments
@ibnesayeed I suggest to continue discussion of a backend class interface in this separate issue.
memento_set = MementoSet(...)
for memento in memento_set:
print(memento) But this is syntactic sugar only, obviously. And it repeats what
What if backend is an IPLD DAG object, which is quite possible to my opinion? You will have to traverse it every time you do search, and such optimizations matter. Thus, I would still suggest using more atomic operations like those I mentioned in my original comment. |
@anatoly-scherbakov my enumerated comments corresponding to your points:
The problem of atomic operations is overhead of repeated common routines. If a backend works better with atomic operations, then it is simply a matter of writing a wrapper function that calls those privately defined atomic functions to consolidate the API response. |
Of course. Chaining interface does not imply the strategy of streaming all and applying filter after the fact, though. For example, if you've seen Django ORM or Peewee ORM, they are building an object lazily when you add more and more calls to the chain. And then - when the time comes - the whole chain is being evaluated, meaning - translated into SQL which is then sent to the database. The
I do not believe this is necessary. Instead of
But, I agree with you that the navigation use case is important and it makes sense to optimize it. I suggest to leave the question of optimization to the person who will implement the particular backend. This boils down to something like this: https://gist.github.com/anatoly-scherbakov/7e17ce45e9aa197d184e83810d914d2a This is a base backend implementation. Every individual backend - Redis/MySQL/file/IPFS DAG/... can implement every method the way they prefer. In the default case, |
I am aware of lazy method chain evaluation in ORMs (my first encounter with them was in Ruby, long before Django introduced them) and some other places such as TensorFlow. Perhaps those places are a good fit for it because the possibilities there are endless and they need to perform look-ahead optimizations before executing composed operations. It would be an unnecessary big hammer in this case where we are dealing with just one type of data with almost fixed schema. Another downside of lazy evaluation is an explicit call to something like That said, if we are not providing an optional filter lambda and want any filters on the record be done outside then, then it perhaps is equally as good to use the dunder method, no need to define I don't mind defining atomic methods in the base class and then composing them, but my only concern is, we will end up defining many individual methods in each back end that we will never use. That's why I was thinking if we could define them as private methods and only expose those that we actually have a need for. In the IPWB replay we really two of them as I described earlier, but if you have a use case of a few others, there is no harm in exposing them as well. One way to resolve it would be to rename and redefine I liked your rough outline of the class in the Gist, though I have a few concerns that can be resolved later. Once such concern is about calling something a cursor which it is not. A cursor in datasets is an iterator of an arbitrary number items in each block with the pointer to the next block, which usually ensures that all the data will be traversed in some order without any repetitions. Also, in your |
Not necessarily. You mentioned iterators before; this is an excellent example of lazy evaluation in Python. Evaluation may happen in mementos = MementoSet.from_local_file('/home/user/example.cdxj')
for memento in mementos.by_uri('https://example.com/mypage'):
print(memento) On many backends, this interface also helps implementing pagination on the backend side (so as not to fetch all records from the DB at once but fetch them lazily, chunk by chunk, using cursors under the hood). With the same code, you can fetch all the items by list(mementos.by_uri('https://example.com/mypage')) Obviously, functions like
This happens because
User of this library probably will write down something like 1 Jan 1900 as a pre-archive time constant in their code. This means this particular piece of API induces extra complexity in the code which is going to use it.
I agree this name is not relevant enough. I struggled with it when writing that snippet but did not invent anything better. Do you have a better idea? A
My point exactly. @dataclasses.dataclass(frozen=True)
class MementoSet:
original_uri: Optional[str] = None
def by_original_uri(self, uri: Optional[str]) -> 'MementoSet':
return dataclasses.replace(
self,
original_uri=uri,
)
def __iter__(self):
... The only thing it does is to return another The evaluation only happens in |
While I listed all the atomic requirements earlier, actual API, based on the usage requirement, can be much more simpler and efficient with only two public functions:
Note: The value corresponding to each key in the returned dictionary of the second function can be a list of record objects, if we plan to support collisions.
In addition to these, the index API may also provide an overall iterator with a filter argument to iterate over all the matching records for the sake of record pinning or MementoMap generation.
Originally posted by @ibnesayeed in #604 (comment)
The text was updated successfully, but these errors were encountered: