Implement index backend classes for CDXJ files from local, web, and IPFS sources #637

anatoly-scherbakov · 2020-04-23T03:29:53Z

While I listed all the atomic requirements earlier, actual API, based on the usage requirement, can be much more simpler and efficient with only two public functions:

get_all_mementos(urir) -> Iterator[chronologically_sorted_index_records_of_matching_urir]

get_nav_mementos(urir, datetime) -> {first, prev, closest, next, last}

Note: The value corresponding to each key in the returned dictionary of the second function can be a list of record objects, if we plan to support collisions.

In addition to these, the index API may also provide an overall iterator with a filter argument to iterate over all the matching records for the sake of record pinning or MementoMap generation.

iterate_all(filter) -> Iterator[filtered_sorted_index_records]

Originally posted by @ibnesayeed in #604 (comment)

The text was updated successfully, but these errors were encountered:

anatoly-scherbakov · 2020-04-23T03:41:00Z

@ibnesayeed I suggest to continue discussion of a backend class interface in this separate issue.

I believe iterate_all() function is unnecessary; it can be implemented as __iter__ magic method, so that code like this works:

memento_set = MementoSet(...)
for memento in memento_set:
    print(memento)

But this is syntactic sugar only, obviously. And it repeats what get_all_mementos() does.

To support filtering, - please explain what you meant under filter argument. A string, an expression?
get_nav_mementos(datetime) is a bigger concern because it is doing too much work. Suppose you only want to fetch the last memento (a pretty often use case). But you will have to supply a datetime to that function, even though it is unnecessary. And, the function will do a lot of extra work - calculate prev, next, closest, and first memento. All of this is completely unneeded for you at this particular moment, and can be slow (depending on the backend of course).

What if backend is an IPLD DAG object, which is quite possible to my opinion? You will have to traverse it every time you do search, and such optimizations matter.

Thus, I would still suggest using more atomic operations like those I mentioned in my original comment.

ibnesayeed · 2020-04-23T13:44:08Z

@anatoly-scherbakov my enumerated comments corresponding to your points:

I am not too picky about how it is implemented, as long as it is provided. I proposed a separate function instead of relying on the dunder method to be able to support a different function signature that allows features like filters and sort orders. While these functionalities can be abstracted out in a chained function, it might be more efficient to perform them in the backend itself. For example, if MySQL is used as a backend, filtering records using SQL would be much more efficient than streaming them all and applying filter after the fact.
It will be an expression so that a callback function or lambda can be supplied as an argument for custom filtering. The function will receive the index record as an argument (and perhaps some other parameters).
I suggested the composite get_nav_mementos function because in IPWB replay we almost always want them all when serving a memento. I do not recall a place where only one specific memento is requested, without its whereabouts. When we serve a memento, we include a Link header that include all these navigational memento relations. There is one situation where we only need the datetime of the closest memento and nothing else, which is when a request arrives with a datetime that we not have an exact match for, so we issue a redirect to the closest matching URI-M. However, this is generally immediately followed by a subsequent request from the client to the exact matching memento that we can cache the backend lookup result in the first attempt for a brief time to avoid two lookup calls and an unnecessary atomic API. When no datetime is provided, as per the Memento protocol, it is implicitly considered as the current time, which results in the most recent memento. This way, we can avoid introducing another API signature to support lookup of the last memento without a datetime, instead we always supply a datetime, be it explicit or implicit.

The problem of atomic operations is overhead of repeated common routines. If a backend works better with atomic operations, then it is simply a matter of writing a wrapper function that calls those privately defined atomic functions to consolidate the API response.

anatoly-scherbakov · 2020-04-24T03:05:37Z

filtering records using SQL would me much more efficient than streaming them all and applying filter after the fact

Of course. Chaining interface does not imply the strategy of streaming all and applying filter after the fact, though. For example, if you've seen Django ORM or Peewee ORM, they are building an object lazily when you add more and more calls to the chain. And then - when the time comes - the whole chain is being evaluated, meaning - translated into SQL which is then sent to the database.

The __iter__ can be replaced with .all().

so that a callback function or lambda can be supplied as an argument for custom filtering

I do not believe this is necessary. Instead of memeto_set.get_your_subset(filter=your_lambda) you should be using filter(your_lambda, memento_set.get_your_subset()), which is much more idiomatic since you've already got all the mementos into a Python iterator anyway.

I understand your motivation. However, you are tailoring backend to one specific front end - the ipwb replay routine, which plays the data to the user. Which is not necessarily the one possible front end. For example, when I suggested my use case somewhere in other thread, - I probably will request the latest memento of an HTML page to fetch its contents, parse it, and get the machine readable data out of it. My parser doesn't care about navigation.

But, I agree with you that the navigation use case is important and it makes sense to optimize it. I suggest to leave the question of optimization to the person who will implement the particular backend.

This boils down to something like this:

https://gist.github.com/anatoly-scherbakov/7e17ce45e9aa197d184e83810d914d2a

This is a base backend implementation. Every individual backend - Redis/MySQL/file/IPFS DAG/... can implement every method the way they prefer. In the default case, cursor() relies upon atomic methods, but backend author can choose to reimplement that method for performance.

ibnesayeed · 2020-04-25T20:12:37Z

I am aware of lazy method chain evaluation in ORMs (my first encounter with them was in Ruby, long before Django introduced them) and some other places such as TensorFlow. Perhaps those places are a good fit for it because the possibilities there are endless and they need to perform look-ahead optimizations before executing composed operations. It would be an unnecessary big hammer in this case where we are dealing with just one type of data with almost fixed schema. Another downside of lazy evaluation is an explicit call to something like run() to indicate that we are done building the computation graph. There might be ways to detect it automatically, but they will come with even more code to maintain. In fact, TensorFlow 2.0 moved away from the lazy evaluation as the default option (in contrast to TF 1.x) because new comers to the platform were struggling to deal with it. They found synchronous execution more Pythonic.

That said, if we are not providing an optional filter lambda and want any filters on the record be done outside then, then it perhaps is equally as good to use the dunder method, no need to define all().

I don't mind defining atomic methods in the base class and then composing them, but my only concern is, we will end up defining many individual methods in each back end that we will never use. That's why I was thinking if we could define them as private methods and only expose those that we actually have a need for. In the IPWB replay we really two of them as I described earlier, but if you have a use case of a few others, there is no harm in exposing them as well. One way to resolve it would be to rename and redefine get_nav_mementos with something like get_closest(urir, datetime=tine.now, include_nav=True). This way, a single function can be used to find the first record (by supplying a pre-archive time), the latest record (by not supplying any time or current time or a future time), and any other time specific record closest to a supplied datetime. In any of these cases one may choose to not include navigational relations (first, prev, next, last) by explicitly setting the include_nav to False.

I liked your rough outline of the class in the Gist, though I have a few concerns that can be resolved later. Once such concern is about calling something a cursor which it is not. A cursor in datasets is an iterator of an arbitrary number items in each block with the pointer to the next block, which usually ensures that all the data will be traversed in some order without any repetitions. Also, in your by_original_uri (which will essentially be used to generate TimeMaps) it is important to make it return an iterator rather than an evaluated set/list, because TimeMaps can be really big at times in big archives and we do not want things to not scale due to limited memory.

anatoly-scherbakov · 2020-04-26T11:30:34Z

I must confess I didn't use Tensorflow, so I am not sure why exactly they abandoned fluent interfaces. For me, such interfaces are appealing because of their monadic nature, meaning composability. Every call has minimal set of parameters, and you may combine them in any ways you would prefer, which sounds nicer than having a huge function with dozens of params.

Another downside of lazy evaluation is an explicit call to something like run()

Not necessarily. You mentioned iterators before; this is an excellent example of lazy evaluation in Python. Evaluation may happen in __iter__(), in that case they will work like this:

mementos = MementoSet.from_local_file('/home/user/example.cdxj')

for memento in mementos.by_uri('https://example.com/mypage'):
    print(memento)

On many backends, this interface also helps implementing pagination on the backend side (so as not to fetch all records from the DB at once but fetch them lazily, chunk by chunk, using cursors under the hood). With the same code, you can fetch all the items by list():

list(mementos.by_uri('https://example.com/mypage'))

Obviously, functions like .first() or .last() will also require evaluation, but they will only poll backend for one Memento item, which can be done rather efficiently on many backends.

One way to resolve it would be to rename and redefine get_nav_mementos with something like get_closest(urir, datetime=tine.now, include_nav=True).

get_closest() with this signature can't be correctly described via the type system. When include_nav is True, this function outputs a dict or a dataclass instance. Otherwise, it returns an Optional[Memento]. However, mypy can't derive the output type from argument values, and you will have to always specify the type explicitly.

This happens because is_closest() does a lot of very different things based on include_nav value, and I believe it will be better to split these things into more than one function.

This way, a single function can be used to find the first record (by supplying a pre-archive time)

User of this library probably will write down something like 1 Jan 1900 as a pre-archive time constant in their code. This means this particular piece of API induces extra complexity in the code which is going to use it.

Once such concern is about calling something a cursor which it is not.

I agree this name is not relevant enough. I struggled with it when writing that snippet but did not invent anything better. Do you have a better idea?

A MementoNavigator maybe?

Also, in your by_original_uri (which will essentially be used to generate TimeMaps) it is important to make it return an iterator rather than an evaluated set/list

My point exactly. by_original_uri() is not evaluating anything. Notice what it does:

@dataclasses.dataclass(frozen=True)
class MementoSet:
    original_uri: Optional[str] = None

    def by_original_uri(self, uri: Optional[str]) -> 'MementoSet':
        return dataclasses.replace(
            self,
            original_uri=uri,
        )

    def __iter__(self):
        ...

The only thing it does is to return another MementoSet object, which is an immutable dataclass instance, which original_uri field set to what the user passed as an argument.

The evaluation only happens in __iter__(), first() and similar methods. These methods must look at self.original_uri field and, if its value is unempty, they must construct the query to backend appropriately to only use the records which match that value. That is how the laziness of MementoSet works.

anatoly-scherbakov mentioned this issue Apr 26, 2020

Does, or should, ipwb support recursive pinning? #636

Open

machawk1 added the ipwb indexer label May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement index backend classes for CDXJ files from local, web, and IPFS sources #637

Implement index backend classes for CDXJ files from local, web, and IPFS sources #637

anatoly-scherbakov commented Apr 23, 2020

anatoly-scherbakov commented Apr 23, 2020

ibnesayeed commented Apr 23, 2020 •

edited

Loading

anatoly-scherbakov commented Apr 24, 2020

ibnesayeed commented Apr 25, 2020

anatoly-scherbakov commented Apr 26, 2020 •

edited

Loading

Implement index backend classes for CDXJ files from local, web, and IPFS sources #637

Implement index backend classes for CDXJ files from local, web, and IPFS sources #637

Comments

anatoly-scherbakov commented Apr 23, 2020

anatoly-scherbakov commented Apr 23, 2020

ibnesayeed commented Apr 23, 2020 • edited Loading

anatoly-scherbakov commented Apr 24, 2020

ibnesayeed commented Apr 25, 2020

anatoly-scherbakov commented Apr 26, 2020 • edited Loading

ibnesayeed commented Apr 23, 2020 •

edited

Loading

anatoly-scherbakov commented Apr 26, 2020 •

edited

Loading