Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy Matching Improvements / POST requests #80

Closed
ikreymer opened this issue Jan 27, 2021 · 9 comments · Fixed by #83
Closed

Fuzzy Matching Improvements / POST requests #80

ikreymer opened this issue Jan 27, 2021 · 9 comments · Fixed by #83
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@ikreymer
Copy link
Collaborator

ikreymer commented Jan 27, 2021

warc2zim now has a set of fuzzy matching rules (https://github.com/openzim/warc2zim/blob/master/src/warc2zim/main.py#L75)
which are a subset of the larger ruleset in wabac.js (https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js#L8)

(pywb also has rules in python that are mostly aligned with the wabac.js rules https://github.com/webrecorder/pywb/blob/master/pywb/rules.yaml)

This many different rule sets is definitely a concern when it comes to maintenance, so perhaps should at least try to have warc2zim use the wabac.js rules, since wabac.js is used for replay. These rules could be easily exposed as a json file that is loaded similar to the sw.js

Ideally, warc2zim would not need any rules and wabac.js could just read from zim using existing rules, but this is not possible for two issues:

  • It is not possible to do a prefix query from the client. One of the ways that wabac.js handles replay is to do a a prefix search, usually by query string, eg. if an archived URL is https://example.com/?A=B&_=1234 and there is a request for
    https://example.com/?A=B&_=1235, it can do a prefix search for https://example.com/? and find the best match.
    This approach allows for finding the best match URL from multiple possible captures.

Since prefix querying is not possible when loading from ZIM, the alternative is a custom canonicalization option, which wabac.js also supports: We create a fake redirect', eg:

https://example.fuzzy.replayweb.page/?A=B which redirects to https://example.com/?A=B&_=1234 in the ZIM
Then, when wabac.js encounters https://example.com/?A=B&_=1235, it also maps to https://example.fuzzy.replayweb.page/?A=B, and so is able to do the lookup.

This does work but is less flexible than the prefix search, as there is only possible match.

  • A further complication of this is when POST request data is needed to lookup the URL, which is unfortunately now the case with youtube (as mentioned in Videos missing webrecorder/browsertrix-crawler#4). Since WARCs contain the POST data,
    wabac.js can take the POST data, especially if query or json and add it as part of the URL query.

For example, lets say a URL is the same but can only be distinguished by the POST data, which contains {"videoid": "A"}
A combined URL after reading the request and response can then be: https://example.com/?_=1234&__post_json_data={"videoid": "A"}, and the previous prefix search for https://example.com/? can find the best match.

For ZIMs, we'll need to do more work, though. The POST request must now also be parsed and a 'fake' redirect URL, probably something like https://example.fuzzy.replaywebpage/?__post_json_data={"videoid": "A"} generated.

This is doable, and can be added, but just wanted to raise awareness as this means creating (and continuing to maintain) a slightly different fuzzy matching scheme for ZIMs than exist for WARCs in wabac.js. The only possible alternatives, it seems, would be to allow for:

  • providing a prefix API that could return a list of all ZIM records by prefix, eg: https://example.com?
  • storing the WARC request data in case of POSTs

This issue is now coming up with youtube as youtube is making a POST request to the same URL, only difference is in the POST data (mentioned in webrecorder/browsertrix-crawler#4). The existing POST handling + prefix system means that replayweb.page is able to replay this new youtube playrer in WARCs + WACZ, but not ZIMs

Let me know if this makes sense, or can elaborate further..

@ikreymer
Copy link
Collaborator Author

A quick update: with the latest commit, this version now works with latest Vimeo videos.

Youtube will still require POST request handling, as mentioned above. The simplest solution is to add that, as that's what was done in pywb and wabac.js, and zim replay requires a modified system as discussed above.

What this involves is looking at the WARC request record, and if it is a POST request, and the content-type is either json or form encoding, the POST request is added to the URL as a query. Then, the special rule is applied to add a fuzzy matching redirect.

@rgaudin
Copy link
Member

rgaudin commented Jan 29, 2021

Thanks for all the details. Trying to understand exactly what each option would imply in terms of changes and maintenance. Surely having prefix search in ZIM (in readers actually, the libzim do provides this feature) saves duplication and possible bugs but it might mean changing every ZIM reader in an out-of-spec way…

Will try to look at the code to understand the other option better.

@ikreymer
Copy link
Collaborator Author

The immediate solution for youtube is to add the POST request mapping. Thinking about it more, there is no way around that, even with prefix support.

Here's an example of the latest conversion function, which now handles both form and JSON data now:
https://github.com/webrecorder/wabac.js/blob/main/src/utils.js#L105

Without adding the POST data, we would end up with duplicate URLs like https://www.youtube.com/youtubei/v1/player?key=<some key> for each video, and can only store one in the ZIM.

So probably should implement something like the above function in warc2zim..

@rgaudin
Copy link
Member

rgaudin commented Jan 29, 2021

OK, thanks, it's a bit clearer now. Discussed this with @kelson42 and we confirm it's not possible to provide a prefix search API at the moment as this is too big of a concept change for the format/reader.

So we'll go with your other option. Let's discuss on slack if/how we can split the workload and maybe refactor those pieces so that it's easier to maintain. We should anyway have a better understanding of the replayer parts at play. We've kinda neglected it since it was maintained in webac.js

@rgaudin
Copy link
Member

rgaudin commented Feb 15, 2021

@ikreymer what's the status of video-replay-fixes branch? Should we merge that in ?

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@Popolechien
Copy link

Hi @ikreymer any update on this?

@stale
Copy link

stale bot commented Jan 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jan 3, 2022
@ikreymer
Copy link
Collaborator Author

ikreymer commented Jan 5, 2022

This is being addressed by #83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants