Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove A/ prefix in paths #104

Closed
kelson42 opened this issue Jan 16, 2023 · 9 comments
Closed

Remove A/ prefix in paths #104

kelson42 opened this issue Jan 16, 2023 · 9 comments
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@kelson42
Copy link
Contributor

This is not necessary anymore since libzim7 and complexify without need the code.

This implies as well to remove the hack with mime-type "text/html;raw=true" and properly put things in:

  • title index
  • title Xapian index
  • fulltext Xapian index

Maybe something should be fixed in Webac as well

@kelson42 kelson42 added the enhancement New feature or request label Jan 16, 2023
@rgaudin
Copy link
Member

rgaudin commented Jan 16, 2023

This is not necessary anymore since libzim7 and complexify without need the code.

It is still necessary to have two entries per resources: one with the payload, one with the headers. We use A/path and H/path currently.

We could remove A/ but it requires editing the headers-retrieving code so that's why it was left when we switch to new libzim. We should still remove it so URLs are nice and not confusing with old, namespace-era ZIMs.

raw=true was awaiting kiwix/libkiwix#671 We can remove it now.

index manipulation is not a scraper's responsibility. All we can do is specify the FRONT_ARTICLE hints, which we do.
It seems that we rely on mimetype (default) as we don't set it on payload entries. Code is most likely wrong as I can see non-html in random results (which uses front=True)

@rgaudin rgaudin changed the title Remove URL with A (and all the context of namespaces) Remove A/ prefix in paths Jan 16, 2023
@kelson42
Copy link
Contributor Author

This is not necessary anymore since libzim7 and complexify without need the code.

It is still necessary to have two entries per resources: one with the payload, one with the headers. We use A/path and H/path currently.

We could remove A/ but it requires editing the headers-retrieving code so that's why it was left when we switch to new libzim. We should still remove it so URLs are nice and not confusing with old, namespace-era ZIMs.

Thanks for the clarification. Agree.

raw=true was awaiting kiwix/libkiwix#671 We can remove it now.

Would be great if not too complicated.

index manipulation is not a scraper's responsibility. All we can do is specify the FRONT_ARTICLE hints, which we do. It seems that we rely on mimetype (default) as we don't set it on payload entries. Code is most likely wrong as I can see non-html in random results (which uses front=True)

Default (newest) libzim behaviour should allow to do this properly based indeed on the indication if we deal with a front-article or not. But AFAIK this is not done so far in warc2zim, because at the time we made the warc2zim it was not possible.

@rgaudin
Copy link
Member

rgaudin commented Jan 16, 2023

Default (newest) libzim behaviour should allow to do this properly based indeed on the indication if we deal with a front-article or not. But AFAIK this is not done so far in warc2zim, because at the time we made the warc2zim it was not possible.

As mentioned, it's done already. We are specifying FRONT_ARTICLE=False for our static files and for the Headers Entries. For the rest of the Entries (everything), we don't specify hints expecting the default behavior to apply (if HTML > is_front) but there's probably a bug around that as even non-html are returned in random.

@Jaifroid
Copy link

Jaifroid commented Feb 1, 2023

We had a long discussion of the issue of removing these pseudo-namespaces or prefixes in #99. At the time, it was decided to keep both C/A/ and C/H/ I think partly for symmetry, although I had argued and had made some wrong assumptions in my original code that we would be switching A/ to C/ (instead of C/A/) when switching warc2zim from the Type 0 to Type 1 ZIM schema. I since added support for C/A of course.

I'm happy for the /A to be removed, if you decide to. Just be aware that I'll need to test a sample ZIM and will probably have to detect and set the correct prefix to use in the backend of KJSWL, before these are released to the general public.

I'll be very glad to see the end of text/html;raw=true. This caused me a LOT of headaches, as several of our regexes were testing for ^text/html$ in the backend and failing. I had no idea this was a hack introduced into these ZIMs.

@rgaudin
Copy link
Member

rgaudin commented Feb 1, 2023

@Jaifroid we've decided we don't want to get into this at the moment. One important piece of code actually depends on it and it would be really fragile not to. The following piece of JS is injected before </head> in all A/xx text-html entries.

<script>
// No SW Fallback check: hit if loaded via direct link (no SW installed or not supported)
if (window.top === window && !window._WBWombat) {
if (!navigator.serviceWorker || !navigator.serviceWorker.controller) {
// finds '/A/' followed by a domain name with a .
var inx = window.location.href.search(/[/]A[/][^/]+[.]/);
var prefix = window.location.href.slice(0, inx);
prefix += "/A/index.html#redirect=" + encodeURIComponent(window.location.href.slice(inx + 3));
setTimeout(() => {
window.location.href = prefix;
}, 100);
// SW installed but not available (probably hard-refresh): just refresh again
} else if (navigator.serviceWorker && navigator.serviceWorker.controller) {
setTimeout(() => {
window.location.reload();
}, 100);
}
}
</script>

Also, yes, we are dropping the ;raw=true. Will be in the next release (due Friday).

@Jaifroid
Copy link

Jaifroid commented Feb 1, 2023

@Jaifroid we've decided we don't want to get into this at the moment.

OK, thanks for the update. Maybe, in due course, we can think of other solutions than injecting that script into every article.

Thanks for the information about ;raw=true.

@rgaudin
Copy link
Member

rgaudin commented Feb 2, 2023

Yes, that's part of the “where should the SW be?” discussion

@kelson42 kelson42 added the question Further information is requested label Apr 24, 2023
@kelson42 kelson42 added this to the 1.6.0 milestone Apr 24, 2023
@kelson42
Copy link
Contributor Author

@rgaudin @mgautierfr Can we update this ticket please considering the Zimit2. I kind of think this should be done for Zimit2 but I don't have all the rationals.

@rgaudin
Copy link
Member

rgaudin commented Nov 25, 2023

I'd say we close this as wontfix since we are using different paths for zimit2 which makes it less relevant to this ticket's history which was SW related

@kelson42 kelson42 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants