Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of scope homepage redirect #138

Open
rgaudin opened this issue Jun 23, 2022 · 7 comments
Open

Out of scope homepage redirect #138

rgaudin opened this issue Jun 23, 2022 · 7 comments
Labels
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Jun 23, 2022

Zimit 1.x, following #76 had a mechanism to ensure that should the passed URL redirect to an out-of-scope domain, the process would halt early as it would result in a barely usable ZIM (homepage not in ZIM).

With improvements to browsertrix-crawler, --scope has been removed in favor of a --scopeType that can be:

  • page: Single URL
  • page-spa: idem plus any fragment link to that URL
  • prefix (default): any URL that shares same prefix up to the last /
  • host: any URL that shares same prefix up to the first /
  • domain: Any URL on same domain or on any subdomain^^ (matched against non-www. if it was present). ⚠️ uses URL port on every domains.
  • any: Anything
  • custom which uses --include and --exclude (regexp)

Note that except for page that is a single URL, others automatically include both http and https variants of matches.

There's no documentation but here's implementation


With this new, complex scope mechanism, we had to remove our feature that checked if the redirected-to homepage is out-of-scope as it would require us to duplicate that whole scope code in zimit. Instead, a warning is displayed if the homepage is a redirection.

Question: is that enough? Do we want a different behavior? Should we duplicate that whole scope matching logic to fail early should target homepage be out-of-scope?

@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Sep 21, 2022
@Jaifroid
Copy link

Jaifroid commented Feb 2, 2023

What are the practical consequences? That those creating Zimfarm recipes or running Zimit will have to be careful to define the scope carefully? Are we getting scrapes that are too small (or too big) as a result of this change? Is this at all related to the appearance of ZIM files that are too small in dev and that have little more than a landing page?

@stale stale bot removed the stale label Feb 2, 2023
@rgaudin
Copy link
Member Author

rgaudin commented Feb 6, 2023

No existing recipe would be affected because they were passing with the previous check so they don't have a redirect to an out-of-scope URL.

I imagine that recipe/requests with such a redirect would complete successfully within seconds and create a tiny ZIM but we should test the scenario to be sure.

@kelson42 kelson42 added this to the 2.0.0 milestone Apr 24, 2023
@kelson42
Copy link
Contributor

kelson42 commented Nov 4, 2023

I have a difficulty to judge the level of impact of this ticket/bug/problem? Can someone help me?

@rgaudin
Copy link
Member Author

rgaudin commented Nov 4, 2023

I don't think I can be more clear than the explanation above.
Maybe reading the source code would help?

zimit/zimit.py

Lines 470 to 490 in c98e450

if actual_url.geturl() != url.geturl():
if scope in (None, "any"):
return actual_url.geturl()
print(
"[WARN] Your URL ({0}) redirects to {1} which {2} on same "
"first-level domain. Depending on your scopeType ({3}), "
"your homepage might be out-of-scope. Please check!".format(
url.geturl(),
actual_url.geturl(),
"is"
if get_fld(url.geturl()) == get_fld(actual_url.geturl())
else "is not",
scope,
)
)
return actual_url.geturl()
return url.geturl()

@kelson42
Copy link
Contributor

kelson42 commented Nov 5, 2023

@rgaudin Sorry, my question was not specific enough. I mean the quantity impact. Do we have a lot of scrapes impacted or only a few each year?

@rgaudin
Copy link
Member Author

rgaudin commented Nov 6, 2023

No idea and we can't really know: this information is just a warning in the logs.
I don't think it would be much as users (at least ours) tend to copy-paste URL from a running browser so redirections are most likely resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants