Out of scope homepage redirect #138

rgaudin · 2022-06-23T09:32:47Z

Zimit 1.x, following #76 had a mechanism to ensure that should the passed URL redirect to an out-of-scope domain, the process would halt early as it would result in a barely usable ZIM (homepage not in ZIM).

With improvements to browsertrix-crawler, --scope has been removed in favor of a --scopeType that can be:

page: Single URL
page-spa: idem plus any fragment link to that URL
prefix (default): any URL that shares same prefix up to the last /
host: any URL that shares same prefix up to the first /
domain: Any URL on same domain or on any subdomain^^ (matched against non-www. if it was present). ⚠️ uses URL port on every domains.
any: Anything
custom which uses --include and --exclude (regexp)

Note that except for page that is a single URL, others automatically include both http and https variants of matches.

There's no documentation but here's implementation

With this new, complex scope mechanism, we had to remove our feature that checked if the redirected-to homepage is out-of-scope as it would require us to duplicate that whole scope code in zimit. Instead, a warning is displayed if the homepage is a redirection.

Question: is that enough? Do we want a different behavior? Should we duplicate that whole scope matching logic to fail early should target homepage be out-of-scope?

The text was updated successfully, but these errors were encountered:

stale · 2022-09-21T03:09:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jaifroid · 2023-02-02T17:09:23Z

What are the practical consequences? That those creating Zimfarm recipes or running Zimit will have to be careful to define the scope carefully? Are we getting scrapes that are too small (or too big) as a result of this change? Is this at all related to the appearance of ZIM files that are too small in dev and that have little more than a landing page?

rgaudin · 2023-02-06T11:02:50Z

No existing recipe would be affected because they were passing with the previous check so they don't have a redirect to an out-of-scope URL.

I imagine that recipe/requests with such a redirect would complete successfully within seconds and create a tiny ZIM but we should test the scenario to be sure.

kelson42 · 2023-11-04T17:33:37Z

I have a difficulty to judge the level of impact of this ticket/bug/problem? Can someone help me?

rgaudin · 2023-11-04T19:05:37Z

I don't think I can be more clear than the explanation above.
Maybe reading the source code would help?

zimit/zimit.py

Lines 470 to 490 in c98e450

    
           if actual_url.geturl() != url.geturl(): 
        
               if scope in (None, "any"): 
        
                   return actual_url.geturl() 
        
               print( 
        
                   "[WARN] Your URL ({0}) redirects to {1} which {2} on same " 
        
                   "first-level domain. Depending on your scopeType ({3}), " 
        
                   "your homepage might be out-of-scope. Please check!".format( 
        
                       url.geturl(), 
        
                       actual_url.geturl(), 
        
                       "is" 
        
                       if get_fld(url.geturl()) == get_fld(actual_url.geturl()) 
        
                       else "is not", 
        
                       scope, 
        
                   ) 
        
               ) 
        
               return actual_url.geturl() 
        
           return url.geturl()

kelson42 · 2023-11-05T16:40:10Z

@rgaudin Sorry, my question was not specific enough. I mean the quantity impact. Do we have a lot of scrapes impacted or only a few each year?

rgaudin · 2023-11-06T13:24:10Z

No idea and we can't really know: this information is just a warning in the logs.
I don't think it would be much as users (at least ours) tend to copy-paste URL from a running browser so redirections are most likely resolved

rgaudin added the question label Jun 23, 2022

stale bot added the stale label Sep 21, 2022

stale bot removed the stale label Feb 2, 2023

kelson42 added this to the 2.0.0 milestone Apr 24, 2023

rgaudin mentioned this issue Nov 4, 2023

Create documentation for content editors about zimit offliner openzim/zimfarm#860

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of scope homepage redirect #138

Out of scope homepage redirect #138

rgaudin commented Jun 23, 2022

stale bot commented Sep 21, 2022

Jaifroid commented Feb 2, 2023

rgaudin commented Feb 6, 2023

kelson42 commented Nov 4, 2023

rgaudin commented Nov 4, 2023

kelson42 commented Nov 5, 2023

rgaudin commented Nov 6, 2023

Out of scope homepage redirect #138

Out of scope homepage redirect #138

Comments

rgaudin commented Jun 23, 2022

stale bot commented Sep 21, 2022

Jaifroid commented Feb 2, 2023

rgaudin commented Feb 6, 2023

kelson42 commented Nov 4, 2023

rgaudin commented Nov 4, 2023

kelson42 commented Nov 5, 2023

rgaudin commented Nov 6, 2023