Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: static-checks: Try multiple user agents #8592

Conversation

jodh-intel
Copy link
Contributor

@jodh-intel jodh-intel commented Dec 6, 2023

Make the URL checker cycle through a list of user agent values until we
hit one the remote server is happy with.

This is required since, unfortunately, we really, really want to check
these URLs, but some sites block clients based on their User-Agent
(UA) request header value. And of course, each site is different and can
change its behaviour at any time.

Our strategy therefore is to try various UA's until we find one the
server accepts:

  • No explicit UA (use curl's default)
  • Explicitly no UA.
  • A blank UA.
  • Partial UA values for various CLI tools.
  • Partial UA values for various console web browsers.
  • Partial UA for Emacs's built-in browser.
  • The existing UA which is used as a "last ditch" attempt where the UA implies multiple platforms and browser.

Notes:

  • The "partial UA" values specify specify the UA "product" but not the
    UA "product version": we specify foo and not foo/1.2.3). We do
    this since most sites tested appear to not care about the version.
    This is as expected given that the version is strictly optional (see [*]).

  • We now treat URLs that the server reports as HTTP 401, HTTP 402 or
    HTTP 403 as valid. See the comments in the code.

  • We now log all errors and display a summary on error in addition to
    the simple list of the URLs we believe to be invalid. This should make
    future debugging simpler.

[*] - https://www.rfc-editor.org/rfc/rfc9110#section-10.1.5

Fixes: #8553.

Signed-off-by: James O. D. Hunt james.o.hunt@intel.com

@katacontainersbot katacontainersbot added the size/large Task of significant size label Dec 6, 2023
@jodh-intel
Copy link
Contributor Author

When I run this locally, all Kata doc URLs are reported as valid apart from https://developer.amd.com/sev/ which returns HTTP 301 ("Moved Permanently"). That should be fine since we request following redirects and in fact curl handles the URL, however it returns 92 ("Stream error in HTTP/2 framing layer"). What do we do about this? 😕

Interestingly, wget does manage to check that URL, but only after I specified a blank UA for it so it appears that site is also filtering on the UA.

I really hope we don't need to cycle between curl and wget 😭

Copy link
Contributor

@cmaf cmaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel thanks for fixing this! I confirmed it works locally with the exception of https://developer.amd.com/sev/.

I am going to port 2 commits to the tests repo, the curl and multiple user agents one (d20822c and 404162c).

Note: We will not be maintaining 2 separate static checks scripts beyond these changes because it will be running from GHA after #8595. However, I am porting them in favor of getting #6826 merged and in case testing takes longer than expected.

@cmaf cmaf added the safe-to-test Add to PR after manually reviewing to allow certain extra checks to run label Dec 6, 2023
tests/static-checks.sh Outdated Show resolved Hide resolved
tests/static-checks.sh Outdated Show resolved Hide resolved
tests/static-checks.sh Outdated Show resolved Hide resolved
@jodh-intel jodh-intel force-pushed the static-checks-try-multiple-user-agents branch from d20822c to f2ee761 Compare December 8, 2023 10:37
@jodh-intel
Copy link
Contributor Author

@amshinde, @gkurz - Branch updated to treat 401 and 402 as valid, but retry with 403. ptal! ;)

@jodh-intel
Copy link
Contributor Author

/test

tests/static-checks.sh Outdated Show resolved Hide resolved
Only attempt to build the markdown checker if it doesn't already exist.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Break up a long line as little to make it easier to read.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Check that the `check_url()` parameters have a value.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Declare and then define a couple of variables separately.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Split the call to `curl` in the URL checker out into a new
`run_url_check_cmd()` function to make `check_url()` slightly clearer.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove some extraneous whitespace.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Removed the Azure Portal URL (https://portal.azure.com) since this
causes problems with our static checks script: that URL returns HTTP 403
("Forbidden") when queried using command-line tools like `curl(1)`,
which is used by the static check script.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Make the URL checker cycle through a list of user agent values until we
hit one the remote server is happy with.

This is required since, unfortunately, we really, really want to check
these URLs, but some sites block clients based on their `User-Agent`
(UA) request header value. And of course, each site is different and can
change its behaviour at any time.

Our strategy therefore is to try various UA's until we find one the
server accepts:

- No explicit UA (use `curl`'s default)
- Explicitly no UA.
- A blank UA.
- Partial UA values for various CLI tools.
- Partial UA values for various console web browsers.
- Partial UA for Emacs's built-in browser.
- The existing UA which is used as a "last ditch" attempt where the UA implies multiple platforms and browser.

> **Notes:**
>
> - The "partial UA" values specify specify the UA "product" but not the
>   UA "product version": we specify `foo` and not `foo/1.2.3`). We do
>   this since most sites tested appear to not care about the version.
>   This is as expected given that the version is strictly optional (see `[*]`).
>
> - We now log all errors and display an error summary if none of the UAs
>   worked, in addition to the simple list of the URLs we believe to be
>   invalid. This should make future debugging simpler.

`[*]` - https://www.rfc-editor.org/rfc/rfc9110#section-10.1.5

Fixes: kata-containers#8553.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
@jodh-intel jodh-intel force-pushed the static-checks-try-multiple-user-agents branch from f2ee761 to 5d085a3 Compare December 8, 2023 18:03
@jodh-intel jodh-intel requested a review from a team as a code owner December 8, 2023 18:03
@jodh-intel
Copy link
Contributor Author

@gkurz, @amshinde - branch updated (and simplified) - ptal! ;)

/test

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branch updated (and simplified) - ptal! ;)

Simpler indeed. :-) Thanks @jodh-intel !

@jodh-intel jodh-intel merged commit 2a35541 into kata-containers:main Dec 11, 2023
164 of 174 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test safe-to-test Add to PR after manually reviewing to allow certain extra checks to run size/large Task of significant size
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Make URL checker use a list of user agents
5 participants