Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change User-Agent to common crawler format #1224

Closed
bwbroersma opened this issue Jan 11, 2024 · 9 comments · Fixed by #1257
Closed

Change User-Agent to common crawler format #1224

bwbroersma opened this issue Jan 11, 2024 · 9 comments · Fixed by #1257

Comments

@bwbroersma
Copy link
Collaborator

Currently internetnl/1.0 is used, this is not ideal since it's not a common format plus since docker others can easily spin up their own instance and the UA should reflect at least the correct link to contact the server/person crawling.

As mentioned before in #363 (comment) and #1042 (comment) I would prefer to change this to a common bot user-agent like also listed in MDN.

The more standardized and accepted User-Agent is Mozilla/5.0 (compatible; SoftwareName/0.1.2; +https://internet.nl/) where the last + part could be the deployed instance (for a protected batch server another public page could be used, plus maybe include some #user-id-token, I've seen monitoring systems that do this). The + part should be configurable, but could default to the current instance domain variable already used.

So I suggest for us:
Mozilla/5.0 (compatible; internetnl/1.8.3; +https://internet.nl/about/)
Ideally we would even setup a 'bot' page like http://www.google.com/bot.html.


The RFC 1945 - 10.5 User-Agent is not strict:

User-Agent     = "User-Agent" ":" 1*( product | comment )

3.7 Product Tokens defines:

product         = token ["/" product-version]
product-version = token

2.2 Basic Rules defines the comment as:

comment        = "(" *( ctext | comment ) ")"
ctext          = <any TEXT excluding "(" and ")">

A string of text is parsed as a single word if it is quoted using double-quote marks.

quoted-string  = ( <"> *(qdtext) <"> )

qdtext         = <any CHAR except <"> and CTLs,
                 but including LWS>
@bwbroersma bwbroersma added infrastructure-docker discuss Requires further team discussion and decisions labels Jan 11, 2024
@mdavids
Copy link

mdavids commented Jan 12, 2024

I agree. Relevant reading, perhaps, here:
https://en.wikipedia.org/wiki/User-Agent_header#User_agent_spoofing
This may mean that less-popular browsers are not sent complex content (even though they might be able to deal with it correctly) or, in extreme cases, refused all content.

@bwbroersma
Copy link
Collaborator Author

bwbroersma commented Jan 12, 2024

If ignoring this:

It's just these two locations:

headers["User-Agent"] = "internetnl/1.0"

conn.putheader("User-Agent", "internetnl/1.0")

Remaining questions are:

  • What version can safely be used? VERSION is a x.y.z-format for releases, but PR's and test versions are different
    VERSION = get_version(version_scheme="release-branch-semver")
  • What should forks do? Add an extra product specifier, or just a suffix, or just change internetnl altogether?
  • Any thoughts / pro's and cons about including a #user-id-token in the batch calls?

@baknu
Copy link
Contributor

baknu commented Jan 17, 2024

Latest RFC on User-Agent header: https://www.rfc-editor.org/rfc/rfc9110.html#name-user-agent

@baknu
Copy link
Contributor

baknu commented Jan 17, 2024

Question: What User-Agent header are other test tools using?

@bwbroersma
Copy link
Collaborator Author

bwbroersma commented Jan 17, 2024

Tool User-Agent
W3C Markup Validation Service W3C_Validator/1.3 http://validator.w3.org/services (IPv6) and
Validator.nu/LV http://validator.w3.org/services (IPv4)
W3C CSS Validation Service Jigsaw/2.3.0 W3C_CSS_Validator_JFouffa/2.0 (See <http://validator.w3.org/services>)
SSL Labs - Test SSL SSL Labs (https://www.ssllabs.com/about/assessment.html)
Plus query parameter: ?SSL_Labs_Renegotiation_Test=User_Agent_May_Not_Show
Security Headers Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 SecurityHeaders
Plus Referer: https://securityheaders.com/
Hardenize Hardenize (https://www.hardenize.com) and
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Hardenize

@baknu
Copy link
Contributor

baknu commented Jan 17, 2024

Thanks! See also: https://udger.com/resources/ua-list/crawlers

@mdavids
Copy link

mdavids commented Jan 18, 2024

Oh, cool' we're on that list: https://udger.com/resources/ua-list/bot-detail?bot=internetnl#id131933

@bwbroersma
Copy link
Collaborator Author

Priority for this issue is asked by a governmental agency, currently the IPv4/IPv6 compare fails because the User-Agent internetnl/1.0 results in a 401, which is a failure because of #1226.
Funny thing is, I always use Mozilla/5.0 when sending requests without an User-Agent is blocked, and this magic Mozilla/5.0 also works on this 'hardened' system.

@bwbroersma bwbroersma added this to the v1.9 milestone Jan 26, 2024
@bwbroersma
Copy link
Collaborator Author

bwbroersma commented Jan 26, 2024

For the record: I'm proposing to put internetnl and the version string in the comment field only.

Decided with @baknu:

  • move UA to central settings.py
  • use plain semver, so just v1.8.1.dev26-g7426760 (g is pointing to the git-hash, 7426760, dev is the commits ahead, why in this case it's v1.8.1 and not v1.8.3 I don't know)
  • use instance url, if branded, use +https://internet.nl/about/.

Note again, internet.nl does not always send a User-Agent, which is a separate bug:

bwbroersma added a commit to bwbroersma/Internet.nl that referenced this issue Jan 31, 2024
bwbroersma added a commit to bwbroersma/Internet.nl that referenced this issue Jan 31, 2024
@bwbroersma bwbroersma removed the discuss Requires further team discussion and decisions label Jan 31, 2024
mxsasha pushed a commit that referenced this issue Feb 20, 2024
mxsasha pushed a commit that referenced this issue Mar 26, 2024
mxsasha pushed a commit that referenced this issue Mar 26, 2024
mxsasha pushed a commit that referenced this issue Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

3 participants