Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit check-url behavior and provide User-Agent a custom default value #229

Merged
merged 4 commits into from
Oct 26, 2023

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Oct 23, 2023

Fix #228
Fix #227
Fix #230

Changes:

  • User-Agent now has a default value (based on my Firefox on Mac/OS, not sure this is the best one)
  • The same UA is passed to Python requests to check URL and to browsertrix crawler
    • This UA is based on the concatenation of CLI flags --userAgent (always has a value due to default) + optional --userAgentSuffix + optional --adminEmail ; these 2 last values are automatically prefixed by a space, no need to provide one (or if someone does, he will have two spaces which is not an issue)
    • Except when a --mobileDevice is passed, then UA is only used for Python requests
    • If both --mobileDevice and --userAgent are passed, a warning is displayed
    • This UA is always used, even when a --mobileDevice is passed
  • In check_url:
    • we do a GET instead of a HEAD
    • we close the connection

NOTA: this won't be totally functional until webrecorder/browsertrix-crawler#419 is fixed (once it is fixed and used by us, there is no change needed except adapt the test case which has been made a little too permissive due to this bug)

@benoit74 benoit74 self-assigned this Oct 23, 2023
@benoit74 benoit74 force-pushed the user_agent branch 3 times, most recently from 8484df1 to 8907a71 Compare October 23, 2023 11:40
@benoit74 benoit74 changed the title User-Agent has a default and is used for check_url Revisit check-url behavior and provide User-Agent a custom default value Oct 23, 2023
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I think custom UA should apply to the crawler even when using --mobileDevice. I don't see a use case for specifying it only for our python check.
Is that a crawler limitation?

@benoit74
Copy link
Collaborator Author

No, it's not a crawler limitation, it is something I even had to "force" somehow with specific code.
Since it is covered by a test case, I thought that the most important thing when passing a --mobileDevice is to set the proper UA.
But you know the business way better than I do, so I do not mind to always use our custom UA even when a mobileDevice is set.
From my culture, aside the UA do you know what else is changed when passing a --mobileDevice to Pupetteer ?

@rgaudin
Copy link
Member

rgaudin commented Oct 23, 2023

Mostly the apparent device sizes (resolution).

That's used mostly to crawl mobile versions of website or to have the appropriate media queries when using it for screenshots.
Modern websites also serve different sized pictures based on media queries

@benoit74
Copy link
Collaborator Author

I have adapted the code to always use the UA, even when mobile device is set. I updated the first comment.
This change allowed me to discover a bug in crawler, see NOTA in first comment (probably not a big concern, at least we can proceed without this fix).

@benoit74 benoit74 requested a review from rgaudin October 24, 2023 18:49
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ; please split the long line

test/integration.py Outdated Show resolved Hide resolved
@benoit74
Copy link
Collaborator Author

I split the line, I let you merge if everything is OK.

@rgaudin rgaudin merged commit e0a4d3f into main Oct 26, 2023
2 checks passed
@rgaudin rgaudin deleted the user_agent branch October 26, 2023 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants