Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate rewritten srcset image urls; handle urls with commas #269

Merged
merged 6 commits into from Jan 5, 2018

Conversation

rebeccacremona
Copy link
Contributor

@rebeccacremona rebeccacremona commented Dec 18, 2017

Part 1: This PR offers a tiny tweak, the result of investigating why certain images at https://lil.law.harvard.edu, like the logo, aren't currently being captured/played back.

It turns out that websites using srcset to implement responsive images normally define a list of sources separated by both commas and whitespace. The whitespace is necessary since URLs can contain commas in the path section without being percent-encoded (see related issue #238). Without the whitespace, browsers will/may interpret the list of sources as one giant url.

The LIL src html:

<img src="/assets/images/header-splash-logo-ani@1x.gif" srcset="/assets/images/header-splash-logo-ani@1x.gif, /assets/images/header-splash-logo-ani@2x.gif 2x" alt="Library Innovation Lab" class="img-full-width">

currently results in the rewritten html:

<img src="/live/im_/https://lil.law.harvard.edu/assets/images/header-splash-logo-ani@1x.gif" srcset="/live/mp_/https://lil.law.harvard.edu/assets/images/header-splash-logo-ani@1x.gif,/live/mp_/https://lil.law.harvard.edu/assets/images/header-splash-logo-ani@2x.gif 2x" alt="Library Innovation Lab" class="img-full-width">

which results in the request (404 Not Found):

::1 - - [2017-12-18 11:12:18] "GET /live/mp_/https://lil.law.harvard.edu/assets/images/header-splash-logo-ani@1x.gif,/live/mp_/https://lil.law.harvard.edu/assets/images/header-splash-logo-ani@2x.gif HTTP/1.1" 404 6790 0.137927

By joining with ", " instead of "," these images are captured and played back as expected.

@rebeccacremona
Copy link
Contributor Author

(this reverses 037fca5 and c421b1c)

@rebeccacremona rebeccacremona changed the title Separate rewritten srcset image urls Separate rewritten srcset image urls; handle urls with commas Jan 2, 2018
@rebeccacremona
Copy link
Contributor Author

Part 2: per offline discussion, address #238 as well, insofar as possible.

Perfect splitting of URL lists in srcset is likely impossible due to ambiguities in the spec: URL paths may contain commas, but in certain circumstances, URLs in srcsets may be joined by simple commas, rather than commas and spaces. "/foo,bar/baz" is therefore ambiguous: a single url, or two relative urls (/foo,bar and /baz") joined by a comma?

This commit represents our best effort so far to split via a regex, rather than implementing the token-based parsing recommended in the spec (see "parse the value of the element’s srcset attribute as follows" in http://w3c.github.io/html/semantics-embedded-content.html#element-attrdef-img-srcset)

@rebeccacremona
Copy link
Contributor Author

rebeccacremona commented Jan 2, 2018

The tests pass locally. The failure in Travis appears unrelated to this PR. I suspect it's a package version issue; investigating.

@rebeccacremona
Copy link
Contributor Author

Confirmed: a fresh clone of the repo + a fresh virtualenv fails in the same way as the travis ci build for this PR.

@rebeccacremona
Copy link
Contributor Author

Revised per offline code review. Thanks @ikreymer!

@ikreymer ikreymer merged commit d3b379e into webrecorder:develop Jan 5, 2018
@ikreymer
Copy link
Member

ikreymer commented Jan 5, 2018

Looks great, thanks @rebeccacremona for this thorough PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants