Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wayback doesn't scrape/rewrite srcset urls correctly #137

Open
nightpool opened this issue Jan 27, 2017 · 1 comment
Open

Wayback doesn't scrape/rewrite srcset urls correctly #137

nightpool opened this issue Jan 27, 2017 · 1 comment

Comments

@nightpool
Copy link

nightpool commented Jan 27, 2017

Let me know if this isn't the right repo, but ran into an issue when testing archival features on http://www.goodbyetohalos.com/

Like many webcomics using wordpress nowadays, Goodbye to Halos uses html5 srcset attribute to displays different image sizes to different devices:

<img
    width="800" height="1200" 
    src="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

so far, so good. however, after crawling/scraping these with wayback, only the src url is scraped and rewritten, leading to the image on wayback'ed page still being served from the original server:

<img
    width="800" height="1200"
    src="/web/20170127042412im_/http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

this is very obvious because the original site doesn't use https, so it leads to a broken image on the wayback machine view:

image

Obviously, the correct behavior here is that all of the images should be scraped (in this case they're just resizings, but in theory they could be completely different images—nothing prevents that) and rewritten.

Thanks! let me know if you need more information, or want me to whip up a more minimal test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants