Wayback doesn't scrape/rewrite srcset urls correctly #137

nightpool · 2017-01-27T04:36:26Z

Let me know if this isn't the right repo, but ran into an issue when testing archival features on http://www.goodbyetohalos.com/

Like many webcomics using wordpress nowadays, Goodbye to Halos uses html5 srcset attribute to displays different image sizes to different devices:

<img
    width="800" height="1200" 
    src="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

so far, so good. however, after crawling/scraping these with wayback, only the src url is scraped and rewritten, leading to the image on wayback'ed page still being served from the original server:

<img
    width="800" height="1200"
    src="/web/20170127042412im_/http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg"
    class="attachment-full size-full" alt=""
    srcset="http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108.jpg 800w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-480x720.jpg 480w,
            http://www.goodbyetohalos.com/wp-content/uploads/2017/01/WEB_ch1_108-96x144.jpg 96w"
    sizes="(max-width: 800px) 100vw, 800px"
    data-webcomic-parent="837"
>

this is very obvious because the original site doesn't use https, so it leads to a broken image on the wayback machine view:

Obviously, the correct behavior here is that all of the images should be scraped (in this case they're just resizings, but in theory they could be completely different images—nothing prevents that) and rewritten.

Thanks! let me know if you need more information, or want me to whip up a more minimal test case

davidar · 2017-05-08T05:58:08Z

iipc/openwayback@8a86e99

nightpool mentioned this issue Feb 5, 2017

heritrix doesn't scrape rewrite srcset urls correctly internetarchive/heritrix3#177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wayback doesn't scrape/rewrite srcset urls correctly #137

Wayback doesn't scrape/rewrite srcset urls correctly #137

nightpool commented Jan 27, 2017 •

edited

davidar commented May 8, 2017

Wayback doesn't scrape/rewrite srcset urls correctly #137

Wayback doesn't scrape/rewrite srcset urls correctly #137

Comments

nightpool commented Jan 27, 2017 • edited

davidar commented May 8, 2017

nightpool commented Jan 27, 2017 •

edited