Design Doc: Prefetching page resources

Prefetching page resources

Collected by Joshua Marantz from an email thread from arvind, ramani, mdw, & cameron.

November 29, 2011

Objective

Chrome supports <link rel=subresource href=...> which initiates a normal-priority fetch for the resource named in the href attribute. We wish to exploit this feature to improve web page latency (PLT & TTFR (NOTE: Page Load Time & Time To First Render)).

See http://dev.chromium.org/spdy/link-headers-and-server-hint/link-rel-subresource

Non-Objective

Prefetching resources for pages the user is likely to navigate to. There’s a separate effort underway toward that goal. The two efforts should be aware of one another and share infrastructure as needed, but this objective is much simpler.

Discussion

According to Arvind, this feature only works as an HTML tag, not as an HTTP request header, despite the doc. Moreover it seems like a better idea to put it in HTML because it will be compressed even without SPDY.

Note that browsers already scans HTML for resource references before rendering, so putting subresource hints for resources directly loaded from HTML, at least from the first packet of HTML, does not make sense. We want to include resources that

are indirectly loaded (via javascript or css files)
don’t appear until lower in the document

In other words, the resources that are easy to discover in the first PSA *flush window *are not useful to link as subresources.

New rewriters such as defer_js try to push resources lower on the page so they don’t block rendering. In fact one of the potential drawbacks of defer_js is that the browser won’t even initiate fetches for javascript until much later. Depending on the page this can have negative ramifications for the user experience even if it accelerates PLT or TTFR. Inserting subresource links for deferred javascript may resolve this issue and make defer_js an unambiguous win.

To implement this feature, we must maintain a cache of HTML->ResourceList mappings. This is because to be useful in a streaming context, we must emit the link tags before seeing the references in our HTML parser flow ourselves. And indeed many of the resources are not discoverable by static analysis of HTML and CSS, even if we are not streaming.

Discovering the Resource list

There are several possibilities, listed here with pros & cons. Note that in every case we will be populating a cache for use on subsequent requests. We will never be able to dynamically determine the correct resources for the current page due to streaming and external CSS.

PSA Static Parsing

A new filter is implemented that tracks URLs parsed in HTML and CSS. At the end of the document, the collection of URLs is written to a cache. On subsequent page views, a cache lookup is initiated to discover page resources. These resources are emitted as link rel=subresource tags early the document. The html-based collection can order the resources in a reasonable way, although it’s hard to tell exactly what would be best for the browser without integrating with a critical path analysis.

Headless Browser Service

A headless browser can be used to periodically update a cache from HTML URL to ResourceList. This presents a compelling client-side view of what resources are needed. There is already a plan in place to use a headless browser for identifying AFT images so this plan may integrate well.

PSA Referer Tracking

In this idea we exploit our existing server-side presence to incrementally build the transitive closure of referenced resources from HTML urls. This has a few engineering challenges but appears straightforward in principle. It catches all the requests made from the browser back to the mod_pagespeed server. but will not catch references to resources in other domains.

Independent of which strategy we use for collecting URLs, we may want to augment, order, or filter the information based on PSA filters. For example defer_js should "vote" for the preloading of deferred javascript files. Resources that are loaded below the fold or after onload should be demoted to the bottom of the list. Resource-rich pages may benefit more if not all the resources are preloaded: experimentation is necessary.

Strategy	Pros	Cons
PSA static parsing	Easy to implement and integrate. Access to rewritten resources names. Works in PSS & mod_pagespeed	Misses JS-loaded resources which are arguably the most important
Headless browser	Resource priority may also be extractable helping us order the links. We want to get AFT info for images anyway and webkit_headless seems like the best way to get that, so we might as well get the resource list. Includes all resources, not just same-domain.	mod_pagespeed doesn't have a headless browser; maybe phantomJS?
PSA Refererer tracking	Access to rewritten resources names. Can be integrated into both PSS & mod_pagespeed.	Several week engineering effort. Would only collect same-domain or proxied resources.

Conclusion

This looks like a compelling idea, but I think it will need to be tuned. There are a few competing ways to implement it. Of course, this feature is useful only for Chrome (and arguably FF via link rel=prefetch).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly