Web Cache - Everything you need to know
In one of the previous posts, I discussed about HTTP and where it stands at this point. This is one is going to be specifically about the caching.
As users, we easily get frustrated by the buffering videos, the images that take seconds to load, pages that got stuck because the content is being loaded. Loading the resources from some cache is much faster than fetching the same from the originating server. It reduces latency, speeds up the loading of resources, decreases the load on server, cuts down the bandwidth costs etc.
What is web cache? It is something that sits somewhere between the client and the server, continuously looking at the requests and their responses, looking for any responses that can be cached. So that there is less time consumed when the same request is made again.
Note that this image is just to give you an idea. Depending upon the type of cache, the place where it is implemented could vary. More on this later.
Before we get into further details, let me give you an overview of the terms that will be used, further in the article
- Client could be your browser or any application requesting the server for some resource
- Origin Server, the source of truth, houses all the content required by the client and is responsible for fulfilling the client requests.
- Stale Content is the cached but expired content
- Fresh Content is the content available in cache that hasn't expired yet
- Cache Validation is the process of contacting the server to check the validity of the cached content and get it updated for when it is going to expire
- Cache Invalidation is the process of removing any stale content available in the cache
Web cache can be shared or private depending upon the location where it exists. Below is the list of caching locations
You might have noticed that when you click the back button in your browser it takes less time to load the page than the time that it took during the first load; this is the browser cache in play. Browser cache is the most common location for caching and browsers usually reserve some space for it.
A browser cache is limited to just one user and unlike other caches, it can store the "private" responses. More on it later.
Unlike browser cache which serves a single user, proxy caches may serve hundreds of different users accessing the same content. They are usually implemented on a broader level by ISPs or any other independent entities for example.
Reverse Proxy Cache
Reverse proxy cache or surrogate cache is implemented close to the origin servers in order to reduce the load on server. Unlike proxy caches which are implemented by ISPs etc to reduce the bandwidth usage in a network, surrogates or reverse proxy caches are implemented near to the origin servers by the server administrators to reduce the load on server.
Although you can control the reverse proxy caches (since it is implemented by you on your server) you can not avoid or control browser and proxy caches. And if your website is not configured to use these caches properly, it will still be cached using whatever the defaults are set on these caches.
So, how do we control the web cache? Whenever the server emits some response, it is accompanied with some HTTP headers to guide the caches whether and how to cache this response. Content provider is the one that has to make sure to return proper HTTP headers to force the caches on how to cache the content.
Before HTTP/1.1 and introduction of
Cache-Control, there was
Expires header which is simply a timestamp telling the caches how long should some content be considered fresh. Possible value to this header is absolute expiry date; where date has to be in GMT. Below is the sample header
Expires: Mon, 13 Mar 2017 12:22:00 GMT
It should be noted that the date cannot be more than a year and if the date format is wrong, content will be considered stale. Also, the clock on cache has to be in sync with the clock on server, otherwise the desired results might not be achieved.
Expires header is still valid and is supported widely by the caches, preference should be given to HTTP/1.1 successor of it i.e.
Another one from the old, pre HTTP/1.1 days, is
Pragma. Everything that it could do is now possible using the cache-control header given below. However, one thing I would like to point out about it is, you might see
Pragma: no-cache being used here and there in hopes of stopping the response from being cached. It might not necessarily work; as HTTP specification discusses it in the request headers and there is no mention of it in the response headers. Rather
Cache-Control header should be used to control the caching.
Cache-Control specifies how long and in what manner should the content be cached. This family of headers was introduced in HTTP/1.1 to overcome the limitations of the
Value for the
Cache-Control header is composite i.e. it can have multiple directive/values. Let's look at the possible values that this header may contain.
Setting the cache to
private means that the content will not be cached in any of the proxies and it will only be cached by the client (i.e. browser)
Having said that, don't let it fool you in to thinking that setting this header will make your data any secure; you still have to use SSL for that purpose.
If set to
public, apart from being cached by the client, it can also be cached by the proxies; serving many other users
no-store specifies that the content is not to be cached by any of the caches
no-cache indicates that the cache can be maintained but the cached content is to be re-validated (using
ETag for example) from the server before being served. That is, there is still a request to server but for validation and not to download the cached content.
Cache-Control: max-age=3600, no-cache, public
max-age specifies the number of seconds for which the content will be cached. For example, if the
cache-control looks like below:
Cache-Control: max-age=3600, public
it would mean that the content is publicly cacheable and will be considered stale after 60 seconds
s- prefix stands for shared. This directive specifically targets the shared caches. Like
max-age it also gets the number of seconds for which something is to be cached. If present, it will override
expires headers for shared caching.
Cache-Control: s-maxage=3600, public
must-revalidate it might happen sometimes that if you have network problems and the content cannot be retrieved from the server, browser may serve stale content without validation.
must-revalidate avoids that. If this directive is present, it means that stale content cannot be served in any case and the data must be re-validated from the server before serving.
Cache-Control: max-age=3600, public, must-revalidate
proxy-revalidate is similar to
must-revalidate but it specifies the same for shared or proxy caches. In other words
proxy-revalidate is to
s-maxage is to
max-age. But why did they not call it
s-revalidate?. I have no idea why, if you have any clue please leave a comment below.
You can combine these directives in different ways to achieve different caching behaviors, however
public/private are mutually exclusive.
If you specify both
no-store will be given precedence over
; If specified both Cache-Control: no-store, no-cache ; Below will be considered Cache-Control: no-store
private/public, for any unauthenticated requests cache is considered
public and for any authenticated ones cache is considered
Up until now we only discussed how the content is cached and how long the cached content is to be considered fresh but we did not discuss how the client does the validation from the server. Below we discuss the headers used for this purpose.
Etag or "entity tag" was introduced in HTTP/1.1 specs. Etag is just a unique identifier that the server attaches with some resource. This ETag is later on used by the client to make conditional HTTP requests stating
"give me this resource if ETag is not same as the ETag that I have" and the content is downloaded only if the etags do not match.
Method by which ETag is generated is not specified in the HTTP docs and usually some collision-resistant hash function is used to assign etags to each version of a resource. There could be two types of etags i.e. strong and weak
ETag: "j82j8232ha7sdh0q2882" - Strong Etag ETag: W/"j82j8232ha7sdh0q2882" - Weak Etag (prefixed with `W/`)
A strong validating ETag means that two resources are exactly same and there is no difference between them at all. While a weak ETag means that two resources are although not strictly same but could be considered same. Weak etags might be useful for dynamic content, for example.
Now you know what etags are but how does the browser make this request? by making a request to server while sending the available Etag in
Consider the scenario, you opened a web page which loaded a logo image with caching period of 60 seconds and ETag of
abc123xyz. After about 30 minutes you reload the page, browser will notice that the logo which was fresh for 60 seconds is now stale; it will trigger a request to server, sending the ETag of the stale logo image in
Server will then compare this ETag with the ETag of the current version of resource. If both etags are matched, server will send back the response of
304 Not Modified which will tell the client that the copy that it has is still good and it will be considered fresh for another 60 seconds. If both the etags do not match i.e. the logo has likely changed and client will be sent the new logo which it will use to replace the stale logo that it has.
Server might include the
Last-Modified header indicating the date and time at which some content was last modified on.
Last-Modified: Wed, 15 Mar 2017 12:30:26 GMT
When the content gets stale, client will make a conditional request including the last modified date that it has inside the header called
If-Modified-Since to server to get the updated
Last-Modified date; if it matches the date that the client has,
Last-Modified date for the content is updated to be considered fresh for another
n seconds. If the received
Last-Modified date does not match the one that the client has, content is reloaded from the server and replaced with the content that client has.
If-Modified-Since: Wed, 15 Mar 2017 12:30:26 GMT
You might be questioning now, what if the cached content has both the
ETag assigned to it? Well, in that case both are to be used i.e. there will not be any re-downloading of the resource if and only if
ETag matches the newly retrieved one and so does the
Last-Modified date. If either the
ETag does not match or the
Last-Modified is greater than the one from the server, content has to be downloaded again.
Where do I start?
Now that we have got everything covered, let us put everything in perspective and see how you can use this information.
Before we get into the possible caching strategies , let me add the fact that most of the servers including Apache and Nginx allow you to implement your caching policy through the server so that you don't have to juggle with headers in your code.
For example, if you are using Apache and you have your static content placed at
/static, you can put below
.htaccess file in the directory to make all the content in it be cached for an year using below
# Cache everything for an year Header set Cache-Control "max-age=31536000, public"
You can further use
filesMatch directive to add conditionals and use different caching strategy for different kinds of files e.g.
# Cache any images for one year <filesMatch ".(png|jpg|jpeg|gif)$"> Header set Cache-Control "max-age=31536000, public" </filesMatch> # Cache any CSS and JS files for a month <filesMatch ".(css|js)$"> Header set Cache-Control "max-age=2628000, public" </filesMatch>
Or if you don't want to use the
.htaccess file you can modify Apache's configuration file
http.conf. Same goes for Nginx, you can add the caching information in the location or server block.
There is no golden rule or set standards about how your caching policy should look like, each of the application is different and you have to look and find what suits your application the best. However, just to give you a rough idea
- You can have aggressive caching (e.g. cache for an year) on any static content and use fingerprinted filenames (e.g.
style.ju2i90.css) so that the cache is automatically rejected whenever the files are updated. Also it should be noted that you should not cross the upper limit of one year as it might not be honored
- Look and decide do you even need caching for any dynamic content, if yes how long it should be. For example, in case of some RSS feed of a blog there could be the caching of a few hours but there couldn't be any caching for inventory items in an ERP.
- Always add the validators (preferably ETags) in your response.
- Pay attention while choosing the visibility (private or public) of the cached content. Make sure that you do not accidentally cache any user-specific or sensitive content in any public proxies. When in doubt, do not use cache at all.
- Test and monitor the caching headers being served by your site. You can use the browser console or
curl -I http://some-url.comfor that purpose.
And that about wraps it up. If you have any comments or feedback, feel free to leave a comment below.
If you find any typos or blatant lies, please contribute by updating the article