No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs/images
snapshot-app
snapshot-kue-ui
snapshot-redis
.gitignore
README.md
build_all_images.sh
refresh_all_containers.sh

README.md

Snapshoter

The Problem

Highly JS-intensive websites like those built with AngularJS tend to be penalized in SEO terms. Search engines' crawlers are not generally able - or trusting enough - to execute client-side JS in order to understand what would be the end-result DOM.

As a massive AngularJS app, Virgin America's site serves as a good example. When users visit an URL such as https://www.virginamerica.com/book/ow/a1/nyc_sfo they expect a visual result similar to this:

NYC-SFO Calendar

In fact, this rendering is dynamically generated by an AngularJS app. If one would open the code inspector and search for the keyword 'November', the following DOM would be visible:

...
<header class="month__header">
    <h1 bo-text="month.monthYear">November 2014</h1>
</header>
...

However a simple HTTP request - like the one below - would yield the basic, JS-unchanged version of the HTML or no reference to 'November' would be possible:

curl https://www.virginamerica.com/book/ow/a1/nyc_sfo

This former method is the most traditional across crawlers and therefore your AngularJS (or heavy JS) website will not be indexed accordingly.

The Solution

The solution proposed by Google and adopted by most search engines is described at the Escaped Fragment Full Specification. A further detailed how-to can be found on Google's AJAX-Crawling - Getting Starts.

In practice this specification allows developers to let the crawler know that this site is rendered via JS. The crawler is then able to call a modified URL that can be trapped by the server in order to provide a static, previosuly rendered version of the website back to the crawler.

There are two mechanisms that can be used by developers to let the crawler know that the site needs this treatment: use hasbangs on the URLs or adding a fragment meta tag.

The first method, hashbangs, means converting URLs like this:

www.example.com/ajax.html#key=value

into this:

www.example.com/ajax.html#!key=value

The crawler will automatically understand that this site is specially-rendered and would convert the request to:

www.example.com/ajax.html?_escaped_fragment_=key=value

The second method, adding a fragment meta tag, means including a special meta tag in the head of the HTML of your page. The meta tag takes the following form:

<meta name="fragment" content="!">

Independently of the chosen method, you need to set up your server to handle requests for URLs that contain _escaped_fragment_.

Suppose you would like to get www.example.com/index.html#!key=value indexed. Your part of the agreement is to provide the crawler with an HTML snapshot of this URL, so that the crawler sees the content. How will your server know when return an HTML snapshot instead of a regular page? The answer is the URL that is requested by the crawler: the crawler will modify each AJAX URLs.

Snapshoter's Architecture

Snapshoter takes the heavylifting off of your HTTP servers. It servers xxx needs:

  1. Treats all _escaped_fragment_ calls that your HTTP servers receive
  2. Generates static snapshots of the dynamic pages on the fly
  3. Caches snapshots for fast responses
  4. Updates cached snapshots on a queued manner
  5. Offers queue management facilities

The following diagram represents the architectural components of Snapshoter:

Architectural Diagram

The next session describes each architectural component in details.

Components Described

Your own HTTP Server(s)

These servers must be set up to mod_redirect and context switch when the _escaped_fragment_ URL is identified. These requests must be processed by the Snapshoter's HTTP Entrypoint.

I.e., considering that Virgin America's website is using the fragment meta tag method, the following URL:

https://www.virginamerica.com/book/ow/a1/nyc_sfo

will be converted by crawlers to:

https://www.virginamerica.com/?_escaped_fragment_=book%2Fow%2Fa1%2Fnyc_sfo

Your HTTP must detect this request and encapsulate (not redirect) a request to your Snapshoter's HTTP Entrypoint installation (i.e. residing at snapshoter.example.com):

http://snapshoter.example.com/snapshots?uri=book%2Fow%2Fa1%2Fnyc_sfo

More details about how to set up your domain on the settings section below.

Snapshoter's HTTP Entrypoint

Snapshoter's HTTP Entrypoint is the main component of Snapshoter. It servers as a HTTP faced to the outside world and orchestrates most of the logic for taking snapshots of dynamic websites.

It takes in URLs such as:

http://snapshoter.example.com/snapshots?uri=book%2Fow%2Fa1%2Fnyc_sfo

This entrypoint will check whether there's a cached version of this snapshot, if not, it will dynamically generate the snapshot and save it for later use.

Redis

Snapshoter uses Redis for its cache as well as for controlling the queue of cache updates.

Queue Worker

A queue worker is a process (or server) that will go through queued up cache update jobs and update them accordingly.

Queue UI

In order to manage the queue, an optional queue management UI is offered as part of Snapshoter's package.

Setting up Redis (snapshot-redis)

A vanilla Redis server is required by Snapshoter. A sample redis.conf is provided for reference.

All sample Redis files are found on snapshot-redis.

If you use Docker, a Dockerfile is provided in the snapshot-redis folder. The build_image.sh script trigers a named image build and refresh_container.sh stops, deletes, recreates and runs a container out of it.

You should also refer to the Dockerfile in terms of dependencies.

Setting up the HTTP Entrypoint (snapshot-app)

Snapshoter's HTTP entrypoint is a Node.js app.

Make sure you have Node.js and npm installed. Refer to http://nodejs.org/ for specifc details for your target environment.

If you use Docker, a Dockerfile is provided in the snapshot-app folder. The build_image.sh script trigers a named image build and refresh_container.sh stops, deletes, recreates and runs a container out of it.

Other dependencies may also apply depending on your setup. Refer to the Dockerfile for details.

For Debian Jessie, for instance, you may need to run the following:

$ apt-get -y update && \
    apt-get install -y \
    libfreetype6 \
    libfontconfig \
    nodejs \
    npm \
    git

IMPORTANT: some Linux distributions (such as Debian Jessie) have Node.js' binary as nodejs. This is incompatible with Snapshoter. If that's your case, make sure to symlink it with:

sudo ln /usr/bin/nodejs /usr/bin/node

The server runs on port 3000 so make sure to remap your ports when running the container or the process on the server.

The triggering command is:

node src/index.js

Before running the server, set up the following environment variables:

export CACHE_LIFETIME=300000
export BASE_URL=https://www.virginamerica.com/

CACHE_LIFETIME is the maximum allowed age (in miliseconds) of the cached snapshot (i.e. 300000 equals 5 minutes). BASE_URL is the base URL that will preface all requests to the dynamic portion of the webserver.

If you are using Docker, this container will need to be linked to the Redis container under the alias redis such as:

docker run \
  --name snapshot-app \
  --link snapshot-redis:redis \
  -p 3000:3000 \
  -e CACHE_LIFETIME=300000 \
  -e BASE_URL=https://www.virginamerica.com/ \
  -d \
  snapshot-app

Notice the --link snapshot-redis:redis. This will link this container with the redis one.

Also notice the port mapping on 3000. Remember to set this up according to your environment.

If you are not using Docker, then make sure to specify the following environment variables:

export REDIS_PORT_6379_TCP_PORT=49161
export REDIS_PORT_6379_TCP_ADDR=192.168.59.103

Where REDIS_PORT_6379_TCP_ADDR should point to the IP address of your Redis instance and REDIS_PORT_6379_TCP_PORT should point to the IP port of your Redis instance.

Make sure to have all Node.js dependencies sorted before running the triggering point process:

npm install

For further questions, refer to the respective Dockerfile.

Setting up Worker (snapshot-worker)

The worker is another instance of the main Snapshoter app but with another script to be spawned:

node src/worker.js

In practice, the worker has the same requirements as the HTTP entrypoint in terms of Linux system and dependencies but can be started with no need for the environment variables.

If you use Docker, a Dockerfile is provided in the snapshot-worker folder. The build_image.sh script trigers a named image build and refresh_container.sh stops, deletes, recreates and runs a container out of it.

Other dependencies may also apply depending on your setup. Refer to the Dockerfile for details.

For Debian Jessie, for instance, you may need to run the following:

$ apt-get -y update && \
    apt-get install -y \
    libfreetype6 \
    libfontconfig \
    nodejs \
    npm \
    git

IMPORTANT: some Linux distributions (such as Debian Jessie) have Node.js' binary as nodejs. This is incompatible with Snapshoter. If that's your case, make sure to symlink it with:

Before running the server, set up the following environment variables:

export CACHE_LIFETIME=300000
export BASE_URL=https://www.virginamerica.com/

CACHE_LIFETIME is the maximum allowed age (in miliseconds) of the cached snapshot (i.e. 300000 equals 5 minutes). BASE_URL is the base URL that will preface all requests to the dynamic portion of the webserver.

If you are using the provided Dockerfile, you'll be able to start the container by:

docker run \
  --name snapshot-worker \
  --link snapshot-redis:redis \
  -d \
  snapshot-app \
  node src/worker.js

Notice the --link snapshot-redis:redis. This will link this container with the redis one.

If you are not using Docker, then make sure to specify the following environment variables:

export REDIS_PORT_6379_TCP_PORT=49161
export REDIS_PORT_6379_TCP_ADDR=192.168.59.103

Where REDIS_PORT_6379_TCP_ADDR should point to the IP address of your Redis instance and REDIS_PORT_6379_TCP_PORT should point to the IP port of your Redis instance.

Make sure to have all Node.js dependencies sorted before running the triggering point process:

npm install

For further questions, refer to the respective Dockerfile.

Setting up Queue UI (snapshot-kue-ui)

The Queue UI is a Node.js app.

Make sure you have Node.js and npm installed. Refer to http://nodejs.org/ for specifc details for your target environment.

If you use Docker, a Dockerfile is provided in the snapshot-kue-ui folder. The build_image.sh script trigers a named image build and refresh_container.sh stops, deletes, recreates and runs a container out of it.

Other dependencies may also apply depending on your setup. Refer to the Dockerfile for details.

For Debian Jessie, for instance, you may need to run the following:

$ apt-get -y update && \
    apt-get install -y \
    nodejs \
    npm \
    git

IMPORTANT: some Linux distributions (such as Debian Jessie) have Node.js' binary as nodejs. This is incompatible with Snapshoter. If that's your case, make sure to symlink it with:

sudo ln /usr/bin/nodejs /usr/bin/node

The server runs on port 3000 so make sure to remap your ports when running the container or the process on the server.

The triggering command is:

node src/index.js

Before running the server, set up the following environment variables:

export KUE_USERNAME=admin
export KUE_PASSWORD=password

KUE_USERNAME is the username that will be used by the UI to authenticate users. KUE_PASSWORD is the password that will be used by the UI to authenticate users.

If you are using Docker, this container will need to be linked to the Redis container under the alias redis such as:

docker run \
  --name snapshot-kue-ui \
  --link snapshot-redis:redis \
  -p 3001:3000 \
  -e KUE_USERNAME=admin \
  -e KUE_PASSWORD=password \
  -d \
  snapshot-kue-ui

Notice the --link snapshot-redis:redis. This will link this container with the redis one.

Also notice the port mapping on 3000. Remember to set this up according to your environment.

If you are not using Docker, then make sure to specify the following environment variables:

export REDIS_PORT_6379_TCP_PORT=49161
export REDIS_PORT_6379_TCP_ADDR=192.168.59.103

Where REDIS_PORT_6379_TCP_ADDR should point to the IP address of your Redis instance and REDIS_PORT_6379_TCP_PORT should point to the IP port of your Redis instance.

Make sure to have all Node.js dependencies sorted before running the triggering point process:

npm install

For further questions, refer to the respective Dockerfile.

Setting up your Domain (i.e. mod_redirect)

Your server will have to redirect those requests identified as coming from crawlers to Snapshoter.

The way to achieve this in Apache is to use Apache's mod_redirect module.

You can set the rewrite rules either in the virtualhost configuration for the site or the .htaccess file that sits at the root of the server directory.

Here is an example of how to configure this using Apache's mod_redirect:

RewriteEngine On
Options +FollowSymLinks
RewriteCond %{REQUEST_URI}  ^/$
RewriteCond %{QUERY_STRING} ^_escaped_fragment_=/?(.*)$
RewriteRule ^(.*)$ /snapshots/?uri=%{REQUEST_URI}?%{QUERY_STRING} [NC,L]

Prior to that you may need to enable the appropriate modules:

$ a2enmod proxy
$ a2enmod proxy_http

You may also need to restart Apache:

$ sudo /etc/init.d/apache2 reload

This configuration will rewrite escaped fragment requests such as:

http://www.example.com/?_escaped_fragment_=book%2Fow%2Fa1%2Fnyc_sfo

to:

http://www.example.com/snapshots/?uri=book%2Fow%2Fa1%2Fnyc_sfo

The only thing left at the HTTP server side is deal with context switching/mapping in order to get the /snapshots mapped to the port where Snapshoter is connected to (refer to your Akamai, AWS or any other hosting vendor for environment-specific directions on how to achieve this).

Advanced Settings

There are cases when the static pages may need some extra treatment before being sent to crawlers. Take title, meta or canonical links for example: if you are generating static snapshots of a single-page app, chances are these will all be the same. However, this is not what you want to send to crawlers for SEO purposes.

If a file called processor-data.js is found in the snapshot-app/src path, then Snapshotter will read load up a post-processor tool and set it up with the contents of processor-data.js. This file must follow the following structure:

module.exports = function (processor) {
  processor
    .global(function ($, path, meta) {
      $('link[rel=canonical]').attr('href','http://www.virginamerica.com' + path);
    })
    .when(/^\/book/, function ($, path, meta) {
      $('title').text(
        'Book Flights,Hotels,Car Rentals &amp; More | Virgin America'
      );
      $('meta[name=description]').attr(
        'content',
        [
          'Book your next flight,hotel or vacation package with Virgin America. ',
          'Find low-fare plane tickets or bundle your reservation with a car ',
          'rental to save even more.'
        ].join('')
      );
    });
};

What is happening here? Check this out:

  1. The global directive will be triggered for all requests
  2. In this case, the function handler changes the canonical link to something more dynamic
  3. Each handler function gets three parameters ($, path and meta)
  4. $ is a parsed DOM version of the HTML. It is parsed using cheerio and support its simplified selectors (see more at Cheerio's website
  5. path is the path of the orginal fragment that triggered this snapshot
  6. meta is the request object from expressjs. It can be used to empower the HTML any way you see fit
  7. The when directive takes a regex as its first argument before the handler function
  8. Each request is matched against the regex of the when directives and if matched, the handler function is called
  9. Multiple when directives may be used
  10. The example above changes the title tag and the meta description tag when the path is /book followed by anything

Testing your Installation

Your installation can be tested by following the steps below:

  1. Keep a console out open on snapshots-app main HTTP process/container
  2. Keep a console out open on snapshots-worker process/container
  3. Create a valid URL to your domain and convert it manually to escaped fragment according to the spec
  4. You should see snapshots-app consoling out the request and indicating its response strategy
  5. Keep refreshing the page to see the cached response
  6. Make sure that a queued job is trigered when your specified cache age expires
  7. Then check snapshots-worker's console to see the cache being updated