Skip to content

Generate Sitemaps on read only filesystems like Heroku

Ben Richardson edited this page Apr 10, 2016 · 19 revisions

To generate sitemaps on read-only filesystems (like Heroku) we generate them into a temporary directory (or any directory with write access) and then upload them to a remote server.

Using Fog

As of 2012-07-12 SitemapGenerator includes some other adapters which you can use if you prefer not to use CarrierWave. The SitemapGenerator::S3Adapter uses fog-aws. You just need to set a few environment variables to configure your S3 key, bucket etc, namely: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, FOG_PROVIDER, FOG_DIRECTORY, FOG_REGION, FOG_PATH_STYLE. Take a look at this issue for more information.

The S3Adapter now supports configurable options so you don't have to use environment variables. The options are:

  • :aws_access_key_id,
  • :aws_secret_access_key,
  • :fog_provider,
  • :fog_directory,
  • :fog_region,
  • :fog_path_style.

Pass them in when you initialize your adapter. You can see the code in this issue

If you omit the access key and secret access key options, it will attempt to use the local IAM profile.

An easy way to configure Fog is to set these environmental variables:

AWS_ACCESS_KEY_ID=XXX
AWS_SECRET_ACCESS_KEY=XXX
FOG_PROVIDER=AWS
FOG_DIRECTORY=your-bucket
FOG_REGION=us-west-2

Alternately, you can pass in some or all of those values when you create your S3Adapter in the sitemap.rb configuration file:

    SitemapGenerator::Sitemap.adapter = SitemapGenerator::S3Adapter.new(fog_provider: 'AWS',
                                         aws_access_key_id: <your-access-key-id>,
                                         aws_secret_access_key: <your-access-key>,
                                         fog_directory: <your-bucket>,
                                         fog_region: <your-aws-region e.g. us-west-2>)

Once you have Fog working, add the following to the sitemap.rb configuration file:

# Set the host name for URL creation
SitemapGenerator::Sitemap.default_host = "http://example.com"
# pick a place safe to write the files
SitemapGenerator::Sitemap.public_path = 'tmp/'
# store on S3 using Fog (pass in configuration values as shown above if needed)
SitemapGenerator::Sitemap.adapter = SitemapGenerator::S3Adapter.new
# inform the map cross-linking where to find the other maps
SitemapGenerator::Sitemap.sitemaps_host = "http://#{ENV['FOG_DIRECTORY']}.s3.amazonaws.com/"
# pick a namespace within your bucket to organize your maps
SitemapGenerator::Sitemap.sitemaps_path = 'sitemaps/'

If your bucket is in a region other than the default, your sitemaps_host must include the region. For example, for a bucket named your-bucket in the us-west-2 region, the sitemaps_host would be http://s3-us-west-2.amazonaws.com/your-bucket/

Using CarrierWave

SitemapGenerator can use CarrierWave to support uploading to Amazon S3 store, Rackspace Cloud Files store, and MongoDB's GridFS...basically whatever CarrierWave supports.

Include the CarrierWave gem

# Gemfile
gem 'sitemap_generator', '2.0.1.pre1'  # at time of writing
gem 'carrierwave'
gem 'fog-aws' # if you're using S3

Configure Sitemap Generator

Here is an example sitemap file. It generates sitemaps into tmp/sitemaps/. Note that we set the sitemaps_host to the hostname of the server that will be hosting our sitemaps. The full path to the sitemaps then becomes the remote host + the sitemaps path + the sitemap filename. We set the adapter to a WaveAdapter which is a CarrierWave::Uploader::Base.

SitemapGenerator::Sitemap.default_host = "http://www.example.com"
SitemapGenerator::Sitemap.sitemaps_host = "http://s3.amazonaws.com/sitemap-generator/"
SitemapGenerator::Sitemap.public_path = 'tmp/'
SitemapGenerator::Sitemap.sitemaps_path = 'sitemaps/'
SitemapGenerator::Sitemap.adapter = SitemapGenerator::WaveAdapter.new
SitemapGenerator::Sitemap.create do
  add 'hello_world!'
  add 'another'
end

Configure CarrierWave

In this example we are uploading to S3 using Fog. (I didn't have any success using the s3 storage option.) The fog_directory is your S3 bucket name.

# config/initializers/carrierwave.rb
CarrierWave.configure do |config|
  config.cache_dir = "#{Rails.root}/tmp/"
  config.storage = :fog
  config.permissions = 0666
  config.fog_credentials = {
    :provider               => 'AWS',
    :aws_access_key_id      => 'your key',
    :aws_secret_access_key  => 'your secret',
  }
  config.fog_directory  = 'bucket name'
end

With all that in place, you should be able to run rake sitemap:refresh and have your sitemaps generated and uploaded!

After running my test with my bucket 'sitemap-generator' my sitemaps were uploaded to https://s3.amazonaws.com/sitemap-generator/sitemaps/sitemap1.xml.gz and https://s3.amazonaws.com/sitemap-generator/sitemaps/sitemap_index.xml.gz successfully.

To make sure that your sitemaps are found by the search engines, include the link to the sitemap_index.xml.gz file in your robots.txt file, by adding the following line:

Sitemap: http://s3.amazonaws.com/sitemap-generator/sitemaps/sitemap_index.xml.gz

And that should be it! This is still in beta and is not well tested at this time.

Troubleshooting

If you encounter problems, first check the tmp/ directory and make sure the sitemap files were generated correctly (matching the rake output). Then make sure that your S3 bucket is made public and check for any response messages from CarrierWave.

From Issue #69 - If you were already using CarrierWave for uploads, make sure to note this line in the carrierwave.rb initializer above:

config.storage = :fog

CarrierWave examples commonly set the storage value in the uploader, like this:

class AvatarUploader < CarrierWave::Uploader::Base
  storage :fog
end

However, in order for sitemap uploads to work properly, this value must be set in the carrierwave.rb initializer.