Skip to content

Search engine harvesting

Ronan McHugh edited this page Aug 16, 2016 · 7 revisions

If you want your site to be harvested by search engines, you will need to consider the effect this will have on server load. Excessive crawling can have a negative impact on search performance for all users.

Sitemap

It is a good idea to create a sitemap which will tell crawlers which pages you want to be harvested. The SitemapGenerator gem can be used to create a sitemap periodically and ping search engines to trigger new harvests. Here is an example implementation that creates links to all relevant documents within Solr (developed for for the Danish Research database).

SitemapGenerator::Sitemap.create do
  # We set a boolean value in our environment files to prevent generation in staging or development
  break unless Rails.application.config.sitemap[:generate]

  # Add static pages
  # This is quite primitive - could perhaps be improved by querying the Rails routes in the about namespace
  ['', 'search-and-get', 'data', 'faq'].each do |page|
    add "/about/#{page}"
  end

  # Add single record pages
  cursorMark = '*'
  loop do
    response = Blacklight.solr.get('/solr/blacklight/select', :params => { # you may need to change the request handler
      'q'          => '*:*', # all docs
      'fl'         => 'id', # we only need the ids
      'fq'         => '', # optional filter query
      'cursorMark' => cursorMark, # we need to use the cursor mark to handle paging
      'rows'       => ENV['BATCH_SIZE'] || 1000,
      'sort'       => 'id asc'
    })

    response['response']['docs'].each do |doc|
      add "/catalog/#{doc['cluster_id_ss'].first}"
    end

    break if response['nextCursorMark'] == cursorMark # this means the result set is finished

    cursorMark = response['nextCursorMark']
  end
end

It is a good idea to trigger sitemap generation via CRON to happen at times of low activity (for example at the weekend) so that harvesting doesn't impact human users. For example:

0 2 * * 6 cd <app_root> && RAILS_ENV=production /usr/bin/bundle exec rake sitemap:clean sitemap:refresh

Robots.txt

If you expose a sitemap with all the pages you do want to be harvested, it is a good idea to tell crawlers which pages you do not want to be harvested. Some crawlers will construct urls for search results pages leading to a potentially infinite number of crawl targets. Therefore you should include a robots.txt file which will disallow search results pages. Here is an example:

# robots.txt
# Load the sitemap if it is present
<%- if File.exists? "#{Rails.root}/public/sitemap.xml.gz" -%>
Sitemap: <%= "#{root_url :locale => nil}sitemap.xml.gz" %>
<%- end -%>
User-agent: *
Disallow: /catalog? # blocks search results pages
Disallow: /catalog.html? # sometimes they use .html to get searches, Sneaky Google!
Disallow: /catalog/facet # blocks facet pages
Disallow: /catalog/range_limit

Metadata

Google Scholar uses Highwire Press tags for parsing academic metadata. For example:

<meta name="citation_title" content="Association between regional cerebral blood flow during hypoglycemia and genetic and phenotypic traits of the renin-angiotensin system" />
<meta name="citation_author" content="Lise Grimmeshave Bie-Olsen" />
<meta name="citation_author" content="Ulrik Pedersen-Bjergaard" />
<meta name="citation_author" content="Troels Wesenberg Kjaer" />
<meta name="citation_author" content="Markus Nowak Lonsdale" />
<meta name="citation_author" content="Ian Law" />
<meta name="citation_author" content="Birger Thorsteinsson" />
<meta name="citation_publication_date" content="2009" />
<meta name="citation_journal_title" content="Journal of Cerebral Blood Flow and Metabolism" />
<meta name="citation_language" content="eng" />
<meta name="citation_doi" content="10.1038/jcbfm.2009.94" />
<meta name="citation_issn" content="1559-7016" />
<meta name="citation_issn" content="0271-678x" />
Clone this wiki locally