Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5639 googlebot guide #5675

Merged
merged 4 commits into from
Mar 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
22 changes: 21 additions & 1 deletion doc/sphinx-guides/source/_static/util/robots.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,26 @@
User-agent: *
# Note: In its current form, this sample robots.txt makes the site
# accessible to all the crawler bots (specified as "User-agent: *")
# It further instructs the bots to access and index the dataverse and dataset pages;
# it also tells them to stay away from all other pages (the "Disallow: /" line);
# and also not to follow any search links on a dataverse page.
# It is possible to specify different access rules for different bots.
# For example, if you only want to make the site accessed by Googlebot, but
# keep all the other bots away, un-comment out the following two lines:
#Disallow: /
#User-agent: Googlebot
Allow: /$
Allow: /dataverse.xhtml
Allow: /dataset.xhtml
Allow: /dataverse/
Allow: /sitemap/
# The following lines are for the facebook, twitter and linkedin preview bots:
Allow: /api/datasets/:persistentId/thumbnail
Allow: /javax.faces.resource/images/
# Comment out the following TWO lines if you DON'T MIND the bots crawling the search API links on dataverse pages:
Disallow: /dataverse/*?q
Disallow: /dataverse/*/search
Disallow: /
# Crawl-delay specification *may* be honored by *some* bots.
# It is *definitely* ignored by Googlebot (they never promise to
# recognize it either - it's never mentioned in their documentation)
Crawl-delay: 20
15 changes: 14 additions & 1 deletion doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -640,6 +640,14 @@ Ensure robots.txt Is Not Blocking Search Engines
For a public production Dataverse installation, it is probably desired that search agents be able to index published pages (AKA - pages that are visible to an unauthenticated user).
Polite crawlers usually respect the `Robots Exclusion Standard <https://en.wikipedia.org/wiki/Robots_exclusion_standard>`_; we have provided an example of a production robots.txt :download:`here </_static/util/robots.txt>`).

We **strongly recommend** using the crawler rules in the sample robots.txt linked above. Note that they make the dataverse and dataset pages accessible to the search engine bots; but discourage them from actually crawling the site, by following any search links - facets and such - on the dataverse pages. Such crawling is very inefficient in terms of system resources, and often results in confusing search results for the end users of the search engines (for example, when partial search results are indexed as individual pages).

The recommended solution instead is to directly point the bots to the dataset and dataverse pages that need to be indexed, by advertising them via an explicit sitemap (please see the next section for details on how to generate the sitemap).

You can of course modify your own robots.txt to suit your specific needs as necessary. If you don't want your datasets to be indexed at all, you can tell the bots to stay away from your site completely. But, as noted above, keep in mind that only the good, "polite" bots honor these rules! You are not really blocking anyone from accessing your site by adding a "Disallow" rule in robots.txt - it is a suggestion only. A rogue bot can and will violate it. If you are having trouble with the site being overloaded with what looks like heavy automated crawling, you may have to resort to blocking this traffic by other means - for example, via rewrite rules in Apache, or even by a Firewall.

(See the sample robots.txt file linked above for some comments on how to set up different "Allow" and "Disallow" rules for different crawler bots)

You have a couple of options for putting an updated robots.txt file into production. If you are fronting Glassfish with Apache as recommended above, you can place robots.txt in the root of the directory specified in your ``VirtualHost`` and to your Apache config a ``ProxyPassMatch`` line like the one below to prevent Glassfish from serving the version of robots.txt that is embedded in the Dataverse war file:

.. code-block:: text
Expand All @@ -666,10 +674,15 @@ This will create or update a file in the following location unless you have cust

On an installation of Dataverse with many datasets, the creation or updating of the sitemap can take a while. You can check Glassfish's server.log file for "BEGIN updateSiteMap" and "END updateSiteMap" lines to know when the process started and stopped and any errors in between.

https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Demo site and yours should be similar. Submit your sitemap URL to Google by following `Google's "submit a sitemap" instructions`_ or similar instructions for other search engines.
https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Demo site and yours should be similar.

Once the sitemap has been generated and placed in the domain docroot directory, it will become available to the outside callers at <YOUR_SITE_URL>/sitemap/sitemap.xml; it will also be accessible at <YOUR_SITE_URL>/sitemap.xml (via a *pretty-faces* rewrite rule). Some search engines will be able to find it at this default location. Some, **including Google**, need to be **specifically instructed** to retrieve it.

One way to submit your sitemap URL to Google is by using their "Search Console" (https://search.google.com/search-console). In order to use the console, you will need to authenticate yourself as the owner of your Dataverse site. Various authentication methods are provided; but if you are already using Google Analytics, the easiest way is to use that account. Make sure you are logged in on Google with the account that has the edit permission on your Google Analytics property; go to the search console and enter the root URL of your Dataverse server, then choose Google Analytics as the authentication method. Once logged in, click on "Sitemaps" in the menu on the left. (todo: add a screenshot?) Consult `Google's "submit a sitemap" instructions`_ for more information; and/or similar instructions for other search engines.

.. _Google's "submit a sitemap" instructions: https://support.google.com/webmasters/answer/183668


Putting Your Dataverse Installation on the Map at dataverse.org
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Expand Down
24 changes: 3 additions & 21 deletions src/main/webapp/robots.txt
Original file line number Diff line number Diff line change
@@ -1,23 +1,5 @@
# Consult the guide on how to open your Dataverse installation to crawling by
# search engine bots:
# /guides/installation/config.html#letting-search-engines-crawl-your-installation
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /
#Crawl-delay: 120
#Disallow: /dvn/faces/javax.faces.resource
#Disallow: /dvn/OAIHandler
#Disallow: /dvn/faces/ContactUsPage.xhtml
#Disallow: /dvn/dv/*/faces/ContactUsPage.xhtml
#Disallow: /dvn/faces/study/TermsOfUsePage.xhtml
#Disallow: /dvn/faces/subsetting/SubsettingPage.xhtml
#Disallow: /dvn/dv/*/faces/subsetting/SubsettingPage.xhtml
#Disallow: /dvn/FileDownload/
#Disallow: /FileDownload/
#Disallow: /dvn/dv/*/FileDownload/
#Disallow: /dvn/resources/
#Disallow: /dvn/api/
#


# Created initially using: http://www.mcanerin.com/EN/search-engine/robots-txt.asp
# Verified using: http://tool.motoricerca.info/robots-checker.phtml