Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for sitemaps #4261 #5084

Merged
merged 22 commits into from
Oct 4, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
d2ccf59
add doc stub for sitemaps #4261
pdurbin Sep 24, 2018
afe3d0f
stub out sitemap code and tests #4261
pdurbin Sep 24, 2018
61231b2
write sitemap to docroot #4261
pdurbin Sep 25, 2018
45bde32
serve from /sitemap.xml #4261
pdurbin Sep 25, 2018
e436355
add datasets to sitemap #4261
pdurbin Sep 26, 2018
125d163
add test to assert that XML is well formed #4261
pdurbin Sep 27, 2018
f3d9b31
validate sitemap against the schema #4261
pdurbin Sep 27, 2018
d96f8cb
add dataverses to sitemap #4261
pdurbin Sep 27, 2018
9934fdf
fix test (dv must be published to appear in sitemap) #4261
pdurbin Sep 27, 2018
b742ce4
consistent "lastmod" for dataverses and datasets #4261
pdurbin Sep 28, 2018
b0c9401
add todo to support more than 50,000 URLs in sitemap #4261
pdurbin Sep 28, 2018
63d7a4f
improve docs, explain what's in sitemap, cron #4261
pdurbin Sep 28, 2018
6eecbf1
Merge branch 'develop' into 4261-sitemap #4261
pdurbin Sep 28, 2018
e73595f
Merge branch '5122-fix-netbeans-compat' into 4261-sitemap #4261
pdurbin Oct 2, 2018
bd54ba0
add BEGIN and END lines to log #4261
pdurbin Oct 2, 2018
c4116a1
explain that logos and sitemaps are written per server #4261
pdurbin Oct 2, 2018
b574f27
stage sitemap before writing to final file #4261
pdurbin Oct 2, 2018
3b9bbf1
add validation to main routine, s/copy/move/ #4261
pdurbin Oct 2, 2018
11f6fca
make async, report error if staged file exists #4261
pdurbin Oct 2, 2018
d3531c5
Merge branch 'develop' into 4261-sitemap #4261
pdurbin Oct 3, 2018
c41fc16
Merge branch 'develop' into 4261-sitemap #4261
pdurbin Oct 3, 2018
c80dc43
typo: wrong directory for sitemap was documented #4261
pdurbin Oct 4, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 5 additions & 1 deletion doc/sphinx-guides/source/installation/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,11 @@ Advanced installations are not officially supported but here we are at least doc
Multiple Glassfish Servers
--------------------------

The main thing to know about running multiple Glassfish servers is that only one can be the dedicated timer server, as explained in the :doc:`/admin/timers` section of the Admin Guide.
You should be conscious of the following when running multiple Glassfish servers.

- Only one Glassfish server can be the dedicated timer server, as explained in the :doc:`/admin/timers` section of the Admin Guide.
- When users upload a logo for their dataverse using the "theme" feature described in the :doc:`/user/dataverse-management` section of the User Guide, these logos are stored only on the Glassfish server the user happend to be on when uploading the logo. By default these logos are written to the directory ``/usr/local/glassfish4/glassfish/domains/domain1/docroot/logos``.
- When a sitemp is created by a Glassfish server it is written to the filesystem of just that Glassfish server. By default the sitemap is written to the directory ``/usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap``.

Detecting Which Glassfish Server a User Is On
+++++++++++++++++++++++++++++++++++++++++++++
Expand Down
22 changes: 22 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,9 @@ Out of the box, Dataverse attempts to block search engines from crawling your in
Letting Search Engines Crawl Your Installation
++++++++++++++++++++++++++++++++++++++++++++++

Ensure robots.txt Is Not Blocking Search Engines
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a public production Dataverse installation, it is probably desired that search agents be able to index published pages (AKA - pages that are visible to an unauthenticated user).
Polite crawlers usually respect the `Robots Exclusion Standard <https://en.wikipedia.org/wiki/Robots_exclusion_standard>`_; we have provided an example of a production robots.txt :download:`here </_static/util/robots.txt>`).

Expand All @@ -437,6 +440,25 @@ For more of an explanation of ``ProxyPassMatch`` see the :doc:`shibboleth` secti

If you are not fronting Glassfish with Apache you'll need to prevent Glassfish from serving the robots.txt file embedded in the war file by overwriting robots.txt after the war file has been deployed. The downside of this technique is that you will have to remember to overwrite robots.txt in the "exploded" war file each time you deploy the war file, which probably means each time you upgrade to a new version of Dataverse. Furthermore, since the version of Dataverse is always incrementing and the version can be part of the file path, you will need to be conscious of where on disk you need to replace the file. For example, for Dataverse 4.6.1 the path to robots.txt may be ``/usr/local/glassfish4/glassfish/domains/domain1/applications/dataverse-4.6.1/robots.txt`` with the version number ``4.6.1`` as part of the path.

Creating a Sitemap and Submitting it to Search Engines
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Search engines have an easier time indexing content when you provide them a sitemap. The Dataverse sitemap includes URLs to all published dataverses and all published datasets that are not harvested or deaccessioned.

Create or update your sitemap by adding the following curl command to cron to run nightly or as you see fit:

``curl -X POST http://localhost:8080/api/admin/sitemap``

This will create or update a file in the following location unless you have customized your installation directory for Glassfish:

``/usr/local/glassfish4/glassfish/domains/domain1/docroot/sitemap/sitemap.xml``

On an installation of Dataverse with many datasets, the creation or updating of the sitemap can take a while. You can check Glassfish's server.log file for "BEGIN updateSiteMap" and "END updateSiteMap" lines to know when the process started and stopped and any errors in between.

https://demo.dataverse.org/sitemap.xml is the sitemap URL for the Dataverse Demo site and yours should be similar. Submit your sitemap URL to Google by following `Google's "submit a sitemap" instructions`_ or similar instructions for other search engines.

.. _Google's "submit a sitemap" instructions: https://support.google.com/webmasters/answer/183668

Putting Your Dataverse Installation on the Map at dataverse.org
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Expand Down
31 changes: 31 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/api/SiteMap.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package edu.harvard.iq.dataverse.api;

import edu.harvard.iq.dataverse.sitemap.SiteMapServiceBean;
import edu.harvard.iq.dataverse.sitemap.SiteMapUtil;
import javax.ejb.EJB;
import javax.ejb.Stateless;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
import javax.ws.rs.core.Response;

@Stateless
@Path("admin/sitemap")
public class SiteMap extends AbstractApiBean {

@EJB
SiteMapServiceBean siteMapSvc;

@POST
@Produces(MediaType.APPLICATION_JSON)
public Response updateSiteMap() {
boolean stageFileExists = SiteMapUtil.stageFileExists();
if (stageFileExists) {
return error(Response.Status.BAD_REQUEST, "Sitemap cannot be updated because staged file exists.");
}
siteMapSvc.updateSiteMap(dataverseSvc.findAll(), datasetSvc.findAll());
return ok("Sitemap update has begun. Check logs for status.");
}

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
package edu.harvard.iq.dataverse.sitemap;

import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.Dataverse;
import java.util.List;
import javax.ejb.Asynchronous;
import javax.ejb.Stateless;

@Stateless
public class SiteMapServiceBean {

@Asynchronous
public void updateSiteMap(List<Dataverse> dataverses, List<Dataset> datasets) {
SiteMapUtil.updateSiteMap(dataverses, datasets);
}

}
225 changes: 225 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/sitemap/SiteMapUtil.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
package edu.harvard.iq.dataverse.sitemap;

import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.Dataverse;
import edu.harvard.iq.dataverse.DvObjectContainer;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.util.xml.XmlValidator;
import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.text.SimpleDateFormat;
import java.util.List;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.SAXException;

public class SiteMapUtil {

private static final Logger logger = Logger.getLogger(SiteMapUtil.class.getCanonicalName());

static final String SITEMAP_FILENAME_FINAL = "sitemap.xml";
static final String SITEMAP_FILENAME_STAGED = "sitemap.xml.staged";

/**
* TODO: Handle more than 50,000 entries in the sitemap.
*
* (As of this writing Harvard Dataverse only has ~3000 dataverses and
* ~30,000 datasets.)
*
* "each Sitemap file that you provide must have no more than 50,000 URLs"
* https://www.sitemaps.org/protocol.html
*
* Consider using a third party library: "One sitemap can contain a maximum
* of 50,000 URLs. (Some sitemaps, like Google News sitemaps, can contain
* only 1,000 URLs.) If you need to put more URLs than that in a sitemap,
* you'll have to use a sitemap index file. Fortunately, WebSitemapGenerator
* can manage the whole thing for you."
* https://github.com/dfabulich/sitemapgen4j
*/
public static void updateSiteMap(List<Dataverse> dataverses, List<Dataset> datasets) {

logger.info("BEGIN updateSiteMap");

String sitemapPathString = getSitemapPathString();
String stagedSitemapPathAndFileString = sitemapPathString + File.separator + SITEMAP_FILENAME_STAGED;
String finalSitemapPathAndFileString = sitemapPathString + File.separator + SITEMAP_FILENAME_FINAL;

Path stagedPath = Paths.get(stagedSitemapPathAndFileString);
if (Files.exists(stagedPath)) {
logger.warning("Unable to update sitemap! The staged file from a previous run already existed. Delete " + stagedSitemapPathAndFileString + " and try again.");
return;
}

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this generation could benefit from schema validation via DocumentBuilderFactory.setSchema(schema) ?

DocumentBuilder documentBuilder = null;
try {
documentBuilder = documentBuilderFactory.newDocumentBuilder();
} catch (ParserConfigurationException ex) {
logger.warning("Unable to update sitemap! ParserConfigurationException: " + ex.getLocalizedMessage());
return;
}
Document document = documentBuilder.newDocument();

Element urlSet = document.createElement("urlset");
urlSet.setAttribute("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");
urlSet.setAttribute("xmlns:xhtml", "http://www.w3.org/1999/xhtml");
document.appendChild(urlSet);

for (Dataverse dataverse : dataverses) {
if (!dataverse.isReleased()) {
continue;
}
Element url = document.createElement("url");
urlSet.appendChild(url);

Element loc = document.createElement("loc");
String dataverseAlias = dataverse.getAlias();
loc.appendChild(document.createTextNode(SystemConfig.getDataverseSiteUrlStatic() + "/dataverse/" + dataverseAlias));
url.appendChild(loc);

Element lastmod = document.createElement("lastmod");
lastmod.appendChild(document.createTextNode(getLastModDate(dataverse)));
url.appendChild(lastmod);
}

for (Dataset dataset : datasets) {
if (!dataset.isReleased()) {
continue;
}
if (dataset.isHarvested()) {
continue;
}
// The deaccessioned check is last because it has to iterate through dataset versions.
if (dataset.isDeaccessioned()) {
continue;
}
Element url = document.createElement("url");
urlSet.appendChild(url);

Element loc = document.createElement("loc");
String datasetPid = dataset.getGlobalId().asString();
loc.appendChild(document.createTextNode(SystemConfig.getDataverseSiteUrlStatic() + "/dataset.xhtml?persistentId=" + datasetPid));
url.appendChild(loc);

Element lastmod = document.createElement("lastmod");
lastmod.appendChild(document.createTextNode(getLastModDate(dataset)));
url.appendChild(lastmod);
}

TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = null;
try {
transformer = transformerFactory.newTransformer();
} catch (TransformerConfigurationException ex) {
logger.warning("Unable to update sitemap! TransformerConfigurationException: " + ex.getLocalizedMessage());
return;
}
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(document);
File directory = new File(sitemapPathString);
if (!directory.exists()) {
directory.mkdir();
}

boolean debug = false;
if (debug) {
logger.info("Writing sitemap to console/logs");
StreamResult consoleResult = new StreamResult(System.out);
try {
transformer.transform(source, consoleResult);
} catch (TransformerException ex) {
logger.warning("Unable to print sitemap to the console: " + ex.getLocalizedMessage());
}
}

logger.info("Writing staged sitemap to " + stagedSitemapPathAndFileString);
StreamResult result = new StreamResult(new File(stagedSitemapPathAndFileString));
try {
transformer.transform(source, result);
} catch (TransformerException ex) {
logger.warning("Unable to update sitemap! Unable to write staged sitemap to " + stagedSitemapPathAndFileString + ". TransformerException: " + ex.getLocalizedMessage());
return;
}

logger.info("Checking staged sitemap for well-formedness. The staged file is " + stagedSitemapPathAndFileString);
try {
XmlValidator.validateXmlWellFormed(stagedSitemapPathAndFileString);
} catch (Exception ex) {
logger.warning("Unable to update sitemap! Staged sitemap file is not well-formed XML! The exception for " + stagedSitemapPathAndFileString + " is " + ex.getLocalizedMessage());
return;
}

logger.info("Checking staged sitemap against XML schema. The staged file is " + stagedSitemapPathAndFileString);
URL schemaUrl = null;
try {
schemaUrl = new URL("https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd");
} catch (MalformedURLException ex) {
// This URL is hard coded and it's fine. We should never get MalformedURLException so we just swallow the exception and carry on.
}
try {
XmlValidator.validateXmlSchema(stagedSitemapPathAndFileString, schemaUrl);
} catch (SAXException | IOException ex) {
logger.warning("Unable to update sitemap! Exception caught while checking XML staged file (" + stagedSitemapPathAndFileString + " ) against XML schema: " + ex.getLocalizedMessage());
return;
}

Path finalPath = Paths.get(finalSitemapPathAndFileString);
logger.info("Copying staged sitemap from " + stagedSitemapPathAndFileString + " to " + finalSitemapPathAndFileString);
try {
Files.move(stagedPath, finalPath, StandardCopyOption.REPLACE_EXISTING);
} catch (IOException ex) {
logger.warning("Unable to update sitemap! Unable to copy staged sitemap from " + stagedSitemapPathAndFileString + " to " + finalSitemapPathAndFileString + ". IOException: " + ex.getLocalizedMessage());
return;
}

logger.info("END updateSiteMap");
}

private static String getLastModDate(DvObjectContainer dvObjectContainer) {
// TODO: Decide if YYYY-MM-DD is enough. https://www.sitemaps.org/protocol.html
// says "The date of last modification of the file. This date should be in W3C Datetime format.
// This format allows you to omit the time portion, if desired, and use YYYY-MM-DD."
return new SimpleDateFormat("yyyy-MM-dd").format(dvObjectContainer.getModificationTime());
}

public static boolean stageFileExists() {
String sitemapPathString = getSitemapPathString();
String stagedSitemapPathAndFileString = sitemapPathString + File.separator + SITEMAP_FILENAME_STAGED;
Path stagedPath = Paths.get(stagedSitemapPathAndFileString);
if (Files.exists(stagedPath)) {
logger.warning("Unable to update sitemap! The staged file from a previous run already existed. Delete " + stagedSitemapPathAndFileString + " and try again.");
return true;
}
return false;
}

private static String getSitemapPathString() {
String sitemapPathString = "/tmp";
// i.e. /usr/local/glassfish4/glassfish/domains/domain1
String domainRoot = System.getProperty("com.sun.aas.instanceRoot");
if (domainRoot != null) {
// Note that we write to a directory called "sitemap" but we serve just "/sitemap.xml" using PrettyFaces.
sitemapPathString = domainRoot + File.separator + "docroot" + File.separator + "sitemap";
}
return sitemapPathString;

}
}
1 change: 1 addition & 0 deletions src/main/webapp/WEB-INF/glassfish-web.xml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@
<property name="alternatedocroot_1" value="from=/guides/* dir=./docroot"/>
<property name="alternatedocroot_2" value="from=/dataexplore/* dir=./docroot"/>
<property name="alternatedocroot_logos" value="from=/logos/* dir=./docroot"/>
<property name="alternatedocroot_sitemap" value="from=/sitemap/* dir=./docroot"/>
<parameter-encoding default-charset="UTF-8"/>
</glassfish-web-app>
5 changes: 5 additions & 0 deletions src/main/webapp/WEB-INF/pretty-config.xml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,9 @@
<view-id value="/search/advanced.xhtml" />
</url-mapping>

<url-mapping id="sitemap">
<pattern value="/sitemap.xml" />
<view-id value="/sitemap/sitemap.xml" />
</url-mapping>

</pretty-config>
23 changes: 23 additions & 0 deletions src/test/java/edu/harvard/iq/dataverse/api/SiteMapIT.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package edu.harvard.iq.dataverse.api;

import com.jayway.restassured.RestAssured;
import org.junit.BeforeClass;
import org.junit.Test;
import com.jayway.restassured.response.Response;

public class SiteMapIT {

@BeforeClass
public static void setUpClass() {
RestAssured.baseURI = UtilIT.getRestAssuredBaseUri();
}

@Test
public void testSiteMap() {
Response response = UtilIT.sitemapUpdate();
response.prettyPrint();
Response download = UtilIT.sitemapDownload();
download.prettyPrint();
}

}
10 changes: 10 additions & 0 deletions src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java
Original file line number Diff line number Diff line change
Expand Up @@ -1612,6 +1612,16 @@ static Response clearMetricCache() {
return requestSpecification.delete("/api/admin/clearMetricsCache");
}

static Response sitemapUpdate() {
return given()
.post("/api/admin/sitemap");
}

static Response sitemapDownload() {
return given()
.get("/sitemap.xml");
}

@Test
public void testGetFileIdFromSwordStatementWithNoFiles() {
String swordStatementWithNoFiles = "<feed xmlns=\"http://www.w3.org/2005/Atom\">\n"
Expand Down