Tim L edited this page Jun 25, 2013 · 127 revisions

This github wiki documents the technology behind the Linked Data aggregation site http://healthdata.tw.rpi.edu.

Jim, Tim, and Alvaro presented an overview of the healthdata project on Jan 23, 2013.

We won! (February 20, 2013)


The Department of Health and Human Services (HHS) is hosting a series of Developer Challenges to "establish learning communities that collaboratively evolve and mature the utility and usability of a broad range of health and human service data (1)."

The first challenge that HHS is hosting is Metadata, which requests "the application of existing voluntary consensus standards for metadata common to all open government data, and invites new designs for health domain specific metadata to classify datasets in our growing catalog, creating entities, attributes and relations that form the foundations for better discovery, integration and liquidity (2)."

We think this HHS developer challenge is a perfect place to apply the principles and technologies of the semantic web and linked data -- and in particular the tools that we've been developing at Tetherless World Constellation to do just this sort of thing.

This github repository contains the code, configurations, and documentation that implements the data aggregation site http://healthdata.tw.rpi.edu. Others can contribute to this repository, or they can take it to use for their own purposes.


This section briefly describes the parts of twc-healthdata and the steps taken to put them together. The starting point for the HHS challenge is their CKAN instance at http://hub.healthdata.gov, which lists their 410 datasets (as of June 2013).

Discovery. First, we needed to programmatically access the dataset listing. Although CKAN provides RDF descriptions of individual datasets, it does not provide similar for the full dataset listing (for details, see Accessing CKAN listings). So, we created a SADI service to provide an RDF description of all datasets within a given CKAN instance. The service reuses the Dublin Core, DCAT, and DataFAQs vocabularies to describe the datasets -- no new terms were needed. Because the public is not able to modify HHS's listings in http://hub.healthdata.gov, we installed our own instance of CKAN and mirrored their entries (for details, see Mirroring a Source CKAN Instance). This allows us to not only cleanse their original metadata, but also to augment it in new and useful ways. Of course, the first addition that we made to our listings was a W3C PROV link to the original dataset listing. For example, http://healthdata.tw.rpi.edu/hub/dataset/2008-basic-stand-alone-carrier has a prov_alternateOf annotation with the value http://hub.healthdata.gov/dataset/2008-basic-stand-alone-carrier.

Access. Once we were able to programmatically access the dataset listings and dataset metadata, we could start retrieving the data file downloads. We reused the principles established by csv2rdf4lod-automation to organize the datasets according to "SDV" -- their Source organization, Dataset identifier, and the dataset Version. Because csv2rdf4lod-automation did not previously have an ability to retrieve its datasets from a CKAN listing, we created a new script to automate the creation of a csv2rdf4lod-automation conversion directory. An important design element for this script is that it does not actually retrieve the datasets; it simply creates DCAT descriptions of the CKAN datasets and situates them within the csv2rdf4lod-automation directory structure (for more details, see Mirroring a Source CKAN Instance). For example, the script created the source/hub-healthdata-gov/2008-basic-stand-alone-carrier/dcat.ttl file in this github repository to describe how to access the dataset http://healthdata.tw.rpi.edu/hub/dataset/2008-basic-stand-alone-carrier. Again, the DCAT and DataFAQs vocabularies were reused for the dataset access description, in addition to reusing the vocabularies VoID, PROV, and csv2rdf4lod's conversion. This kind of description can be recognized by any DCAT system, including csv2rdf4lod-automations's cr-retrieve.sh, which is discussed further at Retrieving CKAN's Dataset Distribution Files.

Integration. Once we were able to retrieve the datasets' files, we could transform the tabular structures into RDF and Linked Data using csv2rdf4lod-automation. By default, the RDF structures created from the tables is not as well structured as we would like, since they do not initially interconnect across datasets, i.e. reuse common subject URIs and reuse existing vocabulary. "Enhancement parameters" are used to specify how good RDF structures should be created from the tabular structures (in fact, enhancement parameters are themselves encoded in RDF using the conversion vocabulary). Currently, human time and effort is needed to craft the enhancement parameters for each dataset, since it requires a significant amount of interpretation, background knowledge, and domain expertise. Fortunately, once the enhancement parameters are created, they can be reused by others to reproduce the same useful RDF structures -- or they can be reapplied to updated versions of the same dataset. Developing, sharing, applying, and evaluating enhancement parameters for each dataset so that they align with other datasets is a basis for -- as HHS asked -- establishing learning communities that collaboratively evolve and mature the utility and usability of a broad range of health and human service data. Contributing to the utility and usability of health data can be as simple as committing an enhancement parameters file for others to use (we provide a tutorial on how to do it here).

Discovery (again). Although enhancement parameters provide the means to exchange and reapply incremental improvements to our 350 aggregated datasets, we can facilitate the process to create them by instrumenting naive recommendations for which datasets should interconnect. Human curators can use these recommendations to focus on which enhancement parameters should be developed. The recommendations are derived by analyzing the raw, naive conversions that csv2rdf4lod creates when it is not given enhancement parameters. Recommendations are shown on dataset pages to help users find potential connections as they navigate the site. The recommendations include datasets that share common raw objects, common raw property labels, common entities, common terminology. DROID file formats (Jim: location, dc:subject, suggested ontology.) (for details, see The Benefits of Mass Raw Conversions). To help others discover the Linked Data produced at healthdata.tw.rpi.edu, we use the Linked Data convention to list our "bubble" on the datahub.io CKAN.

Access (again). The twc-healthdata aggregation site maximizes the liquidity of its holdings by providing many forms of access to its interconnected RDF graph. First, every RDF conversion performed by csv2rdf4lod-automation is available on the web as a dump file, and the dump files are also described in RDF using the VoID vocabulary (specifically, void:dataDump). This allows users to "get all of the data" from a particular dataset, in cases where they wish to process it in bulk or rehost it in their own site. In addition, all datasets are described in the conventional location http://healthdata.tw.rpi.edu/void.ttl, so it is easy to find out what datasets exist on the site. Certain metadata are also aggregated and published, to facilitate their access (for details, see Using VoID for Accessibility and Monitoring Incremental Integration). This includes caches of the CKAN catalog, Aggregated VoID, and Aggregated DCAT.

The Virtuoso SPARQL Endpoint is loaded with the same dump files that are mentioned in the VoID metadata, so that others can query across datasets and build Linked Data applications without retrieving the necessary dump files. The graph names in the SPARQL endpoint correspond to the dataset URI, so that it is easier to determine what datasets are being queried (for example void:Dataset http://purl.org/twc/health/source/hub-healthdata-gov/dataset/food-recalls is loaded in GRAPH <http://purl.org/twc/health/source/hub-healthdata-gov/dataset/food-recalls> {}). In addition, the provenance of an endpoint's named graph load from a dump file is captured and available in the endpoint. LODSPeaKr serves all RDF in the SPARQL Endpoint as content-negotiated Linked Data, and also creates the human-readable website using a set of model/views that is is available in this github repository.

Technical Details. VM Installation Notes provides the technical details for how http://healthdata.tw.rpi.edu was deployed, and Automation describes how it keeps itself running and up to date as new datasets, recommendations, and enhancements appear.