Skip to content
timrdf edited this page Oct 1, 2012 · 13 revisions

This github wiki documents the technology behind the Linked Data aggregation site http://healthdata.tw.rpi.edu.

This page covers the preferred modeling strategy for our health data project. Please read this and keep it handy if you would like to enhance any datasets. If you're wondering why we're going through this trouble, please read Modeling Motivations and Notes.

Introduction

There are a few very specific goals with modeling here. One is to re-use existing vocabularies as much as possible. This includes general vocabularies, such as PROV, Dublin Core, VOID, DCAT, and RDF Data Cube, as well as domain-specific ones like SIO, FMA, SNOMED, and OBO. We will detail how to use these vocabularies, how to deal with linking to other datasets like DBpedia, how to interlink entities across datasets, and how to provide contextualization for entities.

Identity

Most of these datasets are about things that change over time. In fact, many of these are the same data methodology re-applied at many different time points. This makes it possible to compare information about the same entity across time and space, within different roles, or as they change their (non-essential) types. We therefore need to be able to express things like "Connecticut in 1990" or "Barack Obama as President", and link them to resources with a more generalized context (like "Connecticut" or "Barack Obama", respectively) without creating a mathematical equality between them, as owl:sameAs would. We therefore require use of prov:specializationOf and prov:alternateOf to link resources that denote the same entity in different contexts. If there is a narrowing of context (such as "Connecticut in 1990" to "Connecticut"), use prov:specializationOf. If there is a change of context but neither of the resources subsume each other in context (for instance "Connecticut in 1990" to "Connecticut in 1991") then use prov:alternateOf. Reserve owl:sameAs for true mathematical identity. For instance, if you were to import a database but change the URI prefixes, those new resources could be considered owl:sameAs their counterparts in the other database, IF no change in the context has occurred.

  • Reserve owl:sameAs for mathematical identity only.
  • Use prov:specializationOf when two resources denote the same entity, but one resource denotes the entity in a narrower context that is fully subsumed by the broader context.
  • To link two resources that denote the same thing but without context subsumption, use prov:alternateOf.

Handling footnotes and other annotations

This use is especially important with the hospital-compare data, as there are footnotes associated with particular statements (and not the referred entity). When additional information is needed about a specific statement, as opposed to a property or an entity, we create a further specialization of the entity and set the measure dimension. This mirrors the technique discussed in http://www.w3.org/TR/vocab-data-cube/#dsd-mm-dim in the RDF Data Cube Vocabulary, except that we continue to maintain identity with further constraint, as qb:measureType is a contextualizing dimension. This allows us to add further statements that only refer to particular statements. For instance, we need to add a footnote to many assertions that have missing values:

:provider/010005 dcterms:identifier "010005" ;
    prov:generalizationOf [
        qb:measureValue hai:HAI_1_SIR ;
	hai:HAI_1_SIR "Not Available" ;
	health:hasAnnotation footnote:1 ;
    ].

The assertion is actually repeated in the template, but since the value "Not Available" doesn't parse to a float value, it is skipped. A non-missing value entry looks like this:

:provider/010006> dcterms:identifier "010006" ;
    prov:generalizationOf [
        qb:measureValue hai:HAI_1_SIR ;
	hai:HAI_1_SIR "0.62" 
    ] ;
    hai:HAI_1_SIR "0.62"^^xsd:decimal .

This is of course redundant as there is no footnote, but it follows the pattern of generalizing with more context to provide more information. The enhancement parameters, for reference, look like this:

:annotatedValue
   a conversion:ImplicitBundle;
   conversion:property_name prov:generalizationOf; # Can also be a URI, e.g. dcterms:title.
.

health:hasAnnotation owl:inverseOf ao:annotatesResource.

<http://purl.org/twc/health/source/hub-healthdata-gov/dataset/hospital-compare/version/2012-Jul-17/conversion/enhancement/1>
   conversion:conversion_process [
      conversion:enhance [
         ov:csvCol          2;
         ov:csvHeader       "msr_cd";
         conversion:range_template "[/sd]measures/hai/[.]";
         conversion:bundled_by :annotatedValue;
         conversion:equivalent_property qb:measureValue;
         conversion:range   rdfs:Resource;
         conversion:object_search [
            conversion:regex     ".*";
            conversion:predicate "[/sd]measures/hai/[.]";
            conversion:object    "[#3]";
         ];
      ];
      conversion:enhance [
         ov:csvCol          3;
         ov:csvHeader       "scr";
         conversion:equivalent_property "[/sd]measures/hai/[#2]";
         conversion:interpret [
            conversion:symbol        "";
            conversion:interpretation conversion:null; 
         ];
         conversion:range   xsd:decimal;
      ];
      conversion:enhance [
         ov:csvCol          4;
         conversion:bundled_by :annotatedValue;
         ov:csvHeader       "footnote";
         conversion:equivalent_property health:hasAnnotation;
         conversion:range_template "[/sd]footnote/[.]";
         conversion:range   rdfs:Resource;
         conversion:interpret [
            conversion:symbol        "";
            conversion:interpretation conversion:null; 
         ];
      ];
   ];
.

Linking to external resources

Linking to external resources using the above strategy is highly recommended, especially for things other than classes and properties. InstanceHub contains many, many useful identifiers, and for the biomedical domain, Bio2RDF and http://identifiers.org contains imports from many databases from bioinformatics, especially around molecular biology, diseases, and drugs. Please consult these databases if there's a chance to map the data to them. This is especially useful if it's possible to use identifiers or names to generate the URIs directly in the databases. Finally, many useful biomedical vocabularies are published to BioPortal

Interlinking within twc-healthdata

This section notes URI schemes for existing entities. Please attempt to push back as much non-essential context as you can when designing the URIs. Follow Einstein's dictum, paraphrased as "Everything should be made as simple as possible, but no simpler."

Healthcare Providers

Medicare has a provider ID number that it uses in many datasets to identify particular healthcare providers. It is often referred to as "provider ID" or similar. These apply to hospitals, clinics, doctors, and many others. The URI pattern, assuming you're looking at provider ID column is:

http://logd.tw.rpi.edu/id/medicare-gov/provider/[medicare provider id]

States

If the state name is spelled out, to refer to the decontextualized state, use:

http://logd.tw.rpi.edu/id/us/state/[state name]

If it's a two-letter state code, use links-via:

 conversion:links_via <http://logd.tw.rpi.edu/source/twc-rpi-edu/file/instance-hub-us-states-and-territories/version/2011-Apr-09/conversion/instance-hub-us-states-and-territories.csv.e1.ttl>;

Counties

If the county name is spelled out, to refer to the decontextualized county, use:

http://logd.tw.rpi.edu/id/us/state/[state name]/county/[county name]

Is-a Versus Has-a

Since we are dealing with graph databases, it is just as easy to pile up types for a resource as it is to pile up attributes. If the column you're looking to convert is a category of the entity represented in the row, find a class to map to (or create one) rather than make up a property and resource to link to. For instance, the hospital-compare dataset has a column which says whether the hospital provides emergency services or not. This column maps to classes like this:

      conversion:enhance [
         conversion:equivalent_property rdf:type;
         conversion:range   rdfs:Resource;
	 conversion:interpret [
            conversion:symbol "Yes";
            conversion:interpretation health:EmergencyServiceHospital;
         ];
	 conversion:interpret [
            conversion:symbol "No";
            conversion:interpretation health:NonEmergencyServiceHospital;
         ];
      ];

Vocabularies to Reuse

This list will be expanded. Please reuse these vocabularies if possible.

General-purpose Vocabularies

  • PROV-O: For provenance, but also provides relations for identity, changes in state over time, time, and location.
  • VOID: descriptions of RDF datasets.
  • DCAT: descriptions of non-RDF datasets.
  • VCARD: contact information for people and organizations.
  • Dublin Core Terms: descriptions, titles, authors, partOf/hasPart.

Domain-specific Vocabularies

Classes and properties should be taken, if possible, from pre-existing biomedical ontologies. Here are some existing mappings that should be used if possible:

@prefix obo:           <http://purl.obolibrary.org/obo/>.
obo:PATO_0000384 a owl:Class;
		 rdfs:label "male".

obo:PATO_0000383 a owl:Class;
		 rdfs:label "female".

Building Classes and Properties

Healthdata vs dataset classes and properties