-
Notifications
You must be signed in to change notification settings - Fork 2
Home
The Data Cube vocabulary (QB) is a proposal by the W3C Government Linked Data Working Group that allows to publish statistical data on the web using the RDF (Resource Description Framework) standard. The QB4OLAP vocabulary is an extension to the Data Cube vocabulary that allows to represent OLAP cubes in RDF, and also allows to implement OLAP operators (such as Roll-up, Slice, and Dice) as SPARQL queries directly on this RDF representation.
QB4OLAP: a Vocabulary for Business Intelligence over the Semantic Web
Last update: 2012-07-08 Editors: Lorena Etcheverry (Instituto de Computación, Universidad de la República, Uruguay) Alejandro A. Vaisman (Department of Computer & Decision Engineering (CoDE), Université Libre de Bruxelles, Belgium)
Abstract
The Data Cube vocabulary (QB) is a proposal by the W3C Government Linked Data Working Group that allows to publish statistical data on the web using the RDF (Resource Description Framework) standard. The QB4OLAP vocabulary is an extension to the Data Cube vocabulary that allows to represent OLAP cubes in RDF, and also allows to implement OLAP operators (such as Roll-up, Slice, and Dice) as SPARQL queries directly on this RDF representation. Status of this document
This is an editor's draft without any formal standing. It is not endorsed by any organization. Anything in this document is still subject to change at this point. The editors seek feedback on the document. Please send any comments to the project's Google Group.
Table of Contents
1. Introduction
1.1 Why the QB Vocabulary is not enough to publish OLAP data?
1.2 The QB4OLAP Vocabulary
1.3 RDF and Linked Data
1.4 Audience and scope
2. OLAP Data Cubes
2.1 OLAP operators
3. An example
4. Outline of the vocabulary
4.1 Vocabulary index
5. Creating data structure definitions
5.1 Levels and rollup hierarchies
5.2 Measures and aggregate functions
5.3 Examples
5.4 Data structure definition
6. Dimension instances and datasets
6.1 Level members and rollup relations
6.2 Observations
7. OLAP operators over QB4OLAP
A. Namespaces used in this document
B. Vocabulary reference
B.1 Levels and Level Members
B.2 Aggregate Functions
C. References
- Introduction
Business intelligence (BI) comprises a collection of techniques used for extracting and analyzing business data, to support decision-making. As part of the BI machinery, On-Line Analytical Processing (OLAP) tools and algorithms allow querying large multidimensional databases called data warehouses (DW).
Since the mid 90's, DW and BI applications have been built to consolidate enterprise business data, allowing taking timely and informed decisions based on these data. The availability of enormous amounts of data from different domains is calling for a shift in the way DW and BI practices are being carried out. It is becoming clear that the traditional approach, where day-to-day business data produced in an organization is collected, cleansed and consolidated in a DW for data analysis, needs to be revised. We believe that the Semantic Web (SW) will most likely be an scenario where OLAP-style data analysis will be crucial in the near future.
Therefore, SW technologies will be needed to model, manipulate, and share multidimensional data. To achieve this, the definition of a precise vocabulary allowing representing adequately OLAP data on the SW is required. Over these vocabulary, multidimensional models and OLAP operators can be defined. The (W3C candidate) Data Cube vocabulary follows a model initially devised for analyzing statistical data, which does not cover all the needs of a vocabulary oriented to support BI analysis on the SW.
The present work is oriented to cover such need. Concretely, we propose the QB4OLAP vocabulary which adds to QB the capability of representing dimension levels, level members, rollup relations between levels and level members, and associating aggregate functions to measures. The QB4OLAP vocabulary is compatible with QB, in the sense that QB4OLAP cube schemas can be built on top of data cube instances (observations) already published using QB. Existing applications, or applications that do not require OLAP style-analysis, can still use the QB schema and instances. Therefore, the cost of adding OLAP capabilities to existing datasets is the cost of building the new schema, in other words, the cost of building the analysis dimensions. Conversely, cubes built over QB4OLAP from scratch can be transformed into QB cubes in order to be exploited by existing applications supporting the latter. As in the case above, the cube instances remain unchanged. 1.1 Why the QB Vocabulary is not enough to publish OLAP data?
Although QB can define the structure of a cube (via the Data Structure Definition), it does not provide a mechanism to represent an OLAP dimension structure (i.e., the dimension levels and the relationships between levels). However, QB allows representing hierarchical relationships between level members in the dimension instances, using the SKOS vocabulary.
Also, the QB vocabulary does not provide direct support for OLAP operations. In spite of this, OLAP operations can be defined over a structure based on QB, although in a limited way. For example, Roll-Up is not supported since dimension levels are not modeled, neither are the aggregate functions for each measure (the latter also prevents support for the Slice operator). The same issues apply to Drill-down. Finally, Dice is partially supported by QB, given that the FO formula σ can only involve cube measures (again, because of the lack of support of dimension levels) It is worth noting that Slices, as defined in QB, represent subsets of observations. Moreover, they are not defined as operators over an existing cube (as in OLAP), but as new structures and new instances (observations) that may be considered as the application of constraints over an existing dataset. 1.2 The QB4OLAP Vocabulary
The QB4OLAP vocabulary proposes to extend QB in order to support the following concepts, defined in classic MD models for OLAP and not modeled in QB:
Dimension structure: the structure of a dimension is defined in terms of levels, which are hierarchically organized through rollup relations.
Dimension instances: level instances are called level members, and there is a relation between level members from different levels. In QB4OLAP the relationship between level members (from most specific to more general concepts) is modeled using skos:broader property
Aggregate functions: aggregate functions are used to compute measure aggregate values when performing OLAP operations (e.g: Roll-Up)
1.3 RDF and Linked Data 1.4 Audience and scope 2. OLAP Data Cubes
In OLAP, data are organized as hypercubes whose axes are called dimensions. Each point in this multidimensional space is mapped into one or more spaces of measures, representing facts that are analyzed along the cube’s dimensions. Dimensions are structured in hierarchies that allow analysis at different aggregation levels. The actual values in a dimension level are called members, which can also have properties or attributes. Members in a dimension level must have a corresponding member in the upper level in the hierarchy, and this correspondence is defined through so-called rollup functions. 2.1 OLAP operators
A well-known set of operations is defined over cubes. We present some of these operations next. They are based on the recently proposed Cube Algebra [CA].
Roll-Up summarizes data at a higher level in a dimension hierarchy. It receives a cube C , a dimension D ∈ C , a dimension level lu ∈ D such that ll →∗ lu , and a set of aggregate functions f , Roll-Up(C, D, lu, f) returns a new cube C′ where measures are aggregated along D up to the level lu. Analogously, Drill-Down disaggregates previously summarized data, and can be considered the inverse of Roll-Up. Note that this requires to store the aggregation path.
Slice receives a cube C,and a dimension D ∈ C, and removes D from the cube. Measure values in the cube are aggregated along dimension D up to level 'All' before removing the dimension, using the aggregate functions associated with each measure.
Dice receives a cube C , and a first order formula σ over measures and levels in C, and returns a new cube C′ such that the elements in C′ are the ones that satisfy σ. 3. An example
In order to illustrate the use of the QB4OLAP vocabulary we present a small example, extracted from StatsWales report number 028727 and Household projections by district, England, 1991-2033, which contains household projections (in thousands) by district in the UK. We want to analyze the household projection (a measure) along the dimensions geographic location and date. Let us call these dimensions geoDim and dateDim, respectively. Dimension geoDim is organized in a hierarchy of levels: Unitary Authority (UA), Government Office Region (GOR) and Country, while dimension dateDim has only one level: year. We can build an OLAP cube householdCS with who dimensions (geoDim and dateDim) and one measure (household). The following table shows a small instance of the cube householdCS. Country GOR UA 2006 2007 2008 England South East Milton Keynes 92 94 96 Reading 58 58 60 South West Bournemouth 71 72 73 Wales Wales Cardiff 132.1 134.2 136.7 Newport 58.7 59 69.6
OLAP operators can be applied to these data. For example, we would like to obtain the total household for each Government Office Region in each year. This corresponds to the application of the Roll-Up operator on the geoDim dimension, up to the GOR level. The aggregate function to compute the values is associated with each measure in the schema definition, in this case the sum function, and it is used to compute the household projection for each GOR. The table below shows the results of applying the Roll-Up operator. Year Country GOR 2006 2007 2008 England South East 150 152 156 South West 71 72 73 Wales Wales 190.8 193.2 196.3 4. Outline of the vocabulary
The following picture shows classes, properties and instances in the QB4OLAP vocabulary (grey background). Classes and properties from QB are also included (white background). Vocabulary outline 4.1. Vocabulary index Classes and properties from QB
Classes: qb:Attachable qb:AttributeProperty qb:CodedProperty qb:ComponentProperty qb:ComponentSet qb:ComponentSpecification qb:DataSet qb:DataStructureDefinition qb:DimensionProperty qb:MeasureProperty qb:Observation qb:Slice qb:SliceKey
Properties: qb:attribute qb:codeList qb:component qb:componentAttachment qb:componentProperty qb:componentRequired qb:concept qb:dataSet qb:dimension qb:measure qb:measureDimension qb:measureType qb:observation qb:order qb:slice qb:sliceKey qb:sliceStructure qb:structure qb:subSlice New classes, properties and instances
Classes: qb4o:LevelProperty qb4o:LevelMember qb4o:AggregateFunction
Properties: qb4o:level qb4o:inLevel qb4o:inDimension qb4o:parentLevel qb4o:hasAggregateFunction
Instances: qb4o:sum qb4o:avg qb4o:count qb4o:min qb4o:max 5. Creating data structure definitions
As in QB, QB4OLAP uses the qb:DataStructureDefinition class to define the structure of data sets. While in QB the structure of a data set is defined in terms of dimensions, measures and attributes, in QB4OLAP the structure is defined using levels, measures and attributes. The declaration of the structure of the data set has several benefits, some of them where already presented in [QB]:
to assist in the integration process of different data sets
to provide a self-contained description of the contents of the data set, which allows for example to implement client applications and operators
to enable the verification that the data set instances match the expected structure
5.1. Levels and rollup hierarchies
The QB4OLAP vocabulary represents the levels, attributes and measures as RDF properties. To represent levels, it adds qb4o:LevelProperty as a sub-class to the QB abstract qb:ComponentProperty class. As presented in Section 2, levels are hierarchically organized within a dimension. For each level, its parent in the rollup hierarchy is declared using the qb4o:parentLevel property. Levels are linked to its corresponding dimension via the qb4o:inDimension property.
The semantics of data structure definitions may be enriched using the skos:closeMatch property to link QB4OLAP concepts with concepts already defined in external vocabularies (e.g: in our running example the year level in the geoDim dimension can be related to the year concept in DBpedia). 5.2. Measures and aggregate functions
Aggregate functions are used to compute the aggregate value of measures, for example when performing a Roll-Up. These functions can be expressed in QB4OLAP using instances of the qb4o:AggregateFunction class, and each measure can be associated with an aggregate function via the qb4o:hasAggregateFunction property. This is also considered part of the data structure definition.
Five aggregate functions are defined in QB4OLAP:
SUM: returns the numeric value obtained by adding a set of numeric values.
AVG: returns the arithmetic mean of a set of numeric values.
COUNT: returns the number of elements in a set of elements (the cardinality of the set).
MIN: returns the minimum element in a set of elements, where a partial order is defined.
MAX: returns the maximum element in a set of elements, where a partial order is defined.
5.3. Example
In our running example there are two dimensions to represent: geographic location (geoDim) and time (dateDim). There is a single measure which corresponds to household projection, measured in thousands of households, and to aggregate measure values the sum function has to be applied. We define the following concepts:
Geographic Location. We define the geoDim dimension and the three levels in its hierarchy, linking them to its related concepts in the Ordnance Survey Administrative Geography Ontology.
eg:geoDim a qb:DimensionProperty.
eg:unitaryAuthority a qb4o:LevelProperty; qb4o:inDimension eg:geoDim; qb4o:parentLevel eg:governmentOfficeRegion; skos:closeMatch adgeo:UnitaryAuthority.
eg:governmentOfficeRegion a qb4o:LevelProperty; qb4o:inDimension eg:geoDim; qb4o:parentLevel eg:country; skos:closeMatch adgeo:GovernmentOfficeRegion.
eg:country a qb4o:LevelProperty; qb4o:inDimension eg:geoDim; skos:closeMatch adgeo:CountryInUk.
Time. We define the timeDim dimension and the only level in its hierarchy, linking them to its related concepts in the DBPedia.
eg:dateDim a qb4o:DimensionProperty.
eg:year a qb4o:LevelProperty; qb4o:inDimension eg:dateDim; skos:closeMatch db:Year.
Measure. This measure property will link each observation to the measure value.
eg:household a qb:MeasureProperty; skos:closeMatch db:Household.
5.4 Data structure definition
We need to declare a qb:DataStructureDefinition resource to define the structure of the data set, which in turn will reference a set of qb:ComponentSpecification resources. The qb:DataStuctureDefinition will be reusable across other data sets with the same structure. Component specifications allow, not only to state that a certain qb:ComponentProperty property is related to a data set structure, but also to specify properties of this relationship. For example, the qb4o:hasAggregateFunction property allows to specify which aggregation function corresponds to a certain measure in a data set structure.
The data structure definition of our running example can be declared by:
eg:householdCS a qb:DataStructureDefinition; qb:component [qb4o:level eg:unitaryAuthority]; qb:component [qb4o:level eg:year]; qb:component [qb:measure eg:household; qb4o:hasAggregateFunction qb4o:sum].
Notice that, instead of stating that a data set structure is defined by a set of dimensions (as in QB), in QB4OLAP a set of levels is declared. These levels define the granularity level of all the observations in the data set. 6. Dimension instances and data sets. 6.1 Level members and rollup relations
As already mentioned the QB4OLAP vocabulary structures dimensions in terms of dimension levels. The instances that populate dimension levels are called level members, and they are represented as instances of type qb4o:LevelMember. Rollup relations relate level members from one level to its correspondent level member in upper levels in the hierarchy. These relations are modeled using the skos:broader property. The example below shows three level members: the "Borough of Reading", which is a unitary authority and is contained in the "South East" government office region of "England". In this case, the resources that describe each of these level members belong to external vocabularies (from Data.gov.uk), but could be defined locally. Level members are linked to its corresponding level in the dimension structure of the data set using the qb4o:inLevel property.
ns0:00mc qb4o:inLevel eg:unitaryAuthority; rdfs:label "The Borough of Reading@en"; skos:broader ns1:J.
ns1:J qb4o:inLevel eg:governmentOfficeRegion; rdfs:label "South East@en" ; skos:broader ns2:921.
ns2:921 qb4o:inLevel eg:country; rdfs:label "England@en".
6.2 Observations
A DataSet is a collection of observations or facts that correspond to a given data structure definition. Each observation represents a point in the multidimensional space formed by dimensions, and for each of these points a set of measure values is recorded. For each data set, a resource is created and typed as qb:DataSet, and it is linked to the corresponding data structure definition using the qb:structure property. Each observation is represented as an instance of type qb:Observation and is linked to the containing data set using the qb:dataSet property.
The example below shows the definition of a data set corresponding to our running example and an observation, which corresponds to the Borough of Reading in 2007 from the table presented in Section 3.
eg:dataset_hh a qb:DataSet; rdfs:label "Household in UK"@en; qb:structure eg:householdCS.
eg:o1 a qb:Observation; qb:dataSet eg:dataset_hh ; eg:unitaryAuthority ns0:00mc ; eg:year db:2007; eg:household 58.
- OLAP operators over QB4OLAP
OLAP operators can be implemented over cubes specified using QB4OLAP vocabulary as SPARQL 1.1 queries. Moreover, provided that the vocabulary models all the structural metadata of the cube (namely: levels, level member, their relationship, rollup relations between level members, hierarchical relationships between levels, aggregate functions and their relationship with measures), these SPARQL queries can be automatically generated.
Since OLAP cubes are represented as RDF graphs, and the result of any OLAP operator must be also a cube, OLAP operators have to be implemented as SPARQL CONSTRUCT queries. Below we present, as an example, the SPARQL query that implements the Roll-Up operator which result was presented in Section 3.
CONSTRUCT { ?id a qb:Observation . ?id qb:dataSet eg:dataset-hh1 . ?id eg:year ?year . ?id eg:governmentOfficeRegion ?gor . ?id eg:household ?sumHhold } WHERE{{ SELECT ?gor ?year (SUM(?hhold) AS ?sumHhold) (iri(fn:concat("http://example.org/hhold#", "hholdGOR","", fn:substring-after(?gor,"http://example.org/hhold#"),"", fn:substring-after(?year,"http://example.org/hhold#"))) AS ?id) WHERE { ?o qb:dataSet eg:dataset-hh . ?o eg:year ?year . ?o eg:household ?hhold . ?o eg:unitaryAuthority ?ua . ?ua skos:broader ?gor . ?gor qb4o:inLevel eg:governmentOfficeRegion } GROUP BY ?gor ?year}}
A. Namespaces used in this document PREFIX URI VOCABULARY rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF core rdfs http://www.w3.org/2000/01/rdf-schema# RDF Schema skos http://www.w3.org/2004/02/skos/core# Simple Knowledge Organization System qb http://purl.org/linked-data/cube# The Data Cube vocabulary db http://dpedia.org/resource/ DBpedia adgeo http://statistics.data.gov.uk/def/administrative-geography/ Administrative Geography ontology from Data.gov.uk ns0 http://statistics.data.gov.uk/doc/local-authority-district/ An ontology describing all local authority districts in the UK. ns1 http://statistics.data.gov.uk/id/government-office-region/ An ontology describing all government office regions in the UK. ns2 http://statistics.data.gov.uk/id/country An ontology describing all countries in the UK. qbp http://purl.org/olap# The QBPlus vocabulary B. Vocabulary reference B.1 Levels and Level Members
Class: qb4o:LevelProperty Sub class of: qb:ComponentProperty The class of components which represent the levels of a dimension. Class: qb4o:LevelMember Sub class of: skos:Concept The class of components which represent the level members. Property: qb4o:level ( qb4o:LevelProperty -> qb:ComponentProperty ) An alternative to qb:ComponentProperty which makes explicit that the component is a level. Property: qb4o:inDimension ( qb4o:LevelProperty -> qb:DimensionProperty ) Indicates to which dimension the level belongs. Property: qb4o:inLevel ( qb4o:LevelMember -> qb4o:LevelProperty ) Indicates to which level the level member belongs. Property: qb4o:parentLevel ( qb4o:LevelProperty -> qb4o:LevelProperty ) Indicates which is the parent level of each level.
B.2 Aggregate Functions
Class: qb4o:AggregateFunction Sub class of: rdfs:Class The class of components which represent aggregate functions that are applied to compute measure aggregate values. Property: qb4o:hasAggregateFunction ( qb:ComponentSpecification -> qb4o:AggregateFunction ) Indicates which aggregate function has to be applied to obtain measure aggregate values, for a certain measure in a cube. Instance: qb4o:sum Instance of: qb4o:AggregateFunction An aggregate function that returns the numeric value obtained by adding a set of numeric values.. Instance: qb4o:avg Instance of: qb4o:AggregateFunction An aggregate function that returns the number of elements in a set of elements (the cardinality of the set). Instance: qb4o:count Instance of: qb4o:AggregateFunction An aggregate function that returns the arithmetic mean of a set of numeric values. Instance: qb4o:min Instance of: qb4o:AggregateFunction An aggregate function that returns the minimum element in a set of elements, where a partial order is defined. Instance: qb4o:max Instance of: qb4o:AggregateFunction An aggregate function that returns the maximum element in a set of elements, where a partial order is defined.
C. References [CA] C. Ciferri, R. Ciferri, L. I. Gómez, M. Schneider, A.A. Vaisman, and E. Zimányi. Cube Algebra: A Generic User-Centric Model and Query Language for OLAP Cubes. IIJDWM (to appear), 2012. [HHP] Household projections by district, England, 1991-2033,URL: http://opendatacommunities.org/id/dataset/housing/household-projections [LD] Linked Data, URL: http://linkeddata.org/ [OS-GEO] Ordnance Survey Administrative Geography Ontology v1, URL: http://www.ordnancesurvey.co.uk/ontology/admingeo.owl [QB] Richard Cyganiak, Dave Reynolds. The RDF Data Cube Vocabulary. 05 April 2012. W3C Candidate. URL: http://www.w3.org/TR/2012/WD-vocab-data-cube-20120405/ [RDF-PRIMER] Frank Manola; Eric Miller. RDF Primer. 10 February 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ [SKOS-PRIMER] Antoine Isaac; Ed Summers. SKOS Simple Knowledge Organization System Primer. 18 August 2009. W3C Note. URL: http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/ [TURTLE-TR] Eric Prud'hommeaux, Gavin Carothers. Turtle:Terse RDF Triple Language. 09 August 2011. W3C Working Draft. URL: http://www.w3.org/TR/2011/WD-turtle-20110809/