Skip to content
Discretization of numeric literals in RDF via SPARQL
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
resources
src/discretize_sparql
test
.gitignore
LICENSE
README.md
project.clj

README.md

discretize-sparql

discretize-sparql is a command-line tool to discretize numeric values in RDF datasets via SPARQL Update operations. Discretization (also known as binning) converts continuous numeric values into discrete intervals. This is typically useful for data mining tools that operate on categorical data. For example, discretization is required for association rule mining with EasyMiner, outlier detection with FPM, or tensor factorization with RESCAL.

This tool wraps the EasyMiner-Discretization library.

Usage

Use a released executable or compile using Leiningen and lein-binplus:

git clone https://github.com/jindrichmynarz/discretize-sparql.git
cd discretize-sparql
lein bin

You can run the created executable file and observe the command-line parameters:

target/discretize_sparql --help

The tool supports the following parameters:

  • -e, --endpoint: URL of the SPARQL endpoint to retrieve data from. The endpoint must allow SPARQL Update operations.
  • -a, --auth: Endpoint's authorization written as username:password. The tool currently supports HTTP Digest authentication, which is used by Virtuoso.
  • -u, --update: Path to SPARQL Update operation. See more about this below.
  • -m, --method: Method of discretization to use. The supported methods are equidistance, equifrequency, and equisize. Equidistant discretization creates intervals of the same size. Equifrequent discretization creates intervals with approximately the same number of members. Equisize discretization creates intervals based on minimum support.
  • -b, --bins: Number of bins (intervals) to generate. Required for equidistance and equifrequency methods.
  • -s, --min-support: Minimum support required for a generated interval. Required for equisize method.
  • -g, --graph: IRI or URN of the named graph to which intervals will be loaded.
  • -p, --page-size (default = 10000): Number of results to fetch in one request.
  • --parallel (default = false): Execute SPARQL queries in parallel.
  • --strict (default = false): Fail if not all discretized values are numeric.
  • -h, --help (default = false): Display help information.

The most important parameter is --update, which provides a path to a SPARQL Update operation that defines the input and output data for a discretization task. This operation employs wishful thinking. Its WHERE clause must contain the variable ?value, which selects numeric values to discretize, and the variable ?interval, which will be assigned the intervals generated by the tool. You're free to do what you want with these variables in the operation. Perhaps you want to delete the ?value and insert ?interval in its place. Or you may insert ?interval along the original values. Let's have a look at an example of such operation:

PREFIX pc:     <http://purl.org/procurement/public-contracts#>
PREFIX schema: <http://schema.org/>

WITH <http://linked.opendata.cz/resource/dataset/isvz.cz>
DELETE {
  ?resource schema:price ?value .
}
INSERT {
  ?resource schema:price ?interval .
}
WHERE {
  [] pc:estimatedPrice ?resource .
  ?resource schema:priceCurrency "CZK" ;
    schema:price ?value .
}

The WHERE clause in this operation selects estimated prices (pc:estimatedPrice) in Czech crowns (schema:priceCurrency "CZK"). The INSERT clause inserts the generated ?interval, while the DELETE clause deletes the original ?value.

Under the hood, this operation is rewritten to SELECT queries that retrieve the values to discretize and a SPARQL Update operation that converts the intervals.

The generated intervals are represented as instances of schema:QuantitativeValue. The bounds of the intervals are described using schema:minValue for the lower bound and schema:maxValue for the upper bound. Classes from the SemanticScience Integrated Ontology are used to determine whether the bounds are open or closed. The intervals are identified with UUID-based URNs. Here's an example:

@prefix schema: <http://schema.org/> .
@prefix sio:    <http://semanticscience.org/resource/SIO_> .

<urn:uuid:4E98F3EE-2861-4A4B-A39C-487A7018165E> a schema:QuantitativeValue,
                                                  sio:001254, # Left-closed interval
                                                  sio:001252 ; # Right-open interval
  schema:minValue 0 ;
  schema:maxValue 1000 .

The intervals are loaded into a named graph provided via the --graph parameter. If this parameter is missing, the tool attempts to guess a named graph to load the interval to. It uses the graph specified by WITH, USING, or in the INSERT clause. If no graph is found, the tool asks you to provide it explicitly via --graph.

Caveats

Virtuoso has an issue with keeping the precision of xsd:decimal. As a result of the precision loss, some decimal numbers may end up not being discretized. In order to avoid this issue, the tool rounds the bounds of intervals to the maximum decimal precision supported by Virtuoso in xsd:decimal.

Acknowledgements

Development of this tool was supported by the by the H2020 project no. 645833 (OpenBudgets.eu).

License

Copyright © 2017 Jindřich Mynarz

Distributed under the Eclipse Public License either version 1.0.

You can’t perform that action at this time.