Skip to content

Helio Materialiser for users

Andrea Cimmino Arriaga edited this page Oct 27, 2020 · 11 revisions

Helio materialiser is able to generate (synchronously or asynchronously) an RDF dataset containing data translated from a set of heterogeneous data sources relying on its own mappings, or existing mappings like RML or Wot-Mappings. The Helio materialiser is responsible for integrating a set of heterogeneous sources and translate their data into RDF, then, the Helio publisher is responsible for publishing the data as Link Data. The materialiser and the publisher require two main inputs in order to properly function, which are, a set of mappings (mandatory) and a configuration file (optional). Additionally, both components can be feed with plugins that extend the regular functionality of Helio, for instance, allowing to retrieve data from new sources (like blockchains, or web APIs that produce enhanced data).

Download the latest Helio Materialiser release, the jar supports any of the following arguments:

  • --mappings= (mandatory): specifies the directory where the mapping files are located (bear in mind that Helio will put together all the mappings files creating a single one in memory). The mapping files must be express in any supported format, additionally, different supported formats can be used at the same time to feed Helio. Currently, three mapping languages are supported the Helio mapping, RML, or Wot-Mappings.
  • --write= (optional): specifies a file in which the generated RDF will be written, the output format will be Turtle.
  • --close (optional): specifies to Helio to shutdown the process after the data is generated, shutting down the asynchronous Data Sources and background threads.
  • --config= (optional): specifies a file in which advanced configuration parameters can be found.
  • --clear (optional): flushes the Helio cache before generating the new RDF removing any other RDF generated previously.

In order to run Helio, the following command can be ran:

java -jar materialiser-X.X.X.jar --mappings=./helio-mappings --write=output.ttl --close

Notice that the X.X.X must be replaced with the correct version of Helio which is the one in the name of the downloaded jar. Notice that the Helio command does not dies after its execution, the reason are the asynchronous Data Sources that will be updated when required, and therefore, Helio keeps a set of the processes alive. In order to specify Helio to shutdown after the generation of RDF the argument --close must be used.

In the next section the Helio Mappings will be explained, and on the one after, the advanced configuration of Helio. Bear in mind that the previous command could be executed without the write option, this would not generate an output file. Although this may seem useless, since Helio can use an existing triple store running Helio without the write option will make the generated RDF to be injected within the triple store. Have a look at the advanced configuration section in order to learn what can be done outside the regular use of Helio.

Compatibility with other languages: Helio counts with different Mappings parsers, and extensions mechanisms to include new mappings. Currently, Helio supports its own mappings explained below and also RML and Wot-Mappings. As a result, any of those mappings can be provided to Helio.

Mappings in Helio are built upon 3 concepts:

  • Data Source: defines from where data is coming regardless the data format (data provider), and how data must be treated depending on the format regardless where it comes (data handler). Additionally, Data Sources can be defined as synchronous, and their RDF will be generated anytime a data request is issued, or asynchronous, and their RDF will be generated with a timer. Materialisation tools like RMLMapper can be integrated as a data provider, as well as query translation tools that always provide the same data (the query that they translate is static). Check the plugins section in order to create new data providers or handlers.

  • Resource Rules: define how data is translated from the original format into RDF

  • Linking Rules: define fuzzy rules to link the resources within the RDF data generated

Each of these concepts can be defined in the same Json document encoding the mapping, or each can be defined in different files and Helio will read the different files and put them together.

Defining a Data Source

Helio expects a Data Source to be submitted as a JSON document. Let's use the following Data Source example to illustrate how to define a Data Source:

{
    "datasources": [
        {
            "id": "EPW Values datasource",
            "refresh" : 60000,
            "handler": { "type": "JsonHandler", "iterator" : "$.epw[*]" },
            "provider": { "type": "FileProvider", "file" : "./data/ESP_CE_Ceuta.603200_TMYx.json" }
        }
    ]
}

As it can be observed, the key datasources encodes an array of Data Sources, for the sake of this example only one was defined. The mandatory keys for a Data Source are:

  • id uniquely identifies this Data Source
  • refresh (optional) if this key is specified, configures the Data Source as asynchronous and its data will be translated each milliseconds specified (in the example above each 60000 milliseconds). If this key is not present, the Data Source is configured as synchronous and will generate the RDF anytime a synchronous data update is required.
  • handler that defines the type of data handler to be used and its inputs
  • provider that defines the type of the data provider and its inputs.

The list of Data Providers and Data Handlers available along with the expected inputs are explained in the Tables below. Consider that in the tables some values are between brackets [ ], in this cases the user must replace the whole expression (brackets included) for a valid input. For instance, the specification of the Json Handler { "type" : "JsonHandler", "iterator" : "[A valid Json Path]"} has the expression [A valid Json Path], to instantiate the specification a user should replace [A valid Json Path] for a valid Json Path; for instance, { "type" : "JsonHandler", "iterator" : "$.epw[*]"}.

Defining Resource Rules

Helio expects a Resource Rule to be submitted as a JSON document. Let's use the following Resource Rule example to illustrate how to define a Resource Rule:

{
  "resource_rules": [
    {
      "id": "DryBulbTemperature",
      "datasource_ids": [
        "EPW Values datasource"
      ],
      "subject": "https://bimerr.iot.linkeddata.es/def/weather#DryBulbTemperature",
      "properties": [
        {
          "predicate": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
          "object": "http://w3id.org/saref#Measurement",
          "is_literal": "False",
        },{
          "predicate": "https://saref.etsi.org/core/relatesToMeasurement",
          "object": "{$.value}",
          "is_literal": "True",
          "datatype" : "http://www.w3.org/2001/XMLSchema#nonNegativeInteger" 
        },{
          "predicate": "http://www.w3.org/2000/01/rdf-schema#label",
          "object": "Measured at {$.city}, {$.Year}-{$.Month}-{$.Day} at {$.Hour}",
          "is_literal": "True",
          "lang" : "en"
        }
      ]
    }
  ]
}

As it can be observed, the key resource_rules encodes an array of Resource Rules, for the sake of this example only one was defined. The mandatory keys for a Resource Rule are:

  • id which uniquely identifies this Resource Rule
  • datasource_ids defines an array that contains the ids of the Data Sources from where data will be took for been translated into RDF
  • subject is a evaluable expression, in which the reference to data is encoded between { and } and functions are expressed between [ and ]
  • predicate is an array of translation rules.

The translation rules are Json documents that specify a predicate and object that is related to the previous specified subject. The mandatory keys for the translation rules are:

  • predicate and object are evaluable expressions, in which the reference to data is encoded between { and } and functions are expressed between [ and ]
  • is_literal is a boolean that specifies whether the triple conformed by the predicate and the object refers to a data type (literal) or an object type (relationship to another URI).
  • Optionally, datatype allows to define the datatype of a literal and lang allows to define the language in which the literal is expressed.

NOTE: prefixes are not allowed in the mappings

Notice that the subject, predicate, and object are evaluable expressions. This kind of expressions can be constant, like https://bimerr.iot.linkeddata.es/def/weather#DryBulbTemperature; contain any number of data references, like https://bimerr.iot.linkeddata.es/weather/{$.city}/{$.Year}-{$.Month}-{$.Day}-{$.Hour}; or contain functions, like https://bimerr.iot.linkeddata.es/weather/[lower({$.city})]/[trim({$.Year})]-[trim({$.Month})]-[hash(concat({$.Day},{$.Hour}))]. The list of available functions and their input is specified here, consider that functions are independent from lower or upper case, and thus, the functions CONCAT and concat will work.

Defining Linking Rules

Once one or more Resource Rules are defined, Helio accepts Linking Rules a JSON document to link two or more subjects generated from the same, or different, Resource Rules. Let's use the following mapping example to illustrate how to define a Linking Rules:

{
    "datasources": [
        {
            "id": "Linking test 1",
            "handler": { "type": "JsonHandler", "iterator" : "$[*]" },
            "provider": { "type": "FileProvider", "file" : "./data-1.json"}
        },{
            "id": "Linking test 2",
            "handler": { "type": "JsonHandler", "iterator" : "$[*]" },
            "provider": { "type": "FileProvider", "file" : "./data-2.json"}
        }
    ],
    "resource_rules": [
        {
            "id" : "Test Linking 1",
            "datasource_ids" : ["Linking test 1"],
            "subject" : "https://example.test.es/{$.key}",
            "properties" : [
                {
                    "predicate" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", 
                    "object" : "https://example.test.org/Test",
                    "is_literal" : "False" 
                },{
                    "predicate" : "http://www.example.org/ontology#key", 
                    "object" : "{$.key}",
                    "is_literal" : "True" 
                },{
                    "predicate" : "http://www.example.org/ontology#number", 
                    "object" : "{$.number}",
                    "is_literal" : "True",
                    "datatype" : "http://www.w3.org/2001/XMLSchema#nonNegativeInteger" 
                },{
                    "predicate" : "http://www.example.org/ontology#text", 
                    "object" : "{$.text}",
                    "is_literal" : "True",
                    "lang" : "en" 
                }
            ]
        },{
            "id" : "Test Linking 2",
            "datasource_ids" : ["Linking test 2"],
            "subject" : "https://linking.test.es/{$.name}",
            "properties" : [
                {
                    "predicate" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", 
                    "object" : "https://www.linking.org/ontology#Country",
                    "is_literal" : "False" 
                },{
                    "predicate" : "http://www.linking.org/ontology#countryName", 
                    "object" : "{$.name}",
                    "is_literal" : "True",
                    "lang" : "en" 
                }
            ]
        }
    ],
    "link_rules" : [
        {
            "condition" : "S({$.name}) = T({$.text})",
            "source" : "Test Linking 2",
            "target" : "Test Linking 1",
            "predicate" : "http://www.w3.org/2002/07/owl#sameAs"
        }
    ]

}

As it can be observed the Linking Rule has 4 mandatory keys:

  • predicate contains the URI that will link the subjects that fulfil the linking condition
  • source and target contain the id of a Resource Rule respectively (in this case Test Linking 1 and Test Linking 2)
  • condition defines an evaluable condition that must be meet in order to link the subjects from the source and target. Notice that in the condition, the data values coming from the source must be enclosed in the function S(), e.g., S({$.name}). Similarly, for the data values coming from the target the function T() should be used.
  • inverse contains the URI that will link the subjects that fulfil the linking condition in the opposite direction of predicate

The condition can use any function from the H2 list, or from the Helio implementations of string similarities or the string transformations. The previous example using a fuzzy similarity function and the inverse predicate is the following:

{
    "link_rules" : [
        {
            "condition" : "cosine(S({$.name}), T({$.text}))>0.7",
            "source" : "Test Linking 2",
            "target" : "Test Linking 1",
            "predicate" : "http://ex.org/ontology#linkedTo",
            "predicate" : "http://ex.org/ontology#linkedBy"
        }
    ]

}

That potentially, would generate the triples <sub1> <http://ex.org/ontology#linkedTo\> <sub2> and <sub2> <http://ex.org/ontology#linkedBy\> <sub1>.

Using the argument --config= Helio can be provided with a file containing parameters to configure advanced features of Helio. The file that can be passed as argument is the following:

{
	"base_uri" : "http://helio.linkeddata.es/",
	"plugins" : [ ... ],	
	"threads"  : {
		"injecting_data" : 100,
		"splitting_data" : 100,
		"linking_data" : 100
	},
        "repository" : {
			"type" : "RDF4JMemoryCache",
			"id" : "helio-storage",
			"configuration" : "./repositories-conf/sparql-repository.ttl"
	},
}

All the fields in the file are optional, and thus a partial version of the previous file can be provided as input. The functionalities related to each field are the following:

  • base_uri defines the URI that Helio will take as base when generating the RDF, default is http://helio.linkeddata.es/, however in some cases it can be convenient modify the base URI.
  • plugins is an array containing the configuration of Helio plugins. The way in which the plugins must be configure is specified in the Helio Plugins repository.
  • threads defines the number of threads used by Helio when generating the RDF. There are three kind of threads: threads for injecting the generated RDF into the memory of Helio, threads splitting data that defines the number of data chunks handled in parallel by Helio, and the linking threads that Helio uses to apply the linking rules.
  • repository allows to define the internal repository used by Helio. The repository requires to be provided with the name of a valid implementation of the MaterialiserCache, check the list of available ones to know more about the available ones and what they need to be configured. In this example the type provided is RDF4JMemoryCache, and is been configured with an RDF4J Repository Template allocated in ./repositories-conf/sparql-repository.ttl.
DataProvider Description Specification
FileProvider Retrieves data from a local file { "type" : "FileProvider", "file" : "[A valid file path]"}
URLProvider Retrieves data from a URL using http, https, ftp, or the file protocol { "type" : "URLProvider", "url" : "[A valid url]"}
HttpProvider Retrieves data through the HTTP protocol using GET or POST { "type" : "HttpProvider", "url" : "[A valid url]", "method":"[GET or POST]","headers" : [(Optional) A valid Json with http headers]}
InMemoryProvider It can only be used when Helio is a code dependency. Allows passing a stream of data as input. It must be initialised using the constructor new InMemoryProvider(PipedInputStream pipedData)
DataHandler Description Specification
JsonHandler Allows interacting with Json documents using Json Paths { "type" : "JsonHandler", "iterator" : "[A valid Json Path]"}
CsvHandler Allows interacting with CSV documents. Columns can be accessed using their name or their position { "type" : "CsvHandler", "separator" : "[A valid column separator]", "delimitator" : "[(Optional) a valid text separator]", "has_headers" : "[(Optional) by default is true, it must be set to false if CSV does not have the column names in the first row]"}
XmlHandler Allows interacting with Xml documents using XPaths { "type" : "XmlHandler", "iterator" : "[A valid XPath]"}
RegexHandler Allows interacting with unstructured documents using regular expressions { "type" : "RegexHandler", "iterator" : "[A valid regex]"}
HtmlHandler Allows interacting with Html documents using the expressions supported by JSoup expression { "type" : "HtmlHandler", "iterator" : "[A valid JSoup expression]"}
RDFHandler Allows interacting with RDF documents {"type" : "RDFHandler", "format" : "[A valid format, check the available ones]"}

The following functions can be invoked in the mappings, either as cleaning functions during the Resource Rules or in the Linking Rules:

Repository Description Details
RDF4JMemoryCache This repository is an RDF4J repository implementation.
By default stored the information in a native repository with persistence, however, it can be configured with any valid RDF4J Repository Template by pointing such file under the configuration key.
Already built templates ready-to-use are allocated in the Helio GitHub
{ "type" : "RDF4JMemoryCache", "id" : "helio-storage", "configuration" : "./repositories-conf/sparql-repository.ttl"}