Resolution strategy configuration

Jan Michelfeit edited this page Jan 26, 2016 · 2 revisions

Resolution strategy defines how to resolve data conflicts that when integrating data from multiple sources. The strategy can be defined in <ResolutionStrategy> and <DefaultStrategy> tags in a LD-FusionTool configuration file (see examples).

Resolution strategy is defined by three configurable options: conflict resolution function, expected value cardinality, and aggregation error strategy.

Conflict Resolution Functions

Conflict resolution function resolves conflicting values. For example, multiple sources can contain information about a product price; a resolution function can resolve price values by averaging them, selecting maximum, latest value, or including all prices in the result.

Possible values:

Deciding resolution functions choose one or more values from the input. They cannot produce any value that is not present in the input.

  • ALL - all distinct values are included in the output
  • ALLBEST - returns the value with the highest quality score; if multiple values have the same score, returns all the top values
  • ANY - returns a single arbitrary value
  • BEST - returns the value with the highest quality score; if multiple values have the same score, returns the first top value
  • BEST_SOURCE - returns value from the named graph with highest source quality score
  • CERTAIN - if there is a single distinct value, returns the value; otherwise returns no values
  • FILTER - returns numerical values falling into the given range; the range can be specified by optional min and max parameters
  • LONGEST - returns the value with the longest lexical representation
  • MAX - returns the maximum literal value (see comparing values)
  • MAX_SOURCE_METADATA - returns a value from the named graph with maximal value of a given property; the property is specified by predicate parameter
  • MIN - returns the minimum literal value (see comparing values)
  • MIN_SOURCE_METADATA - returns value from the named graph with minimal value of a given property; the property is specified by predicate parameter
  • NONE - returns all values; unlike ALL, preserves duplicate values
  • CHOOSE_SOURCE - returns values from the given named graph; the named graph is specified by source parameter
  • SHORTEST - returns the value with the longest lexical representation
  • TOPN - returns n values with the highest quality score; n is specified by the n parameter
  • THRESHOLD - returns values with the quality score above the given threshold; the threshold is specified by the threshold parameter
  • VOTE - returns the most frequently occurring value
  • WEIGHTED_VOTE - returns the most frequently occurring value where occurrences are weighted by source quality scores

Mediating resolution functions may produce values not included in the input.

  • AVG - returns the numerical average of values
  • SUM - returns the sum of values
  • CONCAT - returns the concatenation of lexical representations of values, separated by value given in optional parameter separator (defaults to ;)
  • MEDIAN - returns the median value

Special resolution functions.

  • DEPENDENT_RESOURCE - expects the values to be resources, and resolves their properties recursively (currently only resolution do depth 1 is supported); more details

Notes:

  • Source quality score. Source quality score is determined from named graph quality score, and average quality of the corresponding data publisher. By default, the properties specifying scores are odcs:score for named graph quality and odcs:publisherScore for publisher quality, and publisher of a named graph is specified by odcs:publishedBy, where odcs: prefix stands for http://opendata.cz/infrastructure/odcleanstore/.
  • Parameters. Some resolution functions have parameters. These are specified by the <Param> element in the configuration XML.
  • Comparing values. Resolution functions comparing literal values use different kinds of comparison depending on data types of literals. LD-FusionTool supports comparison for strings (xsd:string or no datatype; lexicographical), numerical values (xsd:int, xsd:long, ...), time (xsd:time), date (xsd:dateTime, xsd:gYearMont, ...), and booleans (xsd:boolean). If multiple types of values are present, the most frequent type is used and the rest of values are processed according to aggregation error strategy.

Expected Value Cardinality

Expected value cardinality affects how quality is computed for resolved quads produced by LD-FusionTool.

Possible values:

  • MANYVALUED - it is valid for the respective property to have multiple values (e.g., authors of a paper)
  • SINGLEVALUED - the respective property should have a single value (e.g., a birth date); quality value will be decreased if multiple values are present (depending on their similarity/difference)

Aggregation Error Strategy

Aggregation error strategy determines how values that cannot be processed by the given conflict resolution function should be treated. E.g., string values cannot be averaged using the AVG resolution function. Such values can be either discarded, or included in the result.

Possible values:

  • IGNORE - discard values that are not accepted by resolution function
  • RETURN_ALL - include all values that are not accepted by resolution function in the result
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.