Skip to content

Latest commit

 

History

History
341 lines (272 loc) · 15.7 KB

PoiToPoiConflation.asciidoc

File metadata and controls

341 lines (272 loc) · 15.7 KB

POI Conflation

Hootenanny provides the ability to perform point to point, i.e. POI to POI, conflation to merge input dataset geometry and attribution using the Unify approach. The Unify POI to POI conflation code uses the Hootenanny JavaScript interface described in [HootJavaScriptOverview] to find, label and merge match candidate features. Because this code deals with a broader range of POI conditions, conflating POI datasets using Unify will invariably produce reviews that the user must manually confirm using the workflow described in the [hootuser], Reviewing Conflicts. Further background on Unifying POI conflation can be found in [hootuser], Unifying POI Conflation. For more details on the POI to POI conflation algorithm, see PoiToPoi

For background on leveraging the Hootenanny JavaScript interface to call functions using Node.js, see [HootJavaScriptOverview].

There are a number of configurable distance thresholds that can be specified if you are comfortable making changes to JavaScript. Please see $HOOT_HOME/rules/Poi.js to modify the tables found in the script. Any changes made to Poi.js will apply to all jobs launched in the user interface as well as command line.

The PLACES algorithm formerly supported in Hootenanny was designed specifically to handle GeoNames style data. In the case of GeoNames it is very fast, but suffers from a few draw-backs.

  • The dated base implementation we use does not consider POI type when matching features.

  • The search radius of a feature is based primarily on the source data and not on the POI type. For example cities don’t inherently get a larger search radius than hotels.

  • No reviews are generated for users to evaluate. This is great if you want a fully automated system, but can lead to some hard to discover errors.

These issues become even more apparent when you try to apply the PLACES conflation routines to data for which it was not intended. For example OSM inputs tend to have very dense POI information that rely heavily on the type tags and in some cases have no name tags at all.

The following sections describe the POI to POI conflation design, the test setup and finally the test results.

Design

This section covers the POI to POI design by going into details about finding match candidates, labeling match candidates and merging matched features.

The POI to POI conflation code uses the Hootenanny JavaScript interface (See JavaScript Overview in [hootuser] for details). The script contains the business logic for conflating POIs. The code can be found and modified in the Hootenanny source tree under rules/Poi.js.

The conflation flow is similar to the other conflation operations and happens in a single pass of the data with other Unifying routines. The high level steps are:

  1. Filter all element to only the POI types.

  2. Find match candidates based on the search radius of the POIs.

  3. Label the POI relationships based on an additives score.

  4. Merge POIs using the user specified tag merging routine.

The following sections will detail each of the steps above.

Filter Elements to Only POIs

Hootenanny takes a fairly liberal view on what should be considered a POI. This overly optimistic approach may increase the number of unnecessary reviews but helps to reduce errors.

To be considered a POI the following criteria must be met:

  • The element must be a node. Lines and Polygons are not Points of Interest.

  • The node must either have a name, or be in a poi category as defined in Hootenanny’s schema files.

The poi category is defined in Hootenanny’s conf/schema directory. See the Schema Files section in [hootuser] for details. Any element that contains at least one tag with the poi category is considered to be a POI type. If a tag is not explicitly defined as being in the poi category it is assumed to be outside the poi category.

Some examples of nodes that are considered to be a POI type are below:

  • Highway rest areas and highway service stops

  • All amenity tags such as hotels, restaurants, and concert halls.

  • All natural tags such as cliff, beach and water.

  • Some man made nodes such as electrical towers, power lines, and gantries.

Some examples of nodes that are not considered to be POIs are below:

  • Stop signs, stop lights, speed bumps.

  • Entrances to buildings.

Find Match Candidates

A match candidate is a pair of POIs that need to have a relationship assigned. If a pair of POIs are not considered a match candidate then they will never be marked as a match or review. Two POIs are a match candidate if the distance between the two nodes is less than or equal to either of the search radii and they come from two different inputs.

The search radius is determined in order of priority

1) by the 'search.radius.poi' configuration option (turned off by default) 2) by the elements circular error tag if the circular error is larger than the configured search radius for the feature 3) by the aforementioned customizable search radii passed in as either a JSON string or file in the configuration option: poi.search.radii

Some example search radii are below:

  • historic - 200m

  • place=city - 5000m

  • place=village - 3000m

  • amenity - 200m

A simple example of overlapping search radii
Figure 1. Search Radius Example

In Search Radius Example points A & B are points from the first input and point C is from the second data set. For this reason A & B will never be match candidates. A’s search radius overlaps the point C so A & C are match candidates. B and C have search radii that overlap, but B’s search radius doesn’t overlap the node C, neither does C’s search radius overlap the node B. For this reason B & C are not match candidates.

Label Based on Additive Scores

Establishing a relationship between two points is a simple additive score. Generally the score starts at zero and as evidence is found of a match the score is increased. Strong evidence of a match increases the score by 1 and weak evidence increases the score by 0.5. There are some exceptions that will be covered at the end of this section.

Some examples of evidence include:

  • Strong Evidence

  • POI types are similar (e.g. amenity=bus_station is similar to transport=station)

  • Additional attributes are similar (e.g. cuisine, power, sport, or art_type)

  • Very similar names

  • Weak Evidence

  • Somewhat similar names

  • Points are close together (< = 20% of the search radii)

In some special cases the scores are manipulated based on type. For instance if one or both of the nodes have no POI type attributes then the closeness and the name are given double the weight when scoring. If one of the POIs is a place and the other is not, then closeness and name are given half the weight. This pragmatic approach helps to give better performance and reduce unnecessary reviews at the expense of complexity. In the Future Work section we will discuss approaches to reducing or eliminating these exceptions.

Once the score has been established a simple threshold is used to establish the relationship:

  • If the score < = 0.5 the relationship is a miss.

  • If the score is between 0.5 and 1.9 the relationship is a review.

  • If the score is >= 1.9 the relationship is a match.

Comparing POI Types

Types are not always the same in the input data. One user may have an extraction guide that specifies a point should have the type of its primary use (e.g. amenity=bus_station), another input may not have a specific bus station tag and it is simply tagged as a transport=station. Intuitively it is obvious that these two points could represent the same entity, however that prior knowledge must be exposed to Hootenanny.

To do this Hootenanny uses schema files. The schema files define that an amenity=bus_station is similar to a transport=station with a graph. This graph contains both isA and similarTo relationships. Details on how the graph works can be found in Calculating the Enumerated Score.

Example Scores

The table below lists a handful of examples as well as the associated scores and relationships.

Table 1. Example POI Scores
Tags 1 Tags 2 Score Reasons

place=locality, historic=ruins, name:fr=Khirbat Masuh, int_name=Khirbat Māsūh;Khirbat Masuh

place=populated, alt_name=Khirbat Masuh;Khirbat Māsūh;Masuh;Māsūh;maswh name=Māsūh

1.5 Review

very similar names, very close together

place=village, name:en=Al Maks

name=Al Maks, amenity=pub

0.5 Miss

very similar names, very close together, no place match

barrier=toll_booth

building=guardhouse

1.5 Review

very close together, similar POI type

name=Georg-Brauchle-Ring, railway=subway_entrance

station=light_rail, name=U-BAHN-GEORG-BRAUCHLE-RING

2 Match

very similar names, similar POI type

name=Izbat Hawd an Nada, place=village

name=Izbat as Sab’in, place=populated

0 Miss

None given by routine (Izbat is a common word so it is given a low weight when comparing names)

Merging POIs

After relationships have been determined the system then determines how to apply said relationships. The simplest cases are when a point is only involved in a single relationship with no overlap between relationships. E.g. A matches only B and B matches only A. In this case the two points will be merged as expected.

However, if there are overlapping matches Hootenanny makes no attempt to determine which match is most appropriate, but marks all the overlapping matches as needing to be reviewed by the user. This does increase the number of reviews in some dense regions, but avoids some unnecessary errors in the process.

The first input is always used as the reference geometry.

The tags are merged using the default tag merging routine. Unless otherwise specified the default tag merging routine is averaging.

Test Setup

To evaluate the performance of automatically conflated results manually matched data is used. The manually matched data was translated into the OSM schema before matching and all non-POI features were removed (e.g. buildings polygons and roads). One data set is designated as the primary and the other as the secondary. The primary data set gets a unique identifier applied to each feature as a "REF1" tag. Then an analyst goes through all the features in the secondary dataset and assigns tags to define the relationships to the corresponding primary input features. The associated tags are listed below:

  • REF2 - This tag signifies matches and can contain either the value of a single REF1 UID or none.

  • REVIEW - If a feature in the secondary data set should be reviewed against zero or more features then this tag is used. A feature may need to be reviewed if there isn’t enough information or the match is ambiguous. This tag will be populated with a semi-colon delimited list of REF1 UIDs.

The data sets used are varied in source and region, but for simplicity some data sets are used multiple times.

Table 2. POI Test Data Sources
Test Region Source 1 Source 2 Input 1 POI Count Input 2 POI Count Approximate Area (km2)

1

Munich

OSM

NAVTEQ

32414

2297

500

2

Egypt

OSM

GeoNames.org

9017

6654

10500

3

Egypt

OSM

MGCP

9017

186066

10500

4

Jordan

OSM

MGCP

2691

59126

500

5

Jordan

OSM

GeoNames.org

2691

1322

500

6

Washington DC

OSM

GeoNames.org

15700

4246

140

7

Jordan

MGCP

GeoNames.org

1322

330

500

All test results presented were run with Hootenanny v0.2.17-76-g140396e. An iterative approach was used to improve performance against the data sets provided. As the tests were run areas that caused errors were identified and improved.

POI Test Flow
digraph G
{
  rankdir = LR;
  node [shape=record,width=2,height=1,style=filled,fillcolor="#e7e7f3"];
  conflate [label = "Automatically\nConflate"];
  improve [label = "Improve\nAlgorithm"];
  evaluate [label = "Evaluate\nResults"];
  "Manually\nMatch Data" ->
  conflate -> evaluate
  evaluate -> improve
  improve:s -> conflate:s
}

In this test setup our testing data is used to improve the algorithm. This creates a biased test scenario, but still provides useful information. When new regions are evaluated in the future the test results are almost certain to vary based on the POI types and data quality that is provided. In other words — your mileage will vary. There can be a great deal of variance in input data sets. To get accuracy values over a new dataset a small test region should be evaluated to obtain values specific to your data set.

Test Results

The test results are presented in the tables below. Note that the tables below represent the categorization of relationships between POIs (not the number of merged POIs). As such the number of POIs that do not match (miss) is very high and omitted from the tables.

The number of reviews also seems quite high, but in reality reviewing a single POI pair is relatively quick at about 12-20 seconds per review.

Table 3. POI Aggregated Confusion Matrix
expected

miss

match

review

miss

-

269

43

outcome

match

283

4053

12

review

0

2998

155

Table 4. POI Test Results
Test Name miss outcome match outcome review outcome Wrong Correct Unn. Review

match exp.

review exp.

miss exp.

match exp.

review exp.

miss exp.

match exp.

review exp.

1

25

7

62

488

8

0

625

116

7.7%

45.4%

47.0%

2

69

9

106

851

3

0

477

15

12.2%

56.6%

31.2%

3

16

7

0

17

0

0

153

20

10.8%

17.4%

71.8%

4

20

19

8

814

1

0

134

2

4.8%

81.8%

13.4%

5

25

0

26

483

0

0

156

0

7.4%

70.0%

22.6%

6

111

1

78

1157

0

0

1371

2

7.0%

42.6%

50.4%

7

3

0

3

243

0

0

82

0

1.8%

73.4%

24.8%

Total

269

43

283

4053

12

0

2998

155

7.8%

53.9%

38.4%

Future Work

In this section we discuss some areas of possible improvement.

In many cases Hoot relies very heavily on name comparisons for making matches. Great promise was found by the PLACES team in using skip-grams for name comparison. We would like to investigate using skip-grams to improve performance (Note: skip-grams have been implemented, I believe, but not sure they’ve ever been tested with POI to POI - BDW).

Also, Hootenanny uses a global dictionary of word frequencies to determine the relevance of a word in a name. However, if you’re looking at the word "Pennsylvania" in Pittsburgh the weight should be low, but the same word in Indonesia would have a very high weight. We would like to investigate using a weight that dynamically changes with the region (Note: There seems to be evidence this has been done for POI to POI, but that needs to be verified - BDW).

In building and road matching we have had good success training supervised models for matching data. We would like to explore using the same techniques for matching POIs.

Generating POI training data can be a time consuming process. To increase efficiency the user could be guided through the matching process with the UI. This would dramatically speed up the process of creating training data, but there is the possibility that false negatives (matches that Hootenanny misses) will be dropped from the data.