Hootenanny provides the ability to perform point to point, i.e. POI to POI, conflation to merge input dataset geometry and attribution using the Unify approach. The Unify POI to POI conflation code uses the Hootenanny JavaScript interface described in [HootJavaScriptOverview] to find, label and merge match candidate features. Because this code deals with a broader range of POI conditions, conflating POI datasets using Unify will invariably produce reviews that the user must manually confirm using the workflow described in the [hootuser], Reviewing Conflicts. Further background on Unifying POI conflation can be found in [hootuser], Unifying POI Conflation. For more details on the POI to POI conflation algorithm, see PoiToPoi
For background on leveraging the Hootenanny JavaScript interface to call functions using Node.js, see [HootJavaScriptOverview].
There are a number of configurable distance thresholds that can be specified if
you are comfortable making changes to JavaScript. Please see
$HOOT_HOME/rules/Poi.js
to modify the tables found in the script. Any
changes made to Poi.js
will apply to all jobs launched in the user
interface as well as command line.
The PLACES algorithm formerly supported in Hootenanny was designed specifically to handle GeoNames style data. In the case of GeoNames it is very fast, but suffers from a few draw-backs.
-
The dated base implementation we use does not consider POI type when matching features.
-
The search radius of a feature is based primarily on the source data and not on the POI type. For example cities don’t inherently get a larger search radius than hotels.
-
No reviews are generated for users to evaluate. This is great if you want a fully automated system, but can lead to some hard to discover errors.
These issues become even more apparent when you try to apply the PLACES conflation routines to data for which it was not intended. For example OSM inputs tend to have very dense POI information that rely heavily on the type tags and in some cases have no name tags at all.
The following sections describe the POI to POI conflation design, the test setup and finally the test results.
This section covers the POI to POI design by going into details about finding match candidates, labeling match candidates and merging matched features.
The POI to POI conflation code uses the Hootenanny JavaScript
interface (See JavaScript Overview in [hootuser] for details). The script contains
the business logic for conflating POIs. The code can be found and modified in the Hootenanny source
tree under rules/Poi.js
.
The conflation flow is similar to the other conflation operations and happens in a single pass of the data with other Unifying routines. The high level steps are:
-
Filter all element to only the POI types.
-
Find match candidates based on the search radius of the POIs.
-
Label the POI relationships based on an additives score.
-
Merge POIs using the user specified tag merging routine.
The following sections will detail each of the steps above.
Filter Elements to Only POIs
Hootenanny takes a fairly liberal view on what should be considered a POI. This overly optimistic approach may increase the number of unnecessary reviews but helps to reduce errors.
To be considered a POI the following criteria must be met:
-
The element must be a node. Lines and Polygons are not Points of Interest.
-
The node must either have a name, or be in a
poi
category as defined in Hootenanny’s schema files.
The poi
category is defined in Hootenanny’s conf/schema
directory. See the
Schema Files section in [hootuser] for details. Any element that contains at
least one tag with the poi
category is considered to be a POI type. If a tag
is not explicitly defined as being in the poi
category it is assumed to be
outside the poi
category.
Some examples of nodes that are considered to be a POI type are below:
-
Highway rest areas and highway service stops
-
All
amenity
tags such as hotels, restaurants, and concert halls. -
All
natural
tags such as cliff, beach and water. -
Some man made nodes such as electrical towers, power lines, and gantries.
Some examples of nodes that are not considered to be POIs are below:
-
Stop signs, stop lights, speed bumps.
-
Entrances to buildings.
Find Match Candidates
A match candidate is a pair of POIs that need to have a relationship assigned. If a pair of POIs are not considered a match candidate then they will never be marked as a match or review. Two POIs are a match candidate if the distance between the two nodes is less than or equal to either of the search radii and they come from two different inputs.
The search radius is determined in order of priority
1) by the 'search.radius.poi' configuration option (turned off by default)
2) by the elements circular error tag if the circular error is larger than the configured search
radius for the feature
3) by the aforementioned customizable search radii passed in as either a JSON string or file in
the configuration option: poi.search.radii
Some example search radii are below:
-
historic
- 200m -
place=city
- 5000m -
place=village
- 3000m -
amenity
- 200m
In Search Radius Example points A & B are points from the first input and point C is from the second data set. For this reason A & B will never be match candidates. A’s search radius overlaps the point C so A & C are match candidates. B and C have search radii that overlap, but B’s search radius doesn’t overlap the node C, neither does C’s search radius overlap the node B. For this reason B & C are not match candidates.
Label Based on Additive Scores
Establishing a relationship between two points is a simple additive score. Generally the score starts at zero and as evidence is found of a match the score is increased. Strong evidence of a match increases the score by 1 and weak evidence increases the score by 0.5. There are some exceptions that will be covered at the end of this section.
Some examples of evidence include:
-
Strong Evidence
-
POI types are similar (e.g.
amenity=bus_station
is similar totransport=station
) -
Additional attributes are similar (e.g.
cuisine
,power
,sport
, orart_type
) -
Very similar names
-
Weak Evidence
-
Somewhat similar names
-
Points are close together (< = 20% of the search radii)
In some special cases the scores are manipulated based on type. For instance if
one or both of the nodes have no POI type attributes then the closeness and the
name are given double the weight when scoring. If one of the POIs is a place
and the other is not, then closeness and name are given half the weight. This
pragmatic approach helps to give better performance and reduce unnecessary
reviews at the expense of complexity. In the Future
Work section we will discuss approaches to reducing or eliminating these
exceptions.
Once the score has been established a simple threshold is used to establish the relationship:
-
If the score < = 0.5 the relationship is a miss.
-
If the score is between 0.5 and 1.9 the relationship is a review.
-
If the score is >= 1.9 the relationship is a match.
Comparing POI Types
Types are not always the same in the input data. One user may have an extraction
guide that specifies a point should have the type of its primary use (e.g.
amenity=bus_station
), another input may not have a specific bus station tag
and it is simply tagged as a transport=station
. Intuitively it is obvious that
these two points could represent the same entity, however that prior knowledge
must be exposed to Hootenanny.
To do this Hootenanny uses schema files. The schema files define that an
amenity=bus_station
is similar to a transport=station
with a graph. This
graph contains both isA
and similarTo
relationships. Details on how the
graph works can be found in Calculating the
Enumerated Score.
Example Scores
The table below lists a handful of examples as well as the associated scores and relationships.
Tags 1 | Tags 2 | Score | Reasons |
---|---|---|---|
place=locality, historic=ruins, name:fr=Khirbat Masuh, int_name=Khirbat Māsūh;Khirbat Masuh |
place=populated, alt_name=Khirbat Masuh;Khirbat Māsūh;Masuh;Māsūh;maswh name=Māsūh |
1.5 Review |
very similar names, very close together |
place=village, name:en=Al Maks |
name=Al Maks, amenity=pub |
0.5 Miss |
very similar names, very close together, no place match |
barrier=toll_booth |
building=guardhouse |
1.5 Review |
very close together, similar POI type |
name=Georg-Brauchle-Ring, railway=subway_entrance |
station=light_rail, name=U-BAHN-GEORG-BRAUCHLE-RING |
2 Match |
very similar names, similar POI type |
name=Izbat Hawd an Nada, place=village |
name=Izbat as Sab’in, place=populated |
0 Miss |
None given by routine (Izbat is a common word so it is given a low weight when comparing names) |
Merging POIs
After relationships have been determined the system then determines how to apply said relationships. The simplest cases are when a point is only involved in a single relationship with no overlap between relationships. E.g. A matches only B and B matches only A. In this case the two points will be merged as expected.
However, if there are overlapping matches Hootenanny makes no attempt to determine which match is most appropriate, but marks all the overlapping matches as needing to be reviewed by the user. This does increase the number of reviews in some dense regions, but avoids some unnecessary errors in the process.
The first input is always used as the reference geometry.
The tags are merged using the default tag merging routine. Unless otherwise specified the default tag merging routine is averaging.
To evaluate the performance of automatically conflated results manually matched data is used. The manually matched data was translated into the OSM schema before matching and all non-POI features were removed (e.g. buildings polygons and roads). One data set is designated as the primary and the other as the secondary. The primary data set gets a unique identifier applied to each feature as a "REF1" tag. Then an analyst goes through all the features in the secondary dataset and assigns tags to define the relationships to the corresponding primary input features. The associated tags are listed below:
-
REF2 - This tag signifies matches and can contain either the value of a single REF1 UID or
none
. -
REVIEW - If a feature in the secondary data set should be reviewed against zero or more features then this tag is used. A feature may need to be reviewed if there isn’t enough information or the match is ambiguous. This tag will be populated with a semi-colon delimited list of REF1 UIDs.
The data sets used are varied in source and region, but for simplicity some data sets are used multiple times.
Test | Region | Source 1 | Source 2 | Input 1 POI Count | Input 2 POI Count | Approximate Area (km2) |
---|---|---|---|---|---|---|
1 |
Munich |
OSM |
NAVTEQ |
32414 |
2297 |
500 |
2 |
Egypt |
OSM |
GeoNames.org |
9017 |
6654 |
10500 |
3 |
Egypt |
OSM |
MGCP |
9017 |
186066 |
10500 |
4 |
Jordan |
OSM |
MGCP |
2691 |
59126 |
500 |
5 |
Jordan |
OSM |
GeoNames.org |
2691 |
1322 |
500 |
6 |
Washington DC |
OSM |
GeoNames.org |
15700 |
4246 |
140 |
7 |
Jordan |
MGCP |
GeoNames.org |
1322 |
330 |
500 |
All test results presented were run with Hootenanny v0.2.17-76-g140396e. An iterative approach was used to improve performance against the data sets provided. As the tests were run areas that caused errors were identified and improved.
digraph G { rankdir = LR; node [shape=record,width=2,height=1,style=filled,fillcolor="#e7e7f3"]; conflate [label = "Automatically\nConflate"]; improve [label = "Improve\nAlgorithm"]; evaluate [label = "Evaluate\nResults"]; "Manually\nMatch Data" -> conflate -> evaluate evaluate -> improve improve:s -> conflate:s }
In this test setup our testing data is used to improve the algorithm. This creates a biased test scenario, but still provides useful information. When new regions are evaluated in the future the test results are almost certain to vary based on the POI types and data quality that is provided. In other words — your mileage will vary. There can be a great deal of variance in input data sets. To get accuracy values over a new dataset a small test region should be evaluated to obtain values specific to your data set.
The test results are presented in the tables below. Note that the tables below represent the categorization of relationships between POIs (not the number of merged POIs). As such the number of POIs that do not match (miss) is very high and omitted from the tables.
The number of reviews also seems quite high, but in reality reviewing a single POI pair is relatively quick at about 12-20 seconds per review.
expected | ||||
---|---|---|---|---|
miss |
match |
review |
||
miss |
- |
269 |
43 |
|
outcome |
match |
283 |
4053 |
12 |
review |
0 |
2998 |
155 |
Test Name | miss outcome | match outcome | review outcome | Wrong | Correct | Unn. Review | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
match exp. |
review exp. |
miss exp. |
match exp. |
review exp. |
miss exp. |
match exp. |
review exp. |
||||
1 |
25 |
7 |
62 |
488 |
8 |
0 |
625 |
116 |
7.7% |
45.4% |
47.0% |
2 |
69 |
9 |
106 |
851 |
3 |
0 |
477 |
15 |
12.2% |
56.6% |
31.2% |
3 |
16 |
7 |
0 |
17 |
0 |
0 |
153 |
20 |
10.8% |
17.4% |
71.8% |
4 |
20 |
19 |
8 |
814 |
1 |
0 |
134 |
2 |
4.8% |
81.8% |
13.4% |
5 |
25 |
0 |
26 |
483 |
0 |
0 |
156 |
0 |
7.4% |
70.0% |
22.6% |
6 |
111 |
1 |
78 |
1157 |
0 |
0 |
1371 |
2 |
7.0% |
42.6% |
50.4% |
7 |
3 |
0 |
3 |
243 |
0 |
0 |
82 |
0 |
1.8% |
73.4% |
24.8% |
Total |
269 |
43 |
283 |
4053 |
12 |
0 |
2998 |
155 |
7.8% |
53.9% |
38.4% |
In this section we discuss some areas of possible improvement.
In many cases Hoot relies very heavily on name comparisons for making matches. Great promise was found by the PLACES team in using skip-grams for name comparison. We would like to investigate using skip-grams to improve performance (Note: skip-grams have been implemented, I believe, but not sure they’ve ever been tested with POI to POI - BDW).
Also, Hootenanny uses a global dictionary of word frequencies to determine the relevance of a word in a name. However, if you’re looking at the word "Pennsylvania" in Pittsburgh the weight should be low, but the same word in Indonesia would have a very high weight. We would like to investigate using a weight that dynamically changes with the region (Note: There seems to be evidence this has been done for POI to POI, but that needs to be verified - BDW).
In building and road matching we have had good success training supervised models for matching data. We would like to explore using the same techniques for matching POIs.
Generating POI training data can be a time consuming process. To increase efficiency the user could be guided through the matching process with the UI. This would dramatically speed up the process of creating training data, but there is the possibility that false negatives (matches that Hootenanny misses) will be dropped from the data.