Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improving inter-layer deduplication #1557

Open
missinglink opened this issue Sep 13, 2021 · 1 comment
Open

improving inter-layer deduplication #1557

missinglink opened this issue Sep 13, 2021 · 1 comment
Labels

Comments

@missinglink
Copy link
Member

missinglink commented Sep 13, 2021

The existing helper/diffPlaces::isLayerDifferent() function provides the ability to deduplicate between disparate layers in certain conditions.

It provides an Array called $layerPreferences which indicates which layers can be considered synonymous for deduplication purposes.

I think this this array may be causing a lot of the confusion when working with the deduplication code and potentially creating errors, the reason for this is that the list contains 9 terms and considers them all to synonymous with each other, so neighbourhood and empire can be deduped, for example, which seems wrong to me.

The dedupe.js logic is pretty complicated, once it finds a set of 'potential matches' then it selects a 'master' record and, considering the others as dupes, adds them to a skip list which excludes them from results.

Since the layerPreferences matching is quite broad, many places are added to the skip list which shouldn't really be there!

I had a quick play around with the code but didn't settle on a solution yet, I think the main improvement would be to have some sort of layer equivalency matrix which allows us to be more specific about exactly which layers on the layerPreferences lists can be combined which which other ones.

Eg:

diff --git a/helper/diffPlaces.js b/helper/diffPlaces.js
index 0094f212..b9804063 100644
--- a/helper/diffPlaces.js
+++ b/helper/diffPlaces.js
@@ -20,6 +20,25 @@ const layerPreferences = [
   'empire'
 ];

+const layerEquivalencyMatrix = [
+  [0,1,1,1,1,1,1,1,1],
+  [1,0,1,1,1,1,1,1,1],
+  [1,1,0,1,1,1,1,1,1],
+  [1,1,1,0,1,1,1,1,1],
+  [1,1,1,1,0,1,1,1,1],
+  [1,1,1,1,1,0,1,1,1],
+  [1,1,1,1,1,1,0,1,1],
+  [1,1,1,1,1,1,1,0,1],
+  [1,1,1,1,1,1,1,1,0]
+];
+
+const layersAreEquivalent = (layer1, layer2) => {
+  return _.get(layerEquivalencyMatrix, [
+    _.indexOf(layerPreferences, layer1),
+    _.indexOf(layerPreferences, layer2)
+  ]) === 1
+}
+
 /**
  * Compare the layer properties if they exist.
  * Returns false if the objects are the same, else true.
@@ -31,8 +50,8 @@ function isLayerDifferent(item1, item2){
         ( item2.layer === 'venue' || !_.includes( canonicalLayers, item2.layer ) ) ){
       return false;
     }
-    // consider some layers to be synonymous
-    if( _.includes( layerPreferences, item1.layer ) && _.includes( layerPreferences, item2.layer ) ){
+    // consider some layers to be equivalent
+    if (layersAreEquivalent(item1.layer, item2.layer)){
       return false;
     }
     return true;
@orangejulius
Copy link
Member

Agreed, this would definitely be helpful. I think the list of "equivalent" layers has slowly grown well beyond what's reasonable.

I know there are lots of inter-layer deduplication cases to consider, but hopefully we could considerably reduce the scope of inter-layer deduping with a change like you proposed. The "widest" required deduping I can recall is deduping Luxembourg across locality, localadmin, region, and country, which is considerably more constrained than what we technically now allow for.

In terms of implementation, I'd just request we don't actually have a literal matrix of 0's and 1's (hopefully that doesn't need to be said 😆).

Maybe we keep things simple with an array of layer pairs that are equivalent, so something like this:

const equivalent_layers = [
  [ 'neighbourhood', 'locality'],
  [ 'locality', 'localadmin' ],
  // etc
];

We'd have to ensure our code handles things in either order (A locality is equivalent to a localadmin, and a localadmin is equivalent to a locality), I imagine, unless there's a reason we don't want that (I can't think of one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants