Make blocking mechanism more efficient #68

pudo · 2022-04-17T10:19:54Z

The principal function of nomenklatura is to de-duplicate entity data, i.e. to find matches between different datasets (or within one dataset) and then merge multiple records in a useful way.

The first problem to solve on that path is to decide which entities might be the same, out of the full pool. If you think of two datasets sized 200,000 and 5,000,000 entities respectively, that's 1,000,000,000,000 comparisons. To avoid that, we have implemented a faster way to find potential matches.

This mechanism - blocking - works by basically constructing an in-memory search index of the whole dataset, where an entity {'id': 'Q7747', 'name': 'Vladimir Putin'}, gets turned into an inverted index ({'vladimir': ['Q7747'], 'putin': ['Q7747']} and then we generate pairwise scores of all entities by counting up how many index terms they have in common (while weighting the index terms using tf/idf). You can see this implemented here:

We need to improve the memory and footprint of doing this. I think it could be worthwhile exploring:

Using numpy to store the inverted index and to generate the pairwise scoring as a numpy array
Using more chaching in the index tokenizer to speed it up

For context: once the entities are blocked, a scoring mechanism is then used to decide which ones to either auto-match, or present to a user for arbitration.

The text was updated successfully, but these errors were encountered:

pudo · 2024-06-13T14:44:39Z

@jbothma we could close this, too :)

jbothma · 2024-06-13T15:18:23Z

We've added an index class TantivyIndex using Tantivy for blocking #149

jbothma closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make blocking mechanism more efficient #68

Make blocking mechanism more efficient #68

pudo commented Apr 17, 2022 •

edited

Loading

pudo commented Jun 13, 2024

jbothma commented Jun 13, 2024

Make blocking mechanism more efficient #68

Make blocking mechanism more efficient #68

Comments

pudo commented Apr 17, 2022 • edited Loading

pudo commented Jun 13, 2024

jbothma commented Jun 13, 2024

pudo commented Apr 17, 2022 •

edited

Loading