Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make blocking mechanism more efficient #68

Closed
pudo opened this issue Apr 17, 2022 · 2 comments
Closed

Make blocking mechanism more efficient #68

pudo opened this issue Apr 17, 2022 · 2 comments

Comments

@pudo
Copy link
Member

pudo commented Apr 17, 2022

The principal function of nomenklatura is to de-duplicate entity data, i.e. to find matches between different datasets (or within one dataset) and then merge multiple records in a useful way.

The first problem to solve on that path is to decide which entities might be the same, out of the full pool. If you think of two datasets sized 200,000 and 5,000,000 entities respectively, that's 1,000,000,000,000 comparisons. To avoid that, we have implemented a faster way to find potential matches.

This mechanism - blocking - works by basically constructing an in-memory search index of the whole dataset, where an entity {'id': 'Q7747', 'name': 'Vladimir Putin'}, gets turned into an inverted index ({'vladimir': ['Q7747'], 'putin': ['Q7747']} and then we generate pairwise scores of all entities by counting up how many index terms they have in common (while weighting the index terms using tf/idf). You can see this implemented here:

We need to improve the memory and footprint of doing this. I think it could be worthwhile exploring:

  • Using numpy to store the inverted index and to generate the pairwise scoring as a numpy array
  • Using more chaching in the index tokenizer to speed it up

For context: once the entities are blocked, a scoring mechanism is then used to decide which ones to either auto-match, or present to a user for arbitration.

@pudo
Copy link
Member Author

pudo commented Jun 13, 2024

@jbothma we could close this, too :)

@jbothma
Copy link
Contributor

jbothma commented Jun 13, 2024

We've added an index class TantivyIndex using Tantivy for blocking #149

@jbothma jbothma closed this as completed Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants