You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The principal function of nomenklatura is to de-duplicate entity data, i.e. to find matches between different datasets (or within one dataset) and then merge multiple records in a useful way.
The first problem to solve on that path is to decide which entities might be the same, out of the full pool. If you think of two datasets sized 200,000 and 5,000,000 entities respectively, that's 1,000,000,000,000 comparisons. To avoid that, we have implemented a faster way to find potential matches.
This mechanism - blocking - works by basically constructing an in-memory search index of the whole dataset, where an entity {'id': 'Q7747', 'name': 'Vladimir Putin'}, gets turned into an inverted index ({'vladimir': ['Q7747'], 'putin': ['Q7747']} and then we generate pairwise scores of all entities by counting up how many index terms they have in common (while weighting the index terms using tf/idf). You can see this implemented here:
We need to improve the memory and footprint of doing this. I think it could be worthwhile exploring:
Using numpy to store the inverted index and to generate the pairwise scoring as a numpy array
Using more chaching in the index tokenizer to speed it up
For context: once the entities are blocked, a scoring mechanism is then used to decide which ones to either auto-match, or present to a user for arbitration.
The text was updated successfully, but these errors were encountered:
The principal function of nomenklatura is to de-duplicate entity data, i.e. to find matches between different datasets (or within one dataset) and then merge multiple records in a useful way.
The first problem to solve on that path is to decide which entities might be the same, out of the full pool. If you think of two datasets sized 200,000 and 5,000,000 entities respectively, that's 1,000,000,000,000 comparisons. To avoid that, we have implemented a faster way to find potential matches.
This mechanism - blocking - works by basically constructing an in-memory search index of the whole dataset, where an entity
{'id': 'Q7747', 'name': 'Vladimir Putin'}
, gets turned into an inverted index ({'vladimir': ['Q7747'], 'putin': ['Q7747']}
and then we generate pairwise scores of all entities by counting up how many index terms they have in common (while weighting the index terms using tf/idf). You can see this implemented here:We need to improve the memory and footprint of doing this. I think it could be worthwhile exploring:
For context: once the entities are blocked, a scoring mechanism is then used to decide which ones to either auto-match, or present to a user for arbitration.
The text was updated successfully, but these errors were encountered: