Tantivy index #149

jbothma · 2024-06-02T12:35:05Z

No description provided.

pudo · 2024-06-03T06:13:22Z

To my mind, one of the next steps here could be swapping out the tokenizer for a custom routine. The NK tokenizer does a ton of things that may not at all be needed, and I'd rather rely on how the tantivy people did it. We might want to keep transliteration and a few bits of normalisation in place, but more on a per-type basis :)

jbothma · 2024-06-03T08:06:35Z

Agreed.

I just want to spend a little bit of time identifying why some of the entities we're not matching are missing - if they're dropping out because they're not coming up as candidates, or some bug somewhere. But definitely want to make better use of the tokenisation available in tantivy

It's giving results as if only one term needs to match, and multiple matches aren't scoring high

jbothma added 2 commits June 2, 2024 12:55

Tantivy index with basic test

90d2ae8

Type isses

a6c3520

jbothma mentioned this pull request Jun 2, 2024

Local enricher using tantivy index opensanctions/opensanctions#907

Merged

jbothma added 4 commits June 3, 2024 17:16

Reuse existing index for quicker dev iteration

c115243

Install tantivy for CI tests

a3e5601

Inline tokeniser for type sadness, to customise soon

d64ecd3

Merge branch 'main' into tantivy-index

c09ee38

jbothma force-pushed the tantivy-index branch from 4890044 to c09ee38 Compare June 7, 2024 14:27

jbothma added 5 commits June 7, 2024 17:36

Rough tantivish tokenisation and querying

b031a30

fail

a3172db

Give up on TermSetQuery

e3f2409

It's giving results as if only one term needs to match, and multiple matches aren't scoring high

Exclude url fields

0552066

Fix type issues

e8e08bf

jbothma marked this pull request as ready for review June 10, 2024 13:36

jbothma merged commit 03559d3 into main Jun 12, 2024
5 checks passed

jbothma deleted the tantivy-index branch June 12, 2024 11:24

jbothma mentioned this pull request Jun 13, 2024

Make blocking mechanism more efficient #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tantivy index #149

Tantivy index #149

jbothma commented Jun 2, 2024

pudo commented Jun 3, 2024

jbothma commented Jun 3, 2024

Tantivy index #149

Tantivy index #149

Conversation

jbothma commented Jun 2, 2024

pudo commented Jun 3, 2024

jbothma commented Jun 3, 2024