Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tantivy index #149

Merged
merged 11 commits into from
Jun 12, 2024
Merged

Tantivy index #149

merged 11 commits into from
Jun 12, 2024

Conversation

jbothma
Copy link
Contributor

@jbothma jbothma commented Jun 2, 2024

No description provided.

@pudo
Copy link
Member

pudo commented Jun 3, 2024

To my mind, one of the next steps here could be swapping out the tokenizer for a custom routine. The NK tokenizer does a ton of things that may not at all be needed, and I'd rather rely on how the tantivy people did it. We might want to keep transliteration and a few bits of normalisation in place, but more on a per-type basis :)

@jbothma
Copy link
Contributor Author

jbothma commented Jun 3, 2024

Agreed.

I just want to spend a little bit of time identifying why some of the entities we're not matching are missing - if they're dropping out because they're not coming up as candidates, or some bug somewhere. But definitely want to make better use of the tokenisation available in tantivy

It's giving results as if only one term needs to match, and multiple matches aren't scoring high
@jbothma jbothma marked this pull request as ready for review June 10, 2024 13:36
@jbothma jbothma merged commit 03559d3 into main Jun 12, 2024
5 checks passed
@jbothma jbothma deleted the tantivy-index branch June 12, 2024 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants