Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language and Locale #19

Closed
silvadenisaraujo opened this issue Aug 29, 2020 · 4 comments
Closed

Language and Locale #19

silvadenisaraujo opened this issue Aug 29, 2020 · 4 comments

Comments

@silvadenisaraujo
Copy link

Dear commiters,

I would like to use txtai for a search query purpose but currently my content is not in English, is there parameters that can be provided to improve the results based on language and locale ?

Thanks,

@davidmezzetti
Copy link
Member

There are additional models available on Hugging Face's model hub.

There are also additional models provided by the Sentence Transformers team. #17 has an example of how to use one of these.

@silvadenisaraujo
Copy link
Author

Thanks for the answer, I will try with a couple of different models from the hub !

Adding to that, do you have an example on how to pass a tokenizer, to don't use the default at:

self.tokenizer = tokenizer if tokenizer else Tokenizer

@davidmezzetti
Copy link
Member

davidmezzetti commented Aug 29, 2020

For the tokenizer, you can pass any object that implements the method:

def tokenize(text)

In this method, you can split the text as you see fit. Could be as simple as text.split(). There is also a lower level abstraction called Pipeline, that you may find more useful if you want to skip the tokenization/content filtering process all together.

@silvadenisaraujo
Copy link
Author

Thanks a lot 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants