Language and Locale #19

silvadenisaraujo · 2020-08-29T10:18:54Z

Dear commiters,

I would like to use txtai for a search query purpose but currently my content is not in English, is there parameters that can be provided to improve the results based on language and locale ?

Thanks,

davidmezzetti · 2020-08-29T11:59:35Z

There are additional models available on Hugging Face's model hub.

There are also additional models provided by the Sentence Transformers team. #17 has an example of how to use one of these.

silvadenisaraujo · 2020-08-29T12:49:02Z

Thanks for the answer, I will try with a couple of different models from the hub !

Adding to that, do you have an example on how to pass a tokenizer, to don't use the default at:

txtai/src/python/txtai/extractor.py

Line 33 in c85256f

self.tokenizer = tokenizer if tokenizer else Tokenizer

davidmezzetti · 2020-08-29T13:52:58Z

For the tokenizer, you can pass any object that implements the method:

def tokenize(text)

In this method, you can split the text as you see fit. Could be as simple as text.split(). There is also a lower level abstraction called Pipeline, that you may find more useful if you want to skip the tokenization/content filtering process all together.

silvadenisaraujo · 2020-08-29T13:56:46Z

Thanks a lot 👍

silvadenisaraujo closed this as completed Aug 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language and Locale #19

Language and Locale #19

silvadenisaraujo commented Aug 29, 2020

davidmezzetti commented Aug 29, 2020

silvadenisaraujo commented Aug 29, 2020

davidmezzetti commented Aug 29, 2020 •

edited

Loading

silvadenisaraujo commented Aug 29, 2020

Language and Locale #19

Language and Locale #19

Comments

silvadenisaraujo commented Aug 29, 2020

davidmezzetti commented Aug 29, 2020

silvadenisaraujo commented Aug 29, 2020

davidmezzetti commented Aug 29, 2020 • edited Loading

silvadenisaraujo commented Aug 29, 2020

davidmezzetti commented Aug 29, 2020 •

edited

Loading