Different Language Support #39

ByUnal · 2020-11-10T11:05:42Z

Hi there, This is a very beautiful work. I want to use this API for languages other than English. How can I implement other Languages model from https://huggingface.co/models?search=turkish or other sources.
Can you anyone help me on this one ?

davidmezzetti · 2020-11-10T18:02:41Z

Thank you for the support.

The best place to start is this notebook: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb

The index in the notebook above uses sentence-transformers. This link has a list of all the sentence transformer models available: https://huggingface.co/models?search=sentence-transformers

The following is an example modification using a multi-lingual model

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens"})

If this doesn't work well, another model to try: sentence-transformers/LaBSE

Then change the sections text below to a couple examples in the target language you want to experiment with.

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]

Finally change the queries to the target language as well:

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):

I would just iterate over a couple different models until you find one that works well. Let me know how it works out.

ByUnal · 2020-11-10T18:12:32Z

I appreciate for your feedback and concern. I will try your recommendation as soon as I'm available, then I will let you know.

ByUnal · 2020-11-13T08:58:05Z

Hello again, now I tried to transformers that you suggested. I tried both of them for Turkish. First one worked in some cases but it is not efficient. Second one didn't work. I need Turkish language support of transformers

davidmezzetti · 2020-11-13T13:36:05Z

Another one to try: sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking

Otherwise you can also try any of the generic Turkish transformer models: https://huggingface.co/models?search=turkish

Those multilingual models are the ones that should support multiple languages. I suspect a model specifically trained for Turkish language and on a NLI/STSB like task for Turkish would work best.

txtai uses the sentence-transformers library to build transformer-based sentence embeddings. I would suggest trying as many of the models to see if any of them work at an acceptable level for your task.

…dresses language support issues in #39 and #43.

davidmezzetti · 2020-12-28T19:37:54Z

Issue should now be resolved. Tokenization can be disabled by setting the config option:

Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False})

davidmezzetti mentioned this issue Nov 25, 2020

tokenizer.py #43

Closed

davidmezzetti added a commit that referenced this issue Dec 28, 2020

Added flag to disable default tokenization for vector models. This ad…

172cd60

…dresses language support issues in #39 and #43.

davidmezzetti closed this as completed Dec 28, 2020

davidmezzetti self-assigned this Jan 5, 2021

davidmezzetti mentioned this issue Feb 11, 2021

Non intuitive behaviour of Tokenizer #61

Closed

davidmezzetti added this to the v2.0.0 milestone May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different Language Support #39

Different Language Support #39

ByUnal commented Nov 10, 2020

davidmezzetti commented Nov 10, 2020

ByUnal commented Nov 10, 2020

ByUnal commented Nov 13, 2020 •

edited

Loading

davidmezzetti commented Nov 13, 2020

davidmezzetti commented Dec 28, 2020

Different Language Support #39

Different Language Support #39

Comments

ByUnal commented Nov 10, 2020

davidmezzetti commented Nov 10, 2020

ByUnal commented Nov 10, 2020

ByUnal commented Nov 13, 2020 • edited Loading

davidmezzetti commented Nov 13, 2020

davidmezzetti commented Dec 28, 2020

ByUnal commented Nov 13, 2020 •

edited

Loading