Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Language Support #39

Closed
ByUnal opened this issue Nov 10, 2020 · 5 comments
Closed

Different Language Support #39

ByUnal opened this issue Nov 10, 2020 · 5 comments
Assignees
Milestone

Comments

@ByUnal
Copy link

ByUnal commented Nov 10, 2020

Hi there, This is a very beautiful work. I want to use this API for languages other than English. How can I implement other Languages model from https://huggingface.co/models?search=turkish or other sources.
Can you anyone help me on this one ?

@davidmezzetti
Copy link
Member

Thank you for the support.

The best place to start is this notebook: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb

The index in the notebook above uses sentence-transformers. This link has a list of all the sentence transformer models available: https://huggingface.co/models?search=sentence-transformers

The following is an example modification using a multi-lingual model

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens"})

If this doesn't work well, another model to try: sentence-transformers/LaBSE

Then change the sections text below to a couple examples in the target language you want to experiment with.

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]

Finally change the queries to the target language as well:

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):

I would just iterate over a couple different models until you find one that works well. Let me know how it works out.

@ByUnal
Copy link
Author

ByUnal commented Nov 10, 2020

I appreciate for your feedback and concern. I will try your recommendation as soon as I'm available, then I will let you know.

@ByUnal
Copy link
Author

ByUnal commented Nov 13, 2020

Hello again, now I tried to transformers that you suggested. I tried both of them for Turkish. First one worked in some cases but it is not efficient. Second one didn't work. I need Turkish language support of transformers

@davidmezzetti
Copy link
Member

Another one to try: sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking

Otherwise you can also try any of the generic Turkish transformer models: https://huggingface.co/models?search=turkish

Those multilingual models are the ones that should support multiple languages. I suspect a model specifically trained for Turkish language and on a NLI/STSB like task for Turkish would work best.

txtai uses the sentence-transformers library to build transformer-based sentence embeddings. I would suggest trying as many of the models to see if any of them work at an acceptable level for your task.

davidmezzetti added a commit that referenced this issue Dec 28, 2020
@davidmezzetti
Copy link
Member

Issue should now be resolved. Tokenization can be disabled by setting the config option:

Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants