-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different Language Support #39
Comments
Thank you for the support. The best place to start is this notebook: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb The index in the notebook above uses sentence-transformers. This link has a list of all the sentence transformer models available: https://huggingface.co/models?search=sentence-transformers The following is an example modification using a multi-lingual model # Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens"}) If this doesn't work well, another model to try: sentence-transformers/LaBSE Then change the sections text below to a couple examples in the target language you want to experiment with. sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"] Finally change the queries to the target language as well: for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"): I would just iterate over a couple different models until you find one that works well. Let me know how it works out. |
I appreciate for your feedback and concern. I will try your recommendation as soon as I'm available, then I will let you know. |
Hello again, now I tried to transformers that you suggested. I tried both of them for Turkish. First one worked in some cases but it is not efficient. Second one didn't work. I need Turkish language support of transformers |
Another one to try: sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking Otherwise you can also try any of the generic Turkish transformer models: https://huggingface.co/models?search=turkish Those multilingual models are the ones that should support multiple languages. I suspect a model specifically trained for Turkish language and on a NLI/STSB like task for Turkish would work best. txtai uses the sentence-transformers library to build transformer-based sentence embeddings. I would suggest trying as many of the models to see if any of them work at an acceptable level for your task. |
Issue should now be resolved. Tokenization can be disabled by setting the config option: Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False}) |
Hi there, This is a very beautiful work. I want to use this API for languages other than English. How can I implement other Languages model from https://huggingface.co/models?search=turkish or other sources.
Can you anyone help me on this one ?
The text was updated successfully, but these errors were encountered: