Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK] Another language support for GoLLIE (more specifically Vietnamese) #9

Open
NoAtmosphere0 opened this issue Oct 30, 2023 · 3 comments

Comments

@NoAtmosphere0
Copy link

Hi GoLLIE research team, I am currently in a group of Vietnamese university students who want to present your paper for an upcoming seminar in our "Introduction to Natural Language Processing" course. Our task is to summarize and explain the contents of your paper to our fellow students and lecturers.

To make it easier to understand for our classmates, we are interested in training GoLLIE using Vietnamese datasets. If it's possible, we would greatly appreciate it if you could provide us with some instructions on how to proceed with this. We sincerely enjoyed reading your paper and believe that it would greatly benefit our presentation.

Here are some datasets for the named-entity-recognition subtask that I found on Hugging Face:

We would be extremely grateful if you could provide us with any guidance or assistance on our endeavor. Please feel free to reach out if you have any questions or require more information from us. We are more than willing to cooperate to make this collaboration successful.

@ikergarcia1996
Copy link
Member

Hi @NoAtmosphere0!

I believe the easiest way to achieve this would be by fine-tuning one of the GoLLIE checkpoints with a Vietnamese dataset. Both Wikiann and Polyglot NER seem like the best candidates since they use the same labels as CoNLL03. To fine-tune your model with either of these datasets, you should:

  1. Duplicate the CoNLL03 config and craft a Wikiann/Polyglot.json file: https://github.com/hitz-zentroa/GoLLIE/blob/main/configs/data_configs/conll03_config.json. Substitute the values in "train_file", "dev_file", and "test_file" with the paths to the datasets in .conll format (.tsv).
  2. Modify the generate data script: https://github.com/hitz-zentroa/GoLLIE/blob/main/bash_scripts/generate_data.sh. Delete all the config files and incorporate the ones you produced in step 1. Subsequently, execute the script.
  3. Modify the GoLLIE7B config file: https://github.com/hitz-zentroa/GoLLIE/blob/main/configs/model_configs/GoLLIE-7B_CodeLLaMA.yaml. Remove all the tasks and incorporate the ones you've recently made. Change the model from codellama/CodeLlama-7b-hf to HiTZ/GoLLIE-7B.
  4. In the output folder, you'll get the new LoRA adapters for GoLLIE. You can use them using the load_model function found here: https://github.com/hitz-zentroa/GoLLIE/blob/main/src/model/load_model.py.

A significant concern here is the proficiency of LLaMA2/CodeLLaMA in Vietnamese. The model might not be very adept for that particular language, and unfortunately, there's a limited selection of multilingual LLMs available.

@NoAtmosphere0
Copy link
Author

Hi @ikergarcia1996!

Thank you for your prompt response and helpful instructions. We will follow the steps that you have outlined in your response to train GoLLIE and also keep in mind your concerns about the proficiency of LLaMA2/CodeLLaMA in Vietnamese.

We will keep you updated on our progress by not closing this issue and let you know if we have any questions or need any further assistance. Thanks again for your support!

@brunoalano
Copy link

@NoAtmosphere0 Did you had any progress on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants