Burra - Somos NLP Hackaton 2023

This repo contains the code for the participation of @zolastro, @luisetex, @coronadoh and @maxserras for the Somos NLP Hackaton 2023.

The main objective of this project was to clean and improve the Somos alpaca es dataset to train Llama in spanish with instructions.

Contexts

One of the goals of the Somos NLP Hackaton 2023 was to train an Alpaca model in Spanish. To that end, the organizers translated the original alpaca dataset that can be found here.

This corpus had multiple issues such as:

Hallucinations due to using unprocessable inputs (images, urls, etc.)
Inconsistencies such as empty outputs, etc.
Wrong answers to math problems (mostly)

Also, almost no evaluation has been done regarding bias/hate speech, etc. over this corpus.

In addition, when translating the corpus to Spanish we found other issues such as:

Inconsistencies in the translation
Sentences that were not translated
Inability of propagating labels from the original EN corpus due to the lack of alignement.
Mixed sentences

The goal of this project was to clean the dataset and improve it to train a Llama model in Spanish.

Outcomes

The principal outcomes provided by the Burra team are:

SetFit model to detect unprocessable samples (images, urls, etc.)
- Same model using the paraphrase-multilingual-mpnet-v2 model, see Model Card
LangID filtering algorithm to detect samples that were not correctly translated to Spanish
Alpaca EN dataset & Alpaca ES dataset alignment algorithm
Evaluation of Bias Detection algorithm and Hate Speech Detection algorithm over the Alpaca EN Dataset. Propagation of this information to the Alpaca ES dataset
Dataset Available in HuggingFace space and HF Datasets
This repo containing the experiments and code used.

Other experiments & Future work

In addition, we've given a try to some other lines of work and research that we would like to share:

Using ChatGPT3.5 to identify inconsistent samples:

We tried some prompt engineering that can be found at the src/promptingfolder.
Some results are available at HF Argilla Space under the name of "somos-alpaca-es-analysis-chatgpt3_5_turbo"

First trials of fine-tuning the Bertin LLama: `src/llama/fine_tune_llama.ipynb'
- Disclaimer: we dedicated almost zero time to this, so we don't guarantee that neither the script nor the results are correct.

Other task remain as future work:

Generate more training samples using other open-licensed LLMs
Improve the prompting
Train self-correction / reflection over the Alpaca ES
Train other SetFit model for math problem detection

How to use this repo

Before running this, export the PYTHONPATH

export PYTHONPATH=$PYTHONPATH:$(pwd)/src

Installation can be done as always:

pip install -r requirements.txt

Then, you need to create the config/envs.json with different environment variables, such as:

- HF_TOKEN: token to access HuggingFace databases / push models / ...
- HUB_DATASET_NAME: name of the HF database to sync the changes
- OPENAI_TOKEN: to use the OpenAI Client

Then, all the entrypoints follow the same logic:

$ python main.py "command"

Where the commands can be:

- "train-unprocessable-setfit": train the SetFit model to detect unprocessable samples
- "predict-unprocessable-setfit": predict the unprocessable samples using the SetFit model
- "train-predict-setfit": train and predict the unprocessable samples using the SetFit model
- "save": save the progress from the Argilla Space to the HF Dataset
- "align": align the Alpaca EN dataset with the Alpaca ES dataset
- "enrich": enrich the Alpaca ES dataset with the results of the Bias Detection and Hate Speech Detection models

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
corpus		corpus
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus

corpus

src

src

.gitignore

.gitignore

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

Burra - Somos NLP Hackaton 2023

Contexts

Outcomes

Other experiments & Future work

How to use this repo

About

Releases

Packages

Contributors 3

Languages

maxserras/burra

Folders and files

Latest commit

History

Repository files navigation

Burra - Somos NLP Hackaton 2023

Contexts

Outcomes

Other experiments & Future work

How to use this repo

About

Resources

Stars

Watchers

Forks

Languages